Message boards : Number crunching : Many instances of MiniRosetta put computer "out of memory"
Author | Message |
---|---|
Hungarian National Philharmonic Orchestra and Choir Send message Joined: 12 Mar 09 Posts: 5 Credit: 1,804,715 RAC: 0 |
Hello all! I have several Windows computers than run the Rosetta@home project, which regularly clogs the computers by gradually increasing the number of minirosetta executables. Only a few of these (the number of CPU cores) seem to be active, but the remaining ones take up memory as well. After several days, as the number of instances grow, the computers become unusable. I have tried to update BOINC, but this phenomenon was the same thru several Boinc and MiniRosetta versions, so I stopped BOINC service on all computers, and haven't experimented with it for a while. I would love to solve this problem. All computers are on the internet all the time. Here are the most significant settings: "Use GPU" in ON Switch between apllications every 60 minutes Use at most 75% CPU time Network usage: no limit set Connect about every 0.1 days Additional work buffer: 0.3 days Tasks checkpoint to disk every 60 secs Use at most 37% of RAM (same for in use and idle) "Leave applications in memory while suspended" is OFF The computer is Core2 Quad, 4 GB RAM, large HDD, NVidia 8600 GPU card. Please see the Task Manager screenshot here: www.astris.com/boincerror.jpg I guess it must be a configuration error, I have tried to change things, but the very same things happened: after a lengthy run, minirosetta instances piled up. Thank you in advance: Laszlo Budapest, Hungary |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I am surprised to see that your setting for leaving applications in memory was to OFF. Because that is the only way I can think of that you should see any in the task list that are not actively running on a CPU. I don't see any quad cores in the list of hosts for the user ID that you posted to message from. I haven't heard of such a thing either. But I could picture it happening if you run lots of Rosetta threads at the same time (such as one per CPU), limit the amount of memory available for Rosetta to run (such as 37%), and then leave the applications in memory when they get suspended because BOINC is not allowed enough memory to run them. It sees too much memory, stops a task, begins another one hoping it will use less memory, which it does as it starts up, then as it requires more eventually the cycle continues. If you have 4 CPUs, then I would suggest that rather then running 75% of CPU time, you instead run with 75% of the available CPUs. Or limit the number of CPUs to 3. This will help BOINC see that you would prefer it run less then full-time in less memory. So it will only try to be starting 3 tasks rather then the default of 4. So one task less memory required. I would suggest you increase the amount of memory you allow BOINC to use, but you have not stated why you set it so low, so I assume you have other reasons for that. The above will at least reduce the level of active tasks and therefore the memory they try to use. I would also suggest you install the current version of BOINC to assure that any problem with it NOT removing tasks from memory when it was supposed to would have a fix installed. Rosetta Moderator: Mod.Sense |
Adam Gajdacs (Mr. Fusion) Send message Joined: 26 Nov 05 Posts: 14 Credit: 3,213,406 RAC: 266 |
Does the BOINC manager show Rosetta tasks in "Waiting for memory" state, when you see those idle Rosetta processes in the Task Manager? I'm not sure if this actually happens when "leave in memory" is off, but it may be caused by a long standing well known issue between BOINC and Rosetta, affecting multi-core systems the most. When a Rosetta WU starts running but hits the global memory limit specified in BOINC, it gets switched to the "Waiting for memory" state and a new WU is started. If that one runs into the memory boundary, it also begins to wait for memory and a new one is started, and so on, until finally a WU is started which actually fits within the memory constraints and can run its course. This "WU cascade" can use up all the physical memory, and bloat the pagefile, eventually pretty much killing the system. If this is what affects you, the only thing that might help is to increase the memory use limit to a value that results in about 400-500MB memory per CPU, so about 45-55% at least, but more like 60%, in your case. Edit: I should've checked that image before posting all this. Those appear to be dead Rosetta instances, caused by some error that prevents them from running, and they are not terminated properly either. Maybe some file permission issues? Hardware (memory) fault? Some bad project data file (you could try disabling work fetch, draining your WU cache, reset the project or even detach, then allow work fetch again so that everything is redownloaded)? |
Jochen Send message Joined: 6 Jun 06 Posts: 133 Credit: 3,847,433 RAC: 0 |
If you look at his computer list, there are 38 copies of one computer. Looking at the one with the most tasks, they all crashed, because of an access violation. I would run some system diagnostics. Looks like the RAM or CPU might be faulty. Or it's a driver issue... Hard to tell from the distance. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
If you look at his computer list, there are 38 copies of one computer. Looking at the one with the most tasks, they all crashed, because of an access violation. There are only two active hosts. Neither of them match the description in the original post, so I am assuming there are more hosts under another user ID somewhere. But perhaps the details on the wrong host were posted. All those errors are sure trying to tell us something about this host. Rosetta Moderator: Mod.Sense |
Hungarian National Philharmonic Orchestra and Choir Send message Joined: 12 Mar 09 Posts: 5 Credit: 1,804,715 RAC: 0 |
Hello everyone, first of all, thank you very much for all replies. Couple of additions: 1. We are talking about more than 30 office computer machines, they are mostly similar (hardware-wise), and my home computer, that is entirely different hardware. They all belong to the same team. And my home computer, and several of the office computers produced this problem, several times. (For most office computers I did not wait for the memory to be filled up, I just stopped Boinc on them.) So we cannot talk about hardware failure here. 2. Most office computers are 2 and 3 GB of RAM (for dual-core), my home computer (quad-core) is 4 GB RAM. So, when we mention 37% of RAM, we are actually talking about 670 MB, 1 GB, and 1.5 GB RAM dedicated for Rosetta (that is about 350-500 MB per core). Anyway, at the beginning, memory percentage was set a lot higher, and, when the problems began, I lowered it to 37%, as I was hoping to prevent Boinc eating too much of our RAM. Did not work. 3. When restarting the computer: these "dead" (?) processes are still there, until I kill them one-by-one. (Done that on several computers, several times.) 4. How can I tell if a task is in "Waiting for memory" state? I have never ever seen that before on any computers. What should I check to avoid file permission problems? On ALL computers, Boinc is (or, rather, WAS) running as a service. 5. By the way, I HAVE tried aborting all tasks, detaching Rosetta, uninstalling boinc, reinstalling, and not much later the same view welcomed me in my Windows Task manager. Best regards, Laszlo |
Hungarian National Philharmonic Orchestra and Choir Send message Joined: 12 Mar 09 Posts: 5 Credit: 1,804,715 RAC: 0 |
P.S. After a long time, I'm experimenting with my home computer for about 3 hours now, and can answer a few questions. In the boinc tasks, I see 4 tasks "ready to report", another 4 "running". In the Windows Task manager, I see 8 instances of minirosetta_2.14_windows_x86_64.exe, 4 of them taking about 220-260 MB RAM, the others are from 4 MB to 19 MB. Only one of them seem to be active (CPU activity and slightly changing RAM consumption). In the task manager, from time to time I see some dead instances disappear and another one (or the same one, restarted?) taking its place in the list, with 0 CPU usage. I really hope you will figure this out. :-) Laszlo |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2474 Credit: 46,499,576 RAC: 3,223 ![]() |
I guess it must be a configuration error, I have tried to change things, but the very same things happened: after a lengthy run, minirosetta instances piled up. Have you tried any of mod.sense's suggestions yet? I had some of your problems (nowhere near as bad) when I first got my quad-core machine. The solution involved setting "Use at most xx% CPU time" to 100% and "Leave applications in memory while suspended" to ON. If you only run Rosetta then there's no GPU used, so you may as well set that to off as well. Those are the most obvious problem issues. Report back on the result and we can take it from there. Getting these settings wrong causes some crazy symptoms, so just concentrate on the cause rather than symptoms for now - you'll probably find a lot of them disappear once those 2 settings are corrected. ![]() ![]() |
![]() ![]() Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
|
mikey![]() Send message Joined: 5 Jan 06 Posts: 1898 Credit: 12,723,752 RAC: 682 ![]() |
One other thing to do is to check the setting for 'While processor usage is less than ___ percent (0 means no restriction)'. This setting is in the Boinc manager and then under Advanced, preferences, processor usage. The default is 25, change it to zero and Boinc should stop stopping units based on processor usage. Also did you say you installed Boinc as a 'Service', if so uninstall Boinc and do not install it that way, there was a problem with that, I do not remember what anymore, and just installing Boinc normally seems to work better. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I believe Chilean intended to say "it shouldn't even slightly slow down". First things first. If you will be studying this on your home machines, what are the host IDs? If you have 4 tasks ready to report and 4 running, then there should only be 4 instances of rosetta in the task list. So then one would want to see the outcome of the 4 that completed. Did they report exceptions? Or run to completion and end normally? The "waiting for memory" is one of the statuses you might see in the same column where you saw "ready to report" and "running". So, it doesn't sound like they are waiting for memory. Rosetta Moderator: Mod.Sense |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2474 Credit: 46,499,576 RAC: 3,223 ![]() |
While awaiting the OP's return, I see there's some success among the failures now, with the failures showing our dear friend "Can't acquire lockfile - exiting". As I recall, these are remnants of when the problems were happening, but I can't remember how to fix it. Something to do with locked files in C:UsersAll UsersBOINCslots requiring Boinc to be shut down in task manager and all services stopped, boinc_lockfile deleted in each of the rogue folders (numbered, 0, 1, 2, 3 etc) then re-booting. Can someone confirm the correct procedure or point to a link? ![]() ![]() |
Hungarian National Philharmonic Orchestra and Choir Send message Joined: 12 Mar 09 Posts: 5 Credit: 1,804,715 RAC: 0 |
Today I set my home PC and one work PC 100% CPU usage, raised RAM limit, "use GPU" OFF (since I run nothing else but Rosetta), and "leave apps in memory" ON. BOINC still running as service. I will get back to you with the results in 24 hours or so. I am a bit concerned about the 100% CPU, because of possible heating issues. (Never had any, but worries me a bit.) I have NEVER seen "waiting for memory" in BOINC Manager's task list. Again, thank you all again for trying to help. Will be back soon. Regards, Laszlo |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Depending on your part of the world, heat will become a welcome byproduct of crunching very soon :) Running at 100% on 75% of the available CPUs should reduce the heat and yet avoid the problem some have seen when running less then 100% of the time. Rosetta Moderator: Mod.Sense |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2474 Credit: 46,499,576 RAC: 3,223 ![]() |
Looking at the 2 visible computers there are a lot more successes now, but still a few stray "can't acquire lockfile - exiting" messages in those few tasks with errors. Can someone help with that issue specifically - I know someone helped me solve it a long time ago, so it can be done. On CPU time, mine is always 100% and has never been cause of an issue. Note that 75% CPU time is actually 100% for 75% of time, then 0% for 25% of the time, while mod.sense's suggestion is 3 cores at 100% and the 4th at 0%. I don't know which is better, but no option allows 75% activity 100% of the time (sounds confusing, but I think I wrote that correctly). Now, about that lockfile issue again... ![]() ![]() |
mikey![]() Send message Joined: 5 Jan 06 Posts: 1898 Credit: 12,723,752 RAC: 682 ![]() |
Looking at the 2 visible computers there are a lot more successes now, but still a few stray "can't acquire lockfile - exiting" messages in those few tasks with errors. Can someone help with that issue specifically - I know someone helped me solve it a long time ago, so it can be done. Try this http://www.boinc-wiki.info/Can%27t_acquire_lockfile_-_exiting |
Hungarian National Philharmonic Orchestra and Choir Send message Joined: 12 Mar 09 Posts: 5 Credit: 1,804,715 RAC: 0 |
Hi everyone, I think it is not too early to draw any conclusions: setting the CPU to 100% and increasing the memory limit eliminated the problem, no more dead processes in my Windows task manager. Since I changed 2 things at once, now I will reduce CPU usage to 85%, leaving the memory limit high, to see if the problem comes back. (One Rosetta process took 560 MB of RAM, uuhhh.) Answering one thing here: now at 100% I DO feel some sluggishness in some not-so-important areas. Namely, in flash animations (eg. FarmVille) and when accessing the computer remotely. Still, running a CPU 100% all the time worries me. It feels like running a car engine 5000 RPM for a lengthy time. I know there are no moving parts here, but some form of material wear could (?) shorten the life of CPUs. (This, by the way, is a known problem at flash memory cards: after a lot of reads and rewrites, certain microscopic data storing materials show some kind of aging, so it seems that SSDs are not a good idea. But I dont know if anything simiar applies to CPUs.) I will come back with my new findings. Thank you, Laszlo |
![]() Send message Joined: 3 Nov 05 Posts: 1834 Credit: 124,260,318 RAC: 9 |
Hi everyone, It's unlikely that Rosetta will have any noticeable effect on the life of your CPUs - they'll become obsolete long before they start to fail, as long as the temperatures are reasonable. I've seen it suggested that damage is more likely without rosetta (or any other CPU-intense program) because of the fluctuating temperatures causing the core to contract and expand, but I don't know whether that's valid or not. If it's any comfort, I've been running Rosetta (and FAD and UD-Think before it) for around 10 years now on lots of computers (probably in the region of 100 CPUs) in lots of configurations and locations and I've never had a CPU fail while running. I've damaged a few while overclocking or by breaking pins off when handling, but I've never had one die from running Rosetta... |
Adam Gajdacs (Mr. Fusion) Send message Joined: 26 Nov 05 Posts: 14 Credit: 3,213,406 RAC: 266 |
The aging of flash based storage devices is a completely different thing, that technology is not used in CPUs. Running a CPU at 100% 24/7 has no noticeable impact on its life expectancy (well, unless you plan to use it for several decades, for which time the erosion the flow of electrons cause on the pathways will indeed become a factor to consider) as long as there's adequate cooling (or maybe even without, since modern CPU won't let themselves to overheat; they throttle back their internal clocks to lower dissipation). |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I've been told by friends in the manufacturing side of things that multi-layer chips and circuit boards tend to crack or delaminate over time due to temperature CHANGES, not necessarily high heat levels. This would typically be caused by turning the machine off and on frequently. And that is why there is a camp of people that feel leaving it on all the time will yield better longevity then turning it off every day. If the box is designed with adequate ventilation, and the room is a reasonable temperature, there should not be a problem. dcdc, the suspected problem previously was apparently caused by running less then 100% and introducing those pauses. That was my reason for suggesting a method of heat reduction that can be achieved while also running at 100% of the time. It will cause the machine to only begin 3 tasks, saving memory (which the configuration settings were indicating was a key resource) and run them 100% of the time. Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
Many instances of MiniRosetta put computer "out of memory"
©2025 University of Washington
https://www.bakerlab.org