Posts by Adam Gajdacs (Mr. Fusion)

1) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 70299)
Posted 9 May 2011 by Adam Gajdacs (Mr. Fusion)
Post:
I am Ray, a graduate student in the Baker lab. I will be taking care of the issues caused by "FOLD_N_DOCK" related jobs. As Greg_BE said, this is really not likely that these jobs could run out of all those 3.24GB of RAM.

They definitely can. I just noticed that one of my two rigs started trashing like hell. Turned out, a single one of these FOLD_N_DOCK WUs (http://boinc.bakerlab.org/rosetta/result.php?resultid=421379634) was using 1.45GB VM on a system with only 1GB physical memory; it was effectively running from the disk. The other core was idle because there was no memory left to run another WU on it, but if there was, it would've been about 3GBs total.
2) Message boards : Number crunching : Rosetta ignoring memory usage limits while not idle (Message 68109)
Posted 16 Oct 2010 by Adam Gajdacs (Mr. Fusion)
Post:
Okay, I worded my question poorly. I didn't mean the graphics, what I meant to ask is if BOINC manager could be installed as a normal app which one could start and stop manually, or if it had to be run as a "Background app," starting and stopping automatically, which sounds like what Oran is describing.


There should be two ways to install BOINC on Windows too, either as a service (which runs in the background all the time regardless of the currently logged in OS user, if any), and as a regular application that you need to start manually, but also can shut down completely at any time to reliably free up any and all used resources if need be.
It's been some time since the last time I actually installed a new BOINC client, but I suppose these two installation options still should exist.

I prefer the second option myself, and had been using BOINC in that mode from the start.
3) Message boards : Number crunching : Many instances of MiniRosetta put computer "out of memory" (Message 67688)
Posted 10 Sep 2010 by Adam Gajdacs (Mr. Fusion)
Post:

Still, running a CPU 100% all the time worries me. It feels like running a car engine 5000 RPM for a lengthy time. I know there are no moving parts here, but some form of material wear could (?) shorten the life of CPUs. (This, by the way, is a known problem at flash memory cards: after a lot of reads and rewrites, certain microscopic data storing materials show some kind of aging, so it seems that SSDs are not a good idea. But I dont know if anything simiar applies to CPUs.)

The aging of flash based storage devices is a completely different thing, that technology is not used in CPUs.
Running a CPU at 100% 24/7 has no noticeable impact on its life expectancy (well, unless you plan to use it for several decades, for which time the erosion the flow of electrons cause on the pathways will indeed become a factor to consider) as long as there's adequate cooling (or maybe even without, since modern CPU won't let themselves to overheat; they throttle back their internal clocks to lower dissipation).
4) Message boards : Number crunching : Many instances of MiniRosetta put computer "out of memory" (Message 67621)
Posted 7 Sep 2010 by Adam Gajdacs (Mr. Fusion)
Post:
Does the BOINC manager show Rosetta tasks in "Waiting for memory" state, when you see those idle Rosetta processes in the Task Manager?

I'm not sure if this actually happens when "leave in memory" is off, but it may be caused by a long standing well known issue between BOINC and Rosetta, affecting multi-core systems the most.

When a Rosetta WU starts running but hits the global memory limit specified in BOINC, it gets switched to the "Waiting for memory" state and a new WU is started. If that one runs into the memory boundary, it also begins to wait for memory and a new one is started, and so on, until finally a WU is started which actually fits within the memory constraints and can run its course.
This "WU cascade" can use up all the physical memory, and bloat the pagefile, eventually pretty much killing the system.

If this is what affects you, the only thing that might help is to increase the memory use limit to a value that results in about 400-500MB memory per CPU, so about 45-55% at least, but more like 60%, in your case.

Edit:
I should've checked that image before posting all this. Those appear to be dead Rosetta instances, caused by some error that prevents them from running, and they are not terminated properly either.
Maybe some file permission issues? Hardware (memory) fault? Some bad project data file (you could try disabling work fetch, draining your WU cache, reset the project or even detach, then allow work fetch again so that everything is redownloaded)?
5) Message boards : Number crunching : error - exited with zero status but no 'finished' file (Message 66070)
Posted 11 May 2010 by Adam Gajdacs (Mr. Fusion)
Post:
What FFT table size did you set in P95 to use for testing? You should go for at least 75% or more of the total physical memory by using the custom settings, as the default torture test uses only relatively small table sizes IIRC.
I just recently ran into memory problems that memtest86 completely failed to detect, and hardly ever affected P95 as long as it was set to use only smaller FFT tables.

The USB stick could also be the cause if it's started to develop cell faults because of aging or intense use (or was simply defective from the start). You could try to move the whole BOINC installation to a hard drive to see if the errors persist. I think simply copying it to somewhere and manually starting the client from there should work, for testing purposes at least (don't quote me on it tho :).
6) Message boards : Number crunching : lr8_combine_smooth_torsion_it00 - All Errors? (Message 64208)
Posted 25 Nov 2009 by Adam Gajdacs (Mr. Fusion)
Post:
Also confirming this, just got all 3 of such WUs I had in my work cache bombing out on me after less than a minute of runtime:
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=272874405
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=272875389
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=272880900

Apparently I had one yesterday too but I only noticed it now:
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=272869210

That's 4 out of 4 for me, as far as I can tell, while other WUs process without errors.
7) Message boards : Number crunching : Memory requirement? (Message 56877)
Posted 12 Nov 2008 by Adam Gajdacs (Mr. Fusion)
Post:
Yes, WUs belonging to this project may require up to 300-400Mbytes of physical memory per core, and I've been getting mostly these kinds of WUs lately.

Normal (or at least different kind of) WUs still usually use about 100-200Mbytes of PM at most.
8) Message boards : Number crunching : Minirosetta v1.40 bug thread (Message 56820)
Posted 11 Nov 2008 by Adam Gajdacs (Mr. Fusion)
Post:
1hzh_1u9p_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_97_0 using minirosetta version 140 (Wu ID: 188064180)

Yesterday this task had been running for over 13 hours on a 4 hours target CPU time. It was stuck on model 1, step 79500, where step did not change for over an hour (the protein display did, however, once in every 15-20 seconds or so). Progress was increasing at the rate of roughly 0.001% per 15-20 seconds at 98.6% or so.

I don't run my system 24/7 (that's why I have a relatively short runtime specified), so I had shut it down yesterday for the night, and today it's started over from 0%; looks like it didn't checkpoint even once in all those 13+ hours. So I'm considering aborting this (and any similar) WU at this point.

In general, the memory use of the 1.40 has skyrocketed again, it fluctuates between 100-350 Mbytes of physical and commits about 300-350Mbytes virtual memory. Once again, this tends to fill up all available PM+VM on multi-core systems as the Rosetta WUs started in parallel will hit the combined memory limit within seconds, thus they get suspended to the "Waiting for memory" state, and then a new WU gets started only to hit the memory limit again. I usually have at least 3-4 "stuck" Rosetta WUs in memory, each holding 200-300Mbytes of VM (and a similar amount of PM until the system is forced to completely page them out).
9) Message boards : Number crunching : Rosetta Mini with new score terms 1.02 (Message 56554)
Posted 31 Oct 2008 by Adam Gajdacs (Mr. Fusion)
Post:
Got two WUs with this new client version recently, both died within 5 seconds after starting up.

31/10/2008 17:35:06|rosetta@home|Starting task 1a19A_BOINC_CASP8_ABRELAX_SPLIT_SPLIT_IGNORE_THE_REST-S25-9-S3-3--1a19A-_4662_408_0 using minirosetta_split_terms version 102
31/10/2008 17:35:10|rosetta@home|Computation for task 1a19A_BOINC_CASP8_ABRELAX_SPLIT_SPLIT_IGNORE_THE_REST-S25-9-S3-3--1a19A-_4662_408_0 finished
31/10/2008 17:35:10|rosetta@home|Output file 1a19A_BOINC_CASP8_ABRELAX_SPLIT_SPLIT_IGNORE_THE_REST-S25-9-S3-3--1a19A-_4662_408_0_0 for task 1a19A_BOINC_CASP8_ABRELAX_SPLIT_SPLIT_IGNORE_THE_REST-S25-9-S3-3--1a19A-_4662_408_0 absent

31/10/2008 18:36:18|rosetta@home|Starting task 1ten__BOINC_CASP8_ABRELAX_SPLIT_SPLIT_IGNORE_THE_REST-S25-9-S3-3--1ten_-_4662_1966_0 using minirosetta_split_terms version 102
31/10/2008 18:36:21|rosetta@home|Reason: Unrecoverable error for result 1ten__BOINC_CASP8_ABRELAX_SPLIT_SPLIT_IGNORE_THE_REST-S25-9-S3-3--1ten_-_4662_1966_0 (Incorrect function. (0x1) - exit code 1 (0x1))
31/10/2008 18:36:21|rosetta@home|Computation for task 1ten__BOINC_CASP8_ABRELAX_SPLIT_SPLIT_IGNORE_THE_REST-S25-9-S3-3--1ten_-_4662_1966_0 finished
31/10/2008 18:36:21|rosetta@home|Output file 1ten__BOINC_CASP8_ABRELAX_SPLIT_SPLIT_IGNORE_THE_REST-S25-9-S3-3--1ten_-_4662_1966_0_0 for task 1ten__BOINC_CASP8_ABRELAX_SPLIT_SPLIT_IGNORE_THE_REST-S25-9-S3-3--1ten_-_4662_1966_0 absent
10) Message boards : Number crunching : Rosetta Checkpointing (Message 54828)
Posted 2 Aug 2008 by Adam Gajdacs (Mr. Fusion)
Post:
Suspending BOINC/projects only temporarily stops them from running (and thus using CPU time), but nothing else.

What would probably work tho is:
- set project/BIONC preferences to leave applications in memory when preempted
- do not exit BOINC when you want to turn the computer off
- instead of shutting down your system, use hybernate (which is usually the preferred method for laptops anyway), which will save a snapshot of the system memory (needs at least as much free hard disk space as much physical memory you have), including the state of processes in a way that they will be restored the exact same state next time you power up the system, meaning that workunits, checkpointed or not, should continue processing from the point where they were before you hybernated the system
11) Message boards : Number crunching : Memory Usage in Beta 5.80 (Message 46348)
Posted 16 Sep 2007 by Adam Gajdacs (Mr. Fusion)
Post:
There is a recurring problem however in connection with multi-core systems and these "high memory requirement" WUs, which would probably need some co-operation between RAH and the folks at BOINC to resolve by modifying the scheduler logic and/or introducing more WU flags regarding expectable memory use.

I'm talking about the situation where a Rosetta WU runs on one core, and another should be started on the other core, but the combined memory use hits the memory boundary set by user preferences for the project, so it's put into the "Waiting for memory" or simply the "Waiting to run" status, then the scheduler starts another Rosetta WU which happens to be also a high memory use one, it hits the memory use cap too, put into the "Waiting for memory" status, so a third needs to be started, and so on.

Right now I have 3 Rosetta WUs in memory, two using 370MB VM each and a third with 310MB, but only one is able to run because of memory preferences, the other two are just sort of deadlocking each other as they're keeping their committed VM in use without being able to actually run so that one of them could be cleared from memory once finished and then the other one get a chance to get processed too.

I wonder if there would be a way for the project or the scheduler to check the following conditions:
- Does the client asking for new work have any high-memory WUs assigned already?
- Is it a multi-CPU system with more than one CPUs enabled for BOINC?
- Would the combined memory use for "CPUs allowed to be used" x mem requirement of one WU be higher than the allowed memory use?

If they are true, then only low-memory WUs should be sent/requested/accepted as new work to that client for the project as long as it didn't return the current high-memory one to stop the scheduler from keep starting and then suspending high-memory WUs (since it has nothing else to process for this specific project), filling up available memory with "zombie" WUs that can not run because of each other.
12) Message boards : Number crunching : Problems with Rosetta version 5.67 (Message 41465)
Posted 26 May 2007 by Adam Gajdacs (Mr. Fusion)
Post:
5.67 gone mad here too. HDD started trashing like crazy, so I tried to figure out what's up. Turned out to be Rosetta, which just got the client refreshed to 5.67 and downloaded two new WUs from the gp04__BOINC_SYMM_FOLD_AND_DOCK_SUBSYSTEM-gp04_-delC126... batch.
Both of them are using over 1GB virtual memory each (HT CPU, both threads enabled), but have only about 35-37MB working set.
Available physical memory stabilized over the course of a few mins after the jobs started, but commit charge is at 95% at the moment.
13) Message boards : Number crunching : Report problems with Rosetta version 5.34 (Message 30185)
Posted 28 Oct 2006 by Adam Gajdacs (Mr. Fusion)
Post:
Got four of these: 1hz6A_BOINC_NATIVEJUMPS_CLOSE_CHAINBREAKS_VARY_ALL_BOND_ANGLES... currently cached on my client which is struggling to finish just the first one for at least a day now, stuck at 48.4%, Model 5, AB Initio (jumping). It's not actually stuck, progressing about 1 step in every few minutes, but at this rate it doesn't seem to be able to reach the next checkpoint in the 8-10 hours during which my computer is on on an average day, so it's effectively stuck. My CPU time preference is at 3 hours, and yet, one of those WUs ran about 7 hours just today without moving an inch ahead.
Seeing the same warnings in the stdout.txt as netwraith in a few posts below.

I guess I'll have to discard them eventually.






©2024 University of Washington
https://www.bakerlab.org