Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 27 · 28 · 29 · 30 · 31 · 32 · 33 . . . 55 · Next
Author | Message |
---|---|
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 0 |
endo_ae__ results cause (and suffer from) BOINC heartbeat problems and they do not checkpoint properly on one of my boxes, my guess is that they have very high RAM requirements (my internet PC with only 2GB RAM, having Firefox nearly always running, one Rosetta task plus 3 projects with very low RAM requirements). They should probably be limited to boxes with more than 3GB physical RAM. Yes I'm having many errors in these tasks, they are reported and "Validate error" and all the copies sent end with the same error in all the computers. I do not think it is caused by lack of RAM because my PCs has plenty of it (18 or 32 GB). The log looks this way always (extract) when starts to report errors: ================ .... ERROR: can't open file: minirosetta_database//sampling/filtered.vall.dat.2006-05-05 ERROR:: Exit from: src/core/fragment/picking_old/vall/vall_io.cc line: 63 # cpu_run_time_pref: 7200 ERROR: aFrame->nr_frags() ERROR:: Exit from: src/core/fragment/FragSet.cc line: 197 ERROR: aFrame->nr_frags() ERROR:: Exit from: src/core/fragment/FragSet.cc line: 197 "repeated 98 times" ERROR: aFrame->nr_frags() ERROR:: Exit from: src/core/fragment/FragSet.cc line: 197 ====================================================== DONE :: 99 starting structures 1201 cpu seconds This process generated 99 decoys from 99 attempts ====================================================== BOINC :: WS_max 0 BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
endo_ae__ results cause (and suffer from) BOINC heartbeat problems and they do not checkpoint properly on one of my boxes, my guess is that they have very high RAM requirements (my internet PC with only 2GB RAM, having Firefox nearly always running, one Rosetta task plus 3 projects with very low RAM requirements). They should probably be limited to boxes with more than 3GB physical RAM. I managed to muddle through this series of tasks with a 4gigs of physical memory and 2 gigs of virtual memory. I got only 1 decoy done in 4hrs on one task and got minimal credit for it. You almost need a desktop super computer to handle those tasks. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
This task named d6587_ab_4Jan2014_117685_485_0 (Task id 628550437) gave an early computation error. A lengthy error log looked like the following repeated multiple times BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. command: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.48_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro -frag3 d6587.200.3mers -in:file:native 52nc2_dis20-68.des.pdb -silent_gz 1 -frag9 d6587.200.9mers -out:file:silent default.out ex1 -abinitio::rsd_wt_loop 0.5 -relax::default_repeats 15 -abinitio::use_filters false -abinitio::increase_cycles 10 -abinitio::rsd_wt_helix 0.5 -abinitio::rg_reweight 0.5 -in:file:boinc_wu_zip d6587.ab.4Jan2014.zip -out:file:silent default.out -silent_gz -mute all -nstruct 10000 -cpu_run_time 10800 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2952613 Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached ERROR: ERROR: Unused "free" argument specified: ex1 [ |
AMDave Send message Joined: 16 Dec 05 Posts: 35 Credit: 12,576,896 RAC: 0 |
Installed v7.2.33, running smoothly for 10+ days, then @Jan 06, there's been over 90% rate of this message: "Task ... exited with zero status but no 'finished' file", followed by "If this happens repeatedly you may need to reset the project." I reset the project, then changed some preferences (I had planned these changes anyway): Target CPU run time: from 4hrs to 6hrs Disk and memory usage > Use at most: less than 1Gb to 2Gb Disk and memory usage > Write to disk at most every: 70 sec to 90 sec The issue remains. Interestingly, the Event Log shows the error message occurring every 3 hours (almost to the minute). I haven't noticed any similarities in the WU names. Is there a file that Rosetta is looking for and not finding, a flawed batch of WUs, or other cause? Suggestions please. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1997 Credit: 9,726,790 RAC: 10,636 |
Please, restart Ralph server.... |
AMDave Send message Joined: 16 Dec 05 Posts: 35 Credit: 12,576,896 RAC: 0 |
Please, restart Ralph server.... Issue remains. This particular WU has restarted 3 TIMES, at 3 hour intervals. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=572101126 If the above suggestion actually means resetting the project in the BOINC Mgr, that's not a solution, because: 1) It's been done and it didn't resolve anything, and 2) It addresses a symptom, not the cause Is the cause bad WUs, or Rosetta 3.48? Also, what is the solution. Thanks |
Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0 |
Please, restart Ralph server.... Hi, Dave, boboviz post wasn't in response to yours; he was trying to alert an admin that ralph@home was down. The "exited with zero status but no 'finished' file" occurs when some other task on your computer prevents the science app from communicating with BOINC. It is usually safe to ignore it as it will have to happen 100 times to a task before the task will give up and error out. Since it's happening to you at such regular intervals I suspect you recently set some scan to occur regularly in the background. On the BOINC forum Jord (Ageless)makes the following suggestions: Possible causes of the "Task exited with zero status but no 'finished' file" syndrome: This is obviously not a Rosetta specific issue; it shows up on just about every project board at some time or another. Gary Roberts, the patient prince of einstein@home, explains what's happening in this post and the BOINC FAQ Service entry is here. Hope this helps. Snags |
AMDave Send message Joined: 16 Dec 05 Posts: 35 Credit: 12,576,896 RAC: 0 |
Huge thanks Snags. Re: the list 1) already done 2) already done 3) already done 4) check 5) already done 6) check 7) already done Problem pursists. A new WU restarted 3 times. Interestingly, at 3 hour intervals. It's odd that everything was fine for @10 days subsequent to installing the latest BOINC Mgr. Then this issue crops up. When this happens and the WU is restarted, does the computation begin anew for the WU, or does it pick up near the point where it exited? |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,788,163 RAC: 14,920 |
When this happens and the WU is restarted, does the computation begin anew for the WU, or does it pick up near the point where it exited? Some types of task are able to checkpoint at certain intervals, and all tasks will checkpoint when a model is completed. A task (as you see them in BOINC Manager) can contain multiple decoys/models, so at the end of each of these there will effectively be a checkpoint. So, if your computer reaches either a checkpoint, or completes a model and moves on to another one (still within the same task), then it will pick up from that point when Rosetta restarts. If it doesn't reach one of those points before being stopped then it will restart the task from the beginning. |
Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0 |
When this happens and the WU is restarted, does the computation begin anew for the WU, or does it pick up near the point where it exited? I see Danny has answered your question so I'll just chime back in to to say this doesn't cause a problem for rosetta@home, it just increases the computer cycles per workunit causing a bit of inefficiency on your end. Eventually you will see a task error out when a model can't complete (after a hundred tries) but I doubt it will happen very often. What else changed around the time you updated BOINC? Maybe I'm fixated on the three hour interval, but it seems most likely to be caused by Windows or some software other than BOINC. As I don't run Windows I don't know how you can see what's happening every three hours. If no one here has a suggestion I would post on the BOINC message boards where both BOINC and Windows gurus hang out and see if they don't have some useful ideas. You might want to use "Task ... exited with zero status but no 'finished' file" as the message title and be sure and give them the details of your troubleshooting efforts (Jord's checklist) in your first post. Good Luck. Let us know what you discover. Snags |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
If a task finds itself restarting from the same point more than 5 times, it is ended for you. This might occur if you have a number of reboots to install operating system or application fixes, or have a long-running task on a day when your machine doesn't run very long before you power off, or if you do not keep tasks in memory and suspend a task repeatedly. But otherwise, this check is there to help ensure that if a task doesn't seem to be running properly on your machine, that it doesn't get hung up. snagles mentioned 100 tries, I just wanted to reassure folks that it only takes 5. I can't recall exactly if the 5th try is the one that aborts, or if five restarts would actually be detected on a 6th start attempt. The idea was to let a task survive some of the normal PC activities of installing fixes etc. but to get it killed if it's not running properly. Rosetta Moderator: Mod.Sense |
SBF-GODS-STONE Send message Joined: 6 Nov 05 Posts: 15 Credit: 44,784 RAC: 0 |
Just a note to let you know I have found a small error in processing of you tasks. I run with a 6 gb work disk for Boinc tasks and I have noticed when things slow down it tends to be "Fragmented Work Disk". I have isolated it to these tasks rather than any additional projects. Every 5 hours I defrag this work disk and off it go at full speed (AND 4.7 GHZ 4 Core CPU). Thanks... |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Now you have me curious. How does one determine the cause of fragmentation? And (since it sounds like you are well-versed in such things) what might the application developers do differently to help things run better without manual defrag? 6GB doesn't sound like much when you have 4 CPUs. Would it behave better if the work disk were larger? It would see as though all of the storage used for a given BOINC slots folder would be freed when the task is completed. Rosetta Moderator: Mod.Sense |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
Just a note to let you know I have found a small error in processing of you tasks. I run with a 6 gb work disk for Boinc tasks and I have noticed when things slow down it tends to be "Fragmented Work Disk". I have isolated it to these tasks rather than any additional projects. Every 5 hours I defrag this work disk and off it go at full speed (AND 4.7 GHZ 4 Core CPU). Out of curiosity, is your swap file on the same partition as your BOINC data directory? You appear to be running Windows XP with 3GB of memory, so that is about 750MB per core (not including any memory reserved for system processes). As Rosetta tasks are heavily dependent on memory you may experience shortages which the system will try to alleviate by putting more pressure on the swap file. Frequent changes to your swap file may in turn be encouraging fragmentation in the rest of the partition. Also, if you are hitting the limit of memory with 4 Rosetta tasks at the same time, that will be another reason for the slow down. If other projects are working fine that lends weight to the memory theory, as Rosetta is the heaviest memory user of the BOINC projects I am familiar with. I'd suggest cutting down to 2 or 3 cores for a day or two to see if there is any significant reduction in fragmentation. |
SBF-GODS-STONE Send message Joined: 6 Nov 05 Posts: 15 Credit: 44,784 RAC: 0 |
Now you have me curious. How does one determine the cause of fragmentation? And (since it sounds like you are well-versed in such things) what might the application developers do differently to help things run better without manual defrag? NO THE BOINC DISK IS SEPERATE FROM THE SYSTEM SWAP DISK (CHANNELS AND DISKS) |
SBF-GODS-STONE Send message Joined: 6 Nov 05 Posts: 15 Credit: 44,784 RAC: 0 |
Now you have me curious. How does one determine the cause of fragmentation? And (since it sounds like you are well-versed in such things) what might the application developers do differently to help things run better without manual defrag? JUST FOR THE INFO SIDE, I HAVE ENLARGED THE BOINC DATA WORK DISK TO 10+ GB |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
WHAT IS THE REASON TO WRITE TO DISK AT A GIVEN TIME? CAN YOU SAY GET A MEMORY BUFFER FOR THIS PROCESS AND REPLACE THE WRITE TO DISK ? You may not be aware, but writing entirely in capitals is considered to be the equivalent of shouting on most internet forums. As I am a volunteer like you I choose not to interact with shouting people most of the time as it usually just stresses me out for no benefit. However, I will assume that you aren't aware of the shouting issue. I am not an expert but I believe Rosetta will write to disk for several reasons: 1. When data is downloaded ready to start the task. 2. When using the swap file. 3. When saving progress at a check point so your work isn't lost in the case of a system failure or shut down. 4. When the task is completed and ready to upload to the server. You mentioned in one of your posts that you have fast memory, however as I pointed out before, you don't have enough memory. Rosetta tasks frequently use hundreds of MB of memory per task so if you have 4 tasks running at once you have no spare capacity. At best you will be using your swap file almost constantly. At worst Rosetta will be checkpointing and suspending a task while it waits for more memory to become available. |
SBF-GODS-STONE Send message Joined: 6 Nov 05 Posts: 15 Credit: 44,784 RAC: 0 |
WHAT IS THE REASON TO WRITE TO DISK AT A GIVEN TIME? CAN YOU SAY GET A MEMORY BUFFER FOR THIS PROCESS AND REPLACE THE WRITE TO DISK ? No I was not aware of the shouting....... I do not blog unless I have too let someone who needs to know what is happening. Thx. Much for the info. Now about the memory, the CPU's don't use the system memory for processing intructions. This CPU chip has for L2 and L3 cashe storage 1 and 2 gig's of memory on the chip. All other activity is done with system resources. The writing that the developers are doing I am only assuming is for computation and temp. storage between tasks. If they use Memory (if Possible) to do this type of operation the tasks will run more effectively. The overhead for disk writing taxes the process very heavly in general. It may not be possible but thought I would mention it. I wrote system software for years and have a little idea what happens in systems. |
SBF-GODS-STONE Send message Joined: 6 Nov 05 Posts: 15 Credit: 44,784 RAC: 0 |
Now you have me curious. How does one determine the cause of fragmentation? And (since it sounds like you are well-versed in such things) what might the application developers do differently to help things run better without manual defrag? Thks for E-mail the system yours or mind hung and some how double wrote that message, I was unaware. I did raise the work disk(Boinc) to 10+ GIG's. The result so far is good, you may have pointed out that Boinc should be installed on bigger Disk's. Have a good day. |
SBF-GODS-STONE Send message Joined: 6 Nov 05 Posts: 15 Credit: 44,784 RAC: 0 |
If a task finds itself restarting from the same point more than 5 times, it is ended for you. This might occur if you have a number of reboots to install operating system or application fixes, or have a long-running task on a day when your machine doesn't run very long before you power off, or if you do not keep tasks in memory and suspend a task repeatedly. But otherwise, this check is there to help ensure that if a task doesn't seem to be running properly on your machine, that it doesn't get hung up. Just saw this , I run this machine 24 hrs 7 days a week. I schedule work for two projects at a time because it runns way faster. Also running four units at once has no real problems with output. Thks.. |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org