Problems and Technical Issues with Rosetta@home

Author	Message
Greg_BE Send message Joined: 30 May 06 Posts: 5774 Credit: 6,139,760 RAC: 0	Message 76298 - Posted: 19 Dec 2013, 0:51:36 UTC - in response to Message 76058. Last modified: 19 Dec 2013, 0:53:46 UTC endo_ae__ results cause (and suffer from) BOINC heartbeat problems and they do not checkpoint properly on one of my boxes, my guess is that they have very high RAM requirements (my internet PC with only 2GB RAM, having Firefox nearly always running, one Rosetta task plus 3 projects with very low RAM requirements). They should probably be limited to boxes with more than 3GB physical RAM. Unfortunately I could not catch/spy on one just before it crashed, so the RAM thing is only a guess. After the crash the RAM history is lost with the PID so I cannot check the maximum usage. Other result types seem not to be affected. Indeed. The endo_ae tasks are terrible: 1. The first checkpoint takes a number of hours. 2. I have at least three which have crashed after a few minutes. 3. The credit from them is poor. In one example over 8 hours for only 20 points. Yes I'm having many errors in these tasks, they are reported and "Validate error" and all the copies sent end with the same error in all the computers. I do not think it is caused by lack of RAM because my PCs has plenty of it (18 or 32 GB). The log looks this way always (extract) when starts to report errors: ================ .... ERROR: can't open file: minirosetta_database//sampling/filtered.vall.dat.2006-05-05 ERROR:: Exit from: src/core/fragment/picking_old/vall/vall_io.cc line: 63 # cpu_run_time_pref: 7200 ERROR: aFrame->nr_frags() ERROR:: Exit from: src/core/fragment/FragSet.cc line: 197 ERROR: aFrame->nr_frags() ERROR:: Exit from: src/core/fragment/FragSet.cc line: 197 "repeated 98 times" ERROR: aFrame->nr_frags() ERROR:: Exit from: src/core/fragment/FragSet.cc line: 197 ====================================================== DONE :: 99 starting structures 1201 cpu seconds This process generated 99 decoys from 99 attempts ====================================================== BOINC :: WS_max 0 BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish I managed to muddle through this series of tasks with a 4gigs of physical memory and 2 gigs of virtual memory. I got only 1 decoy done in 4hrs on one task and got minimal credit for it. You almost need a desktop super computer to handle those tasks. ID: 76298 · Rating: 0 · rate: /

svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0	Message 76327 - Posted: 7 Jan 2014, 2:06:37 UTC This task named d6587_ab_4Jan2014_117685_485_0 (Task id 628550437) gave an early computation error. A lengthy error log looked like the following repeated multiple times BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. command: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.48_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro -frag3 d6587.200.3mers -in:file:native 52nc2_dis20-68.des.pdb -silent_gz 1 -frag9 d6587.200.9mers -out:file:silent default.out ex1 -abinitio::rsd_wt_loop 0.5 -relax::default_repeats 15 -abinitio::use_filters false -abinitio::increase_cycles 10 -abinitio::rsd_wt_helix 0.5 -abinitio::rg_reweight 0.5 -in:file:boinc_wu_zip d6587.ab.4Jan2014.zip -out:file:silent default.out -silent_gz -mute all -nstruct 10000 -cpu_run_time 10800 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2952613 Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached ERROR: ERROR: Unused "free" argument specified: ex1 [ ID: 76327 · Rating: 0 · rate: /

AMDave Send message Joined: 16 Dec 05 Posts: 35 Credit: 12,576,896 RAC: 0	Message 76336 - Posted: 9 Jan 2014, 18:57:49 UTC Installed v7.2.33, running smoothly for 10+ days, then @Jan 06, there's been over 90% rate of this message: "Task ... exited with zero status but no 'finished' file", followed by "If this happens repeatedly you may need to reset the project." I reset the project, then changed some preferences (I had planned these changes anyway): Target CPU run time: from 4hrs to 6hrs Disk and memory usage > Use at most: less than 1Gb to 2Gb Disk and memory usage > Write to disk at most every: 70 sec to 90 sec The issue remains. Interestingly, the Event Log shows the error message occurring every 3 hours (almost to the minute). I haven't noticed any similarities in the WU names. Is there a file that Rosetta is looking for and not finding, a flawed batch of WUs, or other cause? Suggestions please. ID: 76336 · Rating: 0 · rate: /

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 76337 - Posted: 9 Jan 2014, 22:35:03 UTC Please, restart Ralph server.... ID: 76337 · Rating: 0 · rate: /

AMDave Send message Joined: 16 Dec 05 Posts: 35 Credit: 12,576,896 RAC: 0	Message 76338 - Posted: 11 Jan 2014, 18:03:18 UTC - in response to Message 76337. Please, restart Ralph server.... Issue remains. This particular WU has restarted 3 TIMES, at 3 hour intervals. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=572101126 If the above suggestion actually means resetting the project in the BOINC Mgr, that's not a solution, because: 1) It's been done and it didn't resolve anything, and 2) It addresses a symptom, not the cause Is the cause bad WUs, or Rosetta 3.48? Also, what is the solution. Thanks ID: 76338 · Rating: 0 · rate: /

Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0	Message 76343 - Posted: 12 Jan 2014, 16:23:52 UTC - in response to Message 76338. Please, restart Ralph server.... Issue remains. This particular WU has restarted 3 TIMES, at 3 hour intervals. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=572101126 If the above suggestion actually means resetting the project in the BOINC Mgr, that's not a solution, because: 1) It's been done and it didn't resolve anything, and 2) It addresses a symptom, not the cause Is the cause bad WUs, or Rosetta 3.48? Also, what is the solution. Thanks Hi, Dave, boboviz post wasn't in response to yours; he was trying to alert an admin that ralph@home was down. The "exited with zero status but no 'finished' file" occurs when some other task on your computer prevents the science app from communicating with BOINC. It is usually safe to ignore it as it will have to happen 100 times to a task before the task will give up and error out. Since it's happening to you at such regular intervals I suspect you recently set some scan to occur regularly in the background. On the BOINC forum Jord (Ageless)makes the following suggestions: Possible causes of the "Task exited with zero status but no 'finished' file" syndrome: 1. Make sure you exclude the BOINC directory and all subdirectories (or the BOINC Data directory and all subdirectories in BOINC 6 and 7) from being actively scanned by anti-virus and anti-spyware software. Only scan when you have exited BOINC. 2. Don't defrag your disk with BOINC on. 3. Don't run Scandisk with BOINC on. 4. Disable Drive Indexing. 5. Update your motherboard chipset drivers, specifically those for your IDE or SATA controllers. 6. Disable the Time synchronization in Windows XP/Vista. Normally found under the clock (double click it in the system tray), third tab (Internet in English), uncheck the sync option. 7. When you use use BOINC's CPU throttling function, you can run into the too many exit(0)s error. The advice here is to disable the BOINC throttling (set it to 100%) and reduce the amount of CPUs/cores for BOINC to use. ** Use at most 100.0 percent of CPU time. * In BOINC 7.0, this is done through the option On multiprocessors, use at most xxx% of the processors. This is obviously not a Rosetta specific issue; it shows up on just about every project board at some time or another. Gary Roberts, the patient prince of einstein@home, explains what's happening in this post and the BOINC FAQ Service entry is here. Hope this helps. Snags ID: 76343 · Rating: 0 · rate: /

AMDave Send message Joined: 16 Dec 05 Posts: 35 Credit: 12,576,896 RAC: 0	Message 76348 - Posted: 14 Jan 2014, 19:46:41 UTC - in response to Message 76343. Huge thanks Snags. Re: the list 1) already done 2) already done 3) already done 4) check 5) already done 6) check 7) already done Problem pursists. A new WU restarted 3 times. Interestingly, at 3 hour intervals. It's odd that everything was fine for @10 days subsequent to installing the latest BOINC Mgr. Then this issue crops up. When this happens and the WU is restarted, does the computation begin anew for the WU, or does it pick up near the point where it exited? ID: 76348 · Rating: 0 · rate: /

dcdc Send message Joined: 3 Nov 05 Posts: 1836 Credit: 124,981,563 RAC: 0	Message 76349 - Posted: 14 Jan 2014, 21:18:12 UTC - in response to Message 76348. Last modified: 14 Jan 2014, 21:18:54 UTC When this happens and the WU is restarted, does the computation begin anew for the WU, or does it pick up near the point where it exited? Some types of task are able to checkpoint at certain intervals, and all tasks will checkpoint when a model is completed. A task (as you see them in BOINC Manager) can contain multiple decoys/models, so at the end of each of these there will effectively be a checkpoint. So, if your computer reaches either a checkpoint, or completes a model and moves on to another one (still within the same task), then it will pick up from that point when Rosetta restarts. If it doesn't reach one of those points before being stopped then it will restart the task from the beginning. ID: 76349 · Rating: 0 · rate: /

Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0	Message 76357 - Posted: 17 Jan 2014, 14:43:35 UTC - in response to Message 76349. When this happens and the WU is restarted, does the computation begin anew for the WU, or does it pick up near the point where it exited? Some types of task are able to checkpoint at certain intervals, and all tasks will checkpoint when a model is completed. A task (as you see them in BOINC Manager) can contain multiple decoys/models, so at the end of each of these there will effectively be a checkpoint. So, if your computer reaches either a checkpoint, or completes a model and moves on to another one (still within the same task), then it will pick up from that point when Rosetta restarts. If it doesn't reach one of those points before being stopped then it will restart the task from the beginning. I see Danny has answered your question so I'll just chime back in to to say this doesn't cause a problem for rosetta@home, it just increases the computer cycles per workunit causing a bit of inefficiency on your end. Eventually you will see a task error out when a model can't complete (after a hundred tries) but I doubt it will happen very often. What else changed around the time you updated BOINC? Maybe I'm fixated on the three hour interval, but it seems most likely to be caused by Windows or some software other than BOINC. As I don't run Windows I don't know how you can see what's happening every three hours. If no one here has a suggestion I would post on the BOINC message boards where both BOINC and Windows gurus hang out and see if they don't have some useful ideas. You might want to use "Task ... exited with zero status but no 'finished' file" as the message title and be sure and give them the details of your troubleshooting efforts (Jord's checklist) in your first post. Good Luck. Let us know what you discover. Snags ID: 76357 · Rating: 0 · rate: /

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 76363 - Posted: 19 Jan 2014, 15:24:49 UTC If a task finds itself restarting from the same point more than 5 times, it is ended for you. This might occur if you have a number of reboots to install operating system or application fixes, or have a long-running task on a day when your machine doesn't run very long before you power off, or if you do not keep tasks in memory and suspend a task repeatedly. But otherwise, this check is there to help ensure that if a task doesn't seem to be running properly on your machine, that it doesn't get hung up. snagles mentioned 100 tries, I just wanted to reassure folks that it only takes 5. I can't recall exactly if the 5th try is the one that aborts, or if five restarts would actually be detected on a 6th start attempt. The idea was to let a task survive some of the normal PC activities of installing fixes etc. but to get it killed if it's not running properly. Rosetta Moderator: Mod.Sense ID: 76363 · Rating: 0 · rate: /

SBF-GODS-STONE Send message Joined: 6 Nov 05 Posts: 15 Credit: 44,784 RAC: 0	Message 76781 - Posted: 29 May 2014, 22:07:26 UTC Just a note to let you know I have found a small error in processing of you tasks. I run with a 6 gb work disk for Boinc tasks and I have noticed when things slow down it tends to be "Fragmented Work Disk". I have isolated it to these tasks rather than any additional projects. Every 5 hours I defrag this work disk and off it go at full speed (AND 4.7 GHZ 4 Core CPU). Thanks... ID: 76781 · Rating: 0 · rate: /

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 76782 - Posted: 30 May 2014, 4:07:21 UTC Last modified: 30 May 2014, 4:08:23 UTC Now you have me curious. How does one determine the cause of fragmentation? And (since it sounds like you are well-versed in such things) what might the application developers do differently to help things run better without manual defrag? 6GB doesn't sound like much when you have 4 CPUs. Would it behave better if the work disk were larger? It would see as though all of the storage used for a given BOINC slots folder would be freed when the task is completed. Rosetta Moderator: Mod.Sense ID: 76782 · Rating: 0 · rate: /

Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0	Message 76784 - Posted: 30 May 2014, 22:15:12 UTC - in response to Message 76781. Just a note to let you know I have found a small error in processing of you tasks. I run with a 6 gb work disk for Boinc tasks and I have noticed when things slow down it tends to be "Fragmented Work Disk". I have isolated it to these tasks rather than any additional projects. Every 5 hours I defrag this work disk and off it go at full speed (AND 4.7 GHZ 4 Core CPU). Thanks... Out of curiosity, is your swap file on the same partition as your BOINC data directory? You appear to be running Windows XP with 3GB of memory, so that is about 750MB per core (not including any memory reserved for system processes). As Rosetta tasks are heavily dependent on memory you may experience shortages which the system will try to alleviate by putting more pressure on the swap file. Frequent changes to your swap file may in turn be encouraging fragmentation in the rest of the partition. Also, if you are hitting the limit of memory with 4 Rosetta tasks at the same time, that will be another reason for the slow down. If other projects are working fine that lends weight to the memory theory, as Rosetta is the heaviest memory user of the BOINC projects I am familiar with. I'd suggest cutting down to 2 or 3 cores for a day or two to see if there is any significant reduction in fragmentation. ID: 76784 · Rating: 0 · rate: /

SBF-GODS-STONE Send message Joined: 6 Nov 05 Posts: 15 Credit: 44,784 RAC: 0	Message 76786 - Posted: 31 May 2014, 6:48:25 UTC - in response to Message 76782. Now you have me curious. How does one determine the cause of fragmentation? And (since it sounds like you are well-versed in such things) what might the application developers do differently to help things run better without manual defrag? WHAT IS THE REASON TO WRITE TO DISK AT A GIVEN TIME? CAN YOU SAY GET A MEMORY BUFFER FOR THIS PROCESS AND REPLACE THE WRITE TO DISK ? OR DO YOU WRITE TO DISK FOR A SEPERATE TASK TO FIND??? I.E. TEMP WRITE.... If you write to a buffer the system will push it around in memory or send it to the swap file and what ever function requires it the system will retun it to memory. WRITING TO DISK IS VERY TIME AND PROCESS TIME HEAVY.... MY Memory is very fast compared to disk writing and reading (this is true in all SYSTEMS). 6GB doesn't sound like much when you have 4 CPUs. Would it behave better if the work disk were larger? It would see as though all of the storage used for a given BOINC slots folder would be freed when the task is completed. NO THE BOINC DISK IS SEPERATE FROM THE SYSTEM SWAP DISK (CHANNELS AND DISKS) ID: 76786 · Rating: 0 · rate: /

SBF-GODS-STONE Send message Joined: 6 Nov 05 Posts: 15 Credit: 44,784 RAC: 0	Message 76788 - Posted: 1 Jun 2014, 18:03:01 UTC - in response to Message 76786. Now you have me curious. How does one determine the cause of fragmentation? And (since it sounds like you are well-versed in such things) what might the application developers do differently to help things run better without manual defrag? WHAT IS THE REASON TO WRITE TO DISK AT A GIVEN TIME? CAN YOU SAY GET A MEMORY BUFFER FOR THIS PROCESS AND REPLACE THE WRITE TO DISK ? OR DO YOU WRITE TO DISK FOR A SEPERATE TASK TO FIND??? I.E. TEMP WRITE.... If you write to a buffer the system will push it around in memory or send it to the swap file and what ever function requires it the system will retun it to memory. WRITING TO DISK IS VERY TIME AND PROCESS TIME HEAVY.... MY Memory is very fast compared to disk writing and reading (this is true in all SYSTEMS). 6GB doesn't sound like much when you have 4 CPUs. Would it behave better if the work disk were larger? It would see as though all of the storage used for a given BOINC slots folder would be freed when the task is completed. NO THE BOINC DISK IS SEPERATE FROM THE SYSTEM SWAP DISK (CHANNELS AND DISKS) JUST FOR THE INFO SIDE, I HAVE ENLARGED THE BOINC DATA WORK DISK TO 10+ GB ID: 76788 · Rating: 0 · rate: /

Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0	Message 76789 - Posted: 1 Jun 2014, 22:33:18 UTC - in response to Message 76788. Last modified: 1 Jun 2014, 22:33:47 UTC WHAT IS THE REASON TO WRITE TO DISK AT A GIVEN TIME? CAN YOU SAY GET A MEMORY BUFFER FOR THIS PROCESS AND REPLACE THE WRITE TO DISK ? NO THE BOINC DISK IS SEPERATE FROM THE SYSTEM SWAP DISK (CHANNELS AND DISKS) JUST FOR THE INFO SIDE, I HAVE ENLARGED THE BOINC DATA WORK DISK TO 10+ GB You may not be aware, but writing entirely in capitals is considered to be the equivalent of shouting on most internet forums. As I am a volunteer like you I choose not to interact with shouting people most of the time as it usually just stresses me out for no benefit. However, I will assume that you aren't aware of the shouting issue. I am not an expert but I believe Rosetta will write to disk for several reasons: 1. When data is downloaded ready to start the task. 2. When using the swap file. 3. When saving progress at a check point so your work isn't lost in the case of a system failure or shut down. 4. When the task is completed and ready to upload to the server. You mentioned in one of your posts that you have fast memory, however as I pointed out before, you don't have enough memory. Rosetta tasks frequently use hundreds of MB of memory per task so if you have 4 tasks running at once you have no spare capacity. At best you will be using your swap file almost constantly. At worst Rosetta will be checkpointing and suspending a task while it waits for more memory to become available. ID: 76789 · Rating: 0 · rate: /

SBF-GODS-STONE Send message Joined: 6 Nov 05 Posts: 15 Credit: 44,784 RAC: 0	Message 76791 - Posted: 2 Jun 2014, 17:54:15 UTC - in response to Message 76789. WHAT IS THE REASON TO WRITE TO DISK AT A GIVEN TIME? CAN YOU SAY GET A MEMORY BUFFER FOR THIS PROCESS AND REPLACE THE WRITE TO DISK ? NO THE BOINC DISK IS SEPERATE FROM THE SYSTEM SWAP DISK (CHANNELS AND DISKS) JUST FOR THE INFO SIDE, I HAVE ENLARGED THE BOINC DATA WORK DISK TO 10+ GB You may not be aware, but writing entirely in capitals is considered to be the equivalent of shouting on most internet forums. As I am a volunteer like you I choose not to interact with shouting people most of the time as it usually just stresses me out for no benefit. However, I will assume that you aren't aware of the shouting issue. I am not an expert but I believe Rosetta will write to disk for several reasons: 1. When data is downloaded ready to start the task. 2. When using the swap file. 3. When saving progress at a check point so your work isn't lost in the case of a system failure or shut down. 4. When the task is completed and ready to upload to the server. You mentioned in one of your posts that you have fast memory, however as I pointed out before, you don't have enough memory. Rosetta tasks frequently use hundreds of MB of memory per task so if you have 4 tasks running at once you have no spare capacity. At best you will be using your swap file almost constantly. At worst Rosetta will be checkpointing and suspending a task while it waits for more memory to become available. No I was not aware of the shouting....... I do not blog unless I have too let someone who needs to know what is happening. Thx. Much for the info. Now about the memory, the CPU's don't use the system memory for processing intructions. This CPU chip has for L2 and L3 cashe storage 1 and 2 gig's of memory on the chip. All other activity is done with system resources. The writing that the developers are doing I am only assuming is for computation and temp. storage between tasks. If they use Memory (if Possible) to do this type of operation the tasks will run more effectively. The overhead for disk writing taxes the process very heavly in general. It may not be possible but thought I would mention it. I wrote system software for years and have a little idea what happens in systems. ID: 76791 · Rating: 0 · rate: /

SBF-GODS-STONE Send message Joined: 6 Nov 05 Posts: 15 Credit: 44,784 RAC: 0	Message 76792 - Posted: 2 Jun 2014, 18:00:52 UTC - in response to Message 76782. Now you have me curious. How does one determine the cause of fragmentation? And (since it sounds like you are well-versed in such things) what might the application developers do differently to help things run better without manual defrag? 6GB doesn't sound like much when you have 4 CPUs. Would it behave better if the work disk were larger? It would see as though all of the storage used for a given BOINC slots folder would be freed when the task is completed. Thks for E-mail the system yours or mind hung and some how double wrote that message, I was unaware. I did raise the work disk(Boinc) to 10+ GIG's. The result so far is good, you may have pointed out that Boinc should be installed on bigger Disk's. Have a good day. ID: 76792 · Rating: 0 · rate: /

SBF-GODS-STONE Send message Joined: 6 Nov 05 Posts: 15 Credit: 44,784 RAC: 0	Message 76793 - Posted: 2 Jun 2014, 18:07:43 UTC - in response to Message 76363. If a task finds itself restarting from the same point more than 5 times, it is ended for you. This might occur if you have a number of reboots to install operating system or application fixes, or have a long-running task on a day when your machine doesn't run very long before you power off, or if you do not keep tasks in memory and suspend a task repeatedly. But otherwise, this check is there to help ensure that if a task doesn't seem to be running properly on your machine, that it doesn't get hung up. snagles mentioned 100 tries, I just wanted to reassure folks that it only takes 5. I can't recall exactly if the 5th try is the one that aborts, or if five restarts would actually be detected on a 6th start attempt. The idea was to let a task survive some of the normal PC activities of installing fixes etc. but to get it killed if it's not running properly. Just saw this , I run this machine 24 hrs 7 days a week. I schedule work for two projects at a time because it runns way faster. Also running four units at once has no real problems with output. Thks.. ID: 76793 · Rating: 0 · rate: /

P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0	Message 76795 - Posted: 2 Jun 2014, 23:08:29 UTC Last modified: 2 Jun 2014, 23:22:06 UTC Hi SBF-GODS-STONE. You seem to be mixing up cpu cashe with system ram, your system is showing about 3 gigabyte of ram ( Memory 2989.52 MB ). Rosetta can use up to and over 1 gig of ram per task/wu sometimes, there is no way that can fit in the cpu's cashe. You said - This CPU chip has for L2 and L3 cashe storage 1 and 2 gig's of memory on the chip. ==================================================== See spec's for your cpu from CPU-World site. AMD FX-4350 Frequency: 4200 MHz Turbo frequency: 4300 MHz Level 1 cache size: 2 x 64 KB shared instruction caches 4 x 16 KB data caches Level 2 cache size: 2 x 2 MB shared exclusive caches Level 3 cache size: 8 MB shared cache Memory controller: The number of controllers: 1 Memory channels: 2 Supported memory: DDR3-1866 ====================================================== i.m.h.o. - Sorry you need more RAM! ID: 76795 · Rating: 0 · rate: /