Report stuck & aborted WU here please

Author	Message
[DPC]Division_Brabant~OldButNotSoWise Send message Joined: 23 Jan 06 Posts: 42 Credit: 371,797 RAC: 0	Message 14372 - Posted: 22 Apr 2006, 13:41:26 UTC Last modified: 22 Apr 2006, 13:42:01 UTC What should I do with this one? 1.6% 17:30:00 hours of crunching, but still very active with he graphics. If it's no error or stuck WU I don't matter that it takes it's time :) http://members.lycos.nl/oldbutnotsowise/fora/rosetta_wu.png ID: 14372 · Rating: 0 · rate: / Reply Quote

Rebel Alliance Send message Joined: 4 Nov 05 Posts: 50 Credit: 3,579,531 RAC: 0	Message 14391 - Posted: 22 Apr 2006, 16:54:21 UTC https://boinc.bakerlab.org/rosetta/result.php?resultid=17824571 Aborted after 12 hours https://boinc.bakerlab.org/rosetta/result.php?resultid=17825321 7 hours for this one ID: 14391 · Rating: 0 · rate: / Reply Quote

Runaway1956 Send message Joined: 5 Nov 05 Posts: 19 Credit: 535,400 RAC: 0	Message 14393 - Posted: 22 Apr 2006, 17:06:11 UTC 4/22/2006 11:59:27 AM\|rosetta@home\|Pausing result TRUNCATE_TERMINI_FULLRELAX_1enh__433_178_0 (left in memory) After this post, I'm going to abort this one. It seems to have run for two days before I caught it, and restarted BOINC to see what would happen. It just hung at 1.something percent, and the remaining time climbed past 30 hours. I SHOULD have copied the messages concerning this WU before resetting BOINC - all were gone when it restarte - sorry about that. ID: 14393 · Rating: 0 · rate: / Reply Quote

Grutte Pier [Wa Oars]~Ytsmabeer Send message Joined: 10 Nov 05 Posts: 2 Credit: 100,205 RAC: 0	Message 14403 - Posted: 22 Apr 2006, 18:08:20 UTC Reporting an WU whitch I aborted because of running for 17 hours and reading about the HBLR type HBLR_1.0_1ogw_420_8424 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13422021 been running four 17 hours made 14% complete ID: 14403 · Rating: 0 · rate: / Reply Quote

Rebel Alliance Send message Joined: 4 Nov 05 Posts: 50 Credit: 3,579,531 RAC: 0	Message 14455 - Posted: 23 Apr 2006, 6:02:38 UTC Just aborted 4 work units from 4 different machines Longest had been running close to 10 hours and was at 5% the shorted 6 hours and at one percent #1 from 2700xp Result ID 17772227 Name HBLR_1.0_1mky_420_9630_1 Workunit 13428053 Created 20 Apr 2006 21:42:41 UTC Sent 21 Apr 2006 4:22:49 UTC Received 23 Apr 2006 5:53:20 UTC Server state Over Outcome Client error Client state Computing Exit status -197 (0xffffff3b) Computer ID 148992 Report deadline 5 May 2006 4:22:49 UTC CPU time 32013.537868 #2 From 1800 xp Result ID 17805638 Name NO_TERM_STRAND_1ogw_423_6947_2 Workunit 13496532 Created 21 Apr 2006 5:49:41 UTC Sent 21 Apr 2006 8:05:02 UTC Received 23 Apr 2006 5:52:38 UTC Server state Over Outcome Client error Client state Computing Exit status -197 (0xffffff3b) Computer ID 105489 Report deadline 5 May 2006 8:05:02 UTC CPU time 24477.506926 #3 from 2000 xp Result ID 17748958 Name FACONTACTS_RECENTER_NOFILTERS_1ig5A_448_551_1 Workunit 14550587 Created 20 Apr 2006 16:34:25 UTC Sent 20 Apr 2006 22:38:14 UTC Received 23 Apr 2006 5:51:22 UTC Server state Over Outcome Client error Client state Computing Exit status -197 (0xffffff3b) Computer ID 106748 Report deadline 4 May 2006 22:38:14 UTC CPU time 25011.984375 #4 from 2500 Xp Result ID 17786001 Name HBLR_1.0_1n0u_ROT_TRIALS_TRIE_449_5_0 Workunit 14630032 Created 21 Apr 2006 1:00:11 UTC Sent 21 Apr 2006 3:09:30 UTC Received 23 Apr 2006 5:50:36 UTC Server state Over Outcome Client error Client state Computing Exit status -197 (0xffffff3b) Computer ID 107679 Report deadline 5 May 2006 3:09:30 UTC CPU time 22721.8125 ID: 14455 · Rating: 0 · rate: / Reply Quote

Runaway1956 Send message Joined: 5 Nov 05 Posts: 19 Credit: 535,400 RAC: 0	Message 14518 - Posted: 24 Apr 2006, 4:06:36 UTC What to do about upload errors? This isn't the first one I've seen - but this is the first 600 point upload error, lol 4/23/2006 22:55:04 PM\|\|Benchmark results: 4/23/2006 22:55:04 PM\|\| Number of CPUs: 1 4/23/2006 22:55:04 PM\|\| 2931 double precision MIPS (Whetstone) per CPU 4/23/2006 22:55:04 PM\|\| 9825 integer MIPS (Dhrystone) per CPU 4/23/2006 22:55:04 PM\|\|Finished CPU benchmarks 4/23/2006 22:55:05 PM\|rosetta@home\|Resuming computation for result 7521_largescale_large_fullatom_relax_dec7521_1_09_2.pdb_437_69_1 using rosetta version 498 4/23/2006 22:55:05 PM\|\|Resuming computation 4/23/2006 22:55:05 PM\|\|Rescheduling CPU: Resuming computation 4/23/2006 22:55:05 PM\|\|Using earliest-deadline-first scheduling because computer is overcommitted. 4/23/2006 22:56:06 PM\|rosetta@home\|Error on file upload: length of file /f/boinc/projects/rosetta/upload/275/7515_largescale_large_fullatom_relax_dec7515_1_66_1.pdb_436_146_0_0 35688 bytes != offset 0 bytes Most of those errors have been on the slower machines, before I set my prefs to run for a whole day. ID: 14518 · Rating: 0 · rate: / Reply Quote

JZ-power Send message Joined: 9 Nov 05 Posts: 1 Credit: 374,157 RAC: 0	Message 14553 - Posted: 24 Apr 2006, 22:41:59 UTC I have 3 WU's, all on version 4.98. I ended them because they got stuck at 1.04% TRUNCATE_TERMINI_FULLRELAX_2tif__433_230_0 ResultID: 16980143 TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_219_1 ResultID: 16991986 TRUNCATE_TERMINI_FULLRELAX_1enh__433_303_0 ResultID: 16987980 ID: 14553 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 14609 - Posted: 25 Apr 2006, 18:51:24 UTC - in response to Message 14207. I like the idea below of not passing on bad jobs to another client when they fail -- so only 1 computer will have the problem, not 4. I'm running this idea by David Baker and David Kim now. Unlike other BOINC projects its not critical for every single workunit to get processed. Its way more important to keep bad workunits from causing trouble! What's the status on the idea to set max results to 1? Any decision taken yet? ID: 14609 · Rating: 0 · rate: / Reply Quote

surrealchereal Send message Joined: 6 Nov 05 Posts: 23 Credit: 243,559 RAC: 0	Message 14658 - Posted: 26 Apr 2006, 11:10:24 UTC I had one stuck on 1.04 % also but now it's gone and so is everything. I can't connect to the server now either. What should I do? Come BOINC with me! USALUG !! ID: 14658 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 14676 - Posted: 26 Apr 2006, 15:34:46 UTC - in response to Message 14609. What's the status on the idea to set max results to 1? Any decision taken yet? With the current version being tested in Ralph, if the watchdog aborts a WU it is considered "valid" and so it's not sent out again. ID: 14676 · Rating: 0 · rate: / Reply Quote

Slaughtercult Send message Joined: 4 Nov 05 Posts: 1 Credit: 4,399,556 RAC: 0	Message 14702 - Posted: 26 Apr 2006, 21:03:30 UTC Last modified: 26 Apr 2006, 21:04:07 UTC I aborted WU 13416703 (HBLR_1.0_1mky_420_7360) after 12.5 hours at 2 %. A few hours before it was 3.x% . greetings ID: 14702 · Rating: 0 · rate: / Reply Quote

Bommer Send message Joined: 26 Nov 05 Posts: 3 Credit: 4,603,378 RAC: 0	Message 15088 - Posted: 30 Apr 2006, 17:19:00 UTC Last modified: 30 Apr 2006, 17:21:30 UTC What should I do with this one? FARELAX_NOFILTERS_1rnbA_413_201_3 4.97% 26:27:13 hours of crunching, but still very active with he graphics. If it's no error or stuck WU I don't matter that it takes it's time :) RESULT ID 18302618 WORKUNIT ID 12816946 I haven't aborted it. The Deadline is on 10 May 2006. The WU is on Hold. Thanx Bommer ID: 15088 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 15098 - Posted: 30 Apr 2006, 19:05:10 UTC - in response to Message 15088. What should I do with this one? FARELAX_NOFILTERS_1rnbA_413_201_3 4.97% 26:27:13 hours of crunching, but still very active with he graphics. If it's no error or stuck WU I don't matter that it takes it's time :) If your computer is still using rosetta version 5.01, then the WU is probably in an infinite loop and should be aborted. Version 5.07 has a watchdog thread, and it's best to let the watchdog do any aborting (if needed) as it will then send back information that is useful to the project. ID: 15098 · Rating: 0 · rate: / Reply Quote

Bommer Send message Joined: 26 Nov 05 Posts: 3 Credit: 4,603,378 RAC: 0	Message 15141 - Posted: 1 May 2006, 7:42:55 UTC Last modified: 1 May 2006, 8:18:39 UTC Hello @Moderator9: Now, my Computers are shown on the web site. The WU is using rosetta version 5.01. The actual Processor Time is 40 hours on 7 %. The WU is now on HOLD. RESULT ID 18302618 WORKUNIT ID 12816946 My Computer is an AMD X2 4600+ with WIN XP Prof Service Pack 2. Greets Bommer ID: 15141 · Rating: 0 · rate: / Reply Quote

belldandy from pleiades Send message Joined: 2 Nov 05 Posts: 6 Credit: 102,731 RAC: 0	Message 15144 - Posted: 1 May 2006, 8:55:19 UTC 2 WUs that I aborted because it takes wayyyy too much time (usual is 2-3 hours), they didn't hang though. https://boinc.bakerlab.org/rosetta/result.php?resultid=17827510 FACONTACTS_NOFILTERS_1r69__441_248_1 https://boinc.bakerlab.org/rosetta/result.php?resultid=17773776 HBLR_1.0_2tif_420_9927_1 Version for both is 5.01 Campeones everywhere! ID: 15144 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 15147 - Posted: 1 May 2006, 9:54:47 UTC - in response to Message 15141. Hello @Moderator9: Now, my Computers are shown on the web site. The WU is using rosetta version 5.01. The actual Processor Time is 40 hours on 7 %. The WU is now on HOLD. RESULT ID 18302618 WORKUNIT ID 12816946 My Computer is an AMD X2 4600+ with WIN XP Prof Service Pack 2. Greets Bommer Abort! It's a faulty WU which won't get aborted by 5.01 in time. ID: 15147 · Rating: 0 · rate: / Reply Quote

Bommer Send message Joined: 26 Nov 05 Posts: 3 Credit: 4,603,378 RAC: 0	Message 15180 - Posted: 1 May 2006, 16:39:16 UTC Hello Now my last question. How many Credits I get for the aborted WU ??? Here the Link: https://boinc.bakerlab.org/rosetta/result.php?resultid=18302618 Greets Bommer ID: 15180 · Rating: 0 · rate: / Reply Quote

cduk Send message Joined: 10 Dec 05 Posts: 3 Credit: 27,710 RAC: 0	Message 15187 - Posted: 1 May 2006, 16:54:25 UTC Last modified: 1 May 2006, 16:55:40 UTC One stuck at 1.04% I'm afraid... Link here Will the fact that I have "Leave applications in memory while preempted" set to "no" have any bearing on this? CD ID: 15187 · Rating: 0 · rate: / Reply Quote

Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0	Message 15193 - Posted: 1 May 2006, 17:41:33 UTC - in response to Message 15187. Last modified: 1 May 2006, 17:42:15 UTC One stuck at 1.04% I'm afraid... Link here Will the fact that I have "Leave applications in memory while preempted" set to "no" have any bearing on this? CD How long has it been running? What is your Rosetta preference for runtime? (looks like the default of 2hrs) What Rosetta application version is shown in the Work tab? Yes, if you change you General Preference to leave applications in memory you will produce more work for your projects. Your PC just swaps Rosetta out to the paging file on disk while it is not running, so "leave in memory" is kind of a poor word choice. It just means that the application isn't completely ended. This allows it to pick up where it left off, except for when you power down your PC. The new more requent checkpointing helps do much the same thing. Which is especially important on these large proteins. Those CASP WUs are "large" proteins, and it takes them much longer to complete each model. If it is truely "stuck" the watchdog will find it and end it. Each WU must complete at least one model, regardless of your time preference. So, if you have a short (2-4 hr) preference, a single model may still take 6 hours to complete. Once it does complete, it will see the time preference is exceeded, zip to 100% progress, and report back the result. Please let it run at least 10 hrs before you abort it. If the steps are progressing, you've probably got a normal one there, it's just large. The user time preference is not an absolute thing. You have to crunch one model in order to have any results to report. You will find that 10-24hr runtimes are very predictable from one WU to the next. It's when the runtime preference is short and the WU is large that disparities occur. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ ID: 15193 · Rating: 0 · rate: / Reply Quote

cduk Send message Joined: 10 Dec 05 Posts: 3 Credit: 27,710 RAC: 0	Message 15203 - Posted: 1 May 2006, 18:59:02 UTC - in response to Message 15193. How long has it been running? So far, nearly two hours. What is your Rosetta preference for runtime? (looks like the default of 2hrs) "Not selected" (default: 4 hours?) What Rosetta application version is shown in the Work tab? 5.07. It was stuck at 1.044x (steps creeping up slowly). While in the process of writing this reply, it has completed and reported successfully...? Confusing. Sorry if I've wasted your time - but many thanks indeed for your explanation. ID: 15203 · Rating: 0 · rate: / Reply Quote

Report stuck & aborted WU here please - II