Message boards : Number crunching : Report stuck & aborted WU here please - II
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · Next
Author | Message |
---|---|
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
It'd still be nice to have WUs that are only causing problems with say.. the Windows clients, only be sent out to the Linux/Darwin users, instead of requiring 3 failures on Windows clients before they're shelved. And if Boinc/Rosetta sends any data back and forth during the network connections that could be piggybacked.. (when we're looking to see if we need new work, returning a finished WU, or checking to see if there's a Rosetta update), so that the file would be left on the machine until the next connection to the server - it would be nice to have a list of problem WUs sent out that should be nuked.. Although, between the mentioned changes, and hopefully better pre-testing on Ralph, we can hope that the problem would not crop up any more.. :) |
Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0 |
It'd still be nice to have WUs that are only causing problems with say.. the Windows clients, only be sent out to the Linux/Darwin users, instead of requiring 3 failures on Windows clients before they're shelved. Or even a user slected option for the client to report back to the servers every 3 to 6 Hrs Could give them a lot of alpha info to see what works better and hot any upgrades are working If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
I haven't yet got the watchdog thread into Rosetta 5.01, but we have very high hopes for it! It was a great idea from this message board. It should go into the next update, probably early next week, if the Windows build cooperates. (We're trying not to do updates during the weekend -- we seem to have had bad luck in the past!) I'm paying attention to the ideas about reverse trickle, keeping contact between client and server, etc. -- these are nice suggestions. As I explained below, those will likely require some changes in the BOINC code, and we'll need help from the BOINC crew. They've been pretty occupied with their upcoming release. I like the idea below of not passing on bad jobs to another client when they fail -- so only 1 computer will have the problem, not 4. I'm running this idea by David Baker and David Kim now. Unlike other BOINC projects its not critical for every single workunit to get processed. Its way more important to keep bad workunits from causing trouble! One final note: we just went through and granted credits to errored jobs in our database. I'm trying to code the watchdog so that it will gracefully abort, including the valid output of data, so that the job will automatically get credit (but will be tagged for us as a premature abort). AGAIN my point here is the Rosetta system need to have in place a method to remove bad work on servers and clients To think this will not happen again is unwise |
Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0 |
Thank you Rhiju For listening to our needs and taking steeps to fix or improve a vary frustrating problem. If any my words were at all harsh Pleases forgive me. It was not my intent I just want to get my point across And words do not come easily to me I have checked all my nodes and not one is on 1.4 So I guess the bad ones are at a end. Again Thank You If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Your comments have been really helpful -- please continue to make suggestions. Hopefully by next week we can ensure that these stupid stuck-at-1.04% jobs never show up again on your computers. Thanks for hanging in there! Thank you Rhiju |
Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0 |
ANother HUGE ammount of CPU time wasted!!!! https://boinc.bakerlab.org/rosetta/result.php?resultid=17734977 CPU time 42670.640625 Claimed credit 145.838794071523 I had to abort this one as It was cought on a loop. Action done arround 6AM AST. stderr out <core_client_version>5.2.13</core_client_version> <message>aborted via GUI RPC </message> <stderr_txt> # cpu_run_time_pref: 21600 # random seed: 1509912 # cpu_run_time_pref: 21600 # Exception caught in nstruct loop ii=1 i=7 # num_decoys:6 attempts:7 cpu_run_time:30500.1 # Exception caught in nstruct loop ii=1 i=7 # num_decoys:6 attempts:8 cpu_run_time:33366.1 # Exception caught in nstruct loop ii=1 i=7 # num_decoys:6 attempts:9 cpu_run_time:34263 </stderr_txt> What irks me is that I was the second Computer to receive this WU. I just hope that that the third one that receives it is wise enough and aborts it before a lot of his cpu time is wasted. So dont gang up on me when I say ARGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH!!!!!! PS Ah at least the new version doesnt wait too long to go the error ways. On that one I will report on the 5.01 therad :( This and no other is the root from which a Tyrant springs; when he first appears he is a protector.†Plato |
![]() ![]() Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
ANother HUGE ammount of CPU time wasted!!!! ... Jose, Your time is not wasted. Look at This post. From this statement the results are used and you will be granted credit. So perhaps not so much ARGH but more like AHHH! Regards Phil |
Steven Purvis Send message Joined: 17 Sep 05 Posts: 1 Credit: 7,977,160 RAC: 1,779 ![]() |
I've just aborted about 6 work units for rosetta 4.98 with names starting 7486_largescale_large_full_atom_relax_XXXXXXXXXXXX They all seemed to be stuck in the getting to about 1.4% but no higher. I have the "don't remove workunits from memory" enabled so that shouldn't cause a problem. The work units results were: 17191225 17191227 17191336 17191339 17191352 17191374 Hope this is useful in some way. |
![]() ![]() Send message Joined: 4 Dec 05 Posts: 5 Credit: 118,303 RAC: 0 |
PROD_ABINITIO_FAST_1tul__447_32515 That one got aborted by BOINC. Claimed credit 251, hope 2 see that one day ;) |
[DPC]Division_Brabant~OldButNotSoWise![]() Send message Joined: 23 Jan 06 Posts: 42 Credit: 371,797 RAC: 0 |
What should I do with this one? 1.6% 17:30:00 hours of crunching, but still very active with he graphics. If it's no error or stuck WU I don't matter that it takes it's time :) http://members.lycos.nl/oldbutnotsowise/fora/rosetta_wu.png |
Rebel Alliance Send message Joined: 4 Nov 05 Posts: 50 Credit: 3,579,531 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=17824571 Aborted after 12 hours https://boinc.bakerlab.org/rosetta/result.php?resultid=17825321 7 hours for this one |
![]() Send message Joined: 5 Nov 05 Posts: 19 Credit: 535,400 RAC: 0 |
4/22/2006 11:59:27 AM|rosetta@home|Pausing result TRUNCATE_TERMINI_FULLRELAX_1enh__433_178_0 (left in memory) After this post, I'm going to abort this one. It seems to have run for two days before I caught it, and restarted BOINC to see what would happen. It just hung at 1.something percent, and the remaining time climbed past 30 hours. I SHOULD have copied the messages concerning this WU before resetting BOINC - all were gone when it restarte - sorry about that. ![]() |
![]() Send message Joined: 10 Nov 05 Posts: 2 Credit: 100,205 RAC: 0 |
Reporting an WU whitch I aborted because of running for 17 hours and reading about the HBLR type HBLR_1.0_1ogw_420_8424 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13422021 been running four 17 hours made 14% complete |
Rebel Alliance Send message Joined: 4 Nov 05 Posts: 50 Credit: 3,579,531 RAC: 0 |
Just aborted 4 work units from 4 different machines Longest had been running close to 10 hours and was at 5% the shorted 6 hours and at one percent #1 from 2700xp Result ID 17772227 Name HBLR_1.0_1mky_420_9630_1 Workunit 13428053 Created 20 Apr 2006 21:42:41 UTC Sent 21 Apr 2006 4:22:49 UTC Received 23 Apr 2006 5:53:20 UTC Server state Over Outcome Client error Client state Computing Exit status -197 (0xffffff3b) Computer ID 148992 Report deadline 5 May 2006 4:22:49 UTC CPU time 32013.537868 #2 From 1800 xp Result ID 17805638 Name NO_TERM_STRAND_1ogw_423_6947_2 Workunit 13496532 Created 21 Apr 2006 5:49:41 UTC Sent 21 Apr 2006 8:05:02 UTC Received 23 Apr 2006 5:52:38 UTC Server state Over Outcome Client error Client state Computing Exit status -197 (0xffffff3b) Computer ID 105489 Report deadline 5 May 2006 8:05:02 UTC CPU time 24477.506926 #3 from 2000 xp Result ID 17748958 Name FACONTACTS_RECENTER_NOFILTERS_1ig5A_448_551_1 Workunit 14550587 Created 20 Apr 2006 16:34:25 UTC Sent 20 Apr 2006 22:38:14 UTC Received 23 Apr 2006 5:51:22 UTC Server state Over Outcome Client error Client state Computing Exit status -197 (0xffffff3b) Computer ID 106748 Report deadline 4 May 2006 22:38:14 UTC CPU time 25011.984375 #4 from 2500 Xp Result ID 17786001 Name HBLR_1.0_1n0u_ROT_TRIALS_TRIE_449_5_0 Workunit 14630032 Created 21 Apr 2006 1:00:11 UTC Sent 21 Apr 2006 3:09:30 UTC Received 23 Apr 2006 5:50:36 UTC Server state Over Outcome Client error Client state Computing Exit status -197 (0xffffff3b) Computer ID 107679 Report deadline 5 May 2006 3:09:30 UTC CPU time 22721.8125 |
![]() Send message Joined: 5 Nov 05 Posts: 19 Credit: 535,400 RAC: 0 |
What to do about upload errors? This isn't the first one I've seen - but this is the first 600 point upload error, lol 4/23/2006 22:55:04 PM||Benchmark results: 4/23/2006 22:55:04 PM|| Number of CPUs: 1 4/23/2006 22:55:04 PM|| 2931 double precision MIPS (Whetstone) per CPU 4/23/2006 22:55:04 PM|| 9825 integer MIPS (Dhrystone) per CPU 4/23/2006 22:55:04 PM||Finished CPU benchmarks 4/23/2006 22:55:05 PM|rosetta@home|Resuming computation for result 7521_largescale_large_fullatom_relax_dec7521_1_09_2.pdb_437_69_1 using rosetta version 498 4/23/2006 22:55:05 PM||Resuming computation 4/23/2006 22:55:05 PM||Rescheduling CPU: Resuming computation 4/23/2006 22:55:05 PM||Using earliest-deadline-first scheduling because computer is overcommitted. 4/23/2006 22:56:06 PM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/275/7515_largescale_large_fullatom_relax_dec7515_1_66_1.pdb_436_146_0_0 35688 bytes != offset 0 bytes Most of those errors have been on the slower machines, before I set my prefs to run for a whole day. ![]() |
JZ-power Send message Joined: 9 Nov 05 Posts: 1 Credit: 374,157 RAC: 0 |
|
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
What's the status on the idea to set max results to 1? Any decision taken yet? |
![]() ![]() Send message Joined: 6 Nov 05 Posts: 23 Credit: 243,559 RAC: 0 |
I had one stuck on 1.04 % also but now it's gone and so is everything. I can't connect to the server now either. What should I do? Come BOINC with me! USALUG !! |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
What's the status on the idea to set max results to 1? Any decision taken yet? With the current version being tested in Ralph, if the watchdog aborts a WU it is considered "valid" and so it's not sent out again. |
Slaughtercult Send message Joined: 4 Nov 05 Posts: 1 Credit: 4,152,652 RAC: 7,295 ![]() |
I aborted WU 13416703 (HBLR_1.0_1mky_420_7360) after 12.5 hours at 2 %. A few hours before it was 3.x% . greetings |
Message boards :
Number crunching :
Report stuck & aborted WU here please - II
©2025 University of Washington
https://www.bakerlab.org