Report stuck & aborted WU here please - II

Message boards : Number crunching : Report stuck & aborted WU here please - II

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · Next

AuthorMessage
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 14192 - Posted: 20 Apr 2006, 18:55:07 UTC

It'd still be nice to have WUs that are only causing problems with say.. the Windows clients, only be sent out to the Linux/Darwin users, instead of requiring 3 failures on Windows clients before they're shelved.

And if Boinc/Rosetta sends any data back and forth during the network connections that could be piggybacked.. (when we're looking to see if we need new work, returning a finished WU, or checking to see if there's a Rosetta update), so that the file would be left on the machine until the next connection to the server - it would be nice to have a list of problem WUs sent out that should be nuked..

Although, between the mentioned changes, and hopefully better pre-testing on Ralph, we can hope that the problem would not crop up any more.. :)
ID: 14192 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 14193 - Posted: 20 Apr 2006, 19:38:01 UTC - in response to Message 14192.  

It'd still be nice to have WUs that are only causing problems with say.. the Windows clients, only be sent out to the Linux/Darwin users, instead of requiring 3 failures on Windows clients before they're shelved.

And if Boinc/Rosetta sends any data back and forth during the network connections that could be piggybacked.. (when we're looking to see if we need new work, returning a finished WU, or checking to see if there's a Rosetta update), so that the file would be left on the machine until the next connection to the server - it would be nice to have a list of problem WUs sent out that should be nuked..

Although, between the mentioned changes, and hopefully better pre-testing on Ralph, we can hope that the problem would not crop up any more.. :)


Or even a user slected option for the client to report back to the servers every 3 to 6 Hrs Could give them a lot of alpha info to see what works better and hot any upgrades are working
If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 14193 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 14207 - Posted: 21 Apr 2006, 0:21:01 UTC - in response to Message 14188.  

I haven't yet got the watchdog thread into Rosetta 5.01, but we have very high hopes for it! It was a great idea from this message board. It should go into the next update, probably early next week, if the Windows build cooperates. (We're trying not to do updates during the weekend -- we seem to have had bad luck in the past!)

I'm paying attention to the ideas about reverse trickle, keeping contact between client and server, etc. -- these are nice suggestions. As I explained below, those will likely require some changes in the BOINC code, and we'll need help from the BOINC crew. They've been pretty occupied with their upcoming release.

I like the idea below of not passing on bad jobs to another client when they fail -- so only 1 computer will have the problem, not 4. I'm running this idea by David Baker and David Kim now. Unlike other BOINC projects its not critical for every single workunit to get processed. Its way more important to keep bad workunits from causing trouble!

One final note: we just went through and granted credits to errored jobs in our database. I'm trying to code the watchdog so that it will gracefully abort, including the valid output of data, so that the job will automatically get credit (but will be tagged for us as a premature abort).


AGAIN my point here is the Rosetta system need to have in place a method to remove bad work on servers and clients To think this will not happen again is unwise


Lauren,

I just read in the other threads, posts by Rhiju, Bin Qian and David Kim that:


1, We've tracked down the bugs which were causing some jobs stucked at 1.04% and been testing the fixes on ralph since yesterday.

2, We've coded up rosetta to do more frequent checkpointing in the modeling process. Now for the large jobs, we are expecting less than 30 minutes for the time between two check points. This code has been tested locally, and will be tested on ralph within a couple of days.

3, Rhiju has coded a watchdog thread for rosetta which will terminate the stucking jobs and return the intermediate results. see his post at this thread. This will be tested on ralph within a couple of days too.

4. One important change I forgot to mention: on top of the more frequent checkpointing, David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory.


If I understand the new features quoted above correctly, #3 and #4 should ensure that no WUs get stuck forever, until a human operator aborts them via BOINC.

Hopefully, the BOINC server code (developed by Berkeley Univ as open source) will eventually implement the other features, most notably the capability / preference flags (e.g. >512M mem, BigWU etc), so that big WUs are sent only to PCs which are capable / willing to process them.

PS: So, in light of recent changes, I'd say that a method to cancel bad WUs from volunteer PCs is much less of an issue, as bad WUs will take care of themselves.


ID: 14207 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 14224 - Posted: 21 Apr 2006, 3:58:30 UTC

Thank you Rhiju
For listening to our needs and taking steeps to fix or improve a vary frustrating problem.
If any my words were at all harsh Pleases forgive me. It was not my intent
I just want to get my point across And words do not come easily to me

I have checked all my nodes and not one is on 1.4 So I guess the bad ones are at a end.
Again Thank You


If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 14224 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 14227 - Posted: 21 Apr 2006, 5:45:07 UTC - in response to Message 14224.  

Your comments have been really helpful -- please continue to make suggestions. Hopefully by next week we can ensure that these stupid stuck-at-1.04% jobs never show up again on your computers. Thanks for hanging in there!

Thank you Rhiju
For listening to our needs and taking steeps to fix or improve a vary frustrating problem.
If any my words were at all harsh Pleases forgive me. It was not my intent
I just want to get my point across And words do not come easily to me

I have checked all my nodes and not one is on 1.4 So I guess the bad ones are at a end.
Again Thank You



ID: 14227 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 14253 - Posted: 21 Apr 2006, 10:24:04 UTC
Last modified: 21 Apr 2006, 10:55:05 UTC

ANother HUGE ammount of CPU time wasted!!!!


https://boinc.bakerlab.org/rosetta/result.php?resultid=17734977
CPU time 42670.640625
Claimed credit 145.838794071523
I had to abort this one as It was cought on a loop. Action done arround 6AM AST.

stderr out <core_client_version>5.2.13</core_client_version>
<message>aborted via GUI RPC
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 1509912
# cpu_run_time_pref: 21600
# Exception caught in nstruct loop ii=1 i=7
# num_decoys:6 attempts:7 cpu_run_time:30500.1
# Exception caught in nstruct loop ii=1 i=7
# num_decoys:6 attempts:8 cpu_run_time:33366.1
# Exception caught in nstruct loop ii=1 i=7
# num_decoys:6 attempts:9 cpu_run_time:34263

</stderr_txt>

What irks me is that I was the second Computer to receive this WU. I just hope that that the third one that receives it is wise enough and aborts it before a lot of his cpu time is wasted.

So dont gang up on me when I say ARGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH!!!!!!

PS

Ah at least the new version doesnt wait too long to go the error ways. On that one I will report on the 5.01 therad :(






This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 14253 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 14267 - Posted: 21 Apr 2006, 14:59:49 UTC - in response to Message 14253.  

ANother HUGE ammount of CPU time wasted!!!! ...



Jose,

Your time is not wasted. Look at This post. From this statement the results are used and you will be granted credit.

So perhaps not so much ARGH but more like AHHH!

Regards
Phil
ID: 14267 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Steven Purvis

Send message
Joined: 17 Sep 05
Posts: 1
Credit: 7,977,160
RAC: 1,779
Message 14280 - Posted: 21 Apr 2006, 17:21:33 UTC

I've just aborted about 6 work units for rosetta 4.98 with names starting 7486_largescale_large_full_atom_relax_XXXXXXXXXXXX

They all seemed to be stuck in the getting to about 1.4% but no higher. I have the "don't remove workunits from memory" enabled so that shouldn't cause a problem.

The work units results were:
17191225
17191227
17191336
17191339
17191352
17191374

Hope this is useful in some way.
ID: 14280 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [DPC]FOKschaap~_mcintosh_
Avatar

Send message
Joined: 4 Dec 05
Posts: 5
Credit: 118,303
RAC: 0
Message 14318 - Posted: 21 Apr 2006, 23:10:15 UTC

PROD_ABINITIO_FAST_1tul__447_32515

That one got aborted by BOINC. Claimed credit 251, hope 2 see that one day ;)
ID: 14318 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[DPC]Division_Brabant~OldButNotSoWise
Avatar

Send message
Joined: 23 Jan 06
Posts: 42
Credit: 371,797
RAC: 0
Message 14372 - Posted: 22 Apr 2006, 13:41:26 UTC
Last modified: 22 Apr 2006, 13:42:01 UTC

What should I do with this one?
1.6% 17:30:00 hours of crunching, but still very active with he graphics.
If it's no error or stuck WU I don't matter that it takes it's time :)

http://members.lycos.nl/oldbutnotsowise/fora/rosetta_wu.png
ID: 14372 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rebel Alliance

Send message
Joined: 4 Nov 05
Posts: 50
Credit: 3,579,531
RAC: 0
Message 14391 - Posted: 22 Apr 2006, 16:54:21 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=17824571
Aborted after 12 hours

https://boinc.bakerlab.org/rosetta/result.php?resultid=17825321
7 hours for this one
ID: 14391 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Runaway1956

Send message
Joined: 5 Nov 05
Posts: 19
Credit: 535,400
RAC: 0
Message 14393 - Posted: 22 Apr 2006, 17:06:11 UTC

4/22/2006 11:59:27 AM|rosetta@home|Pausing result TRUNCATE_TERMINI_FULLRELAX_1enh__433_178_0 (left in memory)



After this post, I'm going to abort this one. It seems to have run for two days
before I caught it, and restarted BOINC to see what would happen. It just hung at
1.something percent, and the remaining time climbed past 30 hours.


I SHOULD have copied the messages concerning this WU before resetting BOINC - all were gone when it restarte - sorry about that.


ID: 14393 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grutte Pier [Wa Oars]~Ytsmabeer

Send message
Joined: 10 Nov 05
Posts: 2
Credit: 100,205
RAC: 0
Message 14403 - Posted: 22 Apr 2006, 18:08:20 UTC

Reporting an WU whitch I aborted because of running for 17 hours and reading about the HBLR type

HBLR_1.0_1ogw_420_8424
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13422021

been running four 17 hours made 14% complete
ID: 14403 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rebel Alliance

Send message
Joined: 4 Nov 05
Posts: 50
Credit: 3,579,531
RAC: 0
Message 14455 - Posted: 23 Apr 2006, 6:02:38 UTC

Just aborted 4 work units from 4 different machines Longest had been running close to 10 hours and was at 5% the shorted 6 hours and at one percent
#1 from 2700xp
Result ID 17772227
Name HBLR_1.0_1mky_420_9630_1
Workunit 13428053
Created 20 Apr 2006 21:42:41 UTC
Sent 21 Apr 2006 4:22:49 UTC
Received 23 Apr 2006 5:53:20 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -197 (0xffffff3b)
Computer ID 148992
Report deadline 5 May 2006 4:22:49 UTC
CPU time 32013.537868

#2 From 1800 xp
Result ID 17805638
Name NO_TERM_STRAND_1ogw_423_6947_2
Workunit 13496532
Created 21 Apr 2006 5:49:41 UTC
Sent 21 Apr 2006 8:05:02 UTC
Received 23 Apr 2006 5:52:38 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -197 (0xffffff3b)
Computer ID 105489
Report deadline 5 May 2006 8:05:02 UTC
CPU time 24477.506926

#3 from 2000 xp
Result ID 17748958
Name FACONTACTS_RECENTER_NOFILTERS_1ig5A_448_551_1
Workunit 14550587
Created 20 Apr 2006 16:34:25 UTC
Sent 20 Apr 2006 22:38:14 UTC
Received 23 Apr 2006 5:51:22 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -197 (0xffffff3b)
Computer ID 106748
Report deadline 4 May 2006 22:38:14 UTC
CPU time 25011.984375

#4 from 2500 Xp
Result ID 17786001
Name HBLR_1.0_1n0u_ROT_TRIALS_TRIE_449_5_0
Workunit 14630032
Created 21 Apr 2006 1:00:11 UTC
Sent 21 Apr 2006 3:09:30 UTC
Received 23 Apr 2006 5:50:36 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -197 (0xffffff3b)
Computer ID 107679
Report deadline 5 May 2006 3:09:30 UTC
CPU time 22721.8125
ID: 14455 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Runaway1956

Send message
Joined: 5 Nov 05
Posts: 19
Credit: 535,400
RAC: 0
Message 14518 - Posted: 24 Apr 2006, 4:06:36 UTC

What to do about upload errors? This isn't the first one I've seen - but this is the first 600 point upload error, lol


4/23/2006 22:55:04 PM||Benchmark results:
4/23/2006 22:55:04 PM|| Number of CPUs: 1
4/23/2006 22:55:04 PM|| 2931 double precision MIPS (Whetstone) per CPU
4/23/2006 22:55:04 PM|| 9825 integer MIPS (Dhrystone) per CPU
4/23/2006 22:55:04 PM||Finished CPU benchmarks
4/23/2006 22:55:05 PM|rosetta@home|Resuming computation for result 7521_largescale_large_fullatom_relax_dec7521_1_09_2.pdb_437_69_1 using rosetta version 498
4/23/2006 22:55:05 PM||Resuming computation
4/23/2006 22:55:05 PM||Rescheduling CPU: Resuming computation
4/23/2006 22:55:05 PM||Using earliest-deadline-first scheduling because computer is overcommitted.
4/23/2006 22:56:06 PM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/275/7515_largescale_large_fullatom_relax_dec7515_1_66_1.pdb_436_146_0_0 35688 bytes != offset 0 bytes



Most of those errors have been on the slower machines, before I set my prefs to run for a whole day.


ID: 14518 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JZ-power

Send message
Joined: 9 Nov 05
Posts: 1
Credit: 374,157
RAC: 0
Message 14553 - Posted: 24 Apr 2006, 22:41:59 UTC

I have 3 WU's, all on version 4.98.
I ended them because they got stuck at 1.04%

TRUNCATE_TERMINI_FULLRELAX_2tif__433_230_0 ResultID: 16980143

TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_219_1 ResultID: 16991986

TRUNCATE_TERMINI_FULLRELAX_1enh__433_303_0 ResultID: 16987980

ID: 14553 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 14609 - Posted: 25 Apr 2006, 18:51:24 UTC - in response to Message 14207.  


I like the idea below of not passing on bad jobs to another client when they fail -- so only 1 computer will have the problem, not 4. I'm running this idea by David Baker and David Kim now. Unlike other BOINC projects its not critical for every single workunit to get processed. Its way more important to keep bad workunits from causing trouble!


What's the status on the idea to set max results to 1? Any decision taken yet?
ID: 14609 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile surrealchereal
Avatar

Send message
Joined: 6 Nov 05
Posts: 23
Credit: 243,559
RAC: 0
Message 14658 - Posted: 26 Apr 2006, 11:10:24 UTC

I had one stuck on 1.04 % also but now it's gone and so is everything.
I can't connect to the server now either. What should I do?
Come BOINC with me!

USALUG !!
ID: 14658 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 14676 - Posted: 26 Apr 2006, 15:34:46 UTC - in response to Message 14609.  

What's the status on the idea to set max results to 1? Any decision taken yet?

With the current version being tested in Ralph, if the watchdog aborts a WU it is considered "valid" and so it's not sent out again.

ID: 14676 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Slaughtercult

Send message
Joined: 4 Nov 05
Posts: 1
Credit: 4,152,652
RAC: 7,295
Message 14702 - Posted: 26 Apr 2006, 21:03:30 UTC
Last modified: 26 Apr 2006, 21:04:07 UTC

I aborted WU 13416703 (HBLR_1.0_1mky_420_7360) after 12.5 hours at 2 %. A few hours before it was 3.x% .

greetings


ID: 14702 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · Next

Message boards : Number crunching : Report stuck & aborted WU here please - II



©2025 University of Washington
https://www.bakerlab.org