Report stuck & aborted WU here please - II

Message boards : Number crunching : Report stuck & aborted WU here please - II

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · Next

AuthorMessage
Profile Cureseekers~Kristof

Send message
Joined: 5 Nov 05
Posts: 80
Credit: 689,603
RAC: 0
Message 14184 - Posted: 20 Apr 2006, 17:14:20 UTC
Last modified: 20 Apr 2006, 17:15:29 UTC

After more than 30 hours runtime, and stuck for hours at the same percentage, I aborted the job:
https://boinc.bakerlab.org/rosetta/result.php?resultid=17454155

260 credits lost...
Member of Dutch Power Cows
ID: 14184 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 14186 - Posted: 20 Apr 2006, 17:41:56 UTC - in response to Message 14131.  

Are there any other conditions under which you think we should abort? Looking forward to some more advice. I think we're on the way to finally bringing an end to these stuck jobs.


What about when deadline is approaching? Sometimes people crank their preference straight from 4 hrs to 24 hrs, and all of the sudden 10 WUs cannot be completed before deadline. So, if deadline is "near" (? how near?) then just finish the current model and end this WU so it can be reported in time. I don't know if you would call that an "abort". It's more of a normal end, in advance of the target runtime.

Great progress!

Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 14186 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 14188 - Posted: 20 Apr 2006, 18:03:56 UTC - in response to Message 14172.  
Last modified: 20 Apr 2006, 18:13:58 UTC

AGAIN my point here is the Rosetta system need to have in place a method to remove bad work on servers and clients To think this will not happen again is unwise


Lauren,

I just read in the other threads, posts by Rhiju, Bin Qian and David Kim that:


1, We've tracked down the bugs which were causing some jobs stucked at 1.04% and been testing the fixes on ralph since yesterday.

2, We've coded up rosetta to do more frequent checkpointing in the modeling process. Now for the large jobs, we are expecting less than 30 minutes for the time between two check points. This code has been tested locally, and will be tested on ralph within a couple of days.

3, Rhiju has coded a watchdog thread for rosetta which will terminate the stucking jobs and return the intermediate results. see his post at this thread. This will be tested on ralph within a couple of days too.

4. One important change I forgot to mention: on top of the more frequent checkpointing, David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory.


If I understand the new features quoted above correctly, #3 and #4 should ensure that no WUs get stuck forever, until a human operator aborts them via BOINC.

Hopefully, the BOINC server code (developed by Berkeley Univ as open source) will eventually implement the other features, most notably the capability / preference flags (e.g. >512M mem, BigWU etc), so that big WUs are sent only to PCs which are capable / willing to process them.

PS: So, in light of recent changes, I'd say that a method to cancel bad WUs from volunteer PCs is much less of an issue, as bad WUs will take care of themselves.
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 14188 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 14190 - Posted: 20 Apr 2006, 18:29:47 UTC - in response to Message 14156.  

That WU was a nasty one. The first person who got it wasted 57 hours of CPU on it before noticing and aborting it, then the second person wasted another 14 hours. The third person let it sit in the queue until it went past deadline. Then you got it. At least now its had too many errors and won't be sent out any more.


Perhaps it is a good idea to turn off the mechanism of resending failed WU at the moment (ie by setting number of failures until cancellation of that WU to 1). With such a setting bad WUs would only bother one participant.
ID: 14190 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 14192 - Posted: 20 Apr 2006, 18:55:07 UTC

It'd still be nice to have WUs that are only causing problems with say.. the Windows clients, only be sent out to the Linux/Darwin users, instead of requiring 3 failures on Windows clients before they're shelved.

And if Boinc/Rosetta sends any data back and forth during the network connections that could be piggybacked.. (when we're looking to see if we need new work, returning a finished WU, or checking to see if there's a Rosetta update), so that the file would be left on the machine until the next connection to the server - it would be nice to have a list of problem WUs sent out that should be nuked..

Although, between the mentioned changes, and hopefully better pre-testing on Ralph, we can hope that the problem would not crop up any more.. :)
ID: 14192 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 14193 - Posted: 20 Apr 2006, 19:38:01 UTC - in response to Message 14192.  

It'd still be nice to have WUs that are only causing problems with say.. the Windows clients, only be sent out to the Linux/Darwin users, instead of requiring 3 failures on Windows clients before they're shelved.

And if Boinc/Rosetta sends any data back and forth during the network connections that could be piggybacked.. (when we're looking to see if we need new work, returning a finished WU, or checking to see if there's a Rosetta update), so that the file would be left on the machine until the next connection to the server - it would be nice to have a list of problem WUs sent out that should be nuked..

Although, between the mentioned changes, and hopefully better pre-testing on Ralph, we can hope that the problem would not crop up any more.. :)


Or even a user slected option for the client to report back to the servers every 3 to 6 Hrs Could give them a lot of alpha info to see what works better and hot any upgrades are working
If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 14193 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 14207 - Posted: 21 Apr 2006, 0:21:01 UTC - in response to Message 14188.  

I haven't yet got the watchdog thread into Rosetta 5.01, but we have very high hopes for it! It was a great idea from this message board. It should go into the next update, probably early next week, if the Windows build cooperates. (We're trying not to do updates during the weekend -- we seem to have had bad luck in the past!)

I'm paying attention to the ideas about reverse trickle, keeping contact between client and server, etc. -- these are nice suggestions. As I explained below, those will likely require some changes in the BOINC code, and we'll need help from the BOINC crew. They've been pretty occupied with their upcoming release.

I like the idea below of not passing on bad jobs to another client when they fail -- so only 1 computer will have the problem, not 4. I'm running this idea by David Baker and David Kim now. Unlike other BOINC projects its not critical for every single workunit to get processed. Its way more important to keep bad workunits from causing trouble!

One final note: we just went through and granted credits to errored jobs in our database. I'm trying to code the watchdog so that it will gracefully abort, including the valid output of data, so that the job will automatically get credit (but will be tagged for us as a premature abort).


AGAIN my point here is the Rosetta system need to have in place a method to remove bad work on servers and clients To think this will not happen again is unwise


Lauren,

I just read in the other threads, posts by Rhiju, Bin Qian and David Kim that:


1, We've tracked down the bugs which were causing some jobs stucked at 1.04% and been testing the fixes on ralph since yesterday.

2, We've coded up rosetta to do more frequent checkpointing in the modeling process. Now for the large jobs, we are expecting less than 30 minutes for the time between two check points. This code has been tested locally, and will be tested on ralph within a couple of days.

3, Rhiju has coded a watchdog thread for rosetta which will terminate the stucking jobs and return the intermediate results. see his post at this thread. This will be tested on ralph within a couple of days too.

4. One important change I forgot to mention: on top of the more frequent checkpointing, David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory.


If I understand the new features quoted above correctly, #3 and #4 should ensure that no WUs get stuck forever, until a human operator aborts them via BOINC.

Hopefully, the BOINC server code (developed by Berkeley Univ as open source) will eventually implement the other features, most notably the capability / preference flags (e.g. >512M mem, BigWU etc), so that big WUs are sent only to PCs which are capable / willing to process them.

PS: So, in light of recent changes, I'd say that a method to cancel bad WUs from volunteer PCs is much less of an issue, as bad WUs will take care of themselves.


ID: 14207 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 14224 - Posted: 21 Apr 2006, 3:58:30 UTC

Thank you Rhiju
For listening to our needs and taking steeps to fix or improve a vary frustrating problem.
If any my words were at all harsh Pleases forgive me. It was not my intent
I just want to get my point across And words do not come easily to me

I have checked all my nodes and not one is on 1.4 So I guess the bad ones are at a end.
Again Thank You


If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 14224 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 14227 - Posted: 21 Apr 2006, 5:45:07 UTC - in response to Message 14224.  

Your comments have been really helpful -- please continue to make suggestions. Hopefully by next week we can ensure that these stupid stuck-at-1.04% jobs never show up again on your computers. Thanks for hanging in there!

Thank you Rhiju
For listening to our needs and taking steeps to fix or improve a vary frustrating problem.
If any my words were at all harsh Pleases forgive me. It was not my intent
I just want to get my point across And words do not come easily to me

I have checked all my nodes and not one is on 1.4 So I guess the bad ones are at a end.
Again Thank You



ID: 14227 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 14238 - Posted: 21 Apr 2006, 7:24:45 UTC
Last modified: 21 Apr 2006, 7:29:49 UTC


Rhiju, ( and other development team members)

I have opened a "sticky" here for you and the development team to post Rosetta application release information as new versions are deployed. It might help people find the information and they can subscribe to the thread so they can be notified when you post something there.

Could you post the details on Version 5.01 to kick this off? I know a lot of people would like to see this done regularly.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 14238 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 14253 - Posted: 21 Apr 2006, 10:24:04 UTC
Last modified: 21 Apr 2006, 10:55:05 UTC

ANother HUGE ammount of CPU time wasted!!!!


https://boinc.bakerlab.org/rosetta/result.php?resultid=17734977
CPU time 42670.640625
Claimed credit 145.838794071523
I had to abort this one as It was cought on a loop. Action done arround 6AM AST.

stderr out <core_client_version>5.2.13</core_client_version>
<message>aborted via GUI RPC
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 1509912
# cpu_run_time_pref: 21600
# Exception caught in nstruct loop ii=1 i=7
# num_decoys:6 attempts:7 cpu_run_time:30500.1
# Exception caught in nstruct loop ii=1 i=7
# num_decoys:6 attempts:8 cpu_run_time:33366.1
# Exception caught in nstruct loop ii=1 i=7
# num_decoys:6 attempts:9 cpu_run_time:34263

</stderr_txt>

What irks me is that I was the second Computer to receive this WU. I just hope that that the third one that receives it is wise enough and aborts it before a lot of his cpu time is wasted.

So dont gang up on me when I say ARGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH!!!!!!

PS

Ah at least the new version doesnt wait too long to go the error ways. On that one I will report on the 5.01 therad :(






This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 14253 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 14267 - Posted: 21 Apr 2006, 14:59:49 UTC - in response to Message 14253.  

ANother HUGE ammount of CPU time wasted!!!! ...



Jose,

Your time is not wasted. Look at This post. From this statement the results are used and you will be granted credit.

So perhaps not so much ARGH but more like AHHH!

Regards
Phil
ID: 14267 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Steven Purvis

Send message
Joined: 17 Sep 05
Posts: 1
Credit: 6,419,158
RAC: 115
Message 14280 - Posted: 21 Apr 2006, 17:21:33 UTC

I've just aborted about 6 work units for rosetta 4.98 with names starting 7486_largescale_large_full_atom_relax_XXXXXXXXXXXX

They all seemed to be stuck in the getting to about 1.4% but no higher. I have the "don't remove workunits from memory" enabled so that shouldn't cause a problem.

The work units results were:
17191225
17191227
17191336
17191339
17191352
17191374

Hope this is useful in some way.
ID: 14280 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [DPC]FOKschaap~_mcintosh_
Avatar

Send message
Joined: 4 Dec 05
Posts: 5
Credit: 118,303
RAC: 0
Message 14318 - Posted: 21 Apr 2006, 23:10:15 UTC

PROD_ABINITIO_FAST_1tul__447_32515

That one got aborted by BOINC. Claimed credit 251, hope 2 see that one day ;)
ID: 14318 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[DPC]Division_Brabant~OldButNotSoWise
Avatar

Send message
Joined: 23 Jan 06
Posts: 42
Credit: 371,797
RAC: 0
Message 14372 - Posted: 22 Apr 2006, 13:41:26 UTC
Last modified: 22 Apr 2006, 13:42:01 UTC

What should I do with this one?
1.6% 17:30:00 hours of crunching, but still very active with he graphics.
If it's no error or stuck WU I don't matter that it takes it's time :)

http://members.lycos.nl/oldbutnotsowise/fora/rosetta_wu.png
ID: 14372 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 14384 - Posted: 22 Apr 2006, 15:34:58 UTC - in response to Message 14372.  

What should I do with this one?
1.6% 17:30:00 hours of crunching, but still very active with he graphics.
If it's no error or stuck WU I don't matter that it takes it's time :)

http://members.lycos.nl/oldbutnotsowise/fora/rosetta_wu.png


It looks like you may have a problem WU. I looked at your system but I cannot tell which WU you are running from the list. There was a batch that were identified for aborting here.

If it is one of those I would abort it. I see it is at 1.6%. In the display the percent should be displayed with 4 decimal places (1.xxxx %) Before you abort it make a note of full value of the percent and include that in your report, and provide a link to the result on your stats page.


(Nice Belgian Sheepdog by the way)
Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 14384 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rebel Alliance

Send message
Joined: 4 Nov 05
Posts: 50
Credit: 3,579,531
RAC: 0
Message 14391 - Posted: 22 Apr 2006, 16:54:21 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=17824571
Aborted after 12 hours

https://boinc.bakerlab.org/rosetta/result.php?resultid=17825321
7 hours for this one
ID: 14391 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Runaway1956

Send message
Joined: 5 Nov 05
Posts: 19
Credit: 535,400
RAC: 0
Message 14393 - Posted: 22 Apr 2006, 17:06:11 UTC

4/22/2006 11:59:27 AM|rosetta@home|Pausing result TRUNCATE_TERMINI_FULLRELAX_1enh__433_178_0 (left in memory)



After this post, I'm going to abort this one. It seems to have run for two days
before I caught it, and restarted BOINC to see what would happen. It just hung at
1.something percent, and the remaining time climbed past 30 hours.


I SHOULD have copied the messages concerning this WU before resetting BOINC - all were gone when it restarte - sorry about that.


ID: 14393 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grutte Pier [Wa Oars]~Ytsmabeer

Send message
Joined: 10 Nov 05
Posts: 2
Credit: 100,205
RAC: 0
Message 14403 - Posted: 22 Apr 2006, 18:08:20 UTC

Reporting an WU whitch I aborted because of running for 17 hours and reading about the HBLR type

HBLR_1.0_1ogw_420_8424
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13422021

been running four 17 hours made 14% complete
ID: 14403 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rebel Alliance

Send message
Joined: 4 Nov 05
Posts: 50
Credit: 3,579,531
RAC: 0
Message 14455 - Posted: 23 Apr 2006, 6:02:38 UTC

Just aborted 4 work units from 4 different machines Longest had been running close to 10 hours and was at 5% the shorted 6 hours and at one percent
#1 from 2700xp
Result ID 17772227
Name HBLR_1.0_1mky_420_9630_1
Workunit 13428053
Created 20 Apr 2006 21:42:41 UTC
Sent 21 Apr 2006 4:22:49 UTC
Received 23 Apr 2006 5:53:20 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -197 (0xffffff3b)
Computer ID 148992
Report deadline 5 May 2006 4:22:49 UTC
CPU time 32013.537868

#2 From 1800 xp
Result ID 17805638
Name NO_TERM_STRAND_1ogw_423_6947_2
Workunit 13496532
Created 21 Apr 2006 5:49:41 UTC
Sent 21 Apr 2006 8:05:02 UTC
Received 23 Apr 2006 5:52:38 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -197 (0xffffff3b)
Computer ID 105489
Report deadline 5 May 2006 8:05:02 UTC
CPU time 24477.506926

#3 from 2000 xp
Result ID 17748958
Name FACONTACTS_RECENTER_NOFILTERS_1ig5A_448_551_1
Workunit 14550587
Created 20 Apr 2006 16:34:25 UTC
Sent 20 Apr 2006 22:38:14 UTC
Received 23 Apr 2006 5:51:22 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -197 (0xffffff3b)
Computer ID 106748
Report deadline 4 May 2006 22:38:14 UTC
CPU time 25011.984375

#4 from 2500 Xp
Result ID 17786001
Name HBLR_1.0_1n0u_ROT_TRIALS_TRIE_449_5_0
Workunit 14630032
Created 21 Apr 2006 1:00:11 UTC
Sent 21 Apr 2006 3:09:30 UTC
Received 23 Apr 2006 5:50:36 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -197 (0xffffff3b)
Computer ID 107679
Report deadline 5 May 2006 3:09:30 UTC
CPU time 22721.8125
ID: 14455 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · Next

Message boards : Number crunching : Report stuck & aborted WU here please - II



©2024 University of Washington
https://www.bakerlab.org