Report stuck & aborted WU here please - II

Message boards : Number crunching : Report stuck & aborted WU here please - II

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 9 · Next

AuthorMessage
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 13575 - Posted: 12 Apr 2006, 20:28:58 UTC

Hi guys... thanks very much for reporting these errors. Work units with 1b3a, 1enh, 2tif, and 1ptq appear to be wreaking havoc throughout boinc. Sorry for the trouble -- this won't happen again, as we are increasing the stringency of our local tests that precede submission to boinc. Please MANUALLY ABORT work units with 1b3a, 1enh, 2tif, or 1ptq in the title!

ID: 13575 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tallguy-13088

Send message
Joined: 14 Dec 05
Posts: 9
Credit: 843,378
RAC: 0
Message 13581 - Posted: 12 Apr 2006, 21:34:58 UTC
Last modified: 12 Apr 2006, 21:45:57 UTC

Hi,
I just aborted TRUNCATE_TERMINI_FULLRELAX_2tif_433_796_0 after 11+ hours (11:50:43) at 1.042% complete. Stage=Full Atom Relaxation, Model=1 and Step= 245292. The links are:

RESULT: https://boinc.bakerlab.org/result.php?resultid=17040622
WORKUNIT: https://boinc.bakerlab.org/workunit.php?wuid=13969890.

Unfortunately, I was unable to download the 4.98 PDB for Windows so I can't help you there but this was running under Win2K (v5.0 Build 2195, SP4) on a Pentium R4 3.20Ghz machine. As noted earlier, Rosetta was 4.98.

OOPS! Just caught the blurb about 4.83 and 4.98 being identical!

Hope this helps you find the little bugger!

ID: 13581 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 13582 - Posted: 12 Apr 2006, 21:45:47 UTC - in response to Message 13581.  

Thanks for the post -- the percent complete is particularly interesting. The reports of 1.04%, 1.042%, 1.17% are telling us that similar work units are getting stuck at rather different points along their simulations. Its helping us focus on where to look for the bug. Please keep posting information on stuck jobs!

Hi,
I just aborted TRUNCATE_TERMINI_FULLRELAX_2tif_433_796_0 after 11+ hours (11:50:43) at 1.042% complete. Stage=Full Atom Relaxation, Model=1 and Step= 245292. The links are:

RESULT: https://boinc.bakerlab.org/result.php?resultid=17040622
WORKUNIT: https://boinc.bakerlab.org/workunit.php?wuid=13969890.

Unfortunately, I was unable to download the 4.98 PDB for Windows so I can't help you there but this was running under Win2K (v5.0 Build 2195, SP4) on a Pentium R4 3.20Ghz machine. As noted earlier, Rosetta was 4.98.

Hope this helps you find the little bugger!


ID: 13582 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tallguy-13088

Send message
Joined: 14 Dec 05
Posts: 9
Credit: 843,378
RAC: 0
Message 13586 - Posted: 12 Apr 2006, 22:21:00 UTC - in response to Message 13582.  

As for the post, no problem! We are all exploring our own little corner of the universe and sharing info is just another way of getting us closer to a better understanding of the "bigger picture". Being a M/F Sys Prog makes it easier for me to understand what you guys need to find the "speed bumps". <grin>. I can even relate to the "smack" sound that will occur when you do eventually find it.

Not knowing anything about your algorithm, I have to rely on an intuitive guess as to where the problem might be. I would imagine that you are probably working on a some kind of descent down a "decision tree" and when you get to a dead end, you have to climb back up a level and pursue the next "branch". My guess is that the program is getting "into a bind" when the structure that is being analyzed is complex enough that the process "loses track" of its previous choices and gets into a loop re-analyzing the same sequence of molecules. Just my guess.

Anyways, good luck in finding it. I suspect that you are "almost there".


Thanks for the post -- the percent complete is particularly interesting. The reports of 1.04%, 1.042%, 1.17% are telling us that similar work units are getting stuck at rather different points along their simulations. Its helping us focus on where to look for the bug. Please keep posting information on stuck jobs!

Hi,
I just aborted TRUNCATE_TERMINI_FULLRELAX_2tif_433_796_0 after 11+ hours (11:50:43) at 1.042% complete. Stage=Full Atom Relaxation, Model=1 and Step= 245292. The links are:

RESULT: https://boinc.bakerlab.org/result.php?resultid=17040622
WORKUNIT: https://boinc.bakerlab.org/workunit.php?wuid=13969890.

Unfortunately, I was unable to download the 4.98 PDB for Windows so I can't help you there but this was running under Win2K (v5.0 Build 2195, SP4) on a Pentium R4 3.20Ghz machine. As noted earlier, Rosetta was 4.98.

Hope this helps you find the little bugger!



ID: 13586 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 13588 - Posted: 12 Apr 2006, 22:38:38 UTC - in response to Message 13551.  

the only change is that we increased the default run time from 2 hours to 4 hours ...


I doubt that's the problem. I think part of the problem is that there have been some bad WUs released recently (the ones Rhiju posted about).

Another problem is that the bug requiring "keep in memory" has been fixed. That means a lot of people are setting "keep in memory" to "no". There are places in some WUs that require more than an hour to get to the next checkpoint, so with the default switching time of one hour the WU will keep dropping back to the last checkpoint indefinitly.
ID: 13588 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 13591 - Posted: 12 Apr 2006, 22:46:21 UTC

Found a bug! David Baker and I just tracked down the problem with these 4 workunits. Its a stupid infinite loop that only occurs with proteins with lengths of exactly 44 residues using one particular mode of Rosetta -- somehow no one in our group had ever looked at a protein exactly that size! So TallGuy-13088, you predicted right ...

Please do abort these workunits (below); otherwise, your client will continue to crunch the jobs until it times out (about 48 hours on a Windows machine). The good news is that we will give credit to all the jobs that time out, and are increasing the rigor of in-house testing to prevent this from happening in the future. And this little adventure helped us track down a pernicious bug in our code. Unfortunately, we don't yet have fixes for *all* the stuck jobs, though -- please continue to post info on other jobs that stop moving. It helps!

Jobs that should be aborted:
TRUNCATE_TERMINI_FULLRELAX_1enh__433
TRUNCATE_TERMINI_FULLRELAX_1b3aA_433
TRUNCATE_TERMINI_FULLRELAX_1ptq__433
TRUNCATE_TERMINI_FULLRELAX_2tif__433

ID: 13591 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
charmed

Send message
Joined: 2 Nov 05
Posts: 11
Credit: 1,780,440
RAC: 0
Message 13592 - Posted: 12 Apr 2006, 23:08:12 UTC
Last modified: 12 Apr 2006, 23:10:23 UTC

About to abort WU FA_RLXpt_hom004_1ptq__361_308_2 it's stuck at 50.242% Stage full atom relax Model 9 Step 205901 its at 7 hours 43 minutes and counting. Here it is https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11760498
Using Win XP home edition service pack 2 client id 62881.


ID: 13592 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tallguy-13088

Send message
Joined: 14 Dec 05
Posts: 9
Credit: 843,378
RAC: 0
Message 13602 - Posted: 12 Apr 2006, 23:50:57 UTC - in response to Message 13591.  
Last modified: 12 Apr 2006, 23:51:36 UTC

CONGRATS! Just remember, given a choice between "lucky" and "good", ALWAYS choose LUCKY ... with enough luck, you may eventually get good <grin>!

Found a bug! David Baker and I just tracked down the problem with these 4 workunits. Its a stupid infinite loop that only occurs with proteins with lengths of exactly 44 residues using one particular mode of Rosetta -- somehow no one in our group had ever looked at a protein exactly that size! So TallGuy-13088, you predicted right ...

Please do abort these workunits (below); otherwise, your client will continue to crunch the jobs until it times out (about 48 hours on a Windows machine). The good news is that we will give credit to all the jobs that time out, and are increasing the rigor of in-house testing to prevent this from happening in the future. And this little adventure helped us track down a pernicious bug in our code. Unfortunately, we don't yet have fixes for *all* the stuck jobs, though -- please continue to post info on other jobs that stop moving. It helps!

Jobs that should be aborted:
TRUNCATE_TERMINI_FULLRELAX_1enh__433
TRUNCATE_TERMINI_FULLRELAX_1b3aA_433
TRUNCATE_TERMINI_FULLRELAX_1ptq__433
TRUNCATE_TERMINI_FULLRELAX_2tif__433


ID: 13602 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
cwangersky

Send message
Joined: 6 Nov 05
Posts: 6
Credit: 325,556
RAC: 0
Message 13605 - Posted: 13 Apr 2006, 0:39:58 UTC

Here's an odd one...

Rosetta 4.98, WU 7449_largescale_large_fullatom_relax_dec7449_1_08_2.pdb_431_25_0 running with BOINC 5.2.13 on Windows XP 64-bit SP1 on an Athlon 64 3200+ with 512MB RAM. I also have SETI@home on that machine.

Starts up, 50% done, 2 hours CPU time used, runs for about an hour, at the end of that time it's still about 50% done, but has 3 hours CPU time; swaps out... SETI runs for an hour and swaps out... and then Rosetta swaps in again, 50% done, 2 hours (!) CPU time used. Caught this one because the accepted protein shape is pretty uncommon (looks sort of like a lollipop).

Shall I kill it or do you want me to keep watching it for a while? It's been on here for three days now, which means ballpark 36 hours, but I think I have only 2 hours credit for it...
ID: 13605 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Robert Everly

Send message
Joined: 8 Oct 05
Posts: 27
Credit: 665,094
RAC: 0
Message 13606 - Posted: 13 Apr 2006, 0:46:06 UTC

Very good news. Keep up the good work everyone!
ID: 13606 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 13607 - Posted: 13 Apr 2006, 1:07:19 UTC - in response to Message 13605.  
Last modified: 13 Apr 2006, 1:08:59 UTC

Here's an odd one...

Rosetta 4.98, WU 7449_largescale_large_fullatom_relax_dec7449_1_08_2.pdb_431_25_0 running with BOINC 5.2.13 on Windows XP 64-bit SP1 on an Athlon 64 3200+ with 512MB RAM. I also have SETI@home on that machine.

Starts up, 50% done, 2 hours CPU time used, runs for about an hour, at the end of that time it's still about 50% done, but has 3 hours CPU time; swaps out... SETI runs for an hour and swaps out... and then Rosetta swaps in again, 50% done, 2 hours (!) CPU time used. Caught this one because the accepted protein shape is pretty uncommon (looks sort of like a lollipop).

Shall I kill it or do you want me to keep watching it for a while? It's been on here for three days now, which means ballpark 36 hours, but I think I have only 2 hours credit for it...


cwangersky, these are very big WUs which take a loooong time per model, on some P4s they might even take more than 2hr PER MODEL, so unless you have "Leave in mem when pre-empted"=YES, the PC can't complete even 1 model in 2hr before Rosetta gets swapped out to run SETI and your PC starts the WU from 0 again...

Solution: increase "time between swaps" to e.g. 4hr or (if your PC has lots of RAM and/or run few BOINC projects) set "leave in mem when preempted"=YES I would choose the latter.

This very example is why Rosetta needs a BigWU flag in preferences IMHO...

AMD's explained it in a previous comment:

Another problem is that the bug requiring "keep in memory" has been fixed. That means a lot of people are setting "keep in memory" to "no". There are places in some WUs that require more than an hour to get to the next checkpoint, so with the default switching time of one hour the WU will keep dropping back to the last checkpoint indefinitly.
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 13607 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dan Wulff

Send message
Joined: 17 Sep 05
Posts: 3
Credit: 25,833,721
RAC: 35
Message 13610 - Posted: 13 Apr 2006, 1:53:37 UTC
Last modified: 13 Apr 2006, 1:57:57 UTC

aborted wu

After over 9.5 hours this one was still at 1.04% and showing 16 more hours to go. I manually aborted this unit.

Result ID 16987331
Name TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_297_0
Workunit 13923431

ID: 13610 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Kevin

Send message
Joined: 15 Jan 06
Posts: 21
Credit: 109,496
RAC: 0
Message 13612 - Posted: 13 Apr 2006, 2:13:21 UTC

Glad to see the Truncate_Termini units were cancelled. I just noticed one of my machines was working on one of those units for a lil more than 29 hours. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13918811


ID: 13612 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
msr-berlin

Send message
Joined: 28 Nov 05
Posts: 2
Credit: 8,058
RAC: 0
Message 13630 - Posted: 13 Apr 2006, 11:35:51 UTC

Aborted the following work untit TRUNCATE_TERMINI_FULLRELAX_2tif__433_417 after runninbg for 34 hours.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13934643

ID: 13630 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
StephenYavorsky

Send message
Joined: 24 Mar 06
Posts: 9
Credit: 87,195
RAC: 0
Message 13644 - Posted: 13 Apr 2006, 14:18:40 UTC - in response to Message 13331.  

This thread is for reporting Workunits that have hung (1% error), or that have been manually aborted for some reason. Please include the type of error in your report, and a link to the RESULT in your stats page. This thread replaces part one which is located here.



Application: Rosetta 4.98
Project: TRUNCATE_TERMINI_FULLRELAX_1ptq__433_36_0
Stuck at 1.04% after 10h36m28s
ID: 13644 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 13650 - Posted: 13 Apr 2006, 14:35:11 UTC - in response to Message 13331.  

This thread is for reporting Workunits that have hung (1% error), or that have been manually aborted for some reason. Please include the type of error in your report, and a link to the RESULT in your stats page. This thread replaces part one which is located here.

Pardon my ignorance, but how does one technically do a link? Thanks!

Regards,
Bob P.
ID: 13650 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile JDHalter

Send message
Joined: 3 Nov 05
Posts: 13
Credit: 722,679
RAC: 0
Message 13653 - Posted: 13 Apr 2006, 15:16:18 UTC
Last modified: 13 Apr 2006, 15:21:37 UTC

I had 2 or 3 1% hangs today...this one was for one machine, and hung at 1.04% for somewhere between 18-22 hrs...(can't remember which one this machine was...), I aborted the other ones, but didn't note the % complete before...sorry.

http://www.boinc.bakerlab.org/rosetta/result.php?resultid=17018657

Here's a second wu that had the 1% hang error (don't know what %), but here's the result link to it:

http://www.boinc.bakerlab.org/rosetta/result.php?resultid=17014563

There may be a few more, as I've got another 2 machines that haven't checked in for 12+ hrs...usually they've been checking in every 4-6. If I get some time, I'll get their result links too.

Hope it helps,
JDHalter
ID: 13653 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Robin2

Send message
Joined: 6 Nov 05
Posts: 8
Credit: 119,665
RAC: 0
Message 13654 - Posted: 13 Apr 2006, 15:18:23 UTC

I had a days worth of work units HBLR_1.0_...... fail, showing Client error. I aborted the remaining units of that type which had not yet been run, and subsequent units (TRUNCATE_TERMINI_FULLRELAX..... and HB_BARCODE_30....) have been fine
Robin
ID: 13654 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 662
Credit: 12,167,519
RAC: 0
Message 13656 - Posted: 13 Apr 2006, 15:44:35 UTC
Last modified: 13 Apr 2006, 15:48:26 UTC

Pardon my ignorance, but how does one technically do a link? Thanks!

In BBCode you use the opening and closing "square brackets" characters, "[" and "]".

I can't show you eactly because obviously, it would create a link, but, type an open square bracket then type url= then paste in the URL you want, (open the page in your browser and copy the contents from the address line), then a closing square bracket.

What you type next will ne the "highlighted text" of your link.

The put another open square bracket followed by /url and a final closing square bracket.

Thats a link.

17128664

That one has an open square brack "url=https://boinc.bakerlab.org/rosetta/result.php?resultid=17128664" then a close square bracket. It has 17128664 next as "highlighted text", then the open square bracket "/url" and a close square bracket.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 13656 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 13659 - Posted: 13 Apr 2006, 16:04:07 UTC - in response to Message 13650.  

Pardon my ignorance, but how does one technically do a link? Thanks!


Find a post that does a link, then click on "reply to this post" for that post. Look at the quoted text in the editing window and it will show how they did it.
ID: 13659 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 9 · Next

Message boards : Number crunching : Report stuck & aborted WU here please - II



©2026 University of Washington
https://www.bakerlab.org