Report stuck & aborted WU here please - II

Message boards : Number crunching : Report stuck & aborted WU here please - II

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

AuthorMessage
msr-berlin

Send message
Joined: 28 Nov 05
Posts: 2
Credit: 8,058
RAC: 0
Message 13630 - Posted: 13 Apr 2006, 11:35:51 UTC

Aborted the following work untit TRUNCATE_TERMINI_FULLRELAX_2tif__433_417 after runninbg for 34 hours.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13934643

ID: 13630 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
StephenYavorsky

Send message
Joined: 24 Mar 06
Posts: 9
Credit: 87,195
RAC: 0
Message 13644 - Posted: 13 Apr 2006, 14:18:40 UTC - in response to Message 13331.  

This thread is for reporting Workunits that have hung (1% error), or that have been manually aborted for some reason. Please include the type of error in your report, and a link to the RESULT in your stats page. This thread replaces part one which is located here.



Application: Rosetta 4.98
Project: TRUNCATE_TERMINI_FULLRELAX_1ptq__433_36_0
Stuck at 1.04% after 10h36m28s
ID: 13644 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 13650 - Posted: 13 Apr 2006, 14:35:11 UTC - in response to Message 13331.  

This thread is for reporting Workunits that have hung (1% error), or that have been manually aborted for some reason. Please include the type of error in your report, and a link to the RESULT in your stats page. This thread replaces part one which is located here.

Pardon my ignorance, but how does one technically do a link? Thanks!

Regards,
Bob P.
ID: 13650 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile JDHalter

Send message
Joined: 3 Nov 05
Posts: 13
Credit: 722,679
RAC: 0
Message 13653 - Posted: 13 Apr 2006, 15:16:18 UTC
Last modified: 13 Apr 2006, 15:21:37 UTC

I had 2 or 3 1% hangs today...this one was for one machine, and hung at 1.04% for somewhere between 18-22 hrs...(can't remember which one this machine was...), I aborted the other ones, but didn't note the % complete before...sorry.

http://www.boinc.bakerlab.org/rosetta/result.php?resultid=17018657

Here's a second wu that had the 1% hang error (don't know what %), but here's the result link to it:

http://www.boinc.bakerlab.org/rosetta/result.php?resultid=17014563

There may be a few more, as I've got another 2 machines that haven't checked in for 12+ hrs...usually they've been checking in every 4-6. If I get some time, I'll get their result links too.

Hope it helps,
JDHalter
ID: 13653 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Robin2

Send message
Joined: 6 Nov 05
Posts: 8
Credit: 119,665
RAC: 0
Message 13654 - Posted: 13 Apr 2006, 15:18:23 UTC

I had a days worth of work units HBLR_1.0_...... fail, showing Client error. I aborted the remaining units of that type which had not yet been run, and subsequent units (TRUNCATE_TERMINI_FULLRELAX..... and HB_BARCODE_30....) have been fine
Robin
ID: 13654 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 42
Message 13656 - Posted: 13 Apr 2006, 15:44:35 UTC
Last modified: 13 Apr 2006, 15:48:26 UTC

Pardon my ignorance, but how does one technically do a link? Thanks!

In BBCode you use the opening and closing "square brackets" characters, "[" and "]".

I can't show you eactly because obviously, it would create a link, but, type an open square bracket then type url= then paste in the URL you want, (open the page in your browser and copy the contents from the address line), then a closing square bracket.

What you type next will ne the "highlighted text" of your link.

The put another open square bracket followed by /url and a final closing square bracket.

Thats a link.

17128664

That one has an open square brack "url=https://boinc.bakerlab.org/rosetta/result.php?resultid=17128664" then a close square bracket. It has 17128664 next as "highlighted text", then the open square bracket "/url" and a close square bracket.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 13656 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 13659 - Posted: 13 Apr 2006, 16:04:07 UTC - in response to Message 13650.  

Pardon my ignorance, but how does one technically do a link? Thanks!


Find a post that does a link, then click on "reply to this post" for that post. Look at the quoted text in the editing window and it will show how they did it.
ID: 13659 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dag
Avatar

Send message
Joined: 16 Dec 05
Posts: 106
Credit: 1,000,020
RAC: 0
Message 13661 - Posted: 13 Apr 2006, 16:11:33 UTC
Last modified: 13 Apr 2006, 16:13:03 UTC

12 hours, 1% - Linux
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13955810
TRUNCATE_TERMINI_FULLRELAX_1enh__433_645

10 hours, 1% - Windoz
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13960661
TRUNCATE_TERMINI_FULLRELAX_1ptq__433_697
dag
--Finding aliens is cool, but understanding the structure of proteins is useful.
ID: 13661 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 42
Message 13666 - Posted: 13 Apr 2006, 17:10:14 UTC

Find a post that does a link, then click on "reply to this post" for that post. Look at the quoted text in the editing window and it will show how they did it.

You're right, it does. I'd not noticed that before!
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 13666 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 13668 - Posted: 13 Apr 2006, 17:22:38 UTC - in response to Message 13659.  
Last modified: 13 Apr 2006, 17:23:56 UTC

Pardon my ignorance, but how does one technically do a link? Thanks!


Find a post that does a link, then click on "reply to this post" for that post. Look at the quoted text in the editing window and it will show how they did it.


In BBCode you use the opening and closing "square brackets" characters, "[" and "]".

I can't show you eactly because obviously, it would create a link, but, type an open square bracket then type url= then paste in the URL you want, (open the page in your browser and copy the contents from the address line), then a closing square bracket.

What you type next will ne the "highlighted text" of your link.

The put another open square bracket followed by /url and a final closing square bracket.

Thats a link.

17128664

That one has an open square brack "url=https://boinc.bakerlab.org/rosetta/result.php?resultid=17128664" then a close square bracket. It has 17128664 next as "highlighted text", then the open square bracket "/url" and a close square bracket.


Thank you both very much! I have saved these responses for my future reference!


Regards,
Bob P.
ID: 13668 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile JDHalter

Send message
Joined: 3 Nov 05
Posts: 13
Credit: 722,679
RAC: 0
Message 13671 - Posted: 13 Apr 2006, 18:20:13 UTC

Here's another 1% hang...again at 1.04%...on a 3rd machine.

https://boinc.bakerlab.org/rosetta/result.php?resultid=17036887
ID: 13671 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
cwangersky

Send message
Joined: 6 Nov 05
Posts: 6
Credit: 325,556
RAC: 0
Message 13672 - Posted: 13 Apr 2006, 18:38:50 UTC - in response to Message 13607.  

Here's an odd one...

Rosetta 4.98, WU 7449_largescale_large_fullatom_relax_dec7449_1_08_2.pdb_431_25_0 (deletia)


cwangersky, these are very big WUs which take a loooong time per model, on some P4s they might even take more than 2hr PER MODEL, so unless you have "Leave in mem when pre-empted"=YES, the PC can't complete even 1 model in 2hr before Rosetta gets swapped out to run SETI and your PC starts the WU from 0 again...

Solution: increase "time between swaps" to e.g. 4hr (deletia)


THank you -- I'll give that a try.
ID: 13672 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Robert J
Avatar

Send message
Joined: 7 Oct 05
Posts: 3
Credit: 397,467
RAC: 0
Message 13674 - Posted: 13 Apr 2006, 19:13:51 UTC
Last modified: 13 Apr 2006, 19:16:20 UTC

This work unit was stuck at 1.04% for over six hours. Windows XP SP2.

TRUNCATE_TERMINI_FULLRELAX_1ptq__433_663_0

https://boinc.bakerlab.org/rosetta/result.php?resultid=17026441
ID: 13674 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
RC

Send message
Joined: 27 Sep 05
Posts: 13
Credit: 262,048
RAC: 0
Message 13677 - Posted: 13 Apr 2006, 19:25:37 UTC

This unit stuck at 1.04% for 5.5 hours on Linux with Rosetta 4.98:
TRUNCATE_TERMINI_FULLRELAX_1enh__433_593_0

https://boinc.bakerlab.org/rosetta/result.php?resultid=17018950
ID: 13677 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 13685 - Posted: 13 Apr 2006, 21:39:21 UTC - in response to Message 13591.  

Just a reminder to those who are posting stuck WU's -- please abort the 4 work units below. We know why they're hanging, are not sending out anymore, and are giving credits to any of these jobs that have timed out! Thanks.

Found a bug! David Baker and I just tracked down the problem with these 4 workunits. Its a stupid infinite loop that only occurs with proteins with lengths of exactly 44 residues using one particular mode of Rosetta -- somehow no one in our group had ever looked at a protein exactly that size! So TallGuy-13088, you predicted right ...

Please do abort these workunits (below); otherwise, your client will continue to crunch the jobs until it times out (about 48 hours on a Windows machine). The good news is that we will give credit to all the jobs that time out, and are increasing the rigor of in-house testing to prevent this from happening in the future. And this little adventure helped us track down a pernicious bug in our code. Unfortunately, we don't yet have fixes for *all* the stuck jobs, though -- please continue to post info on other jobs that stop moving. It helps!

Jobs that should be aborted:
TRUNCATE_TERMINI_FULLRELAX_1enh__433
TRUNCATE_TERMINI_FULLRELAX_1b3aA_433
TRUNCATE_TERMINI_FULLRELAX_1ptq__433
TRUNCATE_TERMINI_FULLRELAX_2tif__433



ID: 13685 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bill Hepburn

Send message
Joined: 18 Sep 05
Posts: 14
Credit: 14,908,579
RAC: 2,909
Message 13694 - Posted: 13 Apr 2006, 23:07:49 UTC - in response to Message 13331.  
Last modified: 13 Apr 2006, 23:12:04 UTC

This one stuck at 1.04% for over 13 hours.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13967978
ID: 13694 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
RC

Send message
Joined: 27 Sep 05
Posts: 13
Credit: 262,048
RAC: 0
Message 13704 - Posted: 14 Apr 2006, 1:49:59 UTC - in response to Message 13677.  

Another one (almost 6 hours at 1.04% on Mac OS X) - aborted.

https://boinc.bakerlab.org/rosetta/result.php?resultid=17050793

ID: 13704 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Delk

Send message
Joined: 20 Feb 06
Posts: 25
Credit: 995,624
RAC: 0
Message 13719 - Posted: 14 Apr 2006, 13:19:12 UTC - in response to Message 13685.  

Just a reminder to those who are posting stuck WU's -- please abort the 4 work units below. We know why they're hanging, are not sending out anymore, and are giving credits to any of these jobs that have timed out! Thanks.



I read the above as saying, no credit granted when we actually abort the stuck units although credit is granted when we leave them to timeout? Will this now be the standard way of dealing with all workunit timeouts or only in this case?
ID: 13719 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
K1100LTSE
Avatar

Send message
Joined: 28 Feb 06
Posts: 7
Credit: 192,387
RAC: 0
Message 13724 - Posted: 14 Apr 2006, 15:24:15 UTC
Last modified: 14 Apr 2006, 15:29:07 UTC

abort by gui
Windows
20 hours, 1.043%
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13969518
ID: 13724 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
K1100LTSE
Avatar

Send message
Joined: 28 Feb 06
Posts: 7
Credit: 192,387
RAC: 0
Message 13731 - Posted: 14 Apr 2006, 16:29:20 UTC

abort by gui
windows
20.15 Hour, 1.042%
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13977888
ID: 13731 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

Message boards : Number crunching : Report stuck & aborted WU here please - II



©2024 University of Washington
https://www.bakerlab.org