Report stuck & aborted WU here please - II

Message boards : Number crunching : Report stuck & aborted WU here please - II

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

AuthorMessage
Profile dag
Avatar

Send message
Joined: 16 Dec 05
Posts: 106
Credit: 1,000,020
RAC: 0
Message 13661 - Posted: 13 Apr 2006, 16:11:33 UTC
Last modified: 13 Apr 2006, 16:13:03 UTC

12 hours, 1% - Linux
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13955810
TRUNCATE_TERMINI_FULLRELAX_1enh__433_645

10 hours, 1% - Windoz
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13960661
TRUNCATE_TERMINI_FULLRELAX_1ptq__433_697
dag
--Finding aliens is cool, but understanding the structure of proteins is useful.
ID: 13661 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 662
Credit: 12,167,519
RAC: 0
Message 13666 - Posted: 13 Apr 2006, 17:10:14 UTC

Find a post that does a link, then click on "reply to this post" for that post. Look at the quoted text in the editing window and it will show how they did it.

You're right, it does. I'd not noticed that before!
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 13666 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 13668 - Posted: 13 Apr 2006, 17:22:38 UTC - in response to Message 13659.  
Last modified: 13 Apr 2006, 17:23:56 UTC

Pardon my ignorance, but how does one technically do a link? Thanks!


Find a post that does a link, then click on "reply to this post" for that post. Look at the quoted text in the editing window and it will show how they did it.


In BBCode you use the opening and closing "square brackets" characters, "[" and "]".

I can't show you eactly because obviously, it would create a link, but, type an open square bracket then type url= then paste in the URL you want, (open the page in your browser and copy the contents from the address line), then a closing square bracket.

What you type next will ne the "highlighted text" of your link.

The put another open square bracket followed by /url and a final closing square bracket.

Thats a link.

17128664

That one has an open square brack "url=https://boinc.bakerlab.org/rosetta/result.php?resultid=17128664" then a close square bracket. It has 17128664 next as "highlighted text", then the open square bracket "/url" and a close square bracket.


Thank you both very much! I have saved these responses for my future reference!


Regards,
Bob P.
ID: 13668 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile JDHalter

Send message
Joined: 3 Nov 05
Posts: 13
Credit: 722,679
RAC: 0
Message 13671 - Posted: 13 Apr 2006, 18:20:13 UTC

Here's another 1% hang...again at 1.04%...on a 3rd machine.

https://boinc.bakerlab.org/rosetta/result.php?resultid=17036887
ID: 13671 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
cwangersky

Send message
Joined: 6 Nov 05
Posts: 6
Credit: 325,556
RAC: 0
Message 13672 - Posted: 13 Apr 2006, 18:38:50 UTC - in response to Message 13607.  

Here's an odd one...

Rosetta 4.98, WU 7449_largescale_large_fullatom_relax_dec7449_1_08_2.pdb_431_25_0 (deletia)


cwangersky, these are very big WUs which take a loooong time per model, on some P4s they might even take more than 2hr PER MODEL, so unless you have "Leave in mem when pre-empted"=YES, the PC can't complete even 1 model in 2hr before Rosetta gets swapped out to run SETI and your PC starts the WU from 0 again...

Solution: increase "time between swaps" to e.g. 4hr (deletia)


THank you -- I'll give that a try.
ID: 13672 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Robert J
Avatar

Send message
Joined: 7 Oct 05
Posts: 3
Credit: 397,467
RAC: 0
Message 13674 - Posted: 13 Apr 2006, 19:13:51 UTC
Last modified: 13 Apr 2006, 19:16:20 UTC

This work unit was stuck at 1.04% for over six hours. Windows XP SP2.

TRUNCATE_TERMINI_FULLRELAX_1ptq__433_663_0

https://boinc.bakerlab.org/rosetta/result.php?resultid=17026441
ID: 13674 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
RC

Send message
Joined: 27 Sep 05
Posts: 13
Credit: 262,048
RAC: 0
Message 13677 - Posted: 13 Apr 2006, 19:25:37 UTC

This unit stuck at 1.04% for 5.5 hours on Linux with Rosetta 4.98:
TRUNCATE_TERMINI_FULLRELAX_1enh__433_593_0

https://boinc.bakerlab.org/rosetta/result.php?resultid=17018950
ID: 13677 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 13685 - Posted: 13 Apr 2006, 21:39:21 UTC - in response to Message 13591.  

Just a reminder to those who are posting stuck WU's -- please abort the 4 work units below. We know why they're hanging, are not sending out anymore, and are giving credits to any of these jobs that have timed out! Thanks.

Found a bug! David Baker and I just tracked down the problem with these 4 workunits. Its a stupid infinite loop that only occurs with proteins with lengths of exactly 44 residues using one particular mode of Rosetta -- somehow no one in our group had ever looked at a protein exactly that size! So TallGuy-13088, you predicted right ...

Please do abort these workunits (below); otherwise, your client will continue to crunch the jobs until it times out (about 48 hours on a Windows machine). The good news is that we will give credit to all the jobs that time out, and are increasing the rigor of in-house testing to prevent this from happening in the future. And this little adventure helped us track down a pernicious bug in our code. Unfortunately, we don't yet have fixes for *all* the stuck jobs, though -- please continue to post info on other jobs that stop moving. It helps!

Jobs that should be aborted:
TRUNCATE_TERMINI_FULLRELAX_1enh__433
TRUNCATE_TERMINI_FULLRELAX_1b3aA_433
TRUNCATE_TERMINI_FULLRELAX_1ptq__433
TRUNCATE_TERMINI_FULLRELAX_2tif__433



ID: 13685 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bill Hepburn

Send message
Joined: 18 Sep 05
Posts: 14
Credit: 14,975,271
RAC: 0
Message 13694 - Posted: 13 Apr 2006, 23:07:49 UTC - in response to Message 13331.  
Last modified: 13 Apr 2006, 23:12:04 UTC

This one stuck at 1.04% for over 13 hours.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13967978
ID: 13694 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
RC

Send message
Joined: 27 Sep 05
Posts: 13
Credit: 262,048
RAC: 0
Message 13704 - Posted: 14 Apr 2006, 1:49:59 UTC - in response to Message 13677.  

Another one (almost 6 hours at 1.04% on Mac OS X) - aborted.

https://boinc.bakerlab.org/rosetta/result.php?resultid=17050793

ID: 13704 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
K1100LTSE
Avatar

Send message
Joined: 28 Feb 06
Posts: 7
Credit: 192,387
RAC: 0
Message 13724 - Posted: 14 Apr 2006, 15:24:15 UTC
Last modified: 14 Apr 2006, 15:29:07 UTC

abort by gui
Windows
20 hours, 1.043%
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13969518
ID: 13724 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
K1100LTSE
Avatar

Send message
Joined: 28 Feb 06
Posts: 7
Credit: 192,387
RAC: 0
Message 13731 - Posted: 14 Apr 2006, 16:29:20 UTC

abort by gui
windows
20.15 Hour, 1.042%
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13977888
ID: 13731 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 13737 - Posted: 14 Apr 2006, 17:24:02 UTC - in response to Message 13719.  

Just a reminder to those who are posting stuck WU's -- please abort the 4 work units below. We know why they're hanging, are not sending out anymore, and are giving credits to any of these jobs that have timed out! Thanks.



I read the above as saying, no credit granted when we actually abort the stuck units although credit is granted when we leave them to timeout? Will this now be the standard way of dealing with all workunit timeouts or only in this case?


Random, I suppose.
I have let a Wu go on (graphics were moving) so it timed out or it aborted itself and no credits.
Can't be positive anymore about this project.
Have run it for about 5 months and another fortnight and than it's over.

ID: 13737 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chilcotin

Send message
Joined: 5 Nov 05
Posts: 15
Credit: 16,969,500
RAC: 0
Message 13743 - Posted: 14 Apr 2006, 18:15:57 UTC
Last modified: 14 Apr 2006, 18:17:57 UTC

Workunit aborted after 23 hours. Stuck at 1.04 %.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13940837

Link


edit: looks like this may be one of the 4 already flagged in the postings above ...
ID: 13743 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 13780 - Posted: 14 Apr 2006, 21:29:03 UTC

The 1.4 stalls are still coming I am vary tired of aborting them and losing the tens of thousands of points that are NOT granted in wasted CPU time.
If this project is going to keep letting out BAD work WU's.
Rosetta need to find a way to purge these Bad WU's from there servers when they are found to cause problems like these have. And / or send commands to the users client to delete or abort the Bad WU's on any upload / download to the Rosetta servers.
To keep all the bad WU's in the system or on the Rosetta servers and forcing us to run them to purge them them from the Rosetta system is unfair to us and does damage to the project reputation.
if this continue with out relief people will start to abandon this project
If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 13780 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 13782 - Posted: 14 Apr 2006, 21:46:52 UTC - in response to Message 13780.  

The 1.4 stalls are still coming I am vary tired of aborting them and losing the tens of thousands of points that are NOT granted in wasted CPU time.
If this project is going to keep letting out BAD work WU's.
Rosetta need to find a way to purge these Bad WU's from there servers when they are found to cause problems like these have. And / or send commands to the users client to delete or abort the Bad WU's on any upload / download to the Rosetta servers.
To keep all the bad WU's in the system or on the Rosetta servers and forcing us to run them to purge them them from the Rosetta system is unfair to us and does damage to the project reputation.
if this continue with out relief people will start to abandon this project


again, we are very sorry for the problems of the recent days. we have spent most of today taking steps to ensure that these problems do not occur again. all the problem work units have been cancelled, and everything should be back to normal very soon (once the jobs that have already been downloaded have left your machines).

since CASP is starting soon, and many of the proteins will be larger, we wanted to do some calculatoins on a broader range of sizes. before pursuing this much further, we need some way of ensuring that these jobs are only sent out to machines appropriate for them, which is difficult with the current BOINC setup; we hope Rom can help us with this.
ID: 13782 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 13784 - Posted: 14 Apr 2006, 22:06:15 UTC - in response to Message 13782.  

since CASP is starting soon, and many of the proteins will be larger, we wanted to do some calculatoins on a broader range of sizes. before pursuing this much further, we need some way of ensuring that these jobs are only sent out to machines appropriate for them, which is difficult with the current BOINC setup; we hope Rom can help us with this.

Beta 5.00 under Ralph@home preliminarily seems to be successfully processing work units that had previously failed under earlier versions of Rosetta. So it may not be the machines that are at fault, but the underlying Rosetta software itself (which seems to be on the way to being cleaned up if these early successes continue to hold up).

Regards,
Bob P.
ID: 13784 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 13797 - Posted: 14 Apr 2006, 23:53:59 UTC - in response to Message 13782.  

since CASP is starting soon, and many of the proteins will be larger, we wanted to do some calculatoins on a broader range of sizes. before pursuing this much further, we need some way of ensuring that these jobs are only sent out to machines appropriate for them, which is difficult with the current BOINC setup; we hope Rom can help us with this.

When you upload information to the server, does it verify who it's coming from, or just blindly accept it, and then process it to see if it came from an actual machine running Rosetta?
If Boinc sends a request from hostid=121218 for another workunit, can't the amount of ram (and cpu speed) be looked up from the database that displays this info: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=121218
Oops.. it doesn't list speed.. just the text cpuID. (speed would thus be based on the floating point and integer ratings..)

And then use something like this to determine what to send to each machine?
(Ram=Ram/number of cpu cores)
If hostid(121218).Ram > 750 Megs, then send EvenBiggerRamWU.
If hostid(121218).Ram > 500 Megs, then send BigRamWU.
If hostid(121218).Ram > 225 Megs, then send NormalWU.

ID: 13797 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 13798 - Posted: 15 Apr 2006, 0:12:13 UTC

But what about the removal of bad WU's from your servers You must set up a way to stop the resending out of the BAD WU's Letting the system purge it self is not right. You have the capability to do auto upgrades you should have the capability to auto abort bad WU;s on client side To let bad WU's run on yours or our system is a BAD THING
If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 13798 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 13799 - Posted: 15 Apr 2006, 1:43:52 UTC - in response to Message 13798.  

But what about the removal of bad WU's from your servers You must set up a way to stop the resending out of the BAD WU's Letting the system purge it self is not right. You have the capability to do auto upgrades you should have the capability to auto abort bad WU;s on client side To let bad WU's run on yours or our system is a BAD THING


The bad WU's are removed from our servers, but we can't remove them from your machines. Hopefully there will be no more bad WU's at all so this won't be a problem anymore.

ID: 13799 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

Message boards : Number crunching : Report stuck & aborted WU here please - II



©2026 University of Washington
https://www.bakerlab.org