WUs that die in 30 seconds..

Message boards : Number crunching : WUs that die in 30 seconds..

To post messages, you must log in.

AuthorMessage
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 11925 - Posted: 12 Mar 2006, 5:15:16 UTC

I got one; and I've seen mention of a number of them in one of the reporting threads. One of the fellows on the Teddies team managed to get enough erroring WUs that he hit the max number of errors allowed in a day. (You're only allowed to upload 48 WUs a day?)

There've been several batches of these terminal WUs in the last few months; so I ask the following questions: Do they have a 100% failure rate on Linux and Windows? And even if not, why aren't all these WUs being having 1 model generated on a fast Windows machine there in the labs before being released to us?


ID: 11925 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11926 - Posted: 12 Mar 2006, 5:23:06 UTC - in response to Message 11925.  

I got one; and I've seen mention of a number of them in one of the reporting threads. One of the fellows on the Teddies team managed to get enough erroring WUs that he hit the max number of errors allowed in a day. (You're only allowed to upload 48 WUs a day?)

There've been several batches of these terminal WUs in the last few months; so I ask the following questions: Do they have a 100% failure rate on Linux and Windows? And even if not, why aren't all these WUs being having 1 model generated on a fast Windows machine there in the labs before being released to us?



you are right--we should be able to avoid this by always testing first on ralph. will check to see why this is still happening
ID: 11926 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 11945 - Posted: 12 Mar 2006, 17:30:55 UTC

This thread has been moved from the science forum as off topic
Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 11945 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 11953 - Posted: 12 Mar 2006, 20:29:54 UTC

...so should we be reporting them in here, or do you have enough info already?
ID: 11953 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 11956 - Posted: 12 Mar 2006, 21:13:00 UTC - in response to Message 11953.  
Last modified: 12 Mar 2006, 21:18:31 UTC

...so should we be reporting them in here, or do you have enough info already?


Good point. I will make a sticky thread in this forum for DOA Work Units.

{EDIT: The new sticky is Here }

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 11956 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 11963 - Posted: 13 Mar 2006, 0:01:53 UTC

HOMSdt_homDB004_1dtj__340_50_0 failed 3 times. HOMSdt_homDB004 appears in a number of the reports on this series of WU failures. Did any of the HOMSdt_homDB WUs get processed successfully?



ID: 11963 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Darren
Avatar

Send message
Joined: 6 Oct 05
Posts: 27
Credit: 43,535
RAC: 0
Message 11964 - Posted: 13 Mar 2006, 1:02:00 UTC - in response to Message 11963.  

Did any of the HOMSdt_homDB WUs get processed successfully?


I have this HOMSdt_homDB005_1dtj__352_78 that processed successfully.

ID: 11964 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 11966 - Posted: 13 Mar 2006, 3:43:16 UTC

Dang.. so some of them did succeed.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10401327
HOMSdt_homDB027_1dtj__352_1364

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10268921
HOMSdt_homDB003_1dtj__352_40

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10310727
HOMSdt_homDB009_1dtj__352_458

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10386127
HOMSdt_homDB027_1dtj__352_1212

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10337611
HOMSdt_homDB011_1dtj__352_727

all failed multiple times.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10683881
HOMSti_homDB017_1tif__352_1174

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10439262
HOMSmk_homDB013_1mkyA_352_1147

both failed, but finally succeeded on the last machine.

Which shows that while these WUs have an incredibly high failure rate, they're able to be processed by some systems.

ID: 11966 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11967 - Posted: 13 Mar 2006, 4:42:18 UTC - in response to Message 11963.  

HOMSdt_homDB004_1dtj__340_50_0 failed 3 times. HOMSdt_homDB004 appears in a number of the reports on this series of WU failures. Did any of the HOMSdt_homDB WUs get processed successfully?




there were only a subset that failed--Divya identified the problem and fixed it. for experts, the problem was that the "-termini" option adds a proton to the N terminus, but for proline there is no place to put the proton, and for a subset of the 1dtj homologues there was an N terminal proline. this is the sort of mistake that only gets made once--it has now been fixed.

ID: 11967 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 12006 - Posted: 14 Mar 2006, 8:35:39 UTC

I was wondering if running a single model of these WUs in the lab would have uncovered the problem - assuming that there was a 100% failure rate. But Darren proved that theory wrong.

And as dgeiser posted here: dgeiser's sub minute failures post it's not limited to a subset of 1dtj WUs.

Any explanation for why the 1tif and 1mkyA (someone else's failures listed in my last message) failed with less than 0.12 credits on the first machine or two, but managed to succeed on the last machine? Do you have any idea what was different? (hardware/software/starting code?)


ID: 12006 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : WUs that die in 30 seconds..



©2024 University of Washington
https://www.bakerlab.org