Information on Ver 4.97 errors

Message boards : Number crunching : Information on Ver 4.97 errors

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 13421 - Posted: 10 Apr 2006, 20:23:27 UTC

Jeff Gilchrist spoketh: distributed folding?

Yep.. that's the one.
ID: 13421 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 13437 - Posted: 11 Apr 2006, 4:38:28 UTC - in response to Message 13300.  

I notice that the HBLR_* WUs have been cancelled. That keeps them from being sent out again, but doesn't remove them from my computers. If my Linux machines successfully crunch and upload them, will the results be useful, or will they automatically be thrown away?


Please don't throw them away if they run fine--I'm very curious about the results!
ID: 13437 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 13438 - Posted: 11 Apr 2006, 5:38:40 UTC

I just got back into town an hour ago, and have not yet been able to pinpoint the source of the recent problems. But I want to apologize in any event, the scale of the problems certainly was my fault.
Here is what happened:

I wanted to test the effects of an improvement in sampling alternative sidechain conformations during the high resolution stage of the search. Tests on our in house computers showed that this improvement resulted in consistently lower energy structures being found, and there were absolutely no signs of any run time problems. David K. sent out the new version of the code to RALPH thursday, and we submitted some test jobs. Friday afternoon we talked, and as there seemed to be no problems on ralph, and the code change was relatively minor, David sent the new version out to rosetta@home.
I was very eager to see how the improvement in sampling would affect the searches I had been carrying out in the HBLR_1.0 series of runs you all had been doing over the past month, and as I was going out of town for a few days I submitted a large number of jobs friday evening so that there would be a clear picture when I returned. You can imagine my horror on checking up on rosetta and ralph in the few minutes before leaving saturday morning! It was clear by saturday that the test jobs I had sent out on ralph had a high error rate on windows, and that I had totoally jumped the gun by sending out the very large set of runs on rosetta on friday. I'm very sorry that I did this, and about the waste of resources and confusion this caused, and definitely learned my lesson--always make sure the ralph tests are complete and 100% positive before submitting large scale on rosetta.
ID: 13438 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 13439 - Posted: 11 Apr 2006, 7:10:02 UTC - in response to Message 13438.  

All I know is that almost 2 days of my computer time have resulted in errors of the kind you describe, To wit:

16811046 13764140 9 Apr 2006 10:36:23 UTC 11 Apr 2006 7:01:19 UTC Over Client error Computing 12,238.19 37.94 ---
16697013 13665863 8 Apr 2006 21:09:49 UTC 9 Apr 2006 3:20:07 UTC Over Client error Computing 18,578.25 57.60 ---
16613497 13627278 8 Apr 2006 13:05:47 UTC 8 Apr 2006 22:54:25 UTC Over Client error Computing 25,537.47 79.18 ---
16564691 13587556 8 Apr 2006 5:48:01 UTC 8 Apr 2006 15:46:50 UTC Over Client error Computing 23,689.95 73.45

To say the least it has been frustrating.

This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 13439 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 13440 - Posted: 11 Apr 2006, 7:25:29 UTC

Did the "rosetta_4.97_windows_intelx86.pdb" file give you any useful information about what happend?

Anders n
ID: 13440 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
XS_Duc
Avatar

Send message
Joined: 30 Dec 05
Posts: 17
Credit: 310,471
RAC: 0
Message 13441 - Posted: 11 Apr 2006, 9:34:27 UTC - in response to Message 13438.  

I just got back into town an hour ago, and have not yet been able to pinpoint the source of the recent problems. But I want to apologize in any event, the scale of the problems certainly was my fault.
Here is what happened:

I wanted to test the effects of an improvement in sampling alternative sidechain conformations during the high resolution stage of the search. Tests on our in house computers showed that this improvement resulted in consistently lower energy structures being found, and there were absolutely no signs of any run time problems. David K. sent out the new version of the code to RALPH thursday, and we submitted some test jobs. Friday afternoon we talked, and as there seemed to be no problems on ralph, and the code change was relatively minor, David sent the new version out to rosetta@home.
I was very eager to see how the improvement in sampling would affect the searches I had been carrying out in the HBLR_1.0 series of runs you all had been doing over the past month, and as I was going out of town for a few days I submitted a large number of jobs friday evening so that there would be a clear picture when I returned. You can imagine my horror on checking up on rosetta and ralph in the few minutes before leaving saturday morning! It was clear by saturday that the test jobs I had sent out on ralph had a high error rate on windows, and that I had totoally jumped the gun by sending out the very large set of runs on rosetta on friday. I'm very sorry that I did this, and about the waste of resources and confusion this caused, and definitely learned my lesson--always make sure the ralph tests are complete and 100% positive before submitting large scale on rosetta.


Those who are free of sin, may now pick up a stone and throw it...
We lost some time and resources, so what? It happened before and will certainly happen again I guess.
Nothing is flawless, mistakes/errors will always be made... but they shall be forgiven and forgotten in the long run towards succes.

The weak shall perish...
ID: 13441 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 14020 - Posted: 18 Apr 2006, 8:50:18 UTC - in response to Message 13298.  

Sounds like "reset project" from the projects tab. This basically aborts any WUs and reloads the application code.


I know it is too late for this thread, but I'd like to correct this please, Feet1st.

Reset is not the same as abort and reload. Reset does a forget and reload.

Often the abort is useful to a project as the error file may contain some useful info. It also allows the WU to be released to another user. For the latter reason, often with a dodgy WU reset is more useful as it does not force a re-issue until the team have had a chance to stop the WU being issued.

So both have their uses, but they are not the same.

Where a project wants the error reports, the short procedure is to go to the work tab and abort each existing work unit, and let it report in due course.

The full procedure if you want also to force a reload is quite complicated as you have to force through the flushing of the aborted work.

1) set No New Work for that project
2) abort all WU separately from the Work tab
3) suspend all other projects from the projects tab to force the aborted WU to run (sounds contradictory, but this is where each WU generates the error report)
4) in the unlikely event that these get stuck, resume then suspend one of the other projects - sometimes you'll find you need to do this as many times as you have aborted WU
5) update this project
6) wait for aborted WU to disappear from work tab
7) *now* reset project if required
8) set allow new work
9) resume all other projects from the projects tab.

It is a lot to ask users to do - which is why a project may well just ask for a reset instead - a larger percentage of users will actually do it! But it still is not the same.

River~~
ID: 14020 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Information on Ver 4.97 errors



©2024 University of Washington
https://www.bakerlab.org