600,000 second/165 Hour/7 day WU!!!

Message boards : Number crunching : 600,000 second/165 Hour/7 day WU!!!

To post messages, you must log in.

AuthorMessage
cloaked_chaos

Send message
Joined: 9 Nov 05
Posts: 14
Credit: 80,818
RAC: 0
Message 10837 - Posted: 17 Feb 2006, 5:51:47 UTC

I was looking through my recent boinc history because I havn't checked it in awhile and noticed that one of the WU's took almost 600,000 seconds to complete!!! It had a stop error of max cpu time exceeded. This WU should have stopped LONG before the almost 7 whole days it took.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=8103040

The wierdest part about it is that another person completed this WU in 2,434.05 seconds with no error.

I would like to know if I am going to get my 2,175.86 credit that was claimed by this WU. :(
ID: 10837 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Keck_Komputers
Avatar

Send message
Joined: 17 Sep 05
Posts: 211
Credit: 4,246,150
RAC: 0
Message 10844 - Posted: 17 Feb 2006, 11:32:43 UTC

I don't know why this happened, but the task errored out so there will be no credit granted.
BOINC WIKI

BOINCing since 2002/12/8
ID: 10844 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
cloaked_chaos

Send message
Joined: 9 Nov 05
Posts: 14
Credit: 80,818
RAC: 0
Message 10961 - Posted: 19 Feb 2006, 20:18:52 UTC

Can I please get a professional opinion on this? I would really like to know why it happened.
ID: 10961 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 10963 - Posted: 19 Feb 2006, 20:25:43 UTC
Last modified: 19 Feb 2006, 20:26:34 UTC

It looks like a MCTE problem so I assume you could report it here https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1008

Perhaps time for D.B. to report something about this problem ?
Credits or not ????
ID: 10963 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 10965 - Posted: 19 Feb 2006, 20:27:38 UTC

It was most likely the 1% bug. It can also happen if you do not keep the app in memory when preempted and your client changes projects before the work unit is able to make it's first prediction. If this was the case, you can prevent it by selecting in your general preferences "Leave applications in memory while preempted?" to yes or by setting "Switch between applications every" to at least two hours or even more.

I'll grant credit for this extreme circumstance. We may consider granting credit for all time out errors in the future.
ID: 10965 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
cloaked_chaos

Send message
Joined: 9 Nov 05
Posts: 14
Credit: 80,818
RAC: 0
Message 10966 - Posted: 19 Feb 2006, 20:34:04 UTC - in response to Message 10965.  

I'll grant credit for this extreme circumstance. We may consider granting credit for all time out errors in the future.

Thank you very much.
ID: 10966 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 10968 - Posted: 19 Feb 2006, 20:47:31 UTC - in response to Message 10965.  


ID: 10968 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
cloaked_chaos

Send message
Joined: 9 Nov 05
Posts: 14
Credit: 80,818
RAC: 0
Message 10981 - Posted: 20 Feb 2006, 2:23:03 UTC

I wonder, is there any way to tell whether or not I have the record for longest WU?
ID: 10981 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Darren
Avatar

Send message
Joined: 6 Oct 05
Posts: 27
Credit: 43,535
RAC: 0
Message 11024 - Posted: 20 Feb 2006, 19:20:51 UTC - in response to Message 10968.  
Last modified: 20 Feb 2006, 19:23:26 UTC

Had it on a machine only running R@H 24/7 (no switching or something else) and still don't know why it happened.


Keep in mind that if you don't leave applications in memory, even with only one project they still get removed when boinc automatically runs the benchmarks. Perhaps that's what got you on that one.


ID: 11024 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 11028 - Posted: 20 Feb 2006, 20:02:23 UTC - in response to Message 11024.  

Had it on a machine only running R@H 24/7 (no switching or something else) and still don't know why it happened.


Keep in mind that if you don't leave applications in memory, even with only one project they still get removed when boinc automatically runs the benchmarks. Perhaps that's what got you on that one.




This is correct. As far as the having the record for longest WU, I am afraid not. There have been larger ones. Usually this happens wonly on "Launch and forget" systems. Systems that are attended do not have this problem often, abcause people intervene. The New application should help prevent this.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 11028 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
cloaked_chaos

Send message
Joined: 9 Nov 05
Posts: 14
Credit: 80,818
RAC: 0
Message 12901 - Posted: 1 Apr 2006, 5:29:19 UTC - in response to Message 10966.  

I'll grant credit for this extreme circumstance. We may consider granting credit for all time out errors in the future.

I am really wondering why I never actually received credit for this work unit, even after being promised it would be granted to me...
ID: 12901 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
James

Send message
Joined: 27 Mar 06
Posts: 4
Credit: 23,809
RAC: 0
Message 12955 - Posted: 2 Apr 2006, 18:24:33 UTC - in response to Message 12901.  
Last modified: 2 Apr 2006, 18:25:14 UTC

I'll grant credit for this extreme circumstance. We may consider granting credit for all time out errors in the future.

I am really wondering why I never actually received credit for this work unit, even after being promised it would be granted to me...


Change your max timeout settings, perhaps using tux's xml script (not the OPTIMIZED client, the 'calibration' client that won't artificially inflate your benchmarks) that comes with his boinc client. This should have been 'killed' way before 600k seconds.

For example, Rosetta runs 120 minute work units. I 'kill' all WUs that do not complete after 145 minutes. You can 'tweak' your preferences:)

As for the credit issue, I have sympathy because I have participated in the climate projects and had unrecoverable errors at 50+ percent ( you know the MASSIVE as in WEEKS/MONTHS WUs). I did get credit though.

Change your settings so you don't have it happen again.

This part isn't addressed to you:

Credit should be granted for 'real' processor usage. Rosetta, unlike say Einstein, does not calibrate WU times. It's getting to be pretty sickening in general because there are 3800s/2.+ghz machines that are claiming massive amounts of credits based upon unreal benchmarks. I overclock my 4800 from a stock 2.4ghz to 2.7ghz for each core and I know that a 3800 can't get 3 times my floating and integers:) The same is true for the 2ghzs intels that are doing the same thing.

I'm not necessarily upset about the 'cheating' but it encourages others to do the same and it creates almost amusing benchmarks on the top computers pages.

The 1 percent error is annoying - so is the fact that Rosetta has yet to incorporate a calibration feature like, say, Einstein that grants credit where credit is deserved, not manipulated artificially.
ID: 12955 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 12958 - Posted: 2 Apr 2006, 20:39:50 UTC - in response to Message 12955.  
Last modified: 2 Apr 2006, 20:45:19 UTC

For example, Rosetta runs 120 minute work units. I 'kill' all WUs that do not complete after 145 minutes. You can 'tweak' your preferences:)


James, Rosetta has adjustable run-time WUs, where it keeps creating as many new "predicted models" from the same "raw protein data" as will fit in your time settings.

Currently, default is only 2hr, but the max WU runtime is 24hr (and used to be 4 CPU days = 96 hours, before the project reduced it, so they can give a time-to-live of 24hr for every WU).

While 120min is the default, many people run every WU for much longer, e.g. I use 8hr myself.

Also, I have to add that I've been crunching Rosetta on 3 P4 PCs for the past 3 months and I've had just ONE case of the 1% bug sofar on WinXP (plus some problems 2.5months ago on a massively underspec'ed Linux, which have since been solved).


Credit should be granted for 'real' processor usage. Rosetta, unlike say Einstein, does not calibrate WU times.

The 1 percent error is annoying - so is the fact that Rosetta has yet to incorporate a calibration feature like, say, Einstein that grants credit where credit is deserved, not manipulated artificially.


Since you keep mentioning Einstein as a model to follow, where did you read that they do this kind of calibration? (web address please). My BOINC massively underclaim credits (as using akosf's app my PCs complete a WU in 1/4th of the time it used to take) for Einstein. From looking at my results, Einstein just uses a quorum of 3 and grants the credit of the middle claim e.g. wu6428418.
My BOINC's claim was for 13.99 credits, someone else's 56 and we all 3 received the middle one of 41 credits.

A project which is using quorum of 3,4 etc is effectively slashing effective CPU speed available to 1/3rd or 1/4th of donated CPU speed. I see this as an ultimate waste of donated resources and personally have stopped crunching for projects which did this just to appease credit-obsessed people, unless there were a valid science reason.

Anyway, afaik the "credit calibration" feature you mentioned is used in SETI-Beta and I hope Rosetta and other projects will use it as soon as it goes mainstream.

Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 12958 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Michael Kirberger

Send message
Joined: 16 Dec 05
Posts: 1
Credit: 11,041
RAC: 0
Message 13715 - Posted: 14 Apr 2006, 10:44:09 UTC

Hello,

I think, I have such a WU, too. After 2:28 h there is only 1.40 % of the work done. If there will be no faster Progress, i will need between 500 an 600 h to complet this WU (I think, yesterday I had 1,56 % after 5 h, but I did not reach a checkpoint, so the work starts again, but now I will run the Computer 24 h until this WU is done). The WU is 7486_largescale_large_fullatom_relax_dec7486_1_02_9.pdb_435_36_0. Should I compute or abort this WU? If I compute, I want the credits for this work :-).

Bye

Michael Kirberger
ID: 13715 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 7 Oct 05
Posts: 234
Credit: 15,020
RAC: 0
Message 13723 - Posted: 14 Apr 2006, 15:17:52 UTC - in response to Message 13715.  

Hello,

I think, I have such a WU, too. After 2:28 h there is only 1.40 % of the work done. If there will be no faster Progress, i will need between 500 an 600 h to complet this WU (I think, yesterday I had 1,56 % after 5 h, but I did not reach a checkpoint, so the work starts again, but now I will run the Computer 24 h until this WU is done). The WU is 7486_largescale_large_fullatom_relax_dec7486_1_02_9.pdb_435_36_0. Should I compute or abort this WU? If I compute, I want the credits for this work :-).

Bye

Michael Kirberger


No, you don't have to abort this WU. Look in this thread about these big molecules. It will run for a long time on about 1.5 - 2 % and then it will finish in a snap.

You'll need to let it stay in memory and don't shut your computer down while running it. I did that myself last night, and now I'm back to zero with the one I have in my cache at the moment. :-(


[b]"I'm trying to maintain a shred of dignity in this world." - Me[/b]

ID: 13723 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : 600,000 second/165 Hour/7 day WU!!!



©2024 University of Washington
https://www.bakerlab.org