Workunit error - check skipped?

Message boards : Number crunching : Workunit error - check skipped?

To post messages, you must log in.

AuthorMessage
Tom Philippart
Avatar

Send message
Joined: 29 May 06
Posts: 183
Credit: 834,667
RAC: 0
Message 35752 - Posted: 30 Jan 2007, 13:19:47 UTC

Could anyone please explain to me what happened here?
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=50835993

This is my result:
https://boinc.bakerlab.org/rosetta/result.php?resultid=59077178

Thanks
ID: 35752 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 35761 - Posted: 30 Jan 2007, 17:46:08 UTC

I haven't seen that one before. I'll look into it.
ID: 35761 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 35767 - Posted: 30 Jan 2007, 20:03:13 UTC

Tom,
We purged the logs last night so I couldn't track down the error. It may have been a corrupted result file. I granted the claimed credit for you.
ID: 35767 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tom Philippart
Avatar

Send message
Joined: 29 May 06
Posts: 183
Credit: 834,667
RAC: 0
Message 35774 - Posted: 30 Jan 2007, 21:19:47 UTC - in response to Message 35767.  

thanks!
ID: 35774 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Monkey

Send message
Joined: 14 Nov 06
Posts: 1
Credit: 1,001,689
RAC: 0
Message 35899 - Posted: 1 Feb 2007, 11:26:43 UTC

I seem to have the same problem. It seems that I was the only one that did the workunit.

Workunit:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=50393160

Result:
https://boinc.bakerlab.org/rosetta/result.php?resultid=60300613
ID: 35899 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marky-UK

Send message
Joined: 1 Nov 05
Posts: 73
Credit: 1,689,495
RAC: 0
Message 35913 - Posted: 1 Feb 2007, 14:29:56 UTC - in response to Message 35899.  
Last modified: 1 Feb 2007, 14:49:56 UTC

I seem to have the same problem. It seems that I was the only one that did the workunit.

Workunit:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=50393160

The "max # of total results" (clicky) is set to 2 which was probably exceeded by the first two results that didn't get returned. I guess "No reply" results don't count as an error, which would have stopped the 3rd result being sent out - if that's the case, I'd think that the "max # of total results" is too low at 2.

The WU in Tom Philippart's post (WU 50835993) went funny because the second result was not returned by its deadline, so a 3rd copy was sent out. Before the 3rd was returned, the second result came back late and passed validation, the "max # of success results" was hit so the 3rd was rejected.

IMHO, if results returned after their deadline are accepted, the "max # of success results" must be higher than 1. And "max # of total results" probably needs to be higher than 2.
ID: 35913 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tom Philippart
Avatar

Send message
Joined: 29 May 06
Posts: 183
Credit: 834,667
RAC: 0
Message 35914 - Posted: 1 Feb 2007, 15:06:26 UTC

not the same problem, but look at this:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=51477081

again the same computer, I hope it isn't on my side though :(
ID: 35914 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marky-UK

Send message
Joined: 1 Nov 05
Posts: 73
Credit: 1,689,495
RAC: 0
Message 35916 - Posted: 1 Feb 2007, 15:13:11 UTC - in response to Message 35914.  
Last modified: 1 Feb 2007, 15:13:26 UTC

not the same problem, but look at this:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=51477081

again the same computer, I hope it isn't on my side though :(

I've got loads of WUs I've returned today that are all stuck in Pending state. The credit for my team hasn't increased at all since before 0800 UTC.
ID: 35916 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile alpha

Send message
Joined: 4 Nov 06
Posts: 27
Credit: 1,550,107
RAC: 0
Message 35928 - Posted: 1 Feb 2007, 19:07:30 UTC

I've got a validate error on this result, which is this work unit.

It seems to have appeared out of the blue on a very stable machine. The only thing I noticed is that the other two computers that were given this WU generated a "client error" and "unknown" outcome.
ID: 35928 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile alpha

Send message
Joined: 4 Nov 06
Posts: 27
Credit: 1,550,107
RAC: 0
Message 36083 - Posted: 4 Feb 2007, 10:58:45 UTC

Bump.

Can someone advise why I wasn't credited for the above work unit?
ID: 36083 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marky-UK

Send message
Joined: 1 Nov 05
Posts: 73
Credit: 1,689,495
RAC: 0
Message 36086 - Posted: 4 Feb 2007, 11:11:30 UTC

The first result errored, which should have killed the entire WU, but two more results were sent out after that. When you returned your result (assuming it was valid), you didn't get any credit because the max # of error results had been hit.

The time period is close to when the validator server failed so maybe that's why the extra two results were sent out.


I still think the settings for max # of error/total/success results is set too low on all WUs. Is a project admin going to respond to my points in this post?
ID: 36086 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 36087 - Posted: 4 Feb 2007, 12:40:47 UTC - in response to Message 36083.  

Bump.

Can someone advise why I wasn't credited for the above work unit?

If you look at the bottom of the result page you se this text.

-Validate state Invalid
-Claimed credit 20.7750264779357
-Granted credit 20
-application version 5.45


As you se you got 20 credits for the WU :)

Happy crunching

Anders n
ID: 36087 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 36106 - Posted: 4 Feb 2007, 18:40:30 UTC

The 20 credits is the maximum the daily credit granting script allows. And by the way you see the credits in the result page, but not the WU display, further confirms that these credits were granted by the daily script.

About the time of your reported problem, the project's validater went down, and so I suspect that is why your result failed to validate properly.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 36106 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 36120 - Posted: 4 Feb 2007, 20:41:28 UTC - in response to Message 36086.  

The first result errored, which should have killed the entire WU, but two more results were sent out after that. When you returned your result (assuming it was valid), you didn't get any credit because the max # of error results had been hit.

The time period is close to when the validator server failed so maybe that's why the extra two results were sent out.


I still think the settings for max # of error/total/success results is set too low on all WUs. Is a project admin going to respond to my points in this post?


The result in question was invalid because it may have been corrupted for some reason and/or the validator was not able to read the result file.

We set the max #s low because we like to keep the lifespan of work units to a minimum without having to decrease the delay bound (since user's have requested a longer delay bound). It does seem odd to us that the scheduler may send more results than the max # of total results though. It may help to start using the reliable_time scheduler option which attempts to send old results to reliable hosts after we update the server this week. Maybe with this option, we could increase the max #s.
ID: 36120 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marky-UK

Send message
Joined: 1 Nov 05
Posts: 73
Credit: 1,689,495
RAC: 0
Message 36122 - Posted: 4 Feb 2007, 21:18:00 UTC - in response to Message 36120.  

We set the max #s low because we like to keep the lifespan of work units to a minimum without having to decrease the delay bound (since user's have requested a longer delay bound). It does seem odd to us that the scheduler may send more results than the max # of total results though. It may help to start using the reliable_time scheduler option which attempts to send old results to reliable hosts after we update the server this week. Maybe with this option, we could increase the max #s.

I can understand wanting to avoid WUs erroring out many times, but what about results that get returned just 1 hour after the deadline? The current settings will mean a second result will already have been sent out, but it won't get any credit because the late result gets it.
ID: 36122 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 36171 - Posted: 5 Feb 2007, 19:16:03 UTC

This is definitely an issue. It would be nice if the scheduler just didn't send out the third result. After the server update, we'll look into a fix. We may just modify the validator.
ID: 36171 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Workunit error - check skipped?



©2024 University of Washington
https://www.bakerlab.org