Credit not granted for reissued tasks

Message boards : Number crunching : Credit not granted for reissued tasks

To post messages, you must log in.

AuthorMessage
tng*

Send message
Joined: 28 Oct 05
Posts: 14
Credit: 5,389,798
RAC: 0
Message 41047 - Posted: 16 May 2007, 0:02:21 UTC
Last modified: 16 May 2007, 0:06:23 UTC

I hope this doesn't mean my results will not contribute to the science here:


Workunit 69100041

If I understand things correctly, the first result errored out, the second missed the deadline and was
reissued to my machine, the second was then returned,
then my machine's result was returned, but was discarded due to the limit on total results.

If it hadn't been for the limit on total results, I suppose it would have been
discarded due to the limit on success results.

Please don't interpret this as a plea to have my credits granted -- I've got plenty already. It just seems a shame to discard that work.

ID: 41047 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1831
Credit: 119,536,330
RAC: 6,139
Message 41052 - Posted: 16 May 2007, 7:30:38 UTC

I don't think it means that too many were returned (although it suggests that!)- it looks like there's an error in the task:

<core_client_version>5.2.8</core_client_version>
<message>process got signal 5
</message>
<stderr_txt>
dyld: rosetta_5.62_powerpc-apple-darwin Undefined symbols:
rosetta_5.62_powerpc-apple-darwin undefined reference to ___cxa_atexit expected to be defined in /usr/lib/libSystem.B.dylib

</stderr_txt>

Danny
ID: 41052 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 41065 - Posted: 16 May 2007, 13:47:28 UTC

dcdc is looking at the first result, the one that did receive an error. tng's result is this one which was sent to them just 14min. after the deadline was missed.
Rosetta Moderator: Mod.Sense
ID: 41065 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1831
Credit: 119,536,330
RAC: 6,139
Message 41067 - Posted: 16 May 2007, 14:24:02 UTC

i looked at the wrong task - my bad! But tng's task does say:

Report deadline 24 May 2007 1:28:45 UTC

which hasn't been reached yet, but it also says

Validate state: Workunit error - check skipped

ID: 41067 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tng*

Send message
Joined: 28 Oct 05
Posts: 14
Credit: 5,389,798
RAC: 0
Message 41084 - Posted: 17 May 2007, 0:04:54 UTC - in response to Message 41067.  

i looked at the wrong task - my bad! But tng's task does say:

Report deadline 24 May 2007 1:28:45 UTC

which hasn't been reached yet, but it also says

Validate state: Workunit error - check skipped


On the workunit display it says:

errors Too many total results

I'm presuming that this is the "Workunit error" referred to, and relates to the

max # of error/total/success results 1, 2, 1

line just above it. What's annoying is that I'm using a 24-hr runtime on my
machines, in order to spare the project servers frequent updates and hopefully increase the amount of real science by reducing the overhead, but this would seem to have resulted in my machine's work being discarded.

I don't know what the project's exact requirements are, but I would think that
a delay in reissuing results which miss deadlines might prevent this sort of
thing. Up to the project to make such decisions, of course.
ID: 41084 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 41322 - Posted: 22 May 2007, 20:27:40 UTC

from problems with 5.64 thread:

3) So user 2 just wasted a bunch of CPU time and got no credit for it and missed getting credit for another work unit that he could have been doing in that time? that sounds a bit unfair, but i guess thats the breaks.

1) Work Unit ID 70636999. My computer completed the work and returned the results and recieved granted credit of 0.00 for over 9228 seconds of work. Another computer received 29.25 credit granted for this workunit. What goes on ?



2) The person before you was assigned the unit but did not complete it within the scheduled time. As such, the unit was sent to you. However, the person before you finished it late but returned it before you returned yours, so he got the credit.
ID: 41322 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 41326 - Posted: 22 May 2007, 20:51:39 UTC

Link to this second occurance of the issue. In both cases there was a failure on the first host, and a return after the deadline by the second host which is granted credit before the third host can return the task.

I've EMailed DK asking about this. I've suggested we find a way to not reissue the task until enough time has passed that no credit will be granted for the second host which is returning the result passed the deadline.
Rosetta Moderator: Mod.Sense
ID: 41326 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 41327 - Posted: 22 May 2007, 23:32:42 UTC - in response to Message 41326.  

Link to this second occurance of the issue. In both cases there was a failure on the first host, and a return after the deadline by the second host which is granted credit before the third host can return the task.

I've EMailed DK asking about this. I've suggested we find a way to not reissue the task until enough time has passed that no credit will be granted for the second host which is returning the result passed the deadline.



perfectly logical thing to have written into your programing - thanks for the note.
ID: 41327 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ramostol

Send message
Joined: 6 Feb 07
Posts: 64
Credit: 584,052
RAC: 0
Message 41335 - Posted: 23 May 2007, 9:06:00 UTC - in response to Message 41326.  

For the record:
Rosetta does not seem to be consistent in such matters, cf this task (I resume the facts, since I assume that the task is no longer available on the internet) where both clients involved received their credits:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=65291520:

created 13 Apr 2007 7:58:26 UTC
name SEARCH_PAIRINGS_-1bk2_-_filters_1646_9159

Result ID 72841013
Computer 298408
Sent 13 Apr 2007 7:59:20 UTC
Time reported or deadline 24 Apr 2007 9:32:14 UTC [[more than 10 days]]
Server state Over
Outcome Success
Client state Done
CPU time (sec) 10,645.89
claimed credit 17.88 [[This process generated 19 decoys from 19 attempts]]
granted credit 24.65

Result ID 74661774
Computer 411681
Sent 23 Apr 2007 7:59:47 UTC
Time reported or deadline 25 Apr 2007 8:39:54 UTC
Server state Over
Outcome Success
Client state Done
CPU time (sec) 5,769.85
claimed credit 8.15 [[This process generated 4 decoys from 4 attempts]]
granted credit 5.19

-- R. A. Mostol
ID: 41335 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1831
Credit: 119,536,330
RAC: 6,139
Message 41337 - Posted: 23 May 2007, 9:53:43 UTC

Credit is assigned per decoy to ensure the computer's processing speed is credited sufficiently.

Granted/No Decoys:

comp1: 24.65/19 = 1.2973 credits per decoy
comp2: 5.19/4 = 1.2975 credits per decoy

The Pentium4 gets more credits per second as rosetta is beter suited to the P4 architecture at the moment.
ID: 41337 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 41342 - Posted: 23 May 2007, 14:26:41 UTC

dcdc I don't think the question was about the amount of credit issued. But the link doesn't bring up a WU anymore. I think the point was that in this case, two people received credit. Which doesn't seem consistent with the setting of only one successful result being allowed.
Rosetta Moderator: Mod.Sense
ID: 41342 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1831
Credit: 119,536,330
RAC: 6,139
Message 41344 - Posted: 23 May 2007, 15:01:15 UTC

i'm not doing too well on this thread! didn't read it in context ;)

I always thought two success results for a given task was quite common... anyway tng's suggestion of a delay before reissuing would probably be wise, but why limit successes at all? Is that to stop cheating through repeat-reporting of the same result on multiple computers?
ID: 41344 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 41347 - Posted: 23 May 2007, 15:49:51 UTC

Actually that brings us to a point I thought might be useful to make here anyway. A specific WU generated on the server has a specific random number seed incorporated in to it. Normally, a specific random number seed will only be sent to a single machine. Having said that, many many unique WUs are created for a given protein or RNA being studied.

So, normally every machine is crunching along a unique path. But, in cases where one host gets an error on a specific WU, it will be reissued to another host, with the same random number seed. This is useful to the project, because it can help them pinpoint if a given error was specific to a given operating system or CPU time, to a specific WU, or due to how a specific random number gets processed. This reissue of the identical WU and random seed can also occur when the first host it was sent to does not respond before the deadline.

So, in all of the cases in this thread, one of those events occured and the WU was actually crunched by more then one machine. Which is fairly rare. Yet we all work together to crunch thousands of different models for the protein, via thousands of unique random number seeds in unique WUs.

This is where we should all get more careful about using the word "result" as compared to "work unit". Since each work unit normally only has one result on Rosetta, we get careless. Look at the URL closely, it has the word "wuid" or "resultid" in it. The "result" is for a specific host. The "work unit" has the unique random number seed incorporated in to it.
Rosetta Moderator: Mod.Sense
ID: 41347 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 41348 - Posted: 23 May 2007, 15:54:13 UTC - in response to Message 41344.  

...why limit successes at all? Is that to stop cheating through repeat-reporting of the same result on multiple computers?


Actually, that would not be possible. You cannot return results from a machine they were never issued to.

So, simply increasing the number of successes allowed would be another approach to resolve the problem. But that approach has a slight disadvantage because two hosts crunched exactly the same models. One might have followed a given random seed further (i.e. crunched for a longer runtime, or on a faster CPU) but the first models would all be identical. So the approach of not reissueing the WU so immediately after the deadline would be slightly better.
Rosetta Moderator: Mod.Sense
ID: 41348 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marky-UK

Send message
Joined: 1 Nov 05
Posts: 73
Credit: 1,689,495
RAC: 0
Message 41377 - Posted: 24 May 2007, 8:00:13 UTC

This issue has come up before and was discussed back in February here; David Kim said they were going to look into it, but the settings still seem to be the same.


ID: 41377 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 46296 - Posted: 15 Sep 2007, 18:29:44 UTC

There is a BOINC trac item for this issue.

DK, has any change been made to the validator specifically for Rosetta? You had mentioned perhaps you would do so here.
Rosetta Moderator: Mod.Sense
ID: 46296 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Credit not granted for reissued tasks



©2024 University of Washington
https://www.bakerlab.org