Rosetta@home

Credit not granted for reissued tasks

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search
Message boards : Number crunching : Credit not granted for reissued tasks

Sort
AuthorMessage
tng

Joined: Oct 28 05
Posts: 14
ID: 7187
Credit: 5,245,667
RAC: 774
Message 41047 - Posted 16 May 2007 0:02:21 UTC
Last modified: 16 May 2007 0:06:23 UTC

I hope this doesn't mean my results will not contribute to the science here:


Workunit 69100041

If I understand things correctly, the first result errored out, the second missed the deadline and was
reissued to my machine, the second was then returned,
then my machine's result was returned, but was discarded due to the limit on total results.

If it hadn't been for the limit on total results, I suppose it would have been
discarded due to the limit on success results.

Please don't interpret this as a plea to have my credits granted -- I've got plenty already. It just seems a shame to discard that work.

____________

dcdc Profile

Joined: Nov 3 05
Posts: 1160
ID: 8948
Credit: 3,331,476
RAC: 3,636
Message 41052 - Posted 16 May 2007 7:30:38 UTC

I don't think it means that too many were returned (although it suggests that!)- it looks like there's an error in the task:

<core_client_version>5.2.8</core_client_version>
<message>process got signal 5
</message>
<stderr_txt>
dyld: rosetta_5.62_powerpc-apple-darwin Undefined symbols:
rosetta_5.62_powerpc-apple-darwin undefined reference to ___cxa_atexit expected to be defined in /usr/lib/libSystem.B.dylib

</stderr_txt>

Danny
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2200
ID: 106194
Credit: 0
RAC: 0
Message 41065 - Posted 16 May 2007 13:47:28 UTC

dcdc is looking at the first result, the one that did receive an error. tng's result is this one which was sent to them just 14min. after the deadline was missed.
____________
Rosetta Moderator: Mod.Sense

dcdc Profile

Joined: Nov 3 05
Posts: 1160
ID: 8948
Credit: 3,331,476
RAC: 3,636
Message 41067 - Posted 16 May 2007 14:24:02 UTC

i looked at the wrong task - my bad! But tng's task does say:

Report deadline 24 May 2007 1:28:45 UTC

which hasn't been reached yet, but it also says

Validate state: Workunit error - check skipped

____________

tng

Joined: Oct 28 05
Posts: 14
ID: 7187
Credit: 5,245,667
RAC: 774
Message 41084 - Posted 17 May 2007 0:04:54 UTC - in response to Message ID 41067.

i looked at the wrong task - my bad! But tng's task does say:

Report deadline 24 May 2007 1:28:45 UTC

which hasn't been reached yet, but it also says

Validate state: Workunit error - check skipped


On the workunit display it says:

errors Too many total results

I'm presuming that this is the "Workunit error" referred to, and relates to the

max # of error/total/success results 1, 2, 1

line just above it. What's annoying is that I'm using a 24-hr runtime on my
machines, in order to spare the project servers frequent updates and hopefully increase the amount of real science by reducing the overhead, but this would seem to have resulted in my machine's work being discarded.

I don't know what the project's exact requirements are, but I would think that
a delay in reissuing results which miss deadlines might prevent this sort of
thing. Up to the project to make such decisions, of course.
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4185
ID: 85645
Credit: 618,929
RAC: 342
Message 41322 - Posted 22 May 2007 20:27:40 UTC

from problems with 5.64 thread:

3) So user 2 just wasted a bunch of CPU time and got no credit for it and missed getting credit for another work unit that he could have been doing in that time? that sounds a bit unfair, but i guess thats the breaks.

1) Work Unit ID 70636999. My computer completed the work and returned the results and recieved granted credit of 0.00 for over 9228 seconds of work. Another computer received 29.25 credit granted for this workunit. What goes on ?



2) The person before you was assigned the unit but did not complete it within the scheduled time. As such, the unit was sent to you. However, the person before you finished it late but returned it before you returned yours, so he got the credit.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2200
ID: 106194
Credit: 0
RAC: 0
Message 41326 - Posted 22 May 2007 20:51:39 UTC

Link to this second occurance of the issue. In both cases there was a failure on the first host, and a return after the deadline by the second host which is granted credit before the third host can return the task.

I've EMailed DK asking about this. I've suggested we find a way to not reissue the task until enough time has passed that no credit will be granted for the second host which is returning the result passed the deadline.
____________
Rosetta Moderator: Mod.Sense

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4185
ID: 85645
Credit: 618,929
RAC: 342
Message 41327 - Posted 22 May 2007 23:32:42 UTC - in response to Message ID 41326.

Link to this second occurance of the issue. In both cases there was a failure on the first host, and a return after the deadline by the second host which is granted credit before the third host can return the task.

I've EMailed DK asking about this. I've suggested we find a way to not reissue the task until enough time has passed that no credit will be granted for the second host which is returning the result passed the deadline.



perfectly logical thing to have written into your programing - thanks for the note.
____________

ramostol

Joined: Feb 6 07
Posts: 61
ID: 145835
Credit: 145,600
RAC: 363
Message 41335 - Posted 23 May 2007 9:06:00 UTC - in response to Message ID 41326.

For the record:
Rosetta does not seem to be consistent in such matters, cf this task (I resume the facts, since I assume that the task is no longer available on the internet) where both clients involved received their credits:

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=65291520:

created 13 Apr 2007 7:58:26 UTC
name SEARCH_PAIRINGS_-1bk2_-_filters_1646_9159

Result ID 72841013
Computer 298408
Sent 13 Apr 2007 7:59:20 UTC
Time reported or deadline 24 Apr 2007 9:32:14 UTC [[more than 10 days]]
Server state Over
Outcome Success
Client state Done
CPU time (sec) 10,645.89
claimed credit 17.88 [[This process generated 19 decoys from 19 attempts]]
granted credit 24.65

Result ID 74661774
Computer 411681
Sent 23 Apr 2007 7:59:47 UTC
Time reported or deadline 25 Apr 2007 8:39:54 UTC
Server state Over
Outcome Success
Client state Done
CPU time (sec) 5,769.85
claimed credit 8.15 [[This process generated 4 decoys from 4 attempts]]
granted credit 5.19

-- R. A. Mostol

dcdc Profile

Joined: Nov 3 05
Posts: 1160
ID: 8948
Credit: 3,331,476
RAC: 3,636
Message 41337 - Posted 23 May 2007 9:53:43 UTC

Credit is assigned per decoy to ensure the computer's processing speed is credited sufficiently.

Granted/No Decoys:

comp1: 24.65/19 = 1.2973 credits per decoy
comp2: 5.19/4 = 1.2975 credits per decoy

The Pentium4 gets more credits per second as rosetta is beter suited to the P4 architecture at the moment.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2200
ID: 106194
Credit: 0
RAC: 0
Message 41342 - Posted 23 May 2007 14:26:41 UTC

dcdc I don't think the question was about the amount of credit issued. But the link doesn't bring up a WU anymore. I think the point was that in this case, two people received credit. Which doesn't seem consistent with the setting of only one successful result being allowed.
____________
Rosetta Moderator: Mod.Sense

dcdc Profile

Joined: Nov 3 05
Posts: 1160
ID: 8948
Credit: 3,331,476
RAC: 3,636
Message 41344 - Posted 23 May 2007 15:01:15 UTC

i'm not doing too well on this thread! didn't read it in context ;)

I always thought two success results for a given task was quite common... anyway tng's suggestion of a delay before reissuing would probably be wise, but why limit successes at all? Is that to stop cheating through repeat-reporting of the same result on multiple computers?
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2200
ID: 106194
Credit: 0
RAC: 0
Message 41347 - Posted 23 May 2007 15:49:51 UTC

Actually that brings us to a point I thought might be useful to make here anyway. A specific WU generated on the server has a specific random number seed incorporated in to it. Normally, a specific random number seed will only be sent to a single machine. Having said that, many many unique WUs are created for a given protein or RNA being studied.

So, normally every machine is crunching along a unique path. But, in cases where one host gets an error on a specific WU, it will be reissued to another host, with the same random number seed. This is useful to the project, because it can help them pinpoint if a given error was specific to a given operating system or CPU time, to a specific WU, or due to how a specific random number gets processed. This reissue of the identical WU and random seed can also occur when the first host it was sent to does not respond before the deadline.

So, in all of the cases in this thread, one of those events occured and the WU was actually crunched by more then one machine. Which is fairly rare. Yet we all work together to crunch thousands of different models for the protein, via thousands of unique random number seeds in unique WUs.

This is where we should all get more careful about using the word "result" as compared to "work unit". Since each work unit normally only has one result on Rosetta, we get careless. Look at the URL closely, it has the word "wuid" or "resultid" in it. The "result" is for a specific host. The "work unit" has the unique random number seed incorporated in to it.
____________
Rosetta Moderator: Mod.Sense

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2200
ID: 106194
Credit: 0
RAC: 0
Message 41348 - Posted 23 May 2007 15:54:13 UTC - in response to Message ID 41344.

...why limit successes at all? Is that to stop cheating through repeat-reporting of the same result on multiple computers?


Actually, that would not be possible. You cannot return results from a machine they were never issued to.

So, simply increasing the number of successes allowed would be another approach to resolve the problem. But that approach has a slight disadvantage because two hosts crunched exactly the same models. One might have followed a given random seed further (i.e. crunched for a longer runtime, or on a faster CPU) but the first models would all be identical. So the approach of not reissueing the WU so immediately after the deadline would be slightly better.
____________
Rosetta Moderator: Mod.Sense

Marky-UK

Joined: Nov 1 05
Posts: 73
ID: 8117
Credit: 1,294,380
RAC: 861
Message 41377 - Posted 24 May 2007 8:00:13 UTC

This issue has come up before and was discussed back in February here; David Kim said they were going to look into it, but the settings still seem to be the same.


____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2200
ID: 106194
Credit: 0
RAC: 0
Message 46296 - Posted 15 Sep 2007 18:29:44 UTC

There is a BOINC trac item for this issue.

DK, has any change been made to the validator specifically for Rosetta? You had mentioned perhaps you would do so here.
____________
Rosetta Moderator: Mod.Sense

Message boards : Number crunching : Credit not granted for reissued tasks


Home | Join | About | Participants | Community | Statistics

Copyright © 2010 University of Washington

Last Modified: 3 Dec 2007 20:36:19 UTC
Back to top ^