Message boards : Number crunching : Credit not granted for reissued tasks
Author | Message |
---|---|
tng* Send message Joined: 28 Oct 05 Posts: 14 Credit: 5,389,798 RAC: 0 |
I hope this doesn't mean my results will not contribute to the science here: Workunit 69100041 If I understand things correctly, the first result errored out, the second missed the deadline and was reissued to my machine, the second was then returned, then my machine's result was returned, but was discarded due to the limit on total results. If it hadn't been for the limit on total results, I suppose it would have been discarded due to the limit on success results. Please don't interpret this as a plea to have my credits granted -- I've got plenty already. It just seems a shame to discard that work. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,627,225 RAC: 11,586 |
I don't think it means that too many were returned (although it suggests that!)- it looks like there's an error in the task: <core_client_version>5.2.8</core_client_version> <message>process got signal 5 </message> <stderr_txt> dyld: rosetta_5.62_powerpc-apple-darwin Undefined symbols: rosetta_5.62_powerpc-apple-darwin undefined reference to ___cxa_atexit expected to be defined in /usr/lib/libSystem.B.dylib </stderr_txt> Danny |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
dcdc is looking at the first result, the one that did receive an error. tng's result is this one which was sent to them just 14min. after the deadline was missed. Rosetta Moderator: Mod.Sense |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,627,225 RAC: 11,586 |
i looked at the wrong task - my bad! But tng's task does say: Report deadline 24 May 2007 1:28:45 UTC which hasn't been reached yet, but it also says Validate state: Workunit error - check skipped |
tng* Send message Joined: 28 Oct 05 Posts: 14 Credit: 5,389,798 RAC: 0 |
i looked at the wrong task - my bad! But tng's task does say: On the workunit display it says: errors Too many total results I'm presuming that this is the "Workunit error" referred to, and relates to the max # of error/total/success results 1, 2, 1 line just above it. What's annoying is that I'm using a 24-hr runtime on my machines, in order to spare the project servers frequent updates and hopefully increase the amount of real science by reducing the overhead, but this would seem to have resulted in my machine's work being discarded. I don't know what the project's exact requirements are, but I would think that a delay in reissuing results which miss deadlines might prevent this sort of thing. Up to the project to make such decisions, of course. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
from problems with 5.64 thread: 3) So user 2 just wasted a bunch of CPU time and got no credit for it and missed getting credit for another work unit that he could have been doing in that time? that sounds a bit unfair, but i guess thats the breaks. 1) Work Unit ID 70636999. My computer completed the work and returned the results and recieved granted credit of 0.00 for over 9228 seconds of work. Another computer received 29.25 credit granted for this workunit. What goes on ? 2) The person before you was assigned the unit but did not complete it within the scheduled time. As such, the unit was sent to you. However, the person before you finished it late but returned it before you returned yours, so he got the credit. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Link to this second occurance of the issue. In both cases there was a failure on the first host, and a return after the deadline by the second host which is granted credit before the third host can return the task. I've EMailed DK asking about this. I've suggested we find a way to not reissue the task until enough time has passed that no credit will be granted for the second host which is returning the result passed the deadline. Rosetta Moderator: Mod.Sense |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Link to this second occurance of the issue. In both cases there was a failure on the first host, and a return after the deadline by the second host which is granted credit before the third host can return the task. perfectly logical thing to have written into your programing - thanks for the note. |
ramostol Send message Joined: 6 Feb 07 Posts: 64 Credit: 584,052 RAC: 0 |
For the record: Rosetta does not seem to be consistent in such matters, cf this task (I resume the facts, since I assume that the task is no longer available on the internet) where both clients involved received their credits: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=65291520: created 13 Apr 2007 7:58:26 UTC name SEARCH_PAIRINGS_-1bk2_-_filters_1646_9159 Result ID 72841013 Computer 298408 Sent 13 Apr 2007 7:59:20 UTC Time reported or deadline 24 Apr 2007 9:32:14 UTC [[more than 10 days]] Server state Over Outcome Success Client state Done CPU time (sec) 10,645.89 claimed credit 17.88 [[This process generated 19 decoys from 19 attempts]] granted credit 24.65 Result ID 74661774 Computer 411681 Sent 23 Apr 2007 7:59:47 UTC Time reported or deadline 25 Apr 2007 8:39:54 UTC Server state Over Outcome Success Client state Done CPU time (sec) 5,769.85 claimed credit 8.15 [[This process generated 4 decoys from 4 attempts]] granted credit 5.19 -- R. A. Mostol |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,627,225 RAC: 11,586 |
Credit is assigned per decoy to ensure the computer's processing speed is credited sufficiently. Granted/No Decoys: comp1: 24.65/19 = 1.2973 credits per decoy comp2: 5.19/4 = 1.2975 credits per decoy The Pentium4 gets more credits per second as rosetta is beter suited to the P4 architecture at the moment. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
dcdc I don't think the question was about the amount of credit issued. But the link doesn't bring up a WU anymore. I think the point was that in this case, two people received credit. Which doesn't seem consistent with the setting of only one successful result being allowed. Rosetta Moderator: Mod.Sense |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,627,225 RAC: 11,586 |
i'm not doing too well on this thread! didn't read it in context ;) I always thought two success results for a given task was quite common... anyway tng's suggestion of a delay before reissuing would probably be wise, but why limit successes at all? Is that to stop cheating through repeat-reporting of the same result on multiple computers? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Actually that brings us to a point I thought might be useful to make here anyway. A specific WU generated on the server has a specific random number seed incorporated in to it. Normally, a specific random number seed will only be sent to a single machine. Having said that, many many unique WUs are created for a given protein or RNA being studied. So, normally every machine is crunching along a unique path. But, in cases where one host gets an error on a specific WU, it will be reissued to another host, with the same random number seed. This is useful to the project, because it can help them pinpoint if a given error was specific to a given operating system or CPU time, to a specific WU, or due to how a specific random number gets processed. This reissue of the identical WU and random seed can also occur when the first host it was sent to does not respond before the deadline. So, in all of the cases in this thread, one of those events occured and the WU was actually crunched by more then one machine. Which is fairly rare. Yet we all work together to crunch thousands of different models for the protein, via thousands of unique random number seeds in unique WUs. This is where we should all get more careful about using the word "result" as compared to "work unit". Since each work unit normally only has one result on Rosetta, we get careless. Look at the URL closely, it has the word "wuid" or "resultid" in it. The "result" is for a specific host. The "work unit" has the unique random number seed incorporated in to it. Rosetta Moderator: Mod.Sense |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
...why limit successes at all? Is that to stop cheating through repeat-reporting of the same result on multiple computers? Actually, that would not be possible. You cannot return results from a machine they were never issued to. So, simply increasing the number of successes allowed would be another approach to resolve the problem. But that approach has a slight disadvantage because two hosts crunched exactly the same models. One might have followed a given random seed further (i.e. crunched for a longer runtime, or on a faster CPU) but the first models would all be identical. So the approach of not reissueing the WU so immediately after the deadline would be slightly better. Rosetta Moderator: Mod.Sense |
Marky-UK Send message Joined: 1 Nov 05 Posts: 73 Credit: 1,689,495 RAC: 0 |
This issue has come up before and was discussed back in February here; David Kim said they were going to look into it, but the settings still seem to be the same. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
There is a BOINC trac item for this issue. DK, has any change been made to the validator specifically for Rosetta? You had mentioned perhaps you would do so here. Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
Credit not granted for reissued tasks
©2024 University of Washington
https://www.bakerlab.org