Max # total results question

Author	Message
DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,255,054 RAC: 0	Message 44709 - Posted: 5 Aug 2007, 16:10:53 UTC Hi, I carefully read two good threads on what causes a result to be granted zero credit, even if it was returned on-time successfully. The threads were: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=3217 https://boinc.bakerlab.org/rosetta/forum_thread.php?id=2851 The workunit in question is: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=83885601 In this workunit, it appears two results sent previously did not report back before the deadline. Then, the server sent my computer a result from this workunit to crunch. My computer reported back on-time, but the previous computer reported back before me (but after its deadline). As a consequence, my computer received zero credit for the workunit. I'm really not going to complain about a few credits, but I would like to help the situation. I believe the scheduler (logic) on the server is doing exactly as it should by re-issuing a workunit if no reply after deadline. In the 2 threads above, someone introducted the idea of having a "grace period" after the deadline but before the workunit is resent. Has there been any further discussion or action on this topic? Also, just to ask the simple question, was I granted no credit because: 1) workunit error (see link) even though Outcome = success), or 2) someone else returned results before me, or 3) other? ID: 44709 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 44711 - Posted: 5 Aug 2007, 16:36:40 UTC The first issueance of the WU was not sent back. The second was returned after the deadline was reached. But perhaps they extended the deadlines a little due to the server outage during that time due to the upgrade to SAN file server. The third was issued to you just after the 10 day expiration date was actually reached (i.e. before the second person returned any result). And so when they later did return the result, that was the first report of a successful completion. You later returned the result and were the second successful completion. The settings are configured to accept only one successful completion. So the actual reason you were not issued credit was that the WU had already received one successful result, which is the maximum. The mistake was in issueing the task to you in the first place, when it was still possible for a completion report to be accepted. I believe there is a BOINC issue open to get this fixed in the server scheduling programs. (someone please post a link if they find it) Thank you for your understanding and constructive approach to the problem. Rosetta Moderator: Mod.Sense ID: 44711 · Rating: 0 · rate: / Reply Quote

DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,255,054 RAC: 0	Message 44712 - Posted: 5 Aug 2007, 17:20:23 UTC - in response to Message 44711. Thank you for the clarification. Occasionally, one of my RAH workunits will crash, so I've been keeping an eye on the project and posting on the sticky threads. Other projects don't have any problem. I just wanted to be sure my computer crunched that WU properly. ID: 44712 · Rating: 0 · rate: / Reply Quote

Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0	Message 44739 - Posted: 6 Aug 2007, 14:10:43 UTC Here is a link to the issue I opened on the BOINC trac system. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ ID: 44739 · Rating: 0 · rate: / Reply Quote

Marky-UK Send message Joined: 1 Nov 05 Posts: 73 Credit: 1,689,495 RAC: 0	Message 44777 - Posted: 7 Aug 2007, 12:30:42 UTC - in response to Message 44711. Last modified: 7 Aug 2007, 12:35:31 UTC The mistake was in issueing the task to you in the first place, when it was still possible for a completion report to be accepted. I believe there is a BOINC issue open to get this fixed in the server scheduling programs. (someone please post a link if they find it) I still contend that the mistake is setting the "max # of success results" number to just 1, when it is a known fact that work can get reissued as soon as a deadline is passed. IMHO this should be changed to 2 (and probably change the "max # of total results" to 3 as well). ID: 44777 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 44778 - Posted: 7 Aug 2007, 13:30:36 UTC The reason for setting maximum successes to 1 is that the project would prefer that another WU were crunched instead of crunching the same one twice. You see each WU generated on the server has embedded within it a random number seed which is used to generate a unique series of starting points. This large number of unique starting points is what we are collectively exploring on our machines. So if the WU is completed by more then one machine, they have duplicated the efforts of the other machine. So NOT reissueing the WU until results will not longer be accepted would be preferable. Rosetta Moderator: Mod.Sense ID: 44778 · Rating: 0 · rate: / Reply Quote

DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,255,054 RAC: 0	Message 44788 - Posted: 7 Aug 2007, 19:40:41 UTC - in response to Message 44778. A grace period for the transitioner (or scheduler?) to re-issue the work makes much more sense, IMHO. I agree that it's better not to recrunch because the researchers are treating each workunit as one sample in their study. In my case, 48 hours would have been enough of a delay before re-issuing work that the late result would have been returned. If the research/technical team can deal with workunits possibly taking an extra 48 hours * # late machines to finish, then it would solve the problem. Worst case, all the deadlines could be moved up 48 hours so account for the grace period, in the event someone doesn't return the result on-time. What would be REALLY cool, is if the Rosetta application knew the deadline. Then it would just "finish" if the deadline has pasted. Similar to the preferred run-length parameter, but have an overriding parameter so it doesn't go past the deadline. ID: 44788 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5774 Credit: 6,139,760 RAC: 0	Message 44795 - Posted: 7 Aug 2007, 20:13:32 UTC - in response to Message 44788. In my case, 48 hours would have been enough of a delay before re-issuing work that the late result would have been returned. If the research/technical team can deal with workunits possibly taking an extra 48 hours * # late machines to finish, then it would solve the problem. Worst case, all the deadlines could be moved up 48 hours so account for the grace period, in the event someone doesn't return the result on-time. (if RAH gives 48hrs after the deadline to reissue the WU, thats plenty of time.) What would be REALLY cool, is if the Rosetta application knew the deadline. Then it would just "finish" if the deadline has pasted. Similar to the preferred run-length parameter, but have an overriding parameter so it doesn't go past the deadline. ((I also think this would be a good idea, i posted something similar in another thread, saying if the deadline is passed RAH should know and then issue a abort command to that computer for that specific WU. No cpu time lost to crunching something that will not be used and no extra power consumption. )) ID: 44795 · Rating: 0 · rate: / Reply Quote