Max # total results question

Message boards : Number crunching : Max # total results question

To post messages, you must log in.

AuthorMessage
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,239,073
RAC: 219
Message 44709 - Posted: 5 Aug 2007, 16:10:53 UTC

Hi,

I carefully read two good threads on what causes a result to be granted zero credit, even if it was returned on-time successfully.

The threads were:
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=3217

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=2851

The workunit in question is:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=83885601

In this workunit, it appears two results sent previously did not report back before the deadline. Then, the server sent my computer a result from this workunit to crunch. My computer reported back on-time, but the previous computer reported back before me (but after its deadline). As a consequence, my computer received zero credit for the workunit.

I'm really not going to complain about a few credits, but I would like to help the situation. I believe the scheduler (logic) on the server is doing exactly as it should by re-issuing a workunit if no reply after deadline. In the 2 threads above, someone introducted the idea of having a "grace period" after the deadline but before the workunit is resent. Has there been any further discussion or action on this topic?

Also, just to ask the simple question, was I granted no credit because:
1) workunit error (see link) even though Outcome = success), or
2) someone else returned results before me, or
3) other?
ID: 44709 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 44711 - Posted: 5 Aug 2007, 16:36:40 UTC

The first issueance of the WU was not sent back.
The second was returned after the deadline was reached. But perhaps they extended the deadlines a little due to the server outage during that time due to the upgrade to SAN file server.
The third was issued to you just after the 10 day expiration date was actually reached (i.e. before the second person returned any result). And so when they later did return the result, that was the first report of a successful completion. You later returned the result and were the second successful completion. The settings are configured to accept only one successful completion. So the actual reason you were not issued credit was that the WU had already received one successful result, which is the maximum.

The mistake was in issueing the task to you in the first place, when it was still possible for a completion report to be accepted. I believe there is a BOINC issue open to get this fixed in the server scheduling programs. (someone please post a link if they find it)

Thank you for your understanding and constructive approach to the problem.
Rosetta Moderator: Mod.Sense
ID: 44711 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,239,073
RAC: 219
Message 44712 - Posted: 5 Aug 2007, 17:20:23 UTC - in response to Message 44711.  

Thank you for the clarification.

Occasionally, one of my RAH workunits will crash, so I've been keeping an eye on the project and posting on the sticky threads. Other projects don't have any problem. I just wanted to be sure my computer crunched that WU properly.
ID: 44712 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 44739 - Posted: 6 Aug 2007, 14:10:43 UTC

Here is a link to the issue I opened on the BOINC trac system.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 44739 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marky-UK

Send message
Joined: 1 Nov 05
Posts: 73
Credit: 1,689,495
RAC: 0
Message 44777 - Posted: 7 Aug 2007, 12:30:42 UTC - in response to Message 44711.  
Last modified: 7 Aug 2007, 12:35:31 UTC

The mistake was in issueing the task to you in the first place, when it was still possible for a completion report to be accepted. I believe there is a BOINC issue open to get this fixed in the server scheduling programs. (someone please post a link if they find it)

I still contend that the mistake is setting the "max # of success results" number to just 1, when it is a known fact that work can get reissued as soon as a deadline is passed. IMHO this should be changed to 2 (and probably change the "max # of total results" to 3 as well).
ID: 44777 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 44778 - Posted: 7 Aug 2007, 13:30:36 UTC

The reason for setting maximum successes to 1 is that the project would prefer that another WU were crunched instead of crunching the same one twice. You see each WU generated on the server has embedded within it a random number seed which is used to generate a unique series of starting points. This large number of unique starting points is what we are collectively exploring on our machines. So if the WU is completed by more then one machine, they have duplicated the efforts of the other machine. So NOT reissueing the WU until results will not longer be accepted would be preferable.
Rosetta Moderator: Mod.Sense
ID: 44778 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,239,073
RAC: 219
Message 44788 - Posted: 7 Aug 2007, 19:40:41 UTC - in response to Message 44778.  

A grace period for the transitioner (or scheduler?) to re-issue the work makes much more sense, IMHO. I agree that it's better not to recrunch because the researchers are treating each workunit as one sample in their study.

In my case, 48 hours would have been enough of a delay before re-issuing work that the late result would have been returned. If the research/technical team can deal with workunits possibly taking an extra 48 hours * # late machines to finish, then it would solve the problem. Worst case, all the deadlines could be moved up 48 hours so account for the grace period, in the event someone doesn't return the result on-time.

What would be REALLY cool, is if the Rosetta application *knew* the deadline. Then it would just "finish" if the deadline has pasted. Similar to the preferred run-length parameter, but have an overriding parameter so it doesn't go past the deadline.
ID: 44788 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,709,112
RAC: 1,939
Message 44795 - Posted: 7 Aug 2007, 20:13:32 UTC - in response to Message 44788.  


In my case, 48 hours would have been enough of a delay before re-issuing work that the late result would have been returned. If the research/technical team can deal with workunits possibly taking an extra 48 hours * # late machines to finish, then it would solve the problem. Worst case, all the deadlines could be moved up 48 hours so account for the grace period, in the event someone doesn't return the result on-time.
(if RAH gives 48hrs after the deadline to reissue the WU, thats plenty of time.)

What would be REALLY cool, is if the Rosetta application *knew* the deadline. Then it would just "finish" if the deadline has pasted. Similar to the preferred run-length parameter, but have an overriding parameter so it doesn't go past the deadline. ((I also think this would be a good idea, i posted something similar in another thread, saying if the deadline is passed RAH should know and then issue a abort command to that computer for that specific WU. No cpu time lost to crunching something that will not be used and no extra power consumption. ))


ID: 44795 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Max # total results question



©2024 University of Washington
https://www.bakerlab.org