Expired deadline

Message boards : Number crunching : Expired deadline

To post messages, you must log in.

AuthorMessage
pixie

Send message
Joined: 30 Aug 08
Posts: 1
Credit: 3,666,539
RAC: 0
Message 56085 - Posted: 29 Sep 2008, 17:05:20 UTC

I just noticed that I missed the deadline by a whole day. Do I abort them so the rest of the tasks get submitted on time, or do I just let them crunch, so it doesn't ruin the project?

TIA
ID: 56085 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5659
Credit: 5,691,837
RAC: 1,806
Message 56088 - Posted: 29 Sep 2008, 20:16:55 UTC - in response to Message 56085.  

They have been reassigned to another user from the looks of it.
You could let them run but then the other user crunches but gets no credit.
Check each one, if the task has been reassigned I personally would abort them so as not to hurt the new guy. Not sure if you will get hit in credit or not, would think not since they never ran.

Someone else might have a different idea on this.

I just noticed that I missed the deadline by a whole day. Do I abort them so the rest of the tasks get submitted on time, or do I just let them crunch, so it doesn't ruin the project?

TIA

ID: 56088 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
R.L. Casey

Send message
Joined: 7 Jun 06
Posts: 91
Credit: 2,728,885
RAC: 0
Message 56096 - Posted: 30 Sep 2008, 1:58:59 UTC - in response to Message 56085.  

I just noticed that I missed the deadline by a whole day. Do I abort them so the rest of the tasks get submitted on time, or do I just let them crunch, so it doesn't ruin the project?

TIA


pixie,
You should abort tasks that have passed the deadline. The WU has been reassigned because your computer did not finish before the deadline which seems to be ten days. Actually, once the WU hsa been reassigned, there is a race: the first of the two computers to report will get credit, and the second will get a Validation Error and no credit.

In order to alleviate this problem, you should greatly reduce, in your BOINC network preferences, the "Computer is connected to the Internet about every ___ days" value and the related one about keeping ___ additional days of work. Many of your computers have 400 to 500 WUs waiting to start -- too mmany, since you will see that many of them are approaching the deadline and have not eevn started crunching yet. I saw one computer with an average turaround time of 9.94 days, so results are barely getting in before the deadlie. No doubt some are not getting done in time.

Try cutting the numbers 'way down; high numbers of WUs waiting to run are not an advantage.

And... VERY nice computer farm!! Thanks for crunching Rosetta!
ID: 56096 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 56113 - Posted: 30 Sep 2008, 14:40:04 UTC - in response to Message 56096.  

You should abort tasks that have passed the deadline.

A rule of thumb: you can immediately abort the task, if if it was not yet started crunching.

If it is already being crunched... then it depends. It is simpler to decide when the tasks take days and you need last few hours until finished. Then you can be sure that the reassigned task wll surely finish later.

Rosetta's tasks are usually much shorter, a reassigned task can be finished in any moment (like a hidden thread :-) So yes - abort it. (Your nice farm will not notice it ;-)

You could let them run but then the other user crunches but gets no credit.
Check each one, if the task has been reassigned I personally would abort them so as not to hurt the new guy.

The WU has been reassigned because your computer did not finish before the deadline which seems to be ten days. Actually, once the WU hsa been reassigned, there is a race: the first of the two computers to report will get credit, and the second will get a Validation Error and no credit.

Ths should definitely not happen until the second guy's deadline will pass!! (If it does, it is BOINC server-side error, which should get reported and repaired. The second guy is given a promise to be able to crunch the reassigned WU until his new deadline.)

Peter
ID: 56113 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 56117 - Posted: 30 Sep 2008, 15:25:09 UTC - in response to Message 56113.  

You could let them run but then the other user crunches but gets no credit.
Check each one, if the task has been reassigned I personally would abort them so as not to hurt the new guy.

The WU has been reassigned because your computer did not finish before the deadline which seems to be ten days. Actually, once the WU hsa been reassigned, there is a race: the first of the two computers to report will get credit, and the second will get a Validation Error and no credit.

Ths should definitely not happen until the second guy's deadline will pass!! (If it does, it is BOINC server-side error, which should get reported and repaired. The second guy is given a promise to be able to crunch the reassigned WU until his new deadline.)

I'm taking my words back. It has nothing to do with BOINC server-side software, it is just Rosetta's tight and intolerant settings:

minimum quorum 1
initial replication 1
max # of error/total/success tasks 1, 2, 1


You are right: "poor second guy" ;-)

Peter
ID: 56117 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
R.L. Casey

Send message
Joined: 7 Jun 06
Posts: 91
Credit: 2,728,885
RAC: 0
Message 56118 - Posted: 30 Sep 2008, 15:35:41 UTC - in response to Message 56117.  

You could let them run but then the other user crunches but gets no credit.
Check each one, if the task has been reassigned I personally would abort them so as not to hurt the new guy.

The WU has been reassigned because your computer did not finish before the deadline which seems to be ten days. Actually, once the WU hsa been reassigned, there is a race: the first of the two computers to report will get credit, and the second will get a Validation Error and no credit.

Ths should definitely not happen until the second guy's deadline will pass!! (If it does, it is BOINC server-side error, which should get reported and repaired. The second guy is given a promise to be able to crunch the reassigned WU until his new deadline.)

I'm taking my words back. It has nothing to do with BOINC server-side software, it is just Rosetta's tight and intolerant settings:

minimum quorum 1
initial replication 1
max # of error/total/success tasks 1, 2, 1


You are right: "poor second guy" ;-)

Peter


I mentioned the 'anomaly' on the Rosetta 5.98 'problems' thread at
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=4213&nowrap=true#55918
.
It would be nice, I think, if the original, over-deadline results were discarded in favor of the second (as yet unfinished) second task, but it's probably a somewhat rare occurrence, not a top priority...
ID: 56118 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 56125 - Posted: 30 Sep 2008, 19:35:13 UTC
Last modified: 30 Sep 2008, 19:36:12 UTC

This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 56125 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
R.L. Casey

Send message
Joined: 7 Jun 06
Posts: 91
Credit: 2,728,885
RAC: 0
Message 56131 - Posted: 30 Sep 2008, 20:49:38 UTC - in response to Message 56125.  

This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified


Thanks for the info, Feet1st! I've been on a crunching sabbatical for a year or so... plus I hadn't fllowed BOINC, anyway. They have an impressive list of items to assess. Thanks again; glad you're still around.
ID: 56131 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 56134 - Posted: 30 Sep 2008, 23:27:35 UTC - in response to Message 56125.  

This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified

The changeset [trac]changeset:276[/trac] describes a different case, where third task is being errorneously reissued, although the total=2.

Would the problem mentioned in this thread be solved using 2-2-2 limit settings?
minimum quorum 1
initial replication 1
max # of error/total/success tasks 2, 2, 2


(Surely it could take longer to discard the WU from server.)

Peter
ID: 56134 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ramostol

Send message
Joined: 6 Feb 07
Posts: 64
Credit: 584,052
RAC: 0
Message 56139 - Posted: 1 Oct 2008, 9:43:15 UTC - in response to Message 56096.  


Actually, once the WU hsa been reassigned, there is a race: the first of the two computers to report will get credit, and the second will get a Validation Error and no credit.


This discussion seems to be going a bit too fast.

I should like to see some documentation that Boinc/Rosetta gives credits merely to one computer in cases where two computers have returned results for the same task.

I have used a slow but reliable computer for one and a half year. I see this situation quite often, receiving tasks expired or crashed on other computers (and occasionally fighting time limits myself), and I have tried to follow what happens to these tasks.

I have never - repeat: never - observed that Rosetta has refused credits to one of two computers delivering valid results for the same task. Rosetta will use the first incoming result as a canonical result, but credits are delivered to everyone.

And I think this is a proper behaviour. Boinc is designed to run unattended, and participants should not need to worry unnecessarily about deadlines or task duplications.

There is of course one limitation. When Rosetta has received one valid result the project is satisfied. The task will stay as statistics for the successful cruncher for a limited period and then disappear from the server. And at that point no one will be able to return results and get credits.

As for aborting tasks passing deadlines I am in two minds. If you see that another computer has delivered a valid result then by all means abort your replication. But an abortion of a task weakens its overall chances of success. I have observed quite a few perfectly sound tasks being cancelled on the server because the first cruncher aborted for time reasons (thereby registering a compute error by the server) and the next receiver crashing or giving in (because some models are too lengthy, what do I know?). Anyhow, I now wonder if it is better for the project to register two successes for the same task than none at all.
ID: 56139 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ramostol

Send message
Joined: 6 Feb 07
Posts: 64
Credit: 584,052
RAC: 0
Message 56140 - Posted: 1 Oct 2008, 9:46:31 UTC - in response to Message 56134.  

This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified

The changeset [trac]changeset:276[/trac] describes a different case, where third task is being errorneously reissued, although the total=2.

Would the problem mentioned in this thread be solved using 2-2-2 limit settings?
minimum quorum 1
initial replication 1
max # of error/total/success tasks 2, 2, 2


(Surely it could take longer to discard the WU from server.)

Peter


My very personal theory is that the text "max # of error/total/success tasks" is misleading and should read "max # of error/total/success results". This situation may be created because Boinc somehow does not consider tasks exceeding the deadline (Server state: Over ; Outcome: No reply ; Client state: New") as results. Then if a result is returned to the server after the deadline in a situation where three computers have received this task for computing, this increases the number of total results in disfavour of the third participant.

By the way, I doubt that we may conclude from "max # of error/total/success tasks" = 1, 2, 1 that Rosetta should not send out more than 2 replications of the same task. By the same interpretation we should have to conclude that Rosetta terminates a task upon receiving one result with "Client error/Compute error". It doesn't, it terminates upon receiving more than 1, that is 2, error results. And Rosetta is perfectly capable of accepting more than 1 (again = 2) success results under ordinary circumstances, giving proper credits to everyone.

ID: 56140 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 56141 - Posted: 1 Oct 2008, 10:57:55 UTC - in response to Message 56140.  

My very personal theory is that the text "max # of error/total/success tasks" is misleading and should read "max # of error/total/success results".


That's just a different wording. The term "task" was introduced at a later time, to describe what is assigned to and running on a host.

Previously there were just WUs, consisting of results (which were to be returned to server after being crunched). But it sounded weird if "a result was running on my host"...

Once the files are returned to the server, they are just plain "results of computation". But the official wording might indeed be "tasks" now.

By the way, I doubt that we may conclude from "max # of error/total/success tasks" = 1, 2, 1 that Rosetta should not send out more than 2 replications of the same task.

But we have to. The scientists use these values to set up, how should the server behave during the WU's lifetime.

By the same interpretation we should have to conclude that Rosetta terminates a task upon receiving one result with "Client error/Compute error". It doesn't,

Sure, it does not. Usually in that moment, there are two results: one failed (1 error is fulfilled) and one just resent. (There should be no second additional resent task, because max is 2.)

it terminates upon receiving more than 1, that is 2, error results.

Exactly. If either 2 successful or 2 eror results are back, suddenly it does not fit in the (1,2,1) form and the WU is declared as failed.

And Rosetta is perfectly capable of accepting more than 1 (again = 2) success results under ordinary circumstances, giving proper credits to everyone.

That is up to the devs to comment on. Anyway, they are still able to grant (semi-manually?) credit to any successful result, regardless of the WU state. This way it is often done on beta projects.

Peter
ID: 56141 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 57887 - Posted: 15 Dec 2008, 11:48:35 UTC - in response to Message 56125.  

This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified


I guess this hasn't been resolved yet. In this workunit it is clear that at the moment it was send to one of my computers there were too many total results. And though my computer crunched it for more then 11 hours, and it finished without any errors, it could never validate.
Bad luck for me, and a complete waste of CPU-time.

AdeB
ID: 57887 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5659
Credit: 5,691,837
RAC: 1,806
Message 57896 - Posted: 15 Dec 2008, 18:40:11 UTC - in response to Message 57887.  
Last modified: 15 Dec 2008, 18:40:48 UTC

This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified


I guess this hasn't been resolved yet. In this workunit it is clear that at the moment it was send to one of my computers there were too many total results. And though my computer crunched it for more then 11 hours, and it finished without any errors, it could never validate.
Bad luck for me, and a complete waste of CPU-time.

AdeB



I think it was still valid to get sent to your computer, as the other 2 systems did not reply for whatever reason, which do not count as errors as no results were ever returned. It is to bad you got stuck on a validate error and wasted cpu time. that is a real problem here sometimes.
ID: 57896 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 57899 - Posted: 15 Dec 2008, 20:35:28 UTC - in response to Message 57896.  

This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified


I guess this hasn't been resolved yet. In this workunit it is clear that at the moment it was send to one of my computers there were too many total results. And though my computer crunched it for more then 11 hours, and it finished without any errors, it could never validate.
Bad luck for me, and a complete waste of CPU-time.

AdeB



I think it was still valid to get sent to your computer, as the other 2 systems did not reply for whatever reason, which do not count as errors as no results were ever returned. It is to bad you got stuck on a validate error and wasted cpu time. that is a real problem here sometimes.


I think that you are right that 'no reply' doesn't count as an error. But it should not be send to a third computer, because then there will be a validate error as the number of tasks exceeds the maximum number of tasks:
max # of error/total/success tasks	[b]1, [color=red]2[/color], 1[/b]



ID: 57899 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5659
Credit: 5,691,837
RAC: 1,806
Message 57904 - Posted: 15 Dec 2008, 22:34:58 UTC - in response to Message 57899.  
Last modified: 15 Dec 2008, 22:39:32 UTC

This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified


I guess this hasn't been resolved yet. In this workunit it is clear that at the moment it was send to one of my computers there were too many total results. And though my computer crunched it for more then 11 hours, and it finished without any errors, it could never validate.
Bad luck for me, and a complete waste of CPU-time.

AdeB



I think it was still valid to get sent to your computer, as the other 2 systems did not reply for whatever reason, which do not count as errors as no results were ever returned. It is to bad you got stuck on a validate error and wasted cpu time. that is a real problem here sometimes.


I think that you are right that 'no reply' doesn't count as an error. But it should not be send to a third computer, because then there will be a validate error as the number of tasks exceeds the maximum number of tasks:
max # of error/total/success tasks	[b]1, [color=red]2[/color], 1[/b]




yeah i see the human logic vs the computer logic do not match. the boinc ticket 276 explains things pretty good. surprised they haven't fixed this bug. must be super low priority.
ID: 57904 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 57925 - Posted: 16 Dec 2008, 10:36:33 UTC - in response to Message 57904.  

This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified


I guess this hasn't been resolved yet. In this workunit it is clear that at the moment it was send to one of my computers there were too many total results. And though my computer crunched it for more then 11 hours, and it finished without any errors, it could never validate.
Bad luck for me, and a complete waste of CPU-time.

AdeB



I think it was still valid to get sent to your computer, as the other 2 systems did not reply for whatever reason, which do not count as errors as no results were ever returned. It is to bad you got stuck on a validate error and wasted cpu time. that is a real problem here sometimes.


I think that you are right that 'no reply' doesn't count as an error. But it should not be send to a third computer, because then there will be a validate error as the number of tasks exceeds the maximum number of tasks:
max # of error/total/success tasks	[b]1, [color=red]2[/color], 1[/b]




yeah i see the human logic vs the computer logic do not match. the boinc ticket 276 explains things pretty good. surprised they haven't fixed this bug. must be super low priority.


Looks like someone stepped in and granted credit for the task. I hope it was also possible to save the results, because that's what its all about.
ID: 57925 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Expired deadline



©2024 University of Washington
https://www.bakerlab.org