WUs Advancing Together

Message boards : Number crunching : WUs Advancing Together

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,860,027
RAC: 909
Message 66103 - Posted: 13 May 2010, 4:37:16 UTC

BOINC runs tasks in FIFO order - first in, first out - unless it thinks a task is in danger of missing its deadline. This has proven to be the most reliable way to ensure that tasks get returned on time.

For your preferred earliest deadline first scheme consider this scenario:
Assume for the sake of simplicity:
.01 day cache
always on connection
Project A has 2 day deadlines and Project B has 4 day deadlines.
It takes your computer 1 day to complete a Project A task and 2 days to complete a Project B task.

May 1: You attach and download one task each from Project A and Project B.
Project A task is due May 3.
Project B task is due May 5.
Project A task #1 is begun.
May 2: Just before it finishes Project A task #1 BOINC asks for and receives another Project A task.
Project A task #2 is due May 4.
Project A task #1 finishes and Project A task #2 begins.
May 3: Just before it finishes Project A task #2 BOINC asks for and receives another Project A task.
Project A task #3 is due May 5.
Project A task #2 finishes. Finally the Project B task deadline is earlier (by a minute or so) than any Project A task and so begins running.
May 4: Project B running.
May 5: Project B task finishes, barely making its deadline or missing it by a few minutes. Project A task #3 starts but needing a whole day rather than the few minutes it has left, misses its deadline.

With a cache, an extra Project A task downloaded on May 2 and Project B task is certain to miss its deadline.

Adding round robin task switching only delays the inevitable - eventually tasks with longer deadlines run out of time.

An individual user with only two projects could avoid missing deadlines by tweaking their resource share but start adding more projects all with different deadlines (Malaria Control with 3 days, CPDN with 6 months) and it becomes increasing complicated and difficult to get right. Deadlines will be missed. Even with only two projects that resource share tweaking takes time, accurate observations, and a bit of experimentation to get right. Deadlines will probably be missed. Some will become frustrated and give up. Novices will wonder what confusing, brain cell eating mess of a program they've gotten themselves into. Deadlines will be missed. Set and forget users can't, um, set and forget, not without missing deadlines. Decided to stick with a single project? Don't pick SETI or any project with different types of workunits all with different deadline lengths. No resource share tweaking to help you with those.

To get yourself out of this mess you will need to add on code to identify and run some tasks earlier than a simple earliest deadline first scheme would allow. So some tasks with later deadlines would end up starting before some tasks with earlier deadlines. Exactly what you say you don't want to see.

Mikey, if you can prove that EDF would result in fewer deadlines missed by novice, and set and forget users (the majority) than FIFO then maybe you can convince the developers to rewrite the code. Good luck.

Best,
Snags
ID: 66103 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,860,027
RAC: 909
Message 66104 - Posted: 13 May 2010, 5:41:00 UTC

Ross,

Changing the target cpu time does not change the deadline so: a shorter target cpu time means the task will need to accumulate less cpu time in the same amount of wall clock time.

Drastic changes to this preference will confuse BOINC and, in your situation, probably lead to overfetching which in turns leads to more tasks running in High Priority and more tasks in varying stages of progress. A safe rule would be no more than one step (which is two hours) change per two completed workunits.

By the way, have you just restarted running Rosetta after being away for a while or has it been running but unattended?

Best,
Snags
ID: 66104 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,846,971
RAC: 3,047
Message 66106 - Posted: 13 May 2010, 11:50:07 UTC - in response to Message 66103.  

BOINC runs tasks in FIFO order - first in, first out - unless it thinks a task is in danger of missing its deadline. This has proven to be the most reliable way to ensure that tasks get returned on time.

Mikey, if you can prove that EDF would result in fewer deadlines missed by novice, and set and forget users (the majority) than FIFO then maybe you can convince the developers to rewrite the code. Good luck.

Best, Snags


No unfortunately your scenario points out clearly why the EDF won't work all the time either. I am not sure there is a 'set and forget' way with Boinc unless a person only attaches to one project. Then the problem is when it goes down they aren't crunching! Actually i think there might be a 'set and forget' way but it is not conducive to running multiple projects. I think it would work if the user was given a choice of a primary project and a backup project and the backup only ran when the primary was out of work and they both had similar deadlines. That is beyond this discussion and probably not a good thing for all the different Boinc Projects anyway. Limiting users is not what all the different Projects want, although that would help with the missed deadlines. Educating users, as you did in your post, is probably the best way to make Boinc work the best it can when dealing with multiple projects and their different deadlines.
ID: 66106 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ross Parlette

Send message
Joined: 10 Nov 05
Posts: 32
Credit: 2,165,044
RAC: 0
Message 66110 - Posted: 13 May 2010, 21:21:31 UTC

I just found the FAQ and I read about the Target CPU time. Now I'm thinking about it while I watch and wait. I did intervene to make sure the May 14th WU completed on time. Everything is back to default now.

What I now perceive Target CPU time as meaning is, "I will process each WU for x hours on my pathetic / average / outstanding computer." The FAQ says that they pre-judge how many models will run. Since my checkpointing (see below) is usually, a few minutes old, I guess my PC takes no more than 5-10 minutes for a model. Target CPU time ranges from 1 hour to 1 day, not 4 days as stated in the FAQ.

I have two WU due May 16th (same time) and I'll see how they fare. So far one is at 69% (Running high priority) and the other is at 1.6%. I still have 6 Rosetta WU and all but one of them is progressing toward completion. I have 9 SETI and only one is progressing (the soonest).

I've seen a lot of talk about scheduling and I want to say that I would only discuss scheduling WITHIN a project.

WRT checkpointing, I see that Rosetta checkpoints whenever it completes a model. Apart from highlighting a WU, clicking on the Properties command bar, and comparing the CPU time vs. the CPU time at last checkpoint, is there any other way of tracking checkpoints for a WU? I see now the reason for the inverted, reversed, backwards setting, Write to disk at most every [60] seconds. Were I running a Cray 3000, I might not want to write to disk as often as the 3.5 seconds it takes to complete a model.

BOINC has been running 24/7 although I do re-boot when required to by Patch Tuesday. Some days I get busy and don't look here for messages.

Ross
ID: 66110 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 66112 - Posted: 14 May 2010, 2:19:11 UTC

You can record messages about checkpoints taken, and therefore get a log over time of checkpoints. The messages list the WU names, so you can tell which of perhaps many in progress is checkpointing.

You activate it be using the cc_config.xml file, and setting on the attribute for "checkpoint_debug". You can get details about using this file here.

Rosetta Moderator: Mod.Sense
ID: 66112 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : WUs Advancing Together



©2024 University of Washington
https://www.bakerlab.org