Message boards : Number crunching : Rosetta WU delivery out of control
Author | Message |
---|---|
Aurum Send message Joined: 12 Jul 17 Posts: 32 Credit: 38,158,977 RAC: 0 |
Last night Rosetta delivered 999 WUs to a quad-core computer with a total estimated time to complete of 265 days. Its queue was set to 0.5/0.1 days. This problem has gotten so bad I've had to set all my computers to accept no new work from Rosetta. Only when it runs out do I let it download way too much work requiring that most be aborted. I'm tired of babysitting this project. From now on I'll just let it send me hundreds of times more than I can crunch and they can sit there until they expire. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
With your hosts hidden, it is difficult to offer you any suggestions to improve your situation. It sounds like perhaps you hit a batch of work that failed rather immediately, and the BOINC Manager started to think that was a normal runtime. The project obviously needs to stop sending tasks that fail, and that will then avoid the side-effect you seem to be observing. Rosetta Moderator: Mod.Sense |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,762,257 RAC: 22,840 |
With your hosts hidden, it is difficult to offer you any suggestions to improve your situation.If the project configures their settings so that only Validated work is considered for Runtime/Estimated time calculations, it will stop faulty Tasks that crash & burn early on/instantly from affecting the Estimated completion times. Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,762,257 RAC: 22,840 |
Its queue was set to 0.5/0.1 days.Given the huge number of projects you are attached to, 0.25 + 0.02 would probably be a better cache setting (although it still would've had issues with the sheer number of faulty Tasks that were sent out). Grant Darwin NT |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
If the project configures their settings so that only Validated work is considered for Runtime/Estimated time calculations, it will stop faulty Tasks that crash & burn early on/instantly from affecting the Estimated completion times. If you could track down the specific project setting required, I'd be glad to suggest it to the Project Team. Rosetta Moderator: Mod.Sense |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,762,257 RAC: 22,840 |
I've not a clue, but put out a request for help to those that might.If the project configures their settings so that only Validated work is considered for Runtime/Estimated time calculations, it will stop faulty Tasks that crash & burn early on/instantly from affecting the Estimated completion times.If you could track done the specific project setting required, I'd be glad to suggest it to the Project Team. Edit- no response to the call for help yet, but from what i've found it's really rather ugly as the Runtime estimation is a significant part of how Credit is calculated. And that gave me nothing but headaches when tying to make sense of it all in the past. From the looks of it, what is happening here at Rosetta shouldn't be happening- it appears you don't need to explicitly exclude Invalid or Error results. There is meant to be a function that excludes outlier results. eg a Task that has an Estimated completion time of 8 hours finishes in 7 hours will be used to calculate further Estimated completion times. Likewise one Estimated to take 8 hours & actually takes 9hrs will be used for further estimates. But one that goes 4 hours over the Estimated completion time, or one that finishes in less than half the time Estimated should be discarded from Estimated completion time calculations. Job Runtime estimation Credit New Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,762,257 RAC: 22,840 |
If you could track done the specific project setting required, I'd be glad to suggest it to the Project Team.I've had a look, and come up with a WAG (Wild Arse Guess). For each Task, the project supplies an estimate of the FLOPs used by a job (wu.fpops_est) a limit on FLOPs, after which the job will be aborted (wu.fpops_bound).Rosetta allows for a 4 hour overrun from the Target CPU Time (this is a fixed time, regardless of Target CPU time? 2hr or 36hr Target CPU time, 4hrs overrun till the watchdog timer ends the Task?). So for Tasks of only 2hr Target time, the estimate of the FLOPs for that Task (wu.fpops_est) would be very small, but the limit on FLOPs after which the job will be aborted (wu.fpops_bound) would have to be very, very large to allow the 4 hour overrun before the Watchdog timer ends the Task. And it would have to be very, very, very, very large to allow for high clock speed CPUs- a lot more FLOPs done during that 4hrs than with a slower CPU. As near as i can tell, the wu.fpops_bound value is used for the Sanity check for Task size, estimated completion time & actual completion time used for keeping track of Runtimes & Estimated completion time. The extremely large wu.fpops_bound value (necessary for the 4 hour cutoff for the Watchdog timer) appears to break the Sanity check, so extremely short completion times (ie Tasks erroring out in seconds) are included in Estimated completion time calculations instead of being excluded. Does the project track how many tasks exceed their Target CPU time? By how much they exceed that time? Maybe that 4 hours could be reduced to 1hr, or even 30min? If my WAG is correct, that would then (maybe, hopefully) allow the Sanity check to work as intended to exclude extremely large outlier runtimes (ie Tasks erroring out in seconds or even minutes) and help reduce people getting more work than they can handle- at least when things go haywire (overly optimistic caches will of course still cause their own issues). I'll leave it to those with more of a clue as to how BOINC works to figure out if i'm barking up the wrong tree or not. Grant Darwin NT |
JAMES Send message Joined: 5 May 07 Posts: 8 Credit: 275,386 RAC: 0 |
Last night Rosetta delivered 999 WUs to a quad-core computer with a total estimated time to complete of 265 days. Its queue was set to 0.5/0.1 days. This problem has gotten so bad I've had to set all my computers to accept no new work from Rosetta. Only when it runs out do I let it download way too much work requiring that most be aborted. Look at it this way, it could have been worse. You could have gotten 999 WU’s from ClimatePrediction. They come in at about 325 MB’s each. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,762,257 RAC: 22,840 |
Look at it this way, it could have been worse. You could have gotten 999 WU’s from ClimatePrediction. They come in at about 325 MB’s each.Some Rosetta Tasks can use up to 1GB of HDD space (actually it's probably more than that, as many Tasks use less- at one stage i had 12GB of HDD space in use by Rosetta with 12 Tasks running). Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,762,257 RAC: 22,840 |
I've not a clue, but put out a request for help to those that might.If the project configures their settings so that only Validated work is considered for Runtime/Estimated time calculations, it will stop faulty Tasks that crash & burn early on/instantly from affecting the Estimated completion times.If you could track done the specific project setting required, I'd be glad to suggest it to the Project Team. Edit- no response to the call for help yet, but from what i've found it's really rather ugly as the Runtime estimation is a significant part of how Credit is calculated. And that gave me nothing but headaches when tying to make sense of it all in the past.[/quote]And help has arrived. The answer- The keyword to look for is "runtime outlier". We did have exactly this problem at SETI around 2011, and we pressurised David Anderson to implement a fix. It's done in the validator (which of course is project-specific code): in SETI's case, we look for the overflow marker Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2121 Credit: 41,179,074 RAC: 11,480 |
Last night Rosetta delivered 999 WUs to a quad-core computer with a total estimated time to complete of 265 days. Its queue was set to 0.5/0.1 days. This problem has gotten so bad I've had to set all my computers to accept no new work from Rosetta. Only when it runs out do I let it download way too much work requiring that most be aborted. Good choice. While scheduling is a Boinc issue, not Rosetta, Rosetta's initial runtime setting for new program versions makes it worse. But with a 0.5+0.1 queue and 8hr runtime, Boinc will time out the initially wrongly sent tasks within 24hrs and give you until the 3-day deadline to send back and be credited for as many as can be completed by then even if you don't manually intervene. Intervening may even make matters worse, so save yourself the trouble. This is a start-up, one-time issue for new hosts and/or new program versions. The project isn't about the first few days but however long the host contributes, so no need to obsess over it in the first day or two. The host will contribute the maximum it can either way. |
Aurum Send message Joined: 12 Jul 17 Posts: 32 Credit: 38,158,977 RAC: 0 |
It has nothing to do with WUs failing or the runtime estimate being wrong. I can crunch any Rosetta WU they send. Rosetta just simply does not respect BOINC settings and DLs far too many WUs. Client side fix is to set all computers to No New Work and abort a few thousand a day. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2121 Credit: 41,179,074 RAC: 11,480 |
It has nothing to do with WUs failing or the runtime estimate being wrong What is the <expected> runtime of your tasks now? Has it got closer to the runtime that you set? It will by now. If so, it'll only be grabbing what you can complete within the deadline with a maximum cache setting of 1.5 days. That's how it works. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
My machine, once I reset to the new project URL, got a new batch of v4.20 tasks and estimated they would take 1H 27M to complete. So, even with a small cache, that's easily waaaaayyyy too much work for my 24 hour runtime preference. So, it does happen. Small cache is helpful, but still doesn't address everything. Especially if your first tasks for the new application version come in when you are away from the machine. Rosetta Moderator: Mod.Sense |
Tomcat雄猫 Send message Joined: 20 Dec 14 Posts: 180 Credit: 5,386,173 RAC: 0 |
My machine, once I reset to the new project URL, got a new batch of v4.20 tasks and estimated they would take 1H 27M to complete. So, even with a small cache, that's easily waaaaayyyy too much work for my 24 hour runtime preference. So, it does happen. Small cache is helpful, but still doesn't address everything. Especially if your first tasks for the new application version come in when you are away from the machine. I decided to give Ralph a spin on my Mac. My cache setting is set to 0.1 + 0 days and it still somehow managed to download nearly too much work. That's because the estimated completion times on my tasks are 47 minutes and 39 seconds, which is a new low... *sigh* |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
You will find that most Ralph tasks run for about 1 hour or so regardless of the estimated completion time. Disregard anything about the % complete, or the time complete, they shoot to 100% suddenly without warning. Run a few, you'll see what I mean. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Ralph has a runtime preference as well. It also tests WUs sometimes that limit the number of models produced. Rosetta Moderator: Mod.Sense |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,762,257 RAC: 22,840 |
Last night Rosetta delivered 999 WUs to a quad-core computer with a total estimated time to complete of 265 days. Its queue was set to 0.5/0.1 days. This problem has gotten so bad I've had to set all my computers to accept no new work from Rosetta. Only when it runs out do I let it download way too much work requiring that most be aborted.Hence my suggestion to fix the problem with Estimated completion times for new hosts/applications. Of course the fact you have such a large cache setting while running so many projects just exacerbates the severity of your problem. Grant Darwin NT |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
i'm not too sure if this would help but i set zero task cache , store additional zero days of work my setup is 0.1 / 0 for now it seemed to work on Pi4 i'm not sure about the rest i'm not too sure if there is any boinc client configs that can further limit the number of tasks downloaded. in the most extreme it may take a custom boinc-client to fix it i'd think. how about make an entry in the boinc forums to see if they could help? perhaps provide a new option to limit work cache based on the number of tasks? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Scheduler changes are being tested on Ralph that will help avoid getting more work than can be completed within the deadline. Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
Rosetta WU delivery out of control
©2024 University of Washington
https://www.bakerlab.org