Downloaded way too many tasks at once?

Message boards : Number crunching : Downloaded way too many tasks at once?

To post messages, you must log in.

AuthorMessage
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,537,480
RAC: 776
Message 96212 - Posted: 7 May 2020, 10:25:53 UTC
Last modified: 7 May 2020, 10:32:48 UTC

Suddenly my laptop, which runs 3 cores on Rosetta and is set to queue 3+3 hours buffer, and has worked perfectly for ages, downloaded about 250 tasks of 7.5 hours each. This would have taken 26 days to complete, and they have a 3 day deadline. Server hiccup? I've aborted enough so I only have a 2 day queue.

After checking the logs, it seems it was downloading one at a time repeatedly for the last few hours. I shall reboot the machine incase Boinc's scheduler is mixed up, but has anyone else seen this happen? Can anyone check the tasks sent to my machine to see what happened? It's this one: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3792849
ID: 96212 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,334,313
RAC: 16,402
Message 96213 - Posted: 7 May 2020, 11:08:03 UTC
Last modified: 7 May 2020, 11:08:39 UTC

It was just that one computer, and you didn't change any of the local settings (such as cache size)?
Even so, it still shouldn't have occurred as the project recently changed the the way Rosetta determines how much work to send. Instead of using the Estimated completion time, it's using the Target CPU Runtime.
So even if a new application is released (which has no processing history at all so a very, very low Estimated completion time) it should stop a system from getting more work than it can handle.
Grant
Darwin NT
ID: 96213 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,537,480
RAC: 776
Message 96223 - Posted: 7 May 2020, 12:13:53 UTC - in response to Message 96213.  

It was just that one computer, and you didn't change any of the local settings (such as cache size)?
Even so, it still shouldn't have occurred as the project recently changed the the way Rosetta determines how much work to send. Instead of using the Estimated completion time, it's using the Target CPU Runtime.
So even if a new application is released (which has no processing history at all so a very, very low Estimated completion time) it should stop a system from getting more work than it can handle.


The other 3 computers didn't do anything unusual.

I changed nothing, apart from shutting it down to move it to another room yesterday (along with 2 others which didn't play up). At the most it's local IP could have changed, but from the Rosetta end it probably only sees my public IP, of which I only have one.

All the tasks were showing an estimated time of 7.5 hours, as they usually do. I did notice some different task names beginning SR5A that I haven't seen before, but they still have the 7.5 hour estimate like the rest.

After aborting most of the tasks and rebooting it, it's not trying to get any more.
ID: 96223 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,537,480
RAC: 776
Message 96333 - Posted: 10 May 2020, 11:44:22 UTC - in response to Message 96213.  

It was just that one computer, and you didn't change any of the local settings (such as cache size)?
Even so, it still shouldn't have occurred as the project recently changed the the way Rosetta determines how much work to send. Instead of using the Estimated completion time, it's using the Target CPU Runtime.
So even if a new application is released (which has no processing history at all so a very, very low Estimated completion time) it should stop a system from getting more work than it can handle.


The same machine did it again. And for I think the same reason. It was shut down (uncleanly if that matters) and rebooted.

It's this one: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3792849

Let me know if I can provide some logs to show the problem. I can probably make it do it again by crashing it.
ID: 96333 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MarkJ

Send message
Joined: 28 Mar 20
Posts: 72
Credit: 25,010,478
RAC: 383
Message 96335 - Posted: 10 May 2020, 12:03:21 UTC

I had similar behaviour when we switched to the 4.20 app. Fortunately I noticed and set the machines to “no new tasks”. After they had returned their first batch of works the estimates had adjusted to the usual 8 hours run time.

Another possibility. Do you use an app_config file with a max concurrent statement? There was a bug with BOINC 7.16 where it would do as you described. It is supposedly fixed with 7.16.6, not sure if it made it into the Windows 7.16.5 version or not.
BOINC blog
ID: 96335 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,537,480
RAC: 776
Message 96340 - Posted: 10 May 2020, 15:40:09 UTC - in response to Message 96335.  
Last modified: 10 May 2020, 15:42:16 UTC

I had similar behaviour when we switched to the 4.20 app. Fortunately I noticed and set the machines to “no new tasks”. After they had returned their first batch of works the estimates had adjusted to the usual 8 hours run time.


The estimates are fine, it's just downloading way too much - it shows an estimate of a month for tasks that have a 3 day deadline and my buffer is 3-6 hours.

It's only done it twice, and both times shortly after rebooting.

Another possibility. Do you use an app_config file with a max concurrent statement? There was a bug with BOINC 7.16 where it would do as you described. It is supposedly fixed with 7.16.6, not sure if it made it into the Windows 7.16.5 version or not.


Yes! And that's the only machine with one. The reason I did so is it has 4 cores and 8GB RAM. 4 Rosettas can sometimes make it run out of RAM and a core is (stupidly) left idle. I had to use the file to tell Boinc to limit Rosetta to 3 WUs, so it runs a Universe on the other core. You'd think it would work this out for itself. 0.5GB left, Rosetta is too big, so obviously try to fit in a smaller project's task?! Boinc really annoys me sometimes with its gross stupidity.

I'm using 7.16.5 in Windows 10. 4 CPU cores, no GPUs on that machine.
ID: 96340 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MarkJ

Send message
Joined: 28 Mar 20
Posts: 72
Credit: 25,010,478
RAC: 383
Message 96356 - Posted: 11 May 2020, 8:23:27 UTC - in response to Message 96340.  

Another possibility. Do you use an app_config file with a max concurrent statement? There was a bug with BOINC 7.16 where it would do as you described. It is supposedly fixed with 7.16.6, not sure if it made it into the Windows 7.16.5 version or not.


Yes! And that's the only machine with one.

I’ll ask the question on the BOINC mailing list if the fix made it into the Windows 7.16.5 build, or maybe the fix didn’t work. I haven’t managed to get a 7.16.6 for Linux yet.
BOINC blog
ID: 96356 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
floyd

Send message
Joined: 26 Jun 14
Posts: 23
Credit: 10,268,639
RAC: 0
Message 96359 - Posted: 11 May 2020, 12:01:55 UTC - in response to Message 96340.  

0.5GB left, Rosetta is too big, so obviously try to fit in a smaller project's task?! Boinc really annoys me sometimes with its gross stupidity.
Not to do what you find obvious could well have been a design decision. Not all people are idiots you know.
ID: 96359 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,557,574
RAC: 10,753
Message 96360 - Posted: 11 May 2020, 12:30:26 UTC - in response to Message 96356.  

I haven’t managed to get a 7.16.6 for Linux yet.


Try here: https://boinc.berkeley.edu/dl/?C=M;O=D
ID: 96360 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,537,480
RAC: 776
Message 96377 - Posted: 11 May 2020, 21:10:19 UTC - in response to Message 96359.  

0.5GB left, Rosetta is too big, so obviously try to fit in a smaller project's task?! Boinc really annoys me sometimes with its gross stupidity.
Not to do what you find obvious could well have been a design decision. Not all people are idiots you know.


There can be no sensible reason for that. Imagine you have a cardboard box that you're going to use to transport some stuff somewhere, and you can fit 3 large objects into it. It's technically big enough for 3.5 objects, but you can't cut those objects in half. But you need to carry some smaller objects too, why not put them in aswell? Empty space is pointless.
ID: 96377 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MarkJ

Send message
Joined: 28 Mar 20
Posts: 72
Credit: 25,010,478
RAC: 383
Message 96381 - Posted: 12 May 2020, 4:50:17 UTC - in response to Message 96360.  
Last modified: 12 May 2020, 5:01:35 UTC

I haven’t managed to get a 7.16.6 for Linux yet.

Try here: https://boinc.berkeley.edu/dl/?C=M;O=D

That is DA's attempt at building under Ubuntu. I'm not sure which version of Ubuntu, probably 18.04.

I'll wait for the Debian maintainers version. They got it as far as Bullseye but that has a newer Glibc and GCC than Buster (the current release of Debian).
BOINC blog
ID: 96381 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MarkJ

Send message
Joined: 28 Mar 20
Posts: 72
Credit: 25,010,478
RAC: 383
Message 96382 - Posted: 12 May 2020, 4:57:59 UTC - in response to Message 96356.  
Last modified: 12 May 2020, 4:59:12 UTC

Yes! And that's the only machine with one.

Peter, can you give a log with the sched_op_debug flag turned on and you'd have to allow work fetch. If its the bug I raised then it will ask for work as the current tasks get close to completion. It will request one task at a time (repeatedly) until you turn off work fetch. If you send it via a private message would be good and I will pass it onto the mailing list. I only need a few work requests, no need for the whole log.
BOINC blog
ID: 96382 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
floyd

Send message
Joined: 26 Jun 14
Posts: 23
Credit: 10,268,639
RAC: 0
Message 96400 - Posted: 12 May 2020, 17:00:38 UTC - in response to Message 96377.  

There can be no sensible reason for that. Imagine you have a cardboard box
I understand your thoughts, I just disagree when you say there's no alternative to your suggested action. Let's get back to BOINC, your cardboard box example is a little too simplified. So you have some tasks running and a big Rosetta task and some other smaller tasks waiting. The Rosetta task doesn't fit so you run a smaller one instead and have made sure there's no unused resources. But that's not the end of the story, the big one is still waiting. Now when a task ends and you get some free memory you could face the same situation and again you decide to do something else first. Can you be sure the big task will ever run? You rely on your luck there. You always do as much as possible but you can't be sure that all will be done eventually. There is an alternative if you don't have enough memory, keep what you have and wait for more. That way you'll eventually have enough but you don't do as much work as you could. This is not ignoring the situation, there's a plan behind it but with other objectives than yours. These are just two simple alternatives, both with their advantages and drawbacks. You could of course improve them while adding complexity. I'm not trying to start an argument about the best decision here, I just want to show there's more than one. If someone doesn't do what you think they should that doesn't necessarily mean they haven't thought about it. In fact they could have thought carefully about it and decided otherwise. We all don't know.
ID: 96400 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,537,480
RAC: 776
Message 96404 - Posted: 12 May 2020, 20:54:13 UTC - in response to Message 96400.  

There can be no sensible reason for that. Imagine you have a cardboard box
I understand your thoughts, I just disagree when you say there's no alternative to your suggested action. Let's get back to BOINC, your cardboard box example is a little too simplified. So you have some tasks running and a big Rosetta task and some other smaller tasks waiting. The Rosetta task doesn't fit so you run a smaller one instead and have made sure there's no unused resources. But that's not the end of the story, the big one is still waiting. Now when a task ends and you get some free memory you could face the same situation and again you decide to do something else first. Can you be sure the big task will ever run? You rely on your luck there. You always do as much as possible but you can't be sure that all will be done eventually. There is an alternative if you don't have enough memory, keep what you have and wait for more. That way you'll eventually have enough but you don't do as much work as you could. This is not ignoring the situation, there's a plan behind it but with other objectives than yours. These are just two simple alternatives, both with their advantages and drawbacks. You could of course improve them while adding complexity. I'm not trying to start an argument about the best decision here, I just want to show there's more than one. If someone doesn't do what you think they should that doesn't necessarily mean they haven't thought about it. In fact they could have thought carefully about it and decided otherwise. We all don't know.


Every x minutes, running tasks are changed, as per: Boinc manager, options, computing preferences, switch between tasks every x minutes. So even if the RAM was full of small tasks and Rosetta would never get in, Boinc will reassess every x minutes, realise Rosetta is falling behind in the project weighting that's set, and stop all tasks, put Rosetta in first, then add others to refill the CPU.
ID: 96404 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Downloaded way too many tasks at once?



©2024 University of Washington
https://www.bakerlab.org