task swamping on multi-project host guidance requested

Message boards : Number crunching : task swamping on multi-project host guidance requested

To post messages, you must log in.

AuthorMessage
Viktor

Send message
Joined: 7 Jul 08
Posts: 5
Credit: 3,281,899
RAC: 656
Message 101381 - Posted: 20 Apr 2021, 1:57:04 UTC

Howdy all,

I have a linux machine running boinc 24/7. I run Milkyway@home on 1 core/1 gpu, Einstein@home on 1 core/1 gpu, Rosetta@home on 4 cpu cores.

To accomplish this I have my rosetta app_config set to:
<project_max_concurrent>4</project_max_concurrent>


This works great except as soon as I accept tasks Rosetta@home feels the need to give me 1000 tasks which are due in 5 minutes. (Exaggeration, but not by much.) If I turn my cache to .01 - .01 which seems to be the overall preferred "fix" after much google action my gpu projects starve due to lack of cache.

Ideas?
ID: 101381 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1670
Credit: 17,460,784
RAC: 24,759
Message 101384 - Posted: 20 Apr 2021, 6:56:48 UTC - in response to Message 101381.  

Howdy all,

I have a linux machine running boinc 24/7. I run Milkyway@home on 1 core/1 gpu, Einstein@home on 1 core/1 gpu, Rosetta@home on 4 cpu cores.

To accomplish this I have my rosetta app_config set to:
<project_max_concurrent>4</project_max_concurrent>


This works great except as soon as I accept tasks Rosetta@home feels the need to give me 1000 tasks which are due in 5 minutes. (Exaggeration, but not by much.) If I turn my cache to .01 - .01 which seems to be the overall preferred "fix" after much google action my gpu projects starve due to lack of cache.

Ideas?
Don't use project_max_concurrent.
WIth the number of core/threads limited for Rosetta, the system will struggle to do enough work to meet your Resource share settings, as the GPU projects will always be out performing the work done by CPU only Rosetta. So in order to do enough Rosetta work to catch up with the GPU projects it will need to stop doing GPU work to allow Rosetta to catch up. Give Rosetta more cores & threads, and the GPUs can continue to crucnch without getting way ahead of Rosetta for work done.


Ideally, use an app_config.xml file to reserve a CPU core/thread to support your GPUs (if needed), but allow all projects to use all available CPU cores/threads that aren't being used to support a GPU. With more than one project, no cache is best as it will allow your Resource share settings to be met in a matter of days (or weeks) and not months (possibly many months).
As long as the Estimated completion time for any Rosetta Tasks you get is around 8 hours, and Rosetta can use all the available CPU core/threads (other than the 2 reserved to support the GPUs), with no cache things should settle down within 24hrs.
We did have a batch of work that was erroring out in a matter of seconds, and a couple of other batches that could error out after only an hour or 2, but they have been cleared up so things should settle down now.
Grant
Darwin NT
ID: 101384 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
floyd

Send message
Joined: 26 Jun 14
Posts: 23
Credit: 10,268,639
RAC: 0
Message 101386 - Posted: 20 Apr 2021, 9:51:09 UTC

First, I agree with Grant's analysis of the underlying problem. Second, I'd like to suggest another course of action which may be more to your liking.

Some remarks ahead. Don't use project_max_concurrent, and if you do make sure to adjust "use n% of CPUs" accordingly. Else you can expect BOINC to fetch more tasks than you allow it to actually process. Don't insist on running 1 Milkyway, 1 Einstein and 4 Rosetta tasks at all times. Run the projects CPU only or GPU only and use resource share to balance projects within each group but this will not work across groups. Keep your cache of work small but you don't need to go as far as 0.01 days. Maybe 0.1 to 0.5 days is good.

Plan 1:
(This is mostly what Grant already suggested) Configure your GPU projects to reserve 1 CPU per task. Configure BOINC to use 100% of CPUs or whatever your preferred maximum is.
Pro: Will always fully load your CPU.
Con: May not run GPU work at all times.

Plan 2:
Set "use n% of CPUs" for CPU tasks only. Make sure there's one CPU left for any possible GPU task running concurrently. Configure the GPU projects to reserve 0.1 (or even less) CPUs per task so the total of all possible GPU tasks is less than 1 CPU.
Pro: Will always run GPU work if available.
Con: If there is no GPU work will not do more CPU work instead.

I think plan 2 is more like what you want.
ID: 101386 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 101394 - Posted: 20 Apr 2021, 14:31:40 UTC - in response to Message 101381.  

This works great except as soon as I accept tasks Rosetta@home feels the need to give me 1000 tasks which are due in 5 minutes. (Exaggeration, but not by much.) If I turn my cache to .01 - .01 which seems to be the overall preferred "fix" after much google action my gpu projects starve due to lack of cache.

Recent (in the last couple of years) versions of BOINC have a strange problem due to a change in the scheduler, where they randomly go berserk and download too many work units.
I have posted on it in a number of forums. It will eventually correct itself, but in the mean time you can do some of the other fixes.
ID: 101394 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,108,550
RAC: 6,406
Message 101407 - Posted: 20 Apr 2021, 21:42:34 UTC - in response to Message 101394.  

This works great except as soon as I accept tasks Rosetta@home feels the need to give me 1000 tasks which are due in 5 minutes. (Exaggeration, but not by much.) If I turn my cache to .01 - .01 which seems to be the overall preferred "fix" after much google action my gpu projects starve due to lack of cache.


Recent (in the last couple of years) versions of BOINC have a strange problem due to a change in the scheduler, where they randomly go berserk and download too many work units.
I have posted on it in a number of forums. It will eventually correct itself, but in the mean time you can do some of the other fixes.


AND it's important to remember that aborting unwanted tasks is an okay thing to do!!! JUST because you got sent a bazillion tasks doesn't mean you have to actually try and finish them, abort the unwanted ones.
ID: 101407 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Viktor

Send message
Joined: 7 Jul 08
Posts: 5
Credit: 3,281,899
RAC: 656
Message 101424 - Posted: 21 Apr 2021, 18:20:48 UTC - in response to Message 101407.  

Thank you guys for your thoughtful replies. I will tinker with setting and see if I can get the desired behavior out of my setup. I like the second plan proposed. I do not want my gpu's idle and I need to hold back 2 cores for other non-boinc work.
ID: 101424 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Viktor

Send message
Joined: 7 Jul 08
Posts: 5
Credit: 3,281,899
RAC: 656
Message 101445 - Posted: 22 Apr 2021, 13:59:28 UTC - in response to Message 101424.  

Thanks again all who helped. Asking for aid and then not giving updates is a dick'ish move.... thus:


* Gutted all my controls via app_config on projects
* Changed my prefs to use max of 75% of cores
* I kept my cc_config GPU exclusions to force certain GPU apps onto certain GPU's
* verified .5 day cache with .01 additional

Updated all projects and kickstarted it. Rosetta took 5 cores, GPU projects 1 per, 2 total.

* Changed my prefs to use max of 74% of cores because inclusive programmer math. Oops.

Updated all projects and kickstarted it. Rosetta took 4 cores, GPU projects 1 per. Rosetta has 4 tasks waiting in reserve which is perfect.
ID: 101445 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
floyd

Send message
Joined: 26 Jun 14
Posts: 23
Credit: 10,268,639
RAC: 0
Message 101446 - Posted: 22 Apr 2021, 17:05:46 UTC - in response to Message 101445.  

* Gutted all my controls via app_config on projects
Please don't be so vague, undoing app_config settings is not trivial. Of course deleting the file is not enough but also reloading the (now non-existent) configuration, updating the project or restarting BOINC isn't, at least the CPU and GPU values persist. The project's original values only come back with new tasks but I'm not sure to what extent they are applied then. I am however sure that the values displayed with old tasks are not updated without another client restart so whatever you see there may be outdated.
When I want to revert app_config settings I first change them to the values used by the project, then reload the configuration, then delete it and restart the client. And I try to avoid app_config in the first place. Don't think of app_config as an easy and safe configuration tool for average users, it is a later add-on to BOINC which as far as I know has never been fully integrated. If you use it you can expect unexpected things to happen. I'm quite sure that getting many more tasks than you could finish was such thing.

* Changed my prefs to use max of 75% of cores
At that point the event log will show you how many CPUs that translates to. Likely the correct six. I've seen BOINC schedule one CPU more than configured when in panic mode but that shouldn't be the case here with only 10 tasks in progress and nearly full time left.

Updated all projects and kickstarted it. Rosetta took 5 cores, GPU projects 1 per, 2 total.
Is that what the Manager showed you? Again, that may not be reality. Without different configuration I'd expect 1 core total scheduled for the GPU tasks and the remaining 5 of 6 for CPU tasks. Real usage will rather have been 2+5, more than you wanted. But if the Manager displayed just that in this case it was coincidence.

* Changed my prefs to use max of 74% of cores because inclusive programmer math.
I don't think so.
Either set 75% and configure the GPU projects to schedule 1 CPU and 1 GPU per task. That way up to 2 CPUs will be scheduled for (usually) 2 GPU tasks and the remaining 4-6 for CPU tasks. OR set 50% and 0.1 CPU + 1 GPU. Due to the way BOINC schedules CPUs it will not reserve any for GPU support (but the applications still use them) and you always have 4 for CPU tasks but never more. That's two simple suggestions, of course you can make things more complicated by running several tasks per GPU.
ID: 101446 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Viktor

Send message
Joined: 7 Jul 08
Posts: 5
Credit: 3,281,899
RAC: 656
Message 101513 - Posted: 26 Apr 2021, 0:13:58 UTC - in response to Message 101446.  

Please don't be so vague, undoing app_config settings is not trivial.
Sure thing and warning heeded. I was checking project status and found a private message from a user in 2020 offering help for the amount of errors my client was throwing. I did a deep dive on how my rosetta progress was going and noticed the flood of tasks, etc mentioned in my initial post. I disallowed any new rosetta tasks a week ago after and let them run through to avoid giving the project any headaches. After I had no tasks left I posted on the forum and received help. Per recommendations I removed the 1 line present in my app_config which was to limit the concurrent tasks. As boinc does not like a blank app_config I deleted it from all projects as I had only created them to help balance rossetta vs the gpu projects. I issued the command to update the projects via "boinccmd --project (url of project) update". I restarted the boinc service which was when I ran into the 6 vs 7 problem. See below.

At that point the event log will show you how many CPUs that translates to. Likely the correct six. I've seen BOINC schedule one CPU more than configured when in panic mode but that shouldn't be the case here with only 10 tasks in progress and nearly full time left.
Is that what the Manager showed you? Again, that may not be reality. Without different configuration I'd expect 1 core total scheduled for the GPU tasks and the remaining 5 of 6 for CPU tasks. Real usage will rather have been 2+5, more than you wanted. But if the Manager displayed just that in this case it was coincidence.
I agree that in theory boinc with 75% volunteered on a 8 core CPU should =6 cores. With that allocation rosetta wanted to run 5 processes and my other two gpu projects wanted to run 2 total, resulting in 7 total used. 6=/=7. My amateur assumption was that I had run into a "counts from 0" issue. My solution was to volunteer 74%, which is confirmed as 5 cores via journalctl. 74% "cpus" volunteered on an 8 core is 5.9x.... so it makes no sense that my this would result with my desired effect:
viktor@bender:~$ ps -u boinc
    PID TTY          TIME CMD
  84619 ?        00:00:16 boinc
  84673 ?        00:11:47 rosetta_4.20_x8
  84676 ?        00:11:42 rosetta_4.20_x8
  84678 ?        00:11:37 rosetta_4.20_x8
  84681 ?        00:11:32 rosetta_4.20_x8
  84746 ?        00:09:00 hsgamma_FGRPB1G
  84812 ?        00:00:40 milkyway_1.46_x


with boinc reporting:
max CPUs used: 5


As to what the event manager thinks I can't help you. I could try to fire up a gui, but I can gather what info I need from logs/ps/nvidia-smi/etc.

Either set 75% and configure the GPU projects to schedule 1 CPU and 1 GPU per task. That way up to 2 CPUs will be scheduled for (usually) 2 GPU tasks and the remaining 4-6 for CPU tasks.


Ok, so it sounds like regardless of my current real life situation being what I am looking for, I came to it via an incorrect way. I am 100% down to keep working until it is done right. I will work with cpu_usage on gpu_versions of the GPU projects. I know I sound like a broken record, but thanks. The replies take me ~15 minutes to type out and those who are providing aid are doing so of their own free will. Much easier to click "next thread". I will report back when I bump my GPU apps to cpu_usage of 1 and see if rosetta takes the other 4 seats.
ID: 101513 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Viktor

Send message
Joined: 7 Jul 08
Posts: 5
Credit: 3,281,899
RAC: 656
Message 101515 - Posted: 26 Apr 2021, 0:43:44 UTC - in response to Message 101513.  
Last modified: 26 Apr 2021, 0:54:49 UTC

Either set 75% and configure the GPU projects to schedule 1 CPU and 1 GPU per task. That way up to 2 CPUs will be scheduled for (usually) 2 GPU tasks and the remaining 4-6 for CPU tasks.


Ok, so it sounds like regardless of my current real life situation being what I am looking for, I came to it via an incorrect way. I am 100% down to keep working until it is done right. I will work with cpu_usage on gpu_versions of the GPU projects. I know I sound like a broken record, but thanks. The replies take me ~15 minutes to type out and those who are providing aid are doing so of their own free will. Much easier to click "next thread". I will report back when I bump my GPU apps to cpu_usage of 1 and see if rosetta takes the other 4 seats.


Well that did it. Forcing the GPU projects to eat 1 core per, 2 total and volunteering 75% cores has resulted in 4 rosetta tasks, 1 milkyway, 1 einstein. In hindsight this makes sense as if the gpu projects were using a fraction of a cpu core the math works. Will report back in a week or so.
ID: 101515 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : task swamping on multi-project host guidance requested



©2024 University of Washington
https://www.bakerlab.org