Project scheduler has gone nuts (4.2 tasks)

Message boards : Number crunching : Project scheduler has gone nuts (4.2 tasks)

To post messages, you must log in.

AuthorMessage
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,711,666
RAC: 1,686
Message 102516 - Posted: 31 Aug 2021, 22:10:13 UTC

The project or BONIC has no idea what its doing.
For 4.2 tasks I have 983!
I run 5 cores of my 16 dedicated to RAH.
I run only 16 hours a day and shut down at night.

I can't even process the queue that is due tomorrow in time I think.
It looks like 30 or 40.
5 cores running 16 hours means 10 tasks for the day.
That leaves 20-30 that do not start by the deadline.
Also wonder how many degrader_site tasks will fail out of that batch.

For 2 September is where the rest of the stuff is at and that is all degrader stuff.
BTW...what is the run time on degrader?
I am set to run 8 hours.
ID: 102516 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 102517 - Posted: 1 Sep 2021, 0:11:38 UTC - in response to Message 102516.  
Last modified: 1 Sep 2021, 0:15:02 UTC

I see that every once in a while. I has happened on several projects. I think it is a BOINC problem.
I saw it first on WCG a couple of years ago, and posted on it then, with no useful replies.

I doubt that anyone at the project will accept that it is something wrong on their end, even if you could find someone to communicate with.
I would just detach and try again later.

PS - Abort any remaining work units first so that the project knows that they are not going to be completed.
ID: 102517 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1497
Credit: 14,721,429
RAC: 16,104
Message 102520 - Posted: 1 Sep 2021, 6:10:39 UTC
Last modified: 1 Sep 2021, 6:17:18 UTC

Is the Remaining (estimated) completion time for non-started Tasks 8 hours?

I have vague memories about this occurring sometimes, but i'm pretty sure it was with an older BOINC version and was meant to have been fixed.


I doubt that anyone at the project will accept that it is something wrong on their end,
Because it's very likely not their problem, but a BOINC issue.
Hence it's best reported to the BOINC message board.


Edit- aha!
here we go, from a previous occurrence
Another possibility. Do you use an app_config file with a max concurrent statement? There was a bug with BOINC 7.16 where it would do as you described. It is supposedly fixed with 7.16.6, not sure if it made it into the Windows 7.16.5 version or not.


Previous Downloaded way too many tasks at once? thread.
Grant
Darwin NT
ID: 102520 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,711,666
RAC: 1,686
Message 102521 - Posted: 1 Sep 2021, 7:11:44 UTC - in response to Message 102520.  
Last modified: 1 Sep 2021, 7:17:46 UTC

Grant, Since the web preference is set for 8 hours run time, I guess the tasks will run 8 hrs.
All new tasks show a run time of 8 hours.

I don't have app_config for RAH.

I'll have a look at that thread you linked to.

The project has aborted quite a few tasks lately because start deadlines were not met, so this annoys me.
The project or BOINC should know what I have allocated to RAH. 5 cores 16 hrs a day. No more no less.

I set the project to "no new tasks" to let this queue clear entirely, then I will start it up again.

"Repaired" BOINC with downloading software.

Maybe a max concurcurrent would solve this?
ID: 102521 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1497
Credit: 14,721,429
RAC: 16,104
Message 102522 - Posted: 1 Sep 2021, 7:32:26 UTC - in response to Message 102521.  
Last modified: 1 Sep 2021, 7:50:19 UTC

The project or BOINC should know what I have allocated to RAH. 5 cores 16 hrs a day. No more no less.
The project knows nothing other than what the BOINC Manager tells it during the Scheduler contacts, when it returns work & updates the details for the work done for the project's applications.

The BOINC Manager is what asks for work, and the amount is asks for is based on system up time, time the system is able to do BOINC work while up, Resource share settings, Project deadlines and the system's resources.
If a system gets too much work, it's because it asked for it.
If it asked for it, then it's an issue that the BOINC developers need to figure out.



The project or BOINC should know what I have allocated to RAH. 5 cores 16 hrs a day. No more no less.
If you have allocated 5 cores, then you must have used max concurrent for Rosetta somewhere. That is the only way you can limit the number of cores/threads a particular project can use.
And as for the 16 hours a a day for Rosetta - that is not how BOINC works.

You need to limit the amount of computing resources as a whole that BOINC uses due to FAH & your general work on the system so that single cores/threads aren't trying to do more than one thing at a time.
But imposing further hard fixed limits on what Resources are available to a particular project just makes meeting your Resource share settings that much more difficult to achieve.
BOINC doesn't allocate processing time to a projected based on time (as such)- it does so on the work done for each project which is determined using REC (Recent Estimated Credit).

Which means Rosetta might get 24+ hours at some some, then nothing at all for the next 5 days- all depending on your Resource share settings & the other projects you have attached to & what work they do or don't have during that period.



Maybe a max concurcurrent would solve this?
Using max concurrent is what has caused the problem in the past.
The more you limit what work can be done, the harder it is for the Scheduler to meet your Resource share settings. And add in a bug that results in more work being downloaded than can possibly be done just makes things even worse.
Grant
Darwin NT
ID: 102522 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 102525 - Posted: 1 Sep 2021, 20:41:49 UTC - in response to Message 102522.  

The project knows nothing other than what the BOINC Manager tells it during the Scheduler contacts, when it returns work & updates the details for the work done for the project's applications.

The BOINC Manager is what asks for work, and the amount is asks for is based on system up time, time the system is able to do BOINC work while up, Resource share settings, Project deadlines and the system's resources.

At least the resource share and project deadlines are sent from the project to BOINC. If those are not reported correctly, then BOINC will not ask for the right amount of work. I don't think it is quite so simple as you suggest.

However, it certainly started when BOINC did an update a few years ago and changed their scheduler in some undefined way (not that it was ever defined before.) I suggested on the WCG forum that the server and the new BOINC did not play well together. No one has volunteered to dig into it at either end that I know of.
ID: 102525 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,711,666
RAC: 1,686
Message 102527 - Posted: 1 Sep 2021, 21:59:38 UTC - in response to Message 102522.  

Ok..I had a fast look.
I do have a max_concurrent line in app_config.
Resource share is set at 125 because I am trying to recover from a 2 week dead spell when my cooked CPU gave out.

I suppose I should lower that? I didn't think it was related to the number of tasks downloaded, I thought it was more about CPU time and priority over the other projects.

I gave LHC 4 cores and that's all it gets from me because anything else crashes.
WCG is also 5 with RAH. But they flip with Amicable numbers (2 cores) and milkyway.

BOINC holds 1 core for GPU control.

I may not have listed everything, but its 15 CPU for crunching ( divided among the projects) and 1 CPU for GPU control.

Right now...4 LHC, 5 RAH, 2 AN, 4 WCG

Resource share 100% except LHC 200 (trying to get caught up in credits) and 125 RAH.
ID: 102527 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1497
Credit: 14,721,429
RAC: 16,104
Message 102528 - Posted: 2 Sep 2021, 6:38:40 UTC - in response to Message 102527.  

Ok..I had a fast look.
I do have a max_concurrent line in app_config.
So that my explain why you're getting all those Tasks- there is still a bug with BOINC & work fetch under some mysterious circumstances when max_oncurrent is used.



I suppose I should lower that? I didn't think it was related to the number of tasks downloaded, I thought it was more about CPU time and priority over the other projects.
Your cache is what sets the limit to the number of Tasks you will get, however the number of projects you do, your Resource share settings, and the number of compute resources you have (cores/threads & GPUs) will also determine how many Tasks for a given project you will have at any particular time.
The cache settings are meant to be a hard limit- but due to the use of max_concurrent and whatever it is that triggers the weird behaviour being triggered the Cache setting is being ignored.


It's an issue for the BOINC developers to figure out & fix.
Grant
Darwin NT
ID: 102528 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,711,666
RAC: 1,686
Message 102530 - Posted: 2 Sep 2021, 7:06:20 UTC - in response to Message 102528.  

Ok..I had a fast look.
I do have a max_concurrent line in app_config.
So that my explain why you're getting all those Tasks- there is still a bug with BOINC & work fetch under some mysterious circumstances when max_oncurrent is used.



I suppose I should lower that? I didn't think it was related to the number of tasks downloaded, I thought it was more about CPU time and priority over the other projects.
Your cache is what sets the limit to the number of Tasks you will get, however the number of projects you do, your Resource share settings, and the number of compute resources you have (cores/threads & GPUs) will also determine how many Tasks for a given project you will have at any particular time.
The cache settings are meant to be a hard limit- but due to the use of max_concurrent and whatever it is that triggers the weird behaviour being triggered the Cache setting is being ignored.


It's an issue for the BOINC developers to figure out & fix.


I find a lot of weird things with the project settings and BOINC with the way I want to spread out things.
Ive found in one project, even though you uncheck a setting you have to go to another page and change something there before it accepts the first setting.

As for BOINC, well it seems that is a group that is overloaded and does not have enough time to make all the changes or explore all the bugs reported. I'll play around with things and see if I can get the balance I want. The other projects don't do this, so its something specific to RAH. Must be a code thing. (shrug)
ID: 102530 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1497
Credit: 14,721,429
RAC: 16,104
Message 102531 - Posted: 2 Sep 2021, 7:40:06 UTC - in response to Message 102530.  
Last modified: 2 Sep 2021, 7:40:39 UTC

The other projects don't do this, so its something specific to RAH. Must be a code thing. (shrug)
The issue is with BOINC- it is requesting work when it shouldn't be. Once you've got several Tasks running, and maybe a few more for the cache settings then it shouldn't ask for more work till you return some and only if it still owes time to Rosetta relative to your other projects. There may be a tie in with the Scheduler code running on Rosetta that makes it ignore the deadline/Resource share & cache setting limits. But it bases it's decisions on what BOINC tells it.
Something in your settings concerning Rosetta, which is different to your settings for all the other projects, seems to find a way to unleash the bug.

It's definitely a code thing.
Grant
Darwin NT
ID: 102531 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,711,666
RAC: 1,686
Message 102648 - Posted: 17 Sep 2021, 16:36:36 UTC - in response to Message 102531.  

The other projects don't do this, so its something specific to RAH. Must be a code thing. (shrug)
The issue is with BOINC- it is requesting work when it shouldn't be. Once you've got several Tasks running, and maybe a few more for the cache settings then it shouldn't ask for more work till you return some and only if it still owes time to Rosetta relative to your other projects. There may be a tie in with the Scheduler code running on Rosetta that makes it ignore the deadline/Resource share & cache setting limits. But it bases it's decisions on what BOINC tells it.
Something in your settings concerning Rosetta, which is different to your settings for all the other projects, seems to find a way to unleash the bug.

It's definitely a code thing.



It's done it again...880 now. Crazy. Server already forced aborted a bunch and there is still a huge batch more to go.
ID: 102648 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,711,666
RAC: 1,686
Message 102756 - Posted: 20 Sep 2021, 11:33:22 UTC

I am kicking back over 350 tasks from 4.20 that the scheduler gave me and I could not process.
So you guys will have some more tasks to work on.
ID: 102756 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Project scheduler has gone nuts (4.2 tasks)



©2024 University of Washington
https://www.bakerlab.org