Times for work units on new machine

Message boards : Number crunching : Times for work units on new machine

To post messages, you must log in.

AuthorMessage
Alan Roberts

Send message
Joined: 7 Jun 06
Posts: 61
Credit: 6,901,926
RAC: 0
Message 61269 - Posted: 19 May 2009, 17:32:12 UTC

Added a new computer to my collection.

I have the computer set to my "Work" location. This location is set for Target CPU run time of 1 day; and a Maintain enough work for an additional setting of 3.5 days.

In BOINC Manager, the To completion time on all the pending work units is 03:44:15. A lot of work units were downloaded (more than it could finish by the initial deadline). Even after a project reset it seems to be repeating this behavior. Actual work done seems all over the map, but very few run just the small time shown for pending work units, and very few run a full day.

Can anybody spot something obvious that I'm doing wrong? All I want is a stable pipeline of full-day work units, not a overbooking.

I did initially try and migrate Rosetta from the older machine that this one replaced, until I discovered I couldn't merge machines across a name change (which was required). Could that be contributing to the mess?

Thanks!
ID: 61269 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Hammeh

Send message
Joined: 11 Nov 08
Posts: 63
Credit: 211,283
RAC: 0
Message 61270 - Posted: 19 May 2009, 18:08:49 UTC

If this is a new PC and/or it has never ran BOINC before, then it will download enough work for it to run 24/7 until the manager works out all of its values, it is the same for the run time.
Just leave it for a week and it will work out everything itself. Until that I would recommend you do not try and intervine and let BOINC fix itself.
ID: 61270 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile JChojnacki
Avatar

Send message
Joined: 17 Sep 05
Posts: 71
Credit: 9,942,378
RAC: 3,400
Message 61272 - Posted: 19 May 2009, 18:33:04 UTC
Last modified: 19 May 2009, 18:33:43 UTC

Congrats on the new machine. Nice little power house of a machine. However, something to keep in mind is that because of the power you are bringing to bear you are going to be seeing your runtimes go all over the place if you keep your runtime at 1 day.

Awhile back, the Rosetta team limited work units so they will only run 99 to 100 models per WU. If you go look at your task details for your tasks, you’ll see you are hitting that limit on most of your work units.
Here is an example:
https://boinc.bakerlab.org/rosetta/result.php?resultid=249865051

So, even though you have a Target CPU run time of 1 day, if a WU hits that 99-100 model mark before the 24 hours are up, the WU will end.

Hope that helps some.

Joel

ID: 61272 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alan Roberts

Send message
Joined: 7 Jun 06
Posts: 61
Credit: 6,901,926
RAC: 0
Message 61275 - Posted: 19 May 2009, 22:11:38 UTC

Hammeh:

Sorry, I was attempting to keep my post short, and left out information. Primary reason for the project reset was some early failures I saw when transitioning to this machine. I initially tried to run using 100% of the cores (all 4), with each core throttled to no more than 65% of the CPU time (i.e., BOINC was throttling the running work units).

While my foreground workstation performance was fine and the core temps were very stable (at least as far as SpeedFan was concerned), I kept finding "dead" Rosetta processes (work units sitting in memory but consuming no CPU, BOINC Manager showing them running but with no CPU time or progress). These failures (e.g., 249863482, 249863504, 249863499) reported as Compute Error although browsing through the log it seems more like some sort of resource contention bug. The corresponding Messages in BOINC Manager had wording along the lines of, "... If this keeps happening, a project reset may be necessary..."

I didn't really have time to research this board and find out if this is a know bug, so I selected No new tasks, waited until everything left on the queue wasn't going to finish by deadline, and Reset project. Before picking up more work I changed preferences to just use 50% of the cores (2) at 100% available CPU time. Without BOINC trying to throttle the work units I seem to be getting stable system behavior (i.e., no dead Rosetta processes).

I do understand that I need to let BOINC settle on Result duration correction factor (and whatever else goes into predicting and scheduling work). Just wanted to make sure I haven't created some unstable configuration that wouldn't tune, since the machine pulled a lot more work units than it could complete before deadline.

Thanks.
ID: 61275 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alan Roberts

Send message
Joined: 7 Jun 06
Posts: 61
Credit: 6,901,926
RAC: 0
Message 61276 - Posted: 19 May 2009, 22:41:05 UTC

Jchojnacki:

No, I either never new about the model limit or had forgotten about it. Certainly explains the wide time variation.

I guess what worries me a bit is what is the machine's current set of tasks (downloaded after I reset). When I total them up there are enough to be 3.5 days of work for 2 x 100% of the cores if the To completion times average 3:44:15 (what BOINC Manager lists for all of them). Since many of the runs will be longer than this, seems like the machine could still be overbooked (have work it can't complete prior to deadline).

If it had only fetched enough work for 3.5 days of 1-day work units, then it might be underbooked (since some/many would finish "early" by hitting the 100 model limit), but that seems more efficient than missing deadlines (since the Rosetta servers won't have to dispatch the overdue WU to another machine).

Hammeh's response says BOINC will figure it all out. Would it settle faster if I turned Target CPU run time down so that most WUs end on CPU time, instead of number of models?

Obviously I haven't been paying enough attention to my Rosetta machines. The last time I was watching carefully, the target run time was consistently the limit for a job's duration.

Thanks!
ID: 61276 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 61277 - Posted: 19 May 2009, 22:53:50 UTC - in response to Message 61275.  

I do understand that I need to let BOINC settle on Result duration correction factor (and whatever else goes into predicting and scheduling work). Just wanted to make sure I haven't created some unstable configuration that wouldn't tune, since the machine pulled a lot more work units than it could complete before deadline.

Irrespective of how long you ask WUs to run (24hrs in your case) I believe the number that come down are based on the 3h 44m completion times that Boinc is currently showing. Until the DCF brings that up to nearer 24hrs (inevitably much less due to those WUs that complete after 100 decoys) you won't get the right number of WUs coming down.

Additionally, you could've taken the opportunity to upgrade Boinc itself - I think it's at 6.6.29 now. I've read that throttling and limiting CPUs wasn't handled very well until more recent versions, but can't talk about that in detail because I was never affected by it myself. Someone else will be along in a minute to talk about that aspect, no doubt.
ID: 61277 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 61280 - Posted: 20 May 2009, 4:10:30 UTC

Throtteling is still not handled well... recommendation is to leave it a 100% unless you absolutely must change it. My personal experience is that you can lose up to 40% of the tasks on windows. Seems to affect the OS-X version of BOINC less. No idea why.

Two problems are common one is "Lock-file" and the other is "exited with no finish file" with the former more common ...
ID: 61280 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 61289 - Posted: 20 May 2009, 13:30:05 UTC - in response to Message 61277.  
Last modified: 20 May 2009, 13:32:36 UTC

...the number that come down are based on the 3h 44m completion times that Boinc is currently showing. Until the DCF brings that up to nearer 24hrs (inevitably much less due to those WUs that complete after 100 decoys) you won't get the right number of WUs coming down.


Correct. The default runtime is 3hrs, and so new hosts are sort of set up initially with that assumption and BOINC has to adjust from there. It just has to complete a few tasks to understand that most run for 24hrs.

The 99 model limit was introduced in a recent edition of RosettaMini. It helps assure more consistent file upload sizes but can result in tasks completing earlier then expected.
Rosetta Moderator: Mod.Sense
ID: 61289 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1982
Credit: 38,465,751
RAC: 14,931
Message 61293 - Posted: 21 May 2009, 10:34:00 UTC - in response to Message 61275.  

Sorry, I was attempting to keep my post short, and left out information. Primary reason for the project reset was some early failures I saw when transitioning to this machine. I initially tried to run using 100% of the cores (all 4), with each core throttled to no more than 65% of the CPU time (i.e., BOINC was throttling the running work units).

While my foreground workstation performance was fine and the core temps were very stable (at least as far as SpeedFan was concerned), I kept finding "dead" Rosetta processes (work units sitting in memory but consuming no CPU, BOINC Manager showing them running but with no CPU time or progress). These failures (e.g., 249863482, 249863504, 249863499) reported as Compute Error although browsing through the log it seems more like some sort of resource contention bug. The corresponding Messages in BOINC Manager had wording along the lines of, "... If this keeps happening, a project reset may be necessary..."

I didn't really have time to research this board and find out if this is a know bug, so I selected No new tasks, waited until everything left on the queue wasn't going to finish by deadline, and Reset project. Before picking up more work I changed preferences to just use 50% of the cores (2) at 100% available CPU time. Without BOINC trying to throttle the work units I seem to be getting stable system behavior (i.e., no dead Rosetta processes).

When I first got my quad core I throttled down to 50% for fear of performance issues and while I didn't use a long run time I did get a lot of compute errors, the same as you. It was partially solved by running 4 cores at 100% but not completely solved until I upgraded BOINC itself past (I think) 6.4.5.

Thing is, when I stopped throttling I also found there was no performance hit at all. The priority the jobs run at is so low, anything I was working on went ahead of the Boinc jobs. Nothing to worry about at all as it turned out. Just go for it is my advice.

I guess what worries me a bit is what is the machine's current set of tasks (downloaded after I reset). When I total them up there are enough to be 3.5 days of work for 2 x 100% of the cores if the To completion times average 3:44:15 (what BOINC Manager lists for all of them). Since many of the runs will be longer than this, seems like the machine could still be overbooked (have work it can't complete prior to deadline).

If it had only fetched enough work for 3.5 days of 1-day work units, then it might be underbooked (since some/many would finish "early" by hitting the 100 model limit), but that seems more efficient than missing deadlines (since the Rosetta servers won't have to dispatch the overdue WU to another machine).

Hammeh's response says BOINC will figure it all out. Would it settle faster if I turned Target CPU run time down so that most WUs end on CPU time, instead of number of models?

Obviously I haven't been paying enough attention to my Rosetta machines. The last time I was watching carefully, the target run time was consistently the limit for a job's duration.

Just to get all your jobs to complete before deadline it looks like you need to reduce Target Run Time to about 12 hours. Select No New tasks until they're all completed, then update to the latest Boinc (the # of days work feature is wrong in early versions too, so it's really worth upgrading to sort this out as well as to get rid of the compute error problem). By the time these jobs are completed your DCF ought to be much more than 4-ish hours as well, then grab just a couple of days work until you're sure you're getting what you expect.

I don't think you're doing anything especially wrong, you're just a victim of circumstances at this precise moment. Hope that helps.
ID: 61293 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 61303 - Posted: 21 May 2009, 17:30:53 UTC - in response to Message 61275.  

Sorry, I was attempting to keep my post short, and left out information. Primary reason for the project reset was some early failures I saw when transitioning to this machine. I initially tried to run using 100% of the cores (all 4), with each core throttled to no more than 65% of the CPU time (i.e., BOINC was throttling the running work units).

While my foreground workstation performance was fine and the core temps were very stable (at least as far as SpeedFan was concerned), I kept finding "dead" Rosetta processes (work units sitting in memory but consuming no CPU, BOINC Manager showing them running but with no CPU time or progress). These failures (e.g., 249863482, 249863504, 249863499) reported as Compute Error although browsing through the log it seems more like some sort of resource contention bug. The corresponding Messages in BOINC Manager had wording along the lines of, "... If this keeps happening, a project reset may be necessary..."

I didn't really have time to research this board and find out if this is a know bug, so I selected No new tasks, waited until everything left on the queue wasn't going to finish by deadline, and Reset project. Before picking up more work I changed preferences to just use 50% of the cores (2) at 100% available CPU time. Without BOINC trying to throttle the work units I seem to be getting stable system behavior (i.e., no dead Rosetta processes).

I do understand that I need to let BOINC settle on Result duration correction factor (and whatever else goes into predicting and scheduling work). Just wanted to make sure I haven't created some unstable configuration that wouldn't tune, since the machine pulled a lot more work units than it could complete before deadline.

Thanks.


Disable the throttling, BOINC jobs run with very low priority, anything that needs CPU (Such as photoshop, opening firefox, listening to music, opening a messenger client...) will have a a higher priority and will "steal" CPU from BOINC jobs instantly, making the performance hit negligible, if at all.
ID: 61303 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Times for work units on new machine



©2024 University of Washington
https://www.bakerlab.org