High priority jobs getting out of control

Message boards : Number crunching : High priority jobs getting out of control

To post messages, you must log in.

AuthorMessage
Plasmon_attack

Send message
Joined: 2 May 10
Posts: 13
Credit: 15,451,384
RAC: 0
Message 70981 - Posted: 9 Aug 2011, 0:47:16 UTC

I've seen this behavior several times and wondering what causes it. Everything will run smoothly for a while, but then out of nowhere a large number of WUs will start running at high priority. Weirdly, these jobs are usually not due for some time (in the current case at least a week) and are crunched before jobs with earlier dates (in this case jobs due in 3 days). Who has high priority can also change while the job is running. At the current moment one of my machines has 110 partially completed work units (anywhere from 1% to 90% complete) that it has paused to run these high priority jobs. Here is an example for those who dig into the database:

Job:
2p9h_lac_sum_rest_LigDes_SAVE-ALL_OUT_29943_319_0 is hung up at 70% complete and due 8/11/2011 and 9:28 PM

Right above it is:
T0393_3d1l.pdb_boinc_symmetric_lr_symm_wangyr_IGNORE_THE_REST_29956_32531_0 which is now 85% complete and not due until 8/15/2011, 10:40 am.

This starts to kill RAC because jobs aren't completing and being uploaded. Often I can fix this by turning off network communication and letting it run down ALL the work before the first WUs expire and roll it up in one massive upload, or click-and-pause all the unstarted WUs and make it finish the partial ones before it can move on. Both are tedious.

Any idea what's going on here?
ID: 70981 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 70983 - Posted: 9 Aug 2011, 2:43:44 UTC - in response to Message 70981.  
Last modified: 9 Aug 2011, 2:48:12 UTC

Boinc Mgr thinks that these tasks will not complete by the deadline due to the additional tasks from other projects so it runs them first.

But you mention you have 110 jobs!?! You might consider reducing your buffer size (i.e. extra days of work) to a more manageable level. I keep between 3-5 days of work on my system and hardly ever run into high priority tasks and I even shut my system down for 8hrs while I am at work to conserve electricity.

So try reducing your additional days buffer and see what happens. I bet you will see an improvement if you do that.

Also Mod made a comment some time back about this: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5732&nowrap=true#70481

I've seen this behavior several times and wondering what causes it. Everything will run smoothly for a while, but then out of nowhere a large number of WUs will start running at high priority. Weirdly, these jobs are usually not due for some time (in the current case at least a week) and are crunched before jobs with earlier dates (in this case jobs due in 3 days). Who has high priority can also change while the job is running. At the current moment one of my machines has 110 partially completed work units (anywhere from 1% to 90% complete) that it has paused to run these high priority jobs. Here is an example for those who dig into the database:

Job:
2p9h_lac_sum_rest_LigDes_SAVE-ALL_OUT_29943_319_0 is hung up at 70% complete and due 8/11/2011 and 9:28 PM

Right above it is:
T0393_3d1l.pdb_boinc_symmetric_lr_symm_wangyr_IGNORE_THE_REST_29956_32531_0 which is now 85% complete and not due until 8/15/2011, 10:40 am.

This starts to kill RAC because jobs aren't completing and being uploaded.
Often I can fix this by turning off network communication and letting it run down ALL the work before the first WUs expire and roll it up in one massive upload, or click-and-pause all the unstarted WUs and make it finish the partial ones before it can move on. Both are tedious.

Any idea what's going on here?
ID: 70983 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,798,775
RAC: 751
Message 70986 - Posted: 9 Aug 2011, 9:03:07 UTC

How much memory do you have for how many cores? What preferences have you set for memory use? Any messages from the messages tab (or log file, depending on your version of boinc) that say why computation has stopped on a particular workunit?

Some of the recent workunits have required quite a bit of memory and if they are bumping up against memory limits that could explain why you have so many sitting in a partially completed state. Do you have "Leave applications in memory while suspended" checked?


Best,
Snags
ID: 70986 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Plasmon_attack

Send message
Joined: 2 May 10
Posts: 13
Credit: 15,451,384
RAC: 0
Message 70991 - Posted: 9 Aug 2011, 17:04:46 UTC - in response to Message 70986.  

Thanks all...I don't think it's a memory issue. The buffer is set to about 5 days (which is pretty reasonable given the length of recent outages)....it's got 16 cores and 16 GB of RAM and isn't running ANY other projects in addition to Rosetta. The 'time to completion' for the high priority tasks is about the same as the ones that have paused. Rosetta has full access to the memory and it's only showing about 8-10 GB used at any time, and with my apps open we don't hit the limit. All 16 cores crunch pretty much all the time, just once in a while these get out of whack. If I stop network communications for a little bit it seems to even out. I've seen this happen on other machines (like a Mac mini) where things are a bit less extreme. There are no error messages around this, just, "Pausing task, xx task is high priority" so I figured this might be coming from the server somehow.
ID: 70991 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 70995 - Posted: 9 Aug 2011, 19:14:25 UTC

Server just dumps work on you and takes back your results.
The boinc program controls the high priority stuff, for some reason it thinks it will not the get work done on time.

What is your run time set to? Perhaps you should shorten that an hour or two and see if that clears up the problem. Also try reducing your queue to 4 days and see if that helps. One or both should take care of the problem I think.
ID: 70995 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,798,775
RAC: 751
Message 70996 - Posted: 9 Aug 2011, 19:25:04 UTC

I think I've got it.

With a 5 day buffer on a fast 16 core machine I bet you usually have quite a few workunits ready to start. And I'll bet you got a bunch of those "flxdsgn" workunits that were running for only 15-60 minutes or so. With each one that you completed boinc would have reduced the estimated time to completion of every other rosetta workunit on your computer. And then boinc would have asked for more workunits from rosetta to refill your cache. Boinc has no way of knowing the short run time of the flxdsgn workunits is an anomaly so would have collected more work based on it's new, shorter estimates resulting in more total workunits. But when the next workunit of a different type ran for the normal run time (default 3 hours cpu time) boinc would have immediately increased the estimated run time for all the other workunits. And suddenly found itself struggling to meet the deadlines thus putting everything in high priority.

Alternatively, instead of having a bunch of shorties you had a long-running model that pushed a task well (up to 4 hours) beyond the usual run time causing boinc to increase the estimates for all the tasks currently in your queue leading boinc to think they were now in deadline trouble.


Your computers are hidden so I can't see your task list to confirm my theory. But I feel pretty good about it : )


Oh, and one more thing. If you stop network communications this reduces the average amount of time boinc is connected to the internet which reduces the time window boinc thinks it has for returning completed workunits. In effect it shortens the deadline and since boinc already thinks it's in deadline trouble this probably isn't helping. The good news is that boinc won't download any more tasks from rosetta as long as rosetta tasks are running in high priority mode. You could abort a few tasks to ease the pressure but it's not necessary. I myself wouldn't abort any unless I was certain that deadlines were going to be missed otherwise.


Best,
Snags
ID: 70996 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 70997 - Posted: 10 Aug 2011, 12:06:47 UTC

Beside what Snags says, I thought the deadline was a guideline, but not an absolute firm deadline for results. I thought that even if you went past the deadline by a day or something that it was no big deal. You still get the credit.
Your results will still be considered. Is that correct Snags?
ID: 70997 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Plasmon_attack

Send message
Joined: 2 May 10
Posts: 13
Credit: 15,451,384
RAC: 0
Message 71009 - Posted: 10 Aug 2011, 22:40:05 UTC

Thanks all, interesting discussion. For the questions about settings I don't know what you mean by 'default run time' and how to raise or lower. This computer is on 24/7, has unrestricted access to the cores, and is rarely paused. It's the only project I run so there's no switching between apps. The idea about the estimated run times being off seems plausible. I did see a lot of the short workunits go by, and then I had a batch that was taking more like 6-7 hours to complete for a while, so I could see the estimator getting confused. It's just weird that it high prioritizes work units due LATER than the earlier ones. I would be worried about not finishing work units due on the 11th and prioritize them more than ones due on the 15th.

I'll just leave the network off until it stabilizes. It won't be much longer now. The queue might've been 5 days but I've found that it usually finishes them 2-3 days earlier than expected. I have a computer at home set to 10 days because when there is a workunit shortage it's usually out in 3-4.
ID: 71009 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,798,775
RAC: 751
Message 71013 - Posted: 10 Aug 2011, 23:16:19 UTC - in response to Message 70997.  

Beside what Snags says, I thought the deadline was a guideline, but not an absolute firm deadline for results. I thought that even if you went past the deadline by a day or something that it was no big deal. You still get the credit.
Your results will still be considered. Is that correct Snags?


I don't know. I speculated in another thread that resends don't produce the exact same models as the initial copy of the workunit but that may just be crazytalk. Even if it's not true the late copy could still have produced more models than its replacement and those models would be unique. But I imagine that would be a pretty small number of models and it might not be worth the trouble to extract them (assuming my mad speculations are wrong and the first models produced by both workunits are the same).

At some point late workunits would be too late to include in the analysis but whether too late means two days or two months I have no idea.

Boinc standard procedure is if the late workunit is returned before the resend it will get credit otherwise you're out of luck. I don't know if rosetta's credit granting script is applied in this situation. If Plasmon attack has any workunits that fit that description (returned late and after the replacement copy has been returned) then he should check the task details page for that workunit about a day after he returns it. And let us know 'cause we (at least I) are (am) curious.


Best,
Snags
ID: 71013 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 71015 - Posted: 11 Aug 2011, 0:11:32 UTC

I believe the OP is saying that BOINC is started all of the 100+ tasks he's talking about, and then not finished them.

Two things here... first is that there were older versions of BOINC Manager where it would scream "high priority" rather prematurely. It would fear deadline trouble and then not really have any. So, that flag can more or less be ignored unless you are having trouble actually missing deadlines.

The second thing, as was pointed out already by others is that some of the current tasks are consuming large amounts of memory and depending upon how much memory you've configured BOINC to live within, it can hit points in execution where memory usage grows to the point that BOINC feels it is crossing the line you've set in your preferences. It already knows the last task it suspended uses just as much memory and so it begins a new one in hopes that it will use less... and for a while it does use less and runs. Eventually, as the normal processing progresses, more memory is used and eventually BOINC starts feeling fat again.

So it will eventually come back to the ones it seems to be leaving behind. It may run on less then all allowed cores for a while to stay within the memory preference you have.

Again, try not to worry about it. Double check that you "leave tasks in memory when suspended", so you avoid losing any of the progress you've made on the tasks... accept that this means in "VIRTUAL" memory, not RAM, and so your swap file may grow.

...it will work itself out. A higher memory setting may help BOINC run with less limitation. A newer BOINC version may help you get more meaningful messages and statuses on work in progress, and hopefully reasonable work-fetch for your preferences as well (although doesn't sound like that's been a problem).
Rosetta Moderator: Mod.Sense
ID: 71015 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Plasmon_attack

Send message
Joined: 2 May 10
Posts: 13
Credit: 15,451,384
RAC: 0
Message 71045 - Posted: 13 Aug 2011, 8:12:36 UTC - in response to Message 71015.  

Thanks all...Rosetta is allowed access to all the memory, it's left in memory while suspended, etc. I've seen it get hung up before with memory limits so I've opened the bore wide and the system has plenty of overhead. I am noticing a mix of 3 hour and 7 hour tasks going by so I think it's getting confused about how long things are going to take. The work units its jumping to are the longer-running units, and it's skipping over the shorter units to do it. Also running on the most recent version of Boinc. I think it's just the switchign around. Looks like it'll miss the deadline on a few so we'll be able to see what happens if they're returned slightly late.
ID: 71045 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : High priority jobs getting out of control



©2024 University of Washington
https://www.bakerlab.org