Three very long tasks running

Message boards : Number crunching : Three very long tasks running

To post messages, you must log in.

AuthorMessage
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,686,153
RAC: 7,085
Message 98077 - Posted: 14 Jul 2020, 19:14:21 UTC

I just noticed 2 of my computers have long running Rosetta tasks, is this normal? They normally limit to 8 hours, sometimes 10.

This machine https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3792849 has two long tasks, one at 22 hours 14 minutes (CPU time, not wall time) https://boinc.bakerlab.org/rosetta/result.php?resultid=1220450326, and one at 14 hours 50 minutes https://boinc.bakerlab.org/rosetta/result.php?resultid=1220685579.

This machine https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=4360598 has one long task, at 15 hours 40 minutes https://boinc.bakerlab.org/rosetta/result.php?resultid=1220618443.

Should I abort them? Have they broken or are they meant to run that long?

I notice they're all rgmjp tasks.
ID: 98077 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1481
Credit: 14,573,125
RAC: 14,201
Message 98082 - Posted: 14 Jul 2020, 19:46:25 UTC - in response to Message 98077.  

I just noticed 2 of my computers have long running Rosetta tasks, is this normal?
It can be.
Sometimes it can take a long time for a model to complete, that's why they have the watchdog timer which is 10 hours.

So if a Task runs for 11 hours longer than it's Target CPU time, then it is probably worth aborting. But until it's at least 10.5 hours over the Target CPU time (and that is CPU time, not Runtime which can be way, way longer- particularly if a system is busy doing other things as well, or people have "Use at most100 % of CPU time" set to anything less than 100%) i would just let it be.
Grant
Darwin NT
ID: 98082 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,686,153
RAC: 7,085
Message 98084 - Posted: 14 Jul 2020, 20:40:15 UTC - in response to Message 98082.  

I just noticed 2 of my computers have long running Rosetta tasks, is this normal?
It can be.
Sometimes it can take a long time for a model to complete, that's why they have the watchdog timer which is 10 hours.

So if a Task runs for 11 hours longer than it's Target CPU time, then it is probably worth aborting. But until it's at least 10.5 hours over the Target CPU time (and that is CPU time, not Runtime which can be way, way longer- particularly if a system is busy doing other things as well, or people have "Use at most100 % of CPU time" set to anything less than 100%) i would just let it be.


So as I'm on the defaults, is that target 8 hours for whole work unit, + 10 hours maximum per model that it started just before 8 hours = 19 hours? One of them is now at 23 hours 17 minutes.

I'm not sure how this thing works. How many models are usually run in one work unit? I ask because it's almost always very close to 8 hours they finish at, which would indicate they run a large number of short models, or there would be a wider variance of run times.

Also, can I check somehow with the running task what it's currently doing?
ID: 98084 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1481
Credit: 14,573,125
RAC: 14,201
Message 98086 - Posted: 14 Jul 2020, 21:04:20 UTC - in response to Message 98084.  
Last modified: 14 Jul 2020, 21:05:23 UTC

So as I'm on the defaults, is that target 8 hours for whole work unit, + 10 hours maximum per model that it started just before 8 hours = 19 hours? One of them is now at 23 hours 17 minutes.
It is Target CPU time (as Runtime) + 10 hours (but i don't know if that is CPU time or Runtime).

However, as is the case with one of your systems- if a system is busy doing other things, an 8 hour Task can take 10 hours to process (i've seen systems where it can take 24hrs to do 8 hours worth of work). The 10 hour watchdog timer i'm not so sure about- if it is 10 Hours Runtime, or 10 hours CPU time (maybe Modsense can fill us in?).

If it were 10 hours Runtime, your 8 hour Target CPU time Tasks would end after 20 hours (because it takes 10 hours Runtime to do the 8 hours of CPU work, plus the extra 10 hours). If it is 10 hours CPU time, then i would expect it to take around 23 hours (once again, because it takes 10 hours to do 8 hours of CPU work, and roughly 12.5hrs to do the extra 10 hours of CPU work).


If it's still going after 26hours i'd say it's well and truly gone beyond it's extended cut off time.
As long as it's progress keeps increasing towards 100%, then it's probably still doing useful work. If it's no longer increasing (and/or the Estimated time keeps growing) then it's probably not actually doing anything usefull.


Also, can I check somehow with the running task what it's currently doing?
In the BOINC Manager, Advanced view, Tasks tab, select the Task in question, Then on the right in the command list, select Properties.
Grant
Darwin NT
ID: 98086 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 350
Credit: 1,017,068
RAC: 435
Message 98088 - Posted: 14 Jul 2020, 21:19:58 UTC - in response to Message 98086.  

Are you sure it's 10 hours?
I could have sworn it was 4 hours.

I posted this a while back and Mod.Sense seemed to agree.
ID: 98088 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,686,153
RAC: 7,085
Message 98089 - Posted: 14 Jul 2020, 21:36:57 UTC - in response to Message 98086.  

It is Target CPU time (as Runtime) + 10 hours (but i don't know if that is CPU time or Runtime).

However, as is the case with one of your systems- if a system is busy doing other things, an 8 hour Task can take 10 hours to process (i've seen systems where it can take 24hrs to do 8 hours worth of work). The 10 hour watchdog timer i'm not so sure about- if it is 10 Hours Runtime, or 10 hours CPU time (maybe Modsense can fill us in?).


The case with one of my systems? What do you mean? Both the systems in question have a CPU time close to the runtime. The tasks are 24 hours vs 25.5 hours, 17 hours, vs 18.5 hours, 16.5 hours vs 18 hours.

If it were 10 hours Runtime, your 8 hour Target CPU time Tasks would end after 20 hours (because it takes 10 hours Runtime to do the 8 hours of CPU work, plus the extra 10 hours). If it is 10 hours CPU time, then i would expect it to take around 23 hours (once again, because it takes 10 hours to do 8 hours of CPU work, and roughly 12.5hrs to do the extra 10 hours of CPU work).


I'm only looking at CPU time. Does the Boinc manager even show this? I'm using Boinctasks, which puts the CPU time in brackets next to the runtime. This is handy for CPU tasks to see if it's getting the whole core, it's handy for multi-core tasks to see how many cores it's actually making use of, and it's handy for GPU tasks to see if the CPU is slowing the GPU down.

If it's still going after 26hours i'd say it's well and truly gone beyond it's extended cut off time.
As long as it's progress keeps increasing towards 100%, then it's probably still doing useful work. If it's no longer increasing (and/or the Estimated time keeps growing) then it's probably not actually doing anything usefull.


There's too many unknowns here. I'll just watch them and if either the progress stops (it's trickling forwards at the moment, they're at 99.308%, 99.033%, and 98.989%) or the deadline is exceeded, then I'll cancel them. I've got 66 cores altogether, 3 stuck isn't the end of the world.

In the BOINC Manager, Advanced view, Tasks tab, select the Task in question, Then on the right in the command list, select Properties.


Do you not know your right from your left? Or can that be swapped over? Maybe in other countries it matches the driving side! Oh, my mistake, it's your right and my left, you're the other side of the screen.

And that doesn't give me any information at all. I wanted to know what model it was running, when it last changed model, etc. I've seen that in LHC, but they're running in a Linux virtual box so you can actually see the program running and putting up some information as it progresses.
ID: 98089 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1481
Credit: 14,573,125
RAC: 14,201
Message 98090 - Posted: 14 Jul 2020, 21:43:29 UTC - in response to Message 98089.  

In the BOINC Manager, Advanced view, Tasks tab, select the Task in question, Then on the right in the command list, select Properties.
Do you not know your right from your left?
Yep, but wht i'm thinking and what i'm typing aren't always the same thing.


The case with one of my systems? What do you mean?
https://boinc.bakerlab.org/rosetta/result.php?resultid=1220380852
Run time 10 hours  4 min 10 sec
CPU time  7 hours 57 min 55 sec

Just over 10 hours to do just under 8 hours of work.
Grant
Darwin NT
ID: 98090 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,686,153
RAC: 7,085
Message 98091 - Posted: 14 Jul 2020, 22:10:17 UTC - in response to Message 98090.  
Last modified: 14 Jul 2020, 22:12:12 UTC

Yep, but wht i'm thinking and what i'm typing aren't always the same thing.


My mother doesn't know which is which. She'll say it's in the cupboard on the right. If I take a long time she shouts "the other right!"

I don't know if this is unusual, but I think aloud (well not really aloud, but verbally in my head). So I work out what I'm going to type, it's translated into the sounds of the words, then typed. Hence I always type the wrong their/there/they're and have to check on proofreading.

https://boinc.bakerlab.org/rosetta/result.php?resultid=1220380852
Run time 10 hours  4 min 10 sec
CPU time  7 hours 57 min 55 sec

Just over 10 hours to do just under 8 hours of work.


Not much difference when we're deciding when to cancel things, as I'd leave it a bit longer anyway.

The reason for that is I use Tthrottle to stop things overheating. No matter how big the fan, things still get too hot, and that's in Scotland!

And the example you quoted is more extreme, because that's the machine I use, which means I want SILENCE! All the fans are limited to 50%, then it throttles after that.
ID: 98091 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1981
Credit: 38,413,287
RAC: 12,777
Message 98092 - Posted: 14 Jul 2020, 22:31:47 UTC - in response to Message 98088.  

Are you sure it's 10 hours?
I could have sworn it was 4 hours.

I posted this a while back and Mod.Sense seemed to agree.

It was 4hrs.
Then they created some tasks in April that needed to run a very long time to complete the first decoy and upped the watchdog to 10hrs, using stacks of RAM, then they stopped the high RAM tasks to approach things a different way, I asked if the 10hrs setting was still appropriate and was told it was, then they did some work on allowing more frequent checkpoints, which seemed to solve task over-runs, and now they seem to have come back in a different form.

So, your guess is as good as mine.

tl;dr - no-one knows
ID: 98092 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 350
Credit: 1,017,068
RAC: 435
Message 98101 - Posted: 15 Jul 2020, 10:33:11 UTC - in response to Message 98092.  

Thanks for the reply!
ID: 98101 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,686,153
RAC: 7,085
Message 98102 - Posted: 15 Jul 2020, 11:00:32 UTC - in response to Message 98077.  
Last modified: 15 Jul 2020, 11:10:58 UTC

I just noticed 2 of my computers have long running Rosetta tasks, is this normal? They normally limit to 8 hours, sometimes 10.

This machine https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3792849 has two long tasks, one at 22 hours 14 minutes (CPU time, not wall time) https://boinc.bakerlab.org/rosetta/result.php?resultid=1220450326, and one at 14 hours 50 minutes https://boinc.bakerlab.org/rosetta/result.php?resultid=1220685579.

This machine https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=4360598 has one long task, at 15 hours 40 minutes https://boinc.bakerlab.org/rosetta/result.php?resultid=1220618443.

Should I abort them? Have they broken or are they meant to run that long?

I notice they're all rgmjp tasks.


ARGH!!! F***ing windows update! I want to physically strangle the absolute moron at Microsoft who does this. The 1st machine above rebooted without my permission overnight for yet another bug fix to sloppy Windows 10 coding, and one of the tasks has gone back to the beginning: https://boinc.bakerlab.org/rosetta/result.php?resultid=1220685579 It's now showing 7 hours 36 minutes CPU time. Strangely the other one https://boinc.bakerlab.org/rosetta/result.php?resultid=1220450326 is still going, and is now at 1 day 9 hours 40 minutes. And so is the one on the other machine, which also rebooted, but I had to go press F1 because of Dell's moronic moaning about one of my RAM chips being suboptimal.
ID: 98102 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,686,153
RAC: 7,085
Message 98107 - Posted: 15 Jul 2020, 17:34:55 UTC - in response to Message 98102.  
Last modified: 15 Jul 2020, 17:35:58 UTC

I just noticed 2 of my computers have long running Rosetta tasks, is this normal? They normally limit to 8 hours, sometimes 10.

This machine https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3792849 has two long tasks, one at 22 hours 14 minutes (CPU time, not wall time) https://boinc.bakerlab.org/rosetta/result.php?resultid=1220450326, and one at 14 hours 50 minutes https://boinc.bakerlab.org/rosetta/result.php?resultid=1220685579.

This machine https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=4360598 has one long task, at 15 hours 40 minutes https://boinc.bakerlab.org/rosetta/result.php?resultid=1220618443.

Should I abort them? Have they broken or are they meant to run that long?

I notice they're all rgmjp tasks.


ARGH!!! F***ing windows update! I want to physically strangle the absolute moron at Microsoft who does this. The 1st machine above rebooted without my permission overnight for yet another bug fix to sloppy Windows 10 coding, and one of the tasks has gone back to the beginning: https://boinc.bakerlab.org/rosetta/result.php?resultid=1220685579 It's now showing 7 hours 36 minutes CPU time. Strangely the other one https://boinc.bakerlab.org/rosetta/result.php?resultid=1220450326 is still going, and is now at 1 day 9 hours 40 minutes. And so is the one on the other machine, which also rebooted, but I had to go press F1 because of Dell's moronic moaning about one of my RAM chips being suboptimal.


While I was out swimming, the one that had restarted completed in 11.8 hours CPU time, which is quicker than it was showing before the reboot. The other two are still plodding away and slowly increasing the percentage done (99.570% and 99.364%). Curiouser and curiouser.
ID: 98107 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 98120 - Posted: 16 Jul 2020, 0:29:11 UTC - in response to Message 98107.  
Last modified: 16 Jul 2020, 0:30:32 UTC

All but one of my machines are set to 24 hour task times. Had one run a task for 32 hours, and it only completed 2 decoys. Guessing this is one of those monster tasks mentioned above.

/edit. Here is the task in question: https://boinc.bakerlab.org/rosetta/result.php?resultid=1220531809
ID: 98120 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,686,153
RAC: 7,085
Message 98121 - Posted: 16 Jul 2020, 10:35:57 UTC - in response to Message 98120.  
Last modified: 16 Jul 2020, 10:37:00 UTC

All but one of my machines are set to 24 hour task times. Had one run a task for 32 hours, and it only completed 2 decoys. Guessing this is one of those monster tasks mentioned above.

/edit. Here is the task in question: https://boinc.bakerlab.org/rosetta/result.php?resultid=1220531809


All of my three finished in 2 days 2 hours CPU time. One of them states less than that, but it seemed to do half of it, then the other half after the windows update reboot [1] without acknowledging how much time it had already done, although it must have saved something, because the second half took less time than the first half. All three did only 1 decoy. So either very big decoys, or your computer was faster. I can't see which computer did the task you mentioned, as that task is no longer listed on the server. But it looks like you have a bunch of xeons similar to my four X5650s. I do love watching 24 tasks running per machine. They're not very efficient with electricity, but they were only £7 a chip!

[1] I've done this to hopefully stop it in the future: https://www.windowscentral.com/how-prevent-windows-10-rebooting-after-installing-updates
ID: 98121 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 98125 - Posted: 16 Jul 2020, 16:16:32 UTC - in response to Message 98121.  
Last modified: 16 Jul 2020, 16:17:04 UTC

All but one of my machines are set to 24 hour task times. Had one run a task for 32 hours, and it only completed 2 decoys. Guessing this is one of those monster tasks mentioned above.

/edit. Here is the task in question: https://boinc.bakerlab.org/rosetta/result.php?resultid=1220531809


All of my three finished in 2 days 2 hours CPU time. One of them states less than that, but it seemed to do half of it, then the other half after the windows update reboot [1] without acknowledging how much time it had already done, although it must have saved something, because the second half took less time than the first half. All three did only 1 decoy. So either very big decoys, or your computer was faster. I can't see which computer did the task you mentioned, as that task is no longer listed on the server. But it looks like you have a bunch of xeons similar to my four X5650s. I do love watching 24 tasks running per machine.


I can't remember which of my three Xeon 12c/24t boxes got the task. 1 of them (OSX) is my only 8hr WU box, the other 2 are 24hr WU's, one Win10, the other OSX. (X5690, X5675, and X5670). I think it was my 2.93 box (the X5670, OSX) that got the long unit.
ID: 98125 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 98128 - Posted: 16 Jul 2020, 17:28:34 UTC
Last modified: 16 Jul 2020, 17:34:58 UTC

I appear to have another one. As I'm typing it hasn't finished yet but currently it stands at just over 17 hours crunch time. The machine it's crunching on is set to 8hr WU's.

Either that or it's a bad WU and something's up.

/edit. It's at 99.028% complete. Every so often it increases .001% but seems to have stalled. It does have the same naming convention as OP's long WU, so maybe it's the same deal. I'll leave it running and see what happens.
ID: 98128 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 98130 - Posted: 16 Jul 2020, 18:35:30 UTC - in response to Message 98128.  
Last modified: 16 Jul 2020, 18:36:35 UTC

Update, the above task finished after almost 18 hours, waaaaay over. 1 decoy produced.
ID: 98130 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,686,153
RAC: 7,085
Message 98133 - Posted: 16 Jul 2020, 19:03:24 UTC - in response to Message 98130.  

Update, the above task finished after almost 18 hours, waaaaay over. 1 decoy produced.


All mine completed. Just leave long ones running, nobody's has failed (or gone over the 3 day deadline) yet.
ID: 98133 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 98226 - Posted: 19 Jul 2020, 8:47:27 UTC - in response to Message 98092.  

Sid wrote:
no-one knows
Tasks are delivered with a command-line option
-boinc::cpu_run_timeout 36000
which suggests it’s 10 hours
ID: 98226 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Three very long tasks running



©2024 University of Washington
https://www.bakerlab.org