Never ending tasks and past tasks

Message boards : Number crunching : Never ending tasks and past tasks

To post messages, you must log in.

AuthorMessage
äxl
Avatar

Send message
Joined: 30 Dec 08
Posts: 11
Credit: 497,080
RAC: 0
Message 93255 - Posted: 3 Apr 2020, 18:03:28 UTC
Last modified: 3 Apr 2020, 18:15:19 UTC

I had 2 tasks that were running for days and they still showed about 20h left. I cancelled those manually a few days ago when they hit the deadline.

Now I've got 2 tasks that run for 1d15h and they still show 19-23h left. Yesterday they showd 15h left ...
What is wrong? Is it my system?

I also can't find the past tasks that I cancelled. That is no longer ago than last week.

https://boinc.bakerlab.org/rosetta/results.php?userid=294942
Running BOINC because:
1) I'm using 100% green energy (no certificates or other non-sense)
2) My computer runs mostly anyway (due to BT and other non-sense)
3) To help
ID: 93255 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 8,784
Message 93269 - Posted: 3 Apr 2020, 19:18:22 UTC - in response to Message 93255.  

I had 2 tasks that were running for days and they still showed about 20h left. I cancelled those manually a few days ago when they hit the deadline.

Now I've got 2 tasks that run for 1d15h and they still show 19-23h left. Yesterday they showd 15h left ...
What is wrong? Is it my system?

I also can't find the past tasks that I cancelled. That is no longer ago than last week.

https://boinc.bakerlab.org/rosetta/results.php?userid=294942

Weird. If you click on the task and select properties, what does it show for CPU time and Elapsed time?
I'm wondering if it's running at all?
ID: 93269 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
äxl
Avatar

Send message
Joined: 30 Dec 08
Posts: 11
Credit: 497,080
RAC: 0
Message 93272 - Posted: 3 Apr 2020, 19:28:39 UTC - in response to Message 93269.  
Last modified: 3 Apr 2020, 19:30:39 UTC

Weird. If you click on the task and select properties, what does it show for CPU time and Elapsed time?
I'm wondering if it's running at all?

Oh, cool. I didn't know this feature.
One of them says:
CPU time 15:29:54
Elapsed time 1d 15:25:10
Estimated time remaining 22:53:40
Fraction done 63.260%
Progress rate 1.440% per hour


The other one is at 68%. So everything's okay I guess.
Thanks!

(I still wonder where the past cancelled tasks went. Didn't task history used to be longer?)
Running BOINC because:
1) I'm using 100% green energy (no certificates or other non-sense)
2) My computer runs mostly anyway (due to BT and other non-sense)
3) To help
ID: 93272 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 8,784
Message 93294 - Posted: 3 Apr 2020, 21:58:50 UTC - in response to Message 93272.  

Weird. If you click on the task and select properties, what does it show for CPU time and Elapsed time?
I'm wondering if it's running at all?

Oh, cool. I didn't know this feature.
One of them says:
CPU time 15:29:54
Elapsed time 1d 15:25:10
Estimated time remaining 22:53:40
Fraction done 63.260%
Progress rate 1.440% per hour

The other one is at 68%. So everything's okay I guess.
Thanks!

(I still wonder where the past cancelled tasks went. Didn't task history used to be longer?)

Well, it's definitely running, but it's getting interrupted quite a lot (24hr difference between the two)
Do you have "Suspend when computer is in use" checked?
What's the time since the last checkpoint? Has it checkpointed at all?
I'm guessing this must be one of those 16hr (cpu time) tasks otherwise the watchdog would have cut in already
It looks like another dodgy task - it's not looking good

Don't worry about old tasks too much. They do seem to be aging them off quite quickly, I agree
ID: 93294 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 93299 - Posted: 3 Apr 2020, 22:10:15 UTC - in response to Message 93294.  
Last modified: 3 Apr 2020, 22:10:46 UTC

it's getting interrupted quite a lot (24hr difference between the two)


Isn't it a difference of 4m 54?
(DOH! I missed the "1d" there!)

Ignore the estimated time remaining. It is 63% done in 15.5 hours of CPU. It should complete at 24 hours of CPU.
Rosetta Moderator: Mod.Sense
ID: 93299 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1682
Credit: 17,854,150
RAC: 18,215
Message 93301 - Posted: 3 Apr 2020, 22:33:58 UTC

A big difference between CPU time & Run time indicates an over committed system- some other programme or process is using up CPU time (Rosetta applications are set to Idle priority level to play nice with other applications, so pretty much any other application making use of the CPU will stop Rosetta from processing work).
It will also occur if you make use of "Use at most 100 % of CPU time" with any value less than 100% If you haven't made use of this setting, i'd check your system for programmes/processes other than Rosetta that are making heavy use of the CPU.
Being a 2 core system, just a web browser with running scripts would have a big impact on Rosetta processing.
Grant
Darwin NT
ID: 93301 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
äxl
Avatar

Send message
Joined: 30 Dec 08
Posts: 11
Credit: 497,080
RAC: 0
Message 93346 - Posted: 4 Apr 2020, 4:59:04 UTC - in response to Message 93294.  

Well, it's definitely running, but it's getting interrupted quite a lot (24hr difference between the two)
Do you have "Suspend when computer is in use" checked?
What's the time since the last checkpoint? Has it checkpointed at all?
I'm guessing this must be one of those 16hr (cpu time) tasks otherwise the watchdog would have cut in already
It looks like another dodgy task - it's not looking good

Don't worry about old tasks too much. They do seem to be aging them off quite quickly, I agree


Here's the full ouput:
Application 			Rosetta 4.07 
Name 				3az4ii6b_jhr_design1_COVID-19_SAVE_ALL_OUT_903430_1
State 				Running
Received 			Sun 29 Mar 2020 05:38:35 CEST
Report deadline 		Mon 06 Apr 2020 05:38:34 CEST
Estimated computation size 	80,000 GFLOPs
CPU time 			17:05:56
CPU time since checkpoint 	00:04:11
Elapsed time 			1d 19:53:47
Estimated time remaining 	19:00:07
Fraction done 			69.789%
Virtual memory size 		1.32 GB
Working set size 		1019.19 MB
Directory 			slots/1
Process ID 			13540
Progress rate 			1.440% per hour
Executable 			rosetta_4.07_i686-pc-linux-gnu


And yes, work is interruped quite a lot. But I've set it up that way cause I don't wanna fry my CPU.
Running BOINC because:
1) I'm using 100% green energy (no certificates or other non-sense)
2) My computer runs mostly anyway (due to BT and other non-sense)
3) To help
ID: 93346 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1682
Credit: 17,854,150
RAC: 18,215
Message 93347 - Posted: 4 Apr 2020, 5:05:55 UTC - in response to Message 93346.  
Last modified: 4 Apr 2020, 5:11:55 UTC

And yes, work is interruped quite a lot. But I've set it up that way cause I don't wanna fry my CPU.
What temperature is the CPU running at? As long it's 70°c or lower, it's not an issue. From memory, even with the stock heatsink & fan, even running Rosetta 24/7 shouldn't put it's temperature over 70°c as long as the heatsink & fan is clean, along with the inlets & outlets of your case and the case fan(s).

The other option would be let the Tasks run uninterrupted, but only use 1 Core of your CPU. More processing would get done, and you'd still keep the CPU cool.
Use at most  50 % of the CPUs
Use at most 100 % of CPU time

Grant
Darwin NT
ID: 93347 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 8,784
Message 93479 - Posted: 5 Apr 2020, 10:38:55 UTC - in response to Message 93346.  

Well, it's definitely running, but it's getting interrupted quite a lot (24hr difference between the two)
Do you have "Suspend when computer is in use" checked?
What's the time since the last checkpoint? Has it checkpointed at all?
I'm guessing this must be one of those 16hr (cpu time) tasks otherwise the watchdog would have cut in already
It looks like another dodgy task - it's not looking good

Don't worry about old tasks too much. They do seem to be aging them off quite quickly, I agree

Here's the full ouput:
Application 			Rosetta 4.07 
Name 				3az4ii6b_jhr_design1_COVID-19_SAVE_ALL_OUT_903430_1
State 				Running
Received 			Sun 29 Mar 2020 05:38:35 CEST
Report deadline 		Mon 06 Apr 2020 05:38:34 CEST
Estimated computation size 	80,000 GFLOPs
CPU time 			17:05:56
CPU time since checkpoint 	00:04:11
Elapsed time 			1d 19:53:47
Estimated time remaining 	19:00:07
Fraction done 			69.789%
Virtual memory size 		1.32 GB
Working set size 		1019.19 MB
Directory 			slots/1
Process ID 			13540
Progress rate 			1.440% per hour
Executable 			rosetta_4.07_i686-pc-linux-gnu

And yes, work is interruped quite a lot. But I've set it up that way cause I don't wanna fry my CPU.

Well, the task is running and it's checkpointing fine.
Is it right, what Mod.Sense worked out, that you've changed your preferred runtime to 24hrs rather than the default 8hrs?
Because if you've also set it to suspend running while in use, that's going to extend the runtime to exactly what you're seeing and sometimes you'll struggle to meet deadline.
The task has run successfully for 17hrs. Set a more appropriate preferred runtime (and the default 8hr suits you) and it should report in a time acceptable to you, before deadline.

You asked at the start if it's the taskproject or you. It's your amended settings.
ID: 93479 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
äxl
Avatar

Send message
Joined: 30 Dec 08
Posts: 11
Credit: 497,080
RAC: 0
Message 93645 - Posted: 6 Apr 2020, 15:51:39 UTC - in response to Message 93347.  
Last modified: 6 Apr 2020, 16:36:13 UTC

What temperature is the CPU running at? As long it's 70°c or lower, it's not an issue. From memory, even with the stock heatsink & fan, even running Rosetta 24/7 shouldn't put it's temperature over 70°c as long as the heatsink & fan is clean, along with the inlets & outlets of your case and the case fan(s).

The other option would be let the Tasks run uninterrupted, but only use 1 Core of your CPU. More processing would get done, and you'd still keep the CPU cool.
Use at most  50 % of the CPUs
Use at most 100 % of CPU time

Thanks for reminding me to clean my (stock) heatsink and fan. I didn't know it would make such a big difference. :/
Unfortunately the case fan is broken so I always keep the case open. So I will have to clean it more often. xD

I try to keep my temperature at around 60 °C. When I limit cores to one I can indeed reach 100% without going too far above 70 °C.
But I can also run BOINC on both cores and set usage safely to 70%. This is better, isn't it?

Well, the task is running and it's checkpointing fine.
Is it right, what Mod.Sense worked out, that you've changed your preferred runtime to 24hrs rather than the default 8hrs?
Because if you've also set it to suspend running while in use, that's going to extend the runtime to exactly what you're seeing and sometimes you'll struggle to meet deadline.

Why? Wouldn't give me a 1d WU give me a farther away deadline?

The task has run successfully for 17hrs. Set a more appropriate preferred runtime (and the default 8hr suits you) and it should report in a time acceptable to you, before deadline.

You asked at the start if it's the taskproject or you. It's your amended settings.

True. xD

Now I've got a task running that hasn't been checkpointed since start. Is that bad?
CPU time			07:12:10
CPU time since checkpoint 	07:12:10

Running BOINC because:
1) I'm using 100% green energy (no certificates or other non-sense)
2) My computer runs mostly anyway (due to BT and other non-sense)
3) To help
ID: 93645 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1682
Credit: 17,854,150
RAC: 18,215
Message 93692 - Posted: 6 Apr 2020, 22:53:43 UTC - in response to Message 93645.  
Last modified: 6 Apr 2020, 22:59:11 UTC

Thanks for reminding me to clean my (stock) heatsink and fan. I didn't know it would make such a big difference. :/
Unfortunately the case fan is broken so I always keep the case open. So I will have to clean it more often. xD
A small desktop fan blowing in to the system is your friend.
I needed this many years ago to keep a system's CPU below 80°c when working hard.
:-)


I try to keep my temperature at around 60 °C. When I limit cores to one I can indeed reach 100% without going too far above 70 °C.
But I can also run BOINC on both cores and set usage safely to 70%. This is better, isn't it?
Not really.
Making use of "Use at most x% of CPU time" to reduce the load on a CPU actually puts more stress on the CPU as the constant starting & stopping actually puts quite a bit of thermal stress on it- it gets hot, then cool, then hot, then cool, then hot then cool. Expand, contract, expand, contact, expand, contract...


Is it right, what Mod.Sense worked out, that you've changed your preferred runtime to 24hrs rather than the default 8hrs?
Because if you've also set it to suspend running while in use, that's going to extend the runtime to exactly what you're seeing and sometimes you'll struggle to meet deadline.
Why? Wouldn't give me a 1d WU give me a farther away deadline?
Nope.
The deadline is fixed, that is the period of time in which to return a Task. If you have it set to run for 24 hours, and then make use of "Use at most x% of CPU time" to keep your CPU cool, as you have found that increases the time it takes to finish the Task.
Hence why going with the default Target CPU runtime, making use of just the 1 core & setting "Use at most x% of CPU time" to 100% would be your best option- keep the temperatures down, get plenty of work done, and not run in to deadline problems.


Now I've got a task running that hasn't been checkpointed since start. Is that bad?
CPU time			07:12:10
CPU time since checkpoint 	07:12:10
My understanding is that Rosetta only checkpoints at the completion of a Decoy, so with a very slow CPU, and a Task that requires a lot of processing to produce a Decoy, it will take a long time before a checkpoint occurs.


Edit- although looking at some of my tasks i get this-
CPU time at last checkpoint 4:19:48
CPU time                    4:20:38
and that's on most Tasks which indicates it is checkpointing every few minutes (at least on these Tasks).
Grant
Darwin NT
ID: 93692 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 93693 - Posted: 6 Apr 2020, 23:00:23 UTC

It looks like some Linux machines (others?) are seeing WUs where they don't get past the first model on v4.12. Similar discussion here
Rosetta Moderator: Mod.Sense
ID: 93693 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
äxl
Avatar

Send message
Joined: 30 Dec 08
Posts: 11
Credit: 497,080
RAC: 0
Message 94839 - Posted: 19 Apr 2020, 7:53:45 UTC - in response to Message 93692.  
Last modified: 19 Apr 2020, 7:54:21 UTC

A small desktop fan blowing in to the system is your friend.
I needed this many years ago to keep a system's CPU below 80°c when working hard.
:-)

I measured voltage on the fan outlet to check if it's the mainboard. I accidentially short circuited the system and it crashed. I replugged the fan and now it runs at 100%. LOL (But: The fan is blowing outside. I didn't touch it physically.)
I can now safely run BOINC at 90% with both cores active.
I try to keep my temperature at around 60 °C. When I limit cores to one I can indeed reach 100% without going too far above 70 °C.
But I can also run BOINC on both cores and set usage safely to 70%. This is better, isn't it?
Not really.
Making use of "Use at most x% of CPU time" to reduce the load on a CPU actually puts more stress on the CPU as the constant starting & stopping actually puts quite a bit of thermal stress on it- it gets hot, then cool, then hot, then cool, then hot then cool. Expand, contract, expand, contact, expand, contract...

This CPU is almost 13 years old so it wouldn't be too bad if it broke IMO.
Also isn't 2x70 more than 1x100?
Why? Wouldn't give me a 1d WU give me a farther away deadline?
Nope.
The deadline is fixed, that is the period of time in which to return a Task. If you have it set to run for 24 hours, and then make use of "Use at most x% of CPU time" to keep your CPU cool, as you have found that increases the time it takes to finish the Task.
Hence why going with the default Target CPU runtime, making use of just the 1 core & setting "Use at most x% of CPU time" to 100% would be your best option- keep the temperatures down, get plenty of work done, and not run in to deadline problems.

Okay, I will lower preferred runtime in settings. But maybe I don't need to do it too much.

I am running this script BTW: https://gitlab.com/UMLAUTaxl/boinctemp/blob/master/boinctemp.sh
I guess I could rewrite it a bit to activate/deactivate cores between longer intervals instead of changing CPU usage every minute.
Running BOINC because:
1) I'm using 100% green energy (no certificates or other non-sense)
2) My computer runs mostly anyway (due to BT and other non-sense)
3) To help
ID: 94839 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1682
Credit: 17,854,150
RAC: 18,215
Message 94842 - Posted: 19 Apr 2020, 8:08:38 UTC - in response to Message 94839.  
Last modified: 19 Apr 2020, 8:09:58 UTC

Also isn't 2x70 more than 1x100?
Maybe, but not always. Because the CPU doesn't stop/slowdown instantly, nor start/ speed up instantly. More importantly, particularly with a dual core system, and i assume doing other things on it as well as processing BONC work, those things will also reduce the amount of time BOINC processing is actually done.
So when it comes to limited cores & threads & CPU time, 1x100 can end up being more than 2x70.
Grant
Darwin NT
ID: 94842 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94879 - Posted: 19 Apr 2020, 14:32:45 UTC - in response to Message 94842.  
Last modified: 19 Apr 2020, 14:33:00 UTC

This would especially be likely if you have contention for L2/L3 cache when both cores are active.
Rosetta Moderator: Mod.Sense
ID: 94879 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Never ending tasks and past tasks



©2024 University of Washington
https://www.bakerlab.org