Message boards : Number crunching : Never ending tasks and past tasks
Author | Message |
---|---|
äxl Send message Joined: 30 Dec 08 Posts: 11 Credit: 497,080 RAC: 0 |
I had 2 tasks that were running for days and they still showed about 20h left. I cancelled those manually a few days ago when they hit the deadline. Now I've got 2 tasks that run for 1d15h and they still show 19-23h left. Yesterday they showd 15h left ... What is wrong? Is it my system? I also can't find the past tasks that I cancelled. That is no longer ago than last week. https://boinc.bakerlab.org/rosetta/results.php?userid=294942 Running BOINC because: 1) I'm using 100% green energy (no certificates or other non-sense) 2) My computer runs mostly anyway (due to BT and other non-sense) 3) To help |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 8,784 |
I had 2 tasks that were running for days and they still showed about 20h left. I cancelled those manually a few days ago when they hit the deadline. Weird. If you click on the task and select properties, what does it show for CPU time and Elapsed time? I'm wondering if it's running at all? |
äxl Send message Joined: 30 Dec 08 Posts: 11 Credit: 497,080 RAC: 0 |
Weird. If you click on the task and select properties, what does it show for CPU time and Elapsed time? Oh, cool. I didn't know this feature. One of them says: CPU time 15:29:54 Elapsed time 1d 15:25:10 Estimated time remaining 22:53:40 Fraction done 63.260% Progress rate 1.440% per hour The other one is at 68%. So everything's okay I guess. Thanks! (I still wonder where the past cancelled tasks went. Didn't task history used to be longer?) Running BOINC because: 1) I'm using 100% green energy (no certificates or other non-sense) 2) My computer runs mostly anyway (due to BT and other non-sense) 3) To help |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 8,784 |
Weird. If you click on the task and select properties, what does it show for CPU time and Elapsed time? Well, it's definitely running, but it's getting interrupted quite a lot (24hr difference between the two) Do you have "Suspend when computer is in use" checked? What's the time since the last checkpoint? Has it checkpointed at all? I'm guessing this must be one of those 16hr (cpu time) tasks otherwise the watchdog would have cut in already It looks like another dodgy task - it's not looking good Don't worry about old tasks too much. They do seem to be aging them off quite quickly, I agree |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
it's getting interrupted quite a lot (24hr difference between the two) Isn't it a difference of 4m 54? (DOH! I missed the "1d" there!) Ignore the estimated time remaining. It is 63% done in 15.5 hours of CPU. It should complete at 24 hours of CPU. Rosetta Moderator: Mod.Sense |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1682 Credit: 17,854,150 RAC: 18,215 |
A big difference between CPU time & Run time indicates an over committed system- some other programme or process is using up CPU time (Rosetta applications are set to Idle priority level to play nice with other applications, so pretty much any other application making use of the CPU will stop Rosetta from processing work). It will also occur if you make use of "Use at most 100 % of CPU time" with any value less than 100% If you haven't made use of this setting, i'd check your system for programmes/processes other than Rosetta that are making heavy use of the CPU. Being a 2 core system, just a web browser with running scripts would have a big impact on Rosetta processing. Grant Darwin NT |
äxl Send message Joined: 30 Dec 08 Posts: 11 Credit: 497,080 RAC: 0 |
Well, it's definitely running, but it's getting interrupted quite a lot (24hr difference between the two) Here's the full ouput: Application Rosetta 4.07 Name 3az4ii6b_jhr_design1_COVID-19_SAVE_ALL_OUT_903430_1 State Running Received Sun 29 Mar 2020 05:38:35 CEST Report deadline Mon 06 Apr 2020 05:38:34 CEST Estimated computation size 80,000 GFLOPs CPU time 17:05:56 CPU time since checkpoint 00:04:11 Elapsed time 1d 19:53:47 Estimated time remaining 19:00:07 Fraction done 69.789% Virtual memory size 1.32 GB Working set size 1019.19 MB Directory slots/1 Process ID 13540 Progress rate 1.440% per hour Executable rosetta_4.07_i686-pc-linux-gnu And yes, work is interruped quite a lot. But I've set it up that way cause I don't wanna fry my CPU. Running BOINC because: 1) I'm using 100% green energy (no certificates or other non-sense) 2) My computer runs mostly anyway (due to BT and other non-sense) 3) To help |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1682 Credit: 17,854,150 RAC: 18,215 |
And yes, work is interruped quite a lot. But I've set it up that way cause I don't wanna fry my CPU.What temperature is the CPU running at? As long it's 70°c or lower, it's not an issue. From memory, even with the stock heatsink & fan, even running Rosetta 24/7 shouldn't put it's temperature over 70°c as long as the heatsink & fan is clean, along with the inlets & outlets of your case and the case fan(s). The other option would be let the Tasks run uninterrupted, but only use 1 Core of your CPU. More processing would get done, and you'd still keep the CPU cool. Use at most 50 % of the CPUs Use at most 100 % of CPU time Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 8,784 |
Well, it's definitely running, but it's getting interrupted quite a lot (24hr difference between the two) Well, the task is running and it's checkpointing fine. Is it right, what Mod.Sense worked out, that you've changed your preferred runtime to 24hrs rather than the default 8hrs? Because if you've also set it to suspend running while in use, that's going to extend the runtime to exactly what you're seeing and sometimes you'll struggle to meet deadline. The task has run successfully for 17hrs. Set a more appropriate preferred runtime (and the default 8hr suits you) and it should report in a time acceptable to you, before deadline. You asked at the start if it's the taskproject or you. It's your amended settings. |
äxl Send message Joined: 30 Dec 08 Posts: 11 Credit: 497,080 RAC: 0 |
What temperature is the CPU running at? As long it's 70°c or lower, it's not an issue. From memory, even with the stock heatsink & fan, even running Rosetta 24/7 shouldn't put it's temperature over 70°c as long as the heatsink & fan is clean, along with the inlets & outlets of your case and the case fan(s). Thanks for reminding me to clean my (stock) heatsink and fan. I didn't know it would make such a big difference. :/ Unfortunately the case fan is broken so I always keep the case open. So I will have to clean it more often. xD I try to keep my temperature at around 60 °C. When I limit cores to one I can indeed reach 100% without going too far above 70 °C. But I can also run BOINC on both cores and set usage safely to 70%. This is better, isn't it? Well, the task is running and it's checkpointing fine. Why? Wouldn't give me a 1d WU give me a farther away deadline? The task has run successfully for 17hrs. Set a more appropriate preferred runtime (and the default 8hr suits you) and it should report in a time acceptable to you, before deadline. True. xD Now I've got a task running that hasn't been checkpointed since start. Is that bad? CPU time 07:12:10 CPU time since checkpoint 07:12:10 Running BOINC because: 1) I'm using 100% green energy (no certificates or other non-sense) 2) My computer runs mostly anyway (due to BT and other non-sense) 3) To help |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1682 Credit: 17,854,150 RAC: 18,215 |
Thanks for reminding me to clean my (stock) heatsink and fan. I didn't know it would make such a big difference. :/A small desktop fan blowing in to the system is your friend. I needed this many years ago to keep a system's CPU below 80°c when working hard. :-) I try to keep my temperature at around 60 °C. When I limit cores to one I can indeed reach 100% without going too far above 70 °C.Not really. Making use of "Use at most x% of CPU time" to reduce the load on a CPU actually puts more stress on the CPU as the constant starting & stopping actually puts quite a bit of thermal stress on it- it gets hot, then cool, then hot, then cool, then hot then cool. Expand, contract, expand, contact, expand, contract... Nope.Is it right, what Mod.Sense worked out, that you've changed your preferred runtime to 24hrs rather than the default 8hrs?Why? Wouldn't give me a 1d WU give me a farther away deadline? The deadline is fixed, that is the period of time in which to return a Task. If you have it set to run for 24 hours, and then make use of "Use at most x% of CPU time" to keep your CPU cool, as you have found that increases the time it takes to finish the Task. Hence why going with the default Target CPU runtime, making use of just the 1 core & setting "Use at most x% of CPU time" to 100% would be your best option- keep the temperatures down, get plenty of work done, and not run in to deadline problems. Now I've got a task running that hasn't been checkpointed since start. Is that bad?My understanding is that Rosetta only checkpoints at the completion of a Decoy, so with a very slow CPU, and a Task that requires a lot of processing to produce a Decoy, it will take a long time before a checkpoint occurs. Edit- although looking at some of my tasks i get this- CPU time at last checkpoint 4:19:48 CPU time 4:20:38and that's on most Tasks which indicates it is checkpointing every few minutes (at least on these Tasks). Grant Darwin NT |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
It looks like some Linux machines (others?) are seeing WUs where they don't get past the first model on v4.12. Similar discussion here Rosetta Moderator: Mod.Sense |
äxl Send message Joined: 30 Dec 08 Posts: 11 Credit: 497,080 RAC: 0 |
A small desktop fan blowing in to the system is your friend. I measured voltage on the fan outlet to check if it's the mainboard. I accidentially short circuited the system and it crashed. I replugged the fan and now it runs at 100%. LOL (But: The fan is blowing outside. I didn't touch it physically.) I can now safely run BOINC at 90% with both cores active. I try to keep my temperature at around 60 °C. When I limit cores to one I can indeed reach 100% without going too far above 70 °C.Not really. This CPU is almost 13 years old so it wouldn't be too bad if it broke IMO. Also isn't 2x70 more than 1x100? Why? Wouldn't give me a 1d WU give me a farther away deadline?Nope. Okay, I will lower preferred runtime in settings. But maybe I don't need to do it too much. I am running this script BTW: https://gitlab.com/UMLAUTaxl/boinctemp/blob/master/boinctemp.sh I guess I could rewrite it a bit to activate/deactivate cores between longer intervals instead of changing CPU usage every minute. Running BOINC because: 1) I'm using 100% green energy (no certificates or other non-sense) 2) My computer runs mostly anyway (due to BT and other non-sense) 3) To help |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1682 Credit: 17,854,150 RAC: 18,215 |
Also isn't 2x70 more than 1x100?Maybe, but not always. Because the CPU doesn't stop/slowdown instantly, nor start/ speed up instantly. More importantly, particularly with a dual core system, and i assume doing other things on it as well as processing BONC work, those things will also reduce the amount of time BOINC processing is actually done. So when it comes to limited cores & threads & CPU time, 1x100 can end up being more than 2x70. Grant Darwin NT |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
This would especially be likely if you have contention for L2/L3 cache when both cores are active. Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
Never ending tasks and past tasks
©2024 University of Washington
https://www.bakerlab.org