Message boards : Number crunching : Very long running 12v1n_ task
Author | Message |
---|---|
awdorrin Send message Joined: 2 Apr 20 Posts: 4 Credit: 18,986,927 RAC: 0 |
Hello - I am new to R@H, only crunching for a few weeks now. I have one task that I am not sure if I should abort or keep running, as it has been running for over 2 days. Here is what the task properties showed me: Application: Rosetta 4.15 Name: 12v1n_al_12mer_design_00003_000575_0001_SAVE_ALL_OUT_913418_27 State: Running Received: 4/15/2020 4:06:19 AM Report deadline: 4/18/2020 4:06:19 AM Estimated computation size: 80,000 GFLOPs CPU time: 2d 07:26:47 CPU time since checkpoint: 2d 07:26:47 Elapsed time: 2d 06:10:15 Estimated time remaining: 00:09:46 Fraction done: 99.700% Virtual memory size: 245.11 MB Working set size: 26.15 MB Progress rate: 1.800% per hour Executable: rosetta_4.15_windows_x86_64.exe |
gFreezer Send message Joined: 4 Aug 17 Posts: 4 Credit: 570,003 RAC: 0 |
I have a task called "12v1n_al_12mer_design_00027_001895_0001_SAVE_ALL_OUT_913636_158" that is approaching 3 days of runtime. I saw some users reporting similar runtimes for WUs starting with "12v1n_al_12mer_design_". Are these the very long-running work units mentioned in the OP? The problem is that this task has the same deadline as the tasks with a "normal" runtime. It is almost half a day late already. Is this a problem? |
gFreezer Send message Joined: 4 Aug 17 Posts: 4 Credit: 570,003 RAC: 0 |
I have the same problem with a very similarly named task: Application Rosetta 4.15 Name 12v1n_al_12mer_design_00027_001895_0001_SAVE_ALL_OUT_913636_158 State Running Received Thu 16 Apr 2020 07:04:50 AM CEST Report deadline Sun 19 Apr 2020 07:04:49 AM CEST Estimated computation size 80,000 GFLOPs CPU time 2d 16:41:35 CPU time since checkpoint 2d 16:41:35 Elapsed time 2d 17:46:58 Estimated time remaining 00:10:10 Fraction done 99.743% Virtual memory size 381.02 MB Working set size 259.35 MB Directory slots/10 Process ID 17381 Progress rate 1.440% per hour Executable rosetta_4.15_x86_64-pc-linux-gnu I think it might be one of the huge work units mentioned in this thread, so maybe it's expected for these tasks to be running much longer than other tasks. I would wait with aborting the task until further clarification by the team. |
Holdolin Send message Joined: 19 Mar 20 Posts: 4 Credit: 2,431,917 RAC: 0 |
Hello - I am new to R@H, only crunching for a few weeks now. I've had a couple of those. They finished ok, but I too raised an eyebrow when I checked in on the system to see it had been crunching that WU for 2 days. Not even sure how i missed it, as i check my systems regularly. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I just saw another similar report on the same type of WU. The maximum CPU time a WU could be configured to run would be the 36 hour maximum runtime preference plus the new 10 hour watchdog. 46 hours of CPU is less than the 2 days you are showing. So I would have to take that as evidence that it is not running normally and should be aborted. Rosetta Moderator: Mod.Sense |
awdorrin Send message Joined: 2 Apr 20 Posts: 4 Credit: 18,986,927 RAC: 0 |
Thanks for the feedback - I will leave them running for awhile longer, to see what happens. However, I did notice that the 'deadline for both of these tasks was a day ago (24.5hrs ago and 32.5 hrs ago) |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
@Holdolin, can you link to the WUs where you saw the extreme runtimes? Rosetta Moderator: Mod.Sense |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
@gFreezer, and others, please link to completed WUs as they are reported back, so we can all see the result details. Rosetta Moderator: Mod.Sense |
Ace Casino Send message Joined: 16 Jul 07 Posts: 18 Credit: 14,011,380 RAC: 14,523 |
I've been deleting them. If I see a work unit running for 15 hours or 1+ days, with little to no progress as I watch it....it gets deleted. |
Admin Project administrator Send message Joined: 1 Jul 05 Posts: 4805 Credit: 0 RAC: 0 |
Yes, I recommend aborting these tasks. More details here: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=12554&postid=94905#94905 |
gFreezer Send message Joined: 4 Aug 17 Posts: 4 Credit: 570,003 RAC: 0 |
@Mod.Sense It's WU 1036003152. Looking at the ..._check.txt file in the slot directory, it really seems to be a faulty WU: $ tail /var/lib/boinc/slots/10/12v1n_al_12mer_design_00027_001895_0001_check.txt LAST 2488 SUCCESS 0 LAST 2489 SUCCESS 0 LAST 2490 SUCCESS 0 LAST 2491 SUCCESS 0 LAST 2492 SUCCESS 0 LAST 2493 SUCCESS 0 LAST 2494 SUCCESS 0 LAST 2495 SUCCESS 0 LAST 2496 SUCCESS 0 LAST 2497 SUCCESS 0 For comparison, here's the check.txt of another Rosetta 4.15 WU that's been running for 2 hours: $ tail /var/lib/boinc/slots/0/12v1n_al_12mer_design_00008_002192_0001_check.txt LAST 7 SUCCESS 6 LAST 8 SUCCESS 7 LAST 9 SUCCESS 8 LAST 10 SUCCESS 9 LAST 11 SUCCESS 10 LAST 12 SUCCESS 11 LAST 13 SUCCESS 12 LAST 14 SUCCESS 13 LAST 15 SUCCESS 14 LAST 16 SUCCESS 15 I'm going to abort it now, it has timed out and been passed on to another host anyway... |
Message boards :
Number crunching :
Very long running 12v1n_ task
©2024 University of Washington
https://www.bakerlab.org