Very long running 12v1n_ task

Message boards : Number crunching : Very long running 12v1n_ task

To post messages, you must log in.

AuthorMessage
awdorrin
Avatar

Send message
Joined: 2 Apr 20
Posts: 4
Credit: 18,986,927
RAC: 0
Message 94882 - Posted: 19 Apr 2020, 14:48:49 UTC

Hello - I am new to R@H, only crunching for a few weeks now.
I have one task that I am not sure if I should abort or keep running, as it has been running for over 2 days.

Here is what the task properties showed me:

Application:     Rosetta 4.15 
Name:            12v1n_al_12mer_design_00003_000575_0001_SAVE_ALL_OUT_913418_27
State:           Running
Received:        4/15/2020 4:06:19 AM
Report deadline: 4/18/2020 4:06:19 AM
Estimated computation size: 80,000 GFLOPs
CPU time:                  2d 07:26:47
CPU time since checkpoint: 2d 07:26:47
Elapsed time:              2d 06:10:15
Estimated time remaining:     00:09:46
Fraction done: 99.700%
Virtual memory size: 245.11 MB
Working set size:    26.15 MB
Progress rate: 1.800% per hour
Executable: rosetta_4.15_windows_x86_64.exe
ID: 94882 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
gFreezer

Send message
Joined: 4 Aug 17
Posts: 4
Credit: 570,003
RAC: 0
Message 94886 - Posted: 19 Apr 2020, 15:26:29 UTC

I have a task called "12v1n_al_12mer_design_00027_001895_0001_SAVE_ALL_OUT_913636_158" that is approaching 3 days of runtime. I saw some users reporting similar runtimes for WUs starting with "12v1n_al_12mer_design_". Are these the very long-running work units mentioned in the OP?

The problem is that this task has the same deadline as the tasks with a "normal" runtime. It is almost half a day late already. Is this a problem?
ID: 94886 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
gFreezer

Send message
Joined: 4 Aug 17
Posts: 4
Credit: 570,003
RAC: 0
Message 94890 - Posted: 19 Apr 2020, 15:39:12 UTC

I have the same problem with a very similarly named task:
Application                 Rosetta 4.15 
Name                        12v1n_al_12mer_design_00027_001895_0001_SAVE_ALL_OUT_913636_158
State                       Running
Received                    Thu 16 Apr 2020 07:04:50 AM CEST
Report deadline             Sun 19 Apr 2020 07:04:49 AM CEST
Estimated computation size  80,000 GFLOPs
CPU time                    2d 16:41:35
CPU time since checkpoint   2d 16:41:35
Elapsed time                2d 17:46:58
Estimated time remaining    00:10:10
Fraction done               99.743%
Virtual memory size         381.02 MB
Working set size            259.35 MB
Directory                   slots/10
Process ID                  17381
Progress rate               1.440% per hour
Executable                  rosetta_4.15_x86_64-pc-linux-gnu

I think it might be one of the huge work units mentioned in this thread, so maybe it's expected for these tasks to be running much longer than other tasks. I would wait with aborting the task until further clarification by the team.
ID: 94890 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Holdolin

Send message
Joined: 19 Mar 20
Posts: 4
Credit: 2,431,917
RAC: 0
Message 94891 - Posted: 19 Apr 2020, 15:39:48 UTC - in response to Message 94882.  

Hello - I am new to R@H, only crunching for a few weeks now.
I have one task that I am not sure if I should abort or keep running, as it has been running for over 2 days.

Here is what the task properties showed me:

Application:     Rosetta 4.15 
Name:            12v1n_al_12mer_design_00003_000575_0001_SAVE_ALL_OUT_913418_27
State:           Running
Received:        4/15/2020 4:06:19 AM
Report deadline: 4/18/2020 4:06:19 AM
Estimated computation size: 80,000 GFLOPs
CPU time:                  2d 07:26:47
CPU time since checkpoint: 2d 07:26:47
Elapsed time:              2d 06:10:15
Estimated time remaining:     00:09:46
Fraction done: 99.700%
Virtual memory size: 245.11 MB
Working set size:    26.15 MB
Progress rate: 1.800% per hour
Executable: rosetta_4.15_windows_x86_64.exe

I've had a couple of those. They finished ok, but I too raised an eyebrow when I checked in on the system to see it had been crunching that WU for 2 days. Not even sure how i missed it, as i check my systems regularly.
ID: 94891 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94892 - Posted: 19 Apr 2020, 15:44:14 UTC

I just saw another similar report on the same type of WU.

The maximum CPU time a WU could be configured to run would be the 36 hour maximum runtime preference plus the new 10 hour watchdog. 46 hours of CPU is less than the 2 days you are showing. So I would have to take that as evidence that it is not running normally and should be aborted.
Rosetta Moderator: Mod.Sense
ID: 94892 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
awdorrin
Avatar

Send message
Joined: 2 Apr 20
Posts: 4
Credit: 18,986,927
RAC: 0
Message 94894 - Posted: 19 Apr 2020, 15:47:12 UTC
Last modified: 19 Apr 2020, 15:48:20 UTC

Thanks for the feedback - I will leave them running for awhile longer, to see what happens.
However, I did notice that the 'deadline for both of these tasks was a day ago (24.5hrs ago and 32.5 hrs ago)
ID: 94894 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94896 - Posted: 19 Apr 2020, 15:50:00 UTC

@Holdolin, can you link to the WUs where you saw the extreme runtimes?
Rosetta Moderator: Mod.Sense
ID: 94896 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94897 - Posted: 19 Apr 2020, 15:54:29 UTC - in response to Message 94886.  
Last modified: 19 Apr 2020, 16:18:06 UTC

@gFreezer, and others, please link to completed WUs as they are reported back, so we can all see the result details.
Rosetta Moderator: Mod.Sense
ID: 94897 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ace Casino

Send message
Joined: 16 Jul 07
Posts: 18
Credit: 13,972,287
RAC: 14,357
Message 94904 - Posted: 19 Apr 2020, 16:39:50 UTC

I've been deleting them.

If I see a work unit running for 15 hours or 1+ days, with little to no progress as I watch it....it gets deleted.
ID: 94904 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 4805
Credit: 0
RAC: 0
Message 94907 - Posted: 19 Apr 2020, 17:00:33 UTC

Yes, I recommend aborting these tasks. More details here:

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=12554&postid=94905#94905
ID: 94907 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
gFreezer

Send message
Joined: 4 Aug 17
Posts: 4
Credit: 570,003
RAC: 0
Message 94911 - Posted: 19 Apr 2020, 17:20:41 UTC - in response to Message 94897.  

@Mod.Sense It's WU 1036003152.

Looking at the ..._check.txt file in the slot directory, it really seems to be a faulty WU:
$ tail /var/lib/boinc/slots/10/12v1n_al_12mer_design_00027_001895_0001_check.txt
LAST    2488    SUCCESS 0
LAST    2489    SUCCESS 0
LAST    2490    SUCCESS 0
LAST    2491    SUCCESS 0
LAST    2492    SUCCESS 0
LAST    2493    SUCCESS 0
LAST    2494    SUCCESS 0
LAST    2495    SUCCESS 0
LAST    2496    SUCCESS 0
LAST    2497    SUCCESS 0

For comparison, here's the check.txt of another Rosetta 4.15 WU that's been running for 2 hours:
$ tail /var/lib/boinc/slots/0/12v1n_al_12mer_design_00008_002192_0001_check.txt
LAST    7       SUCCESS 6
LAST    8       SUCCESS 7
LAST    9       SUCCESS 8
LAST    10      SUCCESS 9
LAST    11      SUCCESS 10
LAST    12      SUCCESS 11
LAST    13      SUCCESS 12
LAST    14      SUCCESS 13
LAST    15      SUCCESS 14
LAST    16      SUCCESS 15

I'm going to abort it now, it has timed out and been passed on to another host anyway...
ID: 94911 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Very long running 12v1n_ task



©2024 University of Washington
https://www.bakerlab.org