Message boards : Number crunching : WUs stuck at 99.50% and no progress
Author | Message |
---|---|
Chris Send message Joined: 12 Apr 06 Posts: 6 Credit: 13,598,060 RAC: 0 |
I've got old DL380 G7, couple of days ago I installed Debian 10 on it and got Boinc running. It's been going pretty well, but today I noticed 2 tasks stuck, with 600s of work remaining - for several hours now. Anyone can give me some hint what to do with that? I mean - I probably need to abort these, I'd rather avoid such situation in the future. Task details: 1) ----------- name: SR5AGU10_LVPlG_44_42843164_5mers_0001_0001_SAVE_ALL_OUT_927439_310_0 WU name: SR5AGU10_LVPlG_44_42843164_5mers_0001_0001_SAVE_ALL_OUT_927439_310 project URL: https://boinc.bakerlab.org/rosetta/ received: Sat May 9 17:50:35 2020 report deadline: Tue May 12 17:50:35 2020 ready to report: no state: downloaded scheduler state: scheduled active_task_state: EXECUTING app version num: 420 resources: 1 CPU estimated CPU time remaining: 600.567115 CPU time at last checkpoint: 0.000000 current CPU time: 112797.000000 fraction done: 0.994708 swap size: 379 MB working set size: 305 MB 2) ----------- name: SR5AGU10_LVPlG_32_8261859_5mers_0001_0001_SAVE_ALL_OUT_927421_311_0 WU name: SR5AGU10_LVPlG_32_8261859_5mers_0001_0001_SAVE_ALL_OUT_927421_311 project URL: https://boinc.bakerlab.org/rosetta/ received: Sat May 9 17:50:35 2020 report deadline: Tue May 12 17:50:35 2020 ready to report: no state: downloaded scheduler state: scheduled active_task_state: EXECUTING app version num: 420 resources: 1 CPU estimated CPU time remaining: 600.760101 CPU time at last checkpoint: 0.000000 current CPU time: 112682.000000 fraction done: 0.994701 swap size: 378 MB working set size: 304 MB |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 393 Credit: 12,105,163 RAC: 5,753 |
I’d leave them running for a while. When you first start processing on a new machine or for a new project it takes Boinc a time to get settled in and work out how long WUs are likely to take, it could be a symptom of that. |
floyd Send message Joined: 26 Jun 14 Posts: 23 Credit: 10,268,639 RAC: 0 |
31 hours of runtime and no checkpoint? Check if those tasks cause any CPU load, I dare guess they haven't done any work at all and the 99.5% progress are just fake. You can turn LAIM off, then suspend and resume the tasks to restart them from the beginning. You have enough time left, but look if they work normally this time. Or abort them and leave them to somebody else. |
Chris Send message Joined: 12 Apr 06 Posts: 6 Credit: 13,598,060 RAC: 0 |
I’d leave them running for a while. The machine is running about 9 days now (uptime) - should be enough for new tasks to settle? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
The new watchdog will only kick in if the task has not ended ten hours passed the runtime preference. What is your runtime preference set to? The default runtime is 8hrs. Ending them will lose all of the work they've done. Running them a second time will most likely bring you back to the same situation. Since maximum runtime preference is 36 hours, it would be possible for them to be normal up through 46 hours of CPU time. If 31 hours is already more than 10 hours passed your runtime preference, then the watchdog did not catch them for some reason, and I would abort them. The provide links if you can so we can see if the "wingman" does any better with them. Rosetta Moderator: Mod.Sense |
Chris Send message Joined: 12 Apr 06 Posts: 6 Credit: 13,598,060 RAC: 0 |
31 hours of runtime and no checkpoint? Check if those tasks cause any CPU load, I dare guess they haven't done any work at all and the 99.5% progress are just fake. You can turn LAIM off, then suspend and resume the tasks to restart them from the beginning. You have enough time left, but look if they work normally this time. Or abort them and leave them to somebody else. I also saw the checkpoint missing, but I simply don't know what that could mean. First problematic task has PID 12258 and definitely it is doing something in there boinc 12258 99.9 2.5 388488 311956 ? RNl May10 1981:22 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu This is alternate view from boinctui shows a bit more info (I think). |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
I also have two of these "SR5AGU10" work units running long on one of my machines. WU1: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1056154721 WU2: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1056126717 The machine these are running on is set to the default 8hr runtime. (Machine in question: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3752305) At present both these WU's have been crunching for ~40 hours. Stuck at 99.585% on one, 99.578% on the other. /edit. After about 15 minutes the % on both has minimally increased by around .003, so they aren't dead in the water. |
floyd Send message Joined: 26 Jun 14 Posts: 23 Credit: 10,268,639 RAC: 0 |
I also saw the checkpoint missing, but I simply don't know what that could mean.I also don't know what that means in detail but I would think that in all the time the task hasn't reached the first intermediate point where something is worth saving. First problematic task has PID 12258 and definitely it is doing something in thereWell I'm surprised now. I've occasionally seen tasks with the clock ticking but nothing being done. Those continued fine after a restart. But I've never seen a task work that long without coming to a result. Someone with more detailed knowledge will have to tell us how we can know if the task is actually making progress and will eventually come to an end. By the way, I also have three of those running. Only around two hours now and nothing suspicious, except the displayed progress is quite high for that short time. Addendum: The first task finished after three hours. |
Energiequant Send message Joined: 19 Sep 05 Posts: 1 Credit: 595,842 RAC: 0 |
I also got one of those, running under Windows 10: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1056151065 Application Rosetta 4.20 Name SR5AGU10_LVPlG_35_50416319_5mers_0001_0001_SAVE_ALL_OUT_927427_226 State Running Received 09/05/2020 23:34:21 Report deadline 12/05/2020 23:34:25 Estimated computation size 80,000 GFLOPs CPU time 1d 02:46:01 CPU time since checkpoint 1d 02:46:01 Elapsed time 1d 03:51:49 Estimated time remaining 00:10:24 Fraction done 99.381% Virtual memory size 248.93 MB Working set size 50.90 MB Directory slots/4 Process ID 1128 Progress rate 3.600% per hour Executable rosetta_4.20_windows_x86_64.exe So it only has seen one checkpoint roughly one hour after it started. Process 1128 is still running (CPU is at 12.5% so one complete hyperthread) but with very low memory consumption (20.4MB) according to the task manager. I did not set up "Target CPU run time" so I assume the watchdog should have aborted the WU 10 hours ago? Looks like a generic issue with that SR5AGU10 job? |
Chris Send message Joined: 12 Apr 06 Posts: 6 Credit: 13,598,060 RAC: 0 |
After another (nearly) 24 hours the tasks are mostly unchanged, so I'm going to abort them. 1) ----------- name: SR5AGU10_LVPlG_44_42843164_5mers_0001_0001_SAVE_ALL_OUT_927439_310_0 WU name: SR5AGU10_LVPlG_44_42843164_5mers_0001_0001_SAVE_ALL_OUT_927439_310 project URL: https://boinc.bakerlab.org/rosetta/ received: Sat May 9 17:50:35 2020 report deadline: Tue May 12 17:50:35 2020 ready to report: no state: downloaded scheduler state: scheduled active_task_state: EXECUTING app version num: 420 resources: 1 CPU estimated CPU time remaining: 600.986054 CPU time at last checkpoint: 0.000000 current CPU time: 179824.500000 fraction done: 0.996672 swap size: 379 MB working set size: 305 MB 2) ----------- name: SR5AGU10_LVPlG_32_8261859_5mers_0001_0001_SAVE_ALL_OUT_927421_311_0 WU name: SR5AGU10_LVPlG_32_8261859_5mers_0001_0001_SAVE_ALL_OUT_927421_311 project URL: https://boinc.bakerlab.org/rosetta/ received: Sat May 9 17:50:35 2020 report deadline: Tue May 12 17:50:35 2020 ready to report: no state: downloaded scheduler state: scheduled active_task_state: EXECUTING app version num: 420 resources: 1 CPU estimated CPU time remaining: 600.848945 CPU time at last checkpoint: 0.000000 current CPU time: 179697.900000 fraction done: 0.996671 swap size: 378 MB working set size: 304 MB |
Message boards :
Number crunching :
WUs stuck at 99.50% and no progress
©2024 University of Washington
https://www.bakerlab.org