Message boards : Number crunching : Watchdog not working too well
Author | Message |
---|---|
Charles Dennett Send message Joined: 27 Sep 05 Posts: 102 Credit: 2,078,372 RAC: 303 |
Have a task that is on a 2 hour run time target. The watchdog should have stopped it at 6 CPU hours. Currently it is over 21 hours of cpu time: Application Rosetta 4.15 Name 12v1n_al_12mer_design_00240_010210_0001_SAVE_ALL_OUT_914331_72 State Running Received Wed 15 Apr 2020 12:05:07 PM EDT Report deadline Sat 18 Apr 2020 12:05:06 PM EDT Estimated computation size 80,000 GFLOPs CPU time 21:22:39 CPU time since checkpoint 21:22:39 Elapsed time 21:39:20 Estimated time remaining 00:10:07 Fraction done 99.226% Virtual memory size 382.00 MB Working set size 304.89 MB Directory slots/3 Process ID 164535 Progress rate 4.680% per hour Executable rosetta_4.15_x86_64-pc-linux-gnu Also note that it has not checkpointed yet either. Looking at files in the slots/3 directory does show some current activity (current time at my location on 13:34 as I type this): ls -lart | tail -rw-r--r--. 1 boinc boinc 0 Apr 15 15:50 rosetta_tmp.txt -rw-r--r--. 1 boinc boinc 0 Apr 15 15:50 minirosetta_database.zip.is_extracted -rw-rw-r--. 1 charlie charlie 0 Apr 16 06:57 stderrgfx.txt -rw-rw-r--. 1 charlie charlie 14 Apr 16 06:57 gfx_info -rw-r--r--. 1 boinc boinc 6175 Apr 16 11:28 init_data.xml drwxrwx--x. 3 boinc boinc 20480 Apr 16 11:28 . -rw-r--r--. 1 boinc boinc 9529 Apr 16 13:30 12v1n_al_12mer_design_00240_010210_0001_check.txt -rw-r--r--. 1 boinc boinc 3589 Apr 16 13:30 rng.state.gz -rw-rw----. 1 boinc boinc 25001680 Apr 16 13:33 boinc_rosetta_3 -rw-r--r--. 1 boinc boinc 8192 Apr 16 13:33 boinc_mmap_file A tail of the 12v1n_al_12mer_design_00240_010210_0001_check.txt file shows this: tail 12v1n_al_12mer_design_00240_010210_0001_check.txt LAST 497 SUCCESS 0 LAST 498 SUCCESS 0 LAST 499 SUCCESS 0 LAST 500 SUCCESS 0 LAST 501 SUCCESS 0 LAST 502 SUCCESS 0 LAST 503 SUCCESS 0 LAST 504 SUCCESS 0 LAST 505 SUCCESS 0 LAST 506 SUCCESS 0 Here's a link to the task: https://boinc.bakerlab.org/rosetta/result.php?resultid=1150908452 I'm going to let it run for a while just to see what happens. -Charlie -Charlie |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I'm going to let it run for a while just to see what happens. You are much more curious than I :) I would blast it. But either way, please post an update when it reports back. I am curious too. Perhaps your dog (your profile photo) can help teach the R@h watchdog. Have you verified the venue of the host as compared to the runtime preference for that venue? Have you been modifying the runtime preferences recently? Rosetta Moderator: Mod.Sense |
Charles Dennett Send message Joined: 27 Sep 05 Posts: 102 Credit: 2,078,372 RAC: 303 |
I'm going to let it run for a while just to see what happens. I don't use a specific venue. A couple of weeks ago I raised the run time from 1 hour to 8 hours and ran that way for a while. I noticed my RAC started dropping so several days ago I lowered it to 2 hours to see if by any chance it would make a difference (not that I expect it to). The task was received well after I did that and was preceded by a lot of tasks that ran successfully with the 2 hour cpu time. So, I doubt that would have been the reason. Still, with an 8 hour run time the watchdog would have aborted it after 12 hours. Unfortunately, the dog in my profile is no longer with us. I'll have to get a picture of my new yellow lab. I ran R@H for a long time but stopped several years ago with all distributed computing. I retired a year ago and with the recent pandemic I jumped back in to to my part. R@H was always one of my favorites. Right now it's all I'm doing across 3 systems/12 cores. -Charlie -Charlie |
Charles Dennett Send message Joined: 27 Sep 05 Posts: 102 Credit: 2,078,372 RAC: 303 |
After over a day and a half of cpu time I've aborted the task. -Charlie -Charlie |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Thank you for posting. These "12v1n" tasks are now under discussion here. Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
Watchdog not working too well
©2024 University of Washington
https://www.bakerlab.org