Message boards : Number crunching : Strange work unit.
Author | Message |
---|---|
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 34 |
I have a work unit on here, this one... https://boinc.bakerlab.org/rosetta/result.php?resultid=1059492200 ... which has behaved in an unusual fashion. I have the run time set to 12 hours, and understand the task runs several times in the time allowed starting with a new random number each time. I saw the above unit had run for over two days, and only a little over 8% complete. I suspended it and then released it, it dropped back to the start and is running again now and has passed the point where it stopped before, indeed, it is showing 13.757% progress. I infer from that, the task is sensitive to certain random numbers, which is a little odd, indeed, worrying. I have Rosetta running as one of the projects on machines that I do not see everyday. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
That one random start may expose a bug that is not seen by a million others is certainly a possibility. But, when a task gets reset like you describe, it will actually restart processing on the same random start. So it sounds like a BOINC issue, that I've not heard brought up for some time, where the BOINC Manager shows the tasks is running, and it is recording time, but the task does not actually get CPU time dispatched to it. You can look at the task manager of Windows or the properties of the task in the BOINC Manager to see if it is actually showing CPU time accumulating. If the task is not accumulating CPU, even when shown as active by the BOINC Manager... I don't recall the work around. Was this one of the symptoms of using the BOINC setting to use less than 100% of CPU? If your settings use less than 100% of CPU of the machine, another approach that doesn't seem to have the problems is to use less than 100% of the number of CPUs instead. In other words, rather than running at 90% CPU, on an 8 core machine, set BOINC to use 87% of the CPUs that the machine has (i.e. 7 cores). Rosetta Moderator: Mod.Sense |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 34 |
I kept half an eye on it after it restarted, but it appeared to run completely normally, finished and uploaded. I agree, that if it restarts with the same random number, then my theory above is incorrect. I am not sufficiently familiar with the code to comment further really. It ran to a normal completion. Something upset it, cosmic ray, neuitrino interaction, could be anything I suppose. I have not changed anything here, so the project continues to run as it always has. Forget it. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 34 |
I've just had another of these, this one: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=959097968. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 34 |
And another. I presume that there is a safety kill mechanism which will abort a task if it exceeds some threshold time value. I ask because I have Rosetta in the portfolio of a couple of machines I do not see every day. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I presume that there is a safety kill mechanism Yes, we call it the watch dog. I ensures that tasks that run more than 4 hours longer then their preferred runtime are ended, and any completed models of that WU are returned. Rosetta Moderator: Mod.Sense |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 34 |
Good, that is what I expected. Thanks. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 34 |
It is a little unfortunate! I have a task running now, it is 98.965% complete, VERY slowly increasing 10 minutes odd to complete, but it has run now 16:40:50 so is at the point where it is more than 4 hours over my 12:00:00 run time. A few minutes ago, it showed .963% and after finishing this post, it says .968% so it IS doing something. <edit> Okay, I managed to up the run time to 14:00:00 before it got the chop so I hope it will get there. Shows 98.971% right now. Interestingly, the time remaining is not decreasing, it has been 00:10:27 since I started. </edit> <edit again> Yes! It suddenly jumped to 100% after 16:50:47. </edit> <edit again> The task is: rb_08_27_7614_7823_ab_t000_robetta_cstwt_5.0_FT_IGNORE_THE_REST_08_06_857976_594 Hope it is a good one. I'll leave the time at 14:00:00 in case there are others like this one. </edit> <edit AGAIN> 1093987561 985420727 3117659 18 Sep 2019, 5:10:20 UTC 19 Sep 2019, 6:59:05 UTC Completed and validated 44,963.77 43,178.52 569.88 Rosetta Mini v3.78 windows_x86_64 1093972969 984034477 3117659 18 Sep 2019, 3:52:35 UTC 19 Sep 2019, 8:51:40 UTC Completed and validated 60,647.23 57,949.45 398.76 Rosetta v4.07 windows_x86_64 1093968124 985403294 3117659 18 Sep 2019, 2:51:36 UTC 19 Sep 2019, 4:16:12 UTC Completed and validated 44,333.80 43,112.95 625.53 Rosetta Mini v3.78 windows_x86_64 1093967112 985402459 3161065 18 Sep 2019, 2:36:42 UTC 19 Sep 2019, 3:42:42 UTC Completed and validated 43,181.07 43,133.70 512.24 Rosetta Mini v3.78 windows_intelx86 1093967259 985402586 3161065 18 Sep 2019, 2:36:42 UTC 19 Sep 2019, 1:43:24 UTC Completed and validated 43,139.44 43,080.78 590.64 Rosetta Mini v3.78 windows_intelx86 Credit column is interesting. Mini looks to be maxy. </edit> Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Message boards :
Number crunching :
Strange work unit.
©2024 University of Washington
https://www.bakerlab.org