Strange work unit.

Message boards : Number crunching : Strange work unit.

To post messages, you must log in.

AuthorMessage
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 558
Credit: 5,694,702
RAC: 6,598
Message 90461 - Posted: 2 Mar 2019, 12:02:20 UTC

I have a work unit on here, this one...

http://boinc.bakerlab.org/rosetta/result.php?resultid=1059492200

... which has behaved in an unusual fashion. I have the run time set to 12 hours, and understand the task runs several times in the time allowed starting with a new random number each time. I saw the above unit had run for over two days, and only a little over 8% complete. I suspended it and then released it, it dropped back to the start and is running again now and has passed the point where it stopped before, indeed, it is showing 13.757% progress.

I infer from that, the task is sensitive to certain random numbers, which is a little odd, indeed, worrying. I have Rosetta running as one of the projects on machines that I do not see everyday.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 90461 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator
Project administrator

Send message
Joined: 22 Aug 06
Posts: 3527
Credit: 0
RAC: 0
Message 90489 - Posted: 6 Mar 2019, 15:59:32 UTC

That one random start may expose a bug that is not seen by a million others is certainly a possibility. But, when a task gets reset like you describe, it will actually restart processing on the same random start. So it sounds like a BOINC issue, that I've not heard brought up for some time, where the BOINC Manager shows the tasks is running, and it is recording time, but the task does not actually get CPU time dispatched to it. You can look at the task manager of Windows or the properties of the task in the BOINC Manager to see if it is actually showing CPU time accumulating.

If the task is not accumulating CPU, even when shown as active by the BOINC Manager... I don't recall the work around. Was this one of the symptoms of using the BOINC setting to use less than 100% of CPU? If your settings use less than 100% of CPU of the machine, another approach that doesn't seem to have the problems is to use less than 100% of the number of CPUs instead. In other words, rather than running at 90% CPU, on an 8 core machine, set BOINC to use 87% of the CPUs that the machine has (i.e. 7 cores).
Rosetta Moderator: Mod.Sense
ID: 90489 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 558
Credit: 5,694,702
RAC: 6,598
Message 90490 - Posted: 6 Mar 2019, 17:16:42 UTC
Last modified: 6 Mar 2019, 17:18:39 UTC

I kept half an eye on it after it restarted, but it appeared to run completely normally, finished and uploaded. I agree, that if it restarts with the same random number, then my theory above is incorrect. I am not sufficiently familiar with the code to comment further really. It ran to a normal completion. Something upset it, cosmic ray, neuitrino interaction, could be anything I suppose. I have not changed anything here, so the project continues to run as it always has. Forget it.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 90490 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 558
Credit: 5,694,702
RAC: 6,598
Message 90594 - Posted: 30 Mar 2019, 12:39:59 UTC

I've just had another of these, this one:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=959097968.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 90594 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Strange work unit.



©2019 University of Washington
http://www.bakerlab.org