Message boards : Rosetta@home Science : Run time.
Author | Message |
---|---|
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 62 |
In your settings, you can change the run time for work units. I am making an assumption about this, and want to confirm the veracity of my assumption. If a job takes 20 minutes to run, and I have my task run time set to 1 hour, then my assumption is the work unit will run 3 jobs. An alternative possibility is that a work unit starts a single job and works on it for 1 hour then stops, reporting its state at the end of the hour. The returned state can then be issued in another work unit as its starting point. This would act in a similar manner as checkpoints within a job do. If the alternate explanation is the case, setting a longer run time would allow the job to progress further within a single work unit, which is probably beneficial to the job overall as the generation, distribution processing and collection of parts is minimised. Comments welcome. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0 |
Your first assumption is the correct one. The program will complete as many decoys as it can in the time allotted. At the end of each decoy it will check if it has time to run another, if not it will wrap up. This is when you may see the task completed in less time than you have chosen in your preferences. On the other hand, not all decoys take the same amount of time to run. Some will continue past your run time preference in order to complete the decoy. If it runs four hours over, the watchdog should cut in and end the task. Snags |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 62 |
Okay, that is more or less what thought, I have my run time set for 12 hours and the work units typically run for circa 12 hours. The other idea was a "just in case" type scenario, in case they needed longer run times. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Just to clarify terminology, the 20 minute run you mention is what is referred to as a decoy or a model. It is a complete run of the algorithm used by the task. What you describe about partial completion and creating another work unit to continue the work is possible with BOINC, but not required by R@h. The algorithms run to completion on a time scale that can be completed by each machine. A longer runtime preference results in more completed decoys, rather than a more precise prediction. At the project level, more reported decoys yields a better prediction overall. If the runtime preference of the user is enough that the average runtime of the first models would predict there is enough time to run another, than a new model with a unique starting point is begun. Note that the runtime preference and time calculations just described refer to actual CPU time, but the BOINC Manager shows both CPU time and run time (wall-clock time). Rosetta Moderator: Mod.Sense |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 62 |
Simply out of curiosity, does the work unit download the decoys as it needs them or does the initial send have sufficient information to run the algorithm many times? I envisage a situation where the model takes a set number of start parameters and these are included in the initial download, it runs the initial decoy, and then, if time allows, it, for example, increments parameter 6, and then runs it again. I was a professional software engineer for 40 years, so am likely to understand a technical reply. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Yes, "the initial send have sufficient information to run the algorithm many times" says it well. The second decoy is running over the same protein sequence, with the same algorithm, it simply starts with a new random number, which is used to basically create a new hypothetical starting position of the protein. So there isn't anything more to download. The next starting point is generated for the subject protein using the random number. The algorithm is then run against this new, starting conformation. Rosetta Moderator: Mod.Sense |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 62 |
Thanks, understand fully. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 62 |
Given what you have said here, if a unit has a "computation error" like this one... https://boinc.bakerlab.org/result.php?resultid=1046340233 ... it has run the job many times. I would expect a "computation error" in an early cycle to crash the work unit. yet that one ran for the time limit I have set using the same protein simply with a different random number start point. This implies that the job has run normally for many start points. I have noticed a number of errors in the last few days actually, the one I highlight is just the worst. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
That task says it had an error trying to create the results file. But the output shows it completed 21 structures. So it did complete those, and just had an error at the end trying to create and zip the results. A few possibilities I can think of would be Windows authority issues, anti-virus software, a full hard drive (or storage device that BOINC is using to run), and possibly somehow the task was configured incorrectly (on the R@h side) and it was trying to access the wrong area of storage to create the file. Rosetta Moderator: Mod.Sense |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 62 |
I hear what you say, but, having spent some time checking the other projects totals, I have only one, that has a single error, and that is a "cancelled by server" which is not an error, and I don't know why it flagged as such. So Einstein, Milkyway, Seti, Yoyo, and Acoustics are not having any problems, just here, and just in the last week. Other projects have not sent work for a while but none I looked at had any errors. Nothing has changed with Windows or Avast, (anti virus) , there is over 20 Gig on the SSD free. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 62 |
Deleted. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I guess I am a bit confused. I do not see that the task you linked indicates it was cancelled by the server. If a batch of tasks is found to have a problem, now that the newer server code is in place, the Project Team can easily cancel them to spare others the same problems. When that happens, the "cancelled by server" sort of status is assigned. It helps make the best use of crunch time. If other projects have not had to cancel batches of tasks, the tasks from those projects will not see such a status. Rosetta Moderator: Mod.Sense |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 62 |
Err, I'm confused now. What I said is... >>> Given what you have said here, if a unit has a "computation error" like this one... (highlight added) https://boinc.bakerlab.org/result.php?resultid=1046340233 <<<< ... I quite understand why tasks can be cancelled by server, and quite agree with the function, however, I did not say that the task was cancelled by server. I did say I was searching my other active projects for "errors", but found none, only here. A cancelled by server work unit is this one... https://boinc.bakerlab.org/workunit.php?wuid=941131930. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 62 |
And now there is something seriously wrong going on here. Looking at my "errors" page, most of the older ones, that had values before are showing "Timed out - no response" - they did NOT show that before. 1045594884 942000780 3117659 6 Dec 2018, 12:54:10 UTC 14 Dec 2018, 12:54:10 UTC Timed out - no response 0.00 0.00 --- Rosetta Mini v3.78 windows_intelx86 Secure copies made. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2122 Credit: 41,184,314 RAC: 9,365 |
And now there is something seriously wrong going on here. Looking at my "errors" page, most of the older ones, that had values before are showing "Timed out - no response" - they did NOT show that before. I've had a lot of this too. I have a theory why it is... Recently there have been a lot of download errors on all my machines. A recent example of these errors is shown below: 14/12/2018 21:12:05 | Rosetta@home | Sending scheduler request: To report completed tasks. My theory is that the server thinks the tasks were received correctly when they weren't, so they're never in the list to process here and they only disappear once they pass the due date. When I check my offline tasks with the number showing in my online task list there is a very large discrepancy. Check yours - I suspect it will be the same. I meant to mention this some weeks ago but never got round to it - sorry |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 62 |
I raised the issue with the research team, I got this back... >>> Do you happen to know the name(s) of these jobs? There was a problematic batch that was sent out by a researcher in the lab that had '..' in the name which the BOINC client did not like. These jobs would fail and may be causing the odd behavior. These jobs also had very long names. <<< ... which seems to apply to your record. Looking at the names of my failures, the .. appears. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2122 Credit: 41,184,314 RAC: 9,365 |
I raised the issue with the research team, I got this back... Makes sense. I wonder why these tasks haven't been withdrawn via the server? |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 62 |
If a job is out in the wild, they do seem to have ways of stopping them, I don't know why they did not do that. I still don't know why my job could not write its output, all the others can. A couple of goofs in quick succession, Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Message boards :
Rosetta@home Science :
Run time.
©2024 University of Washington
https://www.bakerlab.org