Message boards : Number crunching : Problems with Rosetta version 5.93
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next
Author | Message |
---|---|
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
As a follow up to message 50796 etc, when my 1c26 unit approached the 10-hour preferred runtime, I increased the runtime to the maximum of 24 hours. As soon as it took effect, the progress % fell to 38. (i.e. CPU time /24) Hi Mike, since you changed your runtime to 24hrs that's how long the tasks will take give or take a few minutes for how may models your computer can do. Pete. |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,860,059 RAC: 7,494 |
As a follow up to message 50796 etc, when my 1c26 unit approached the 10-hour preferred runtime, I increased the runtime to the maximum of 24 hours. As soon as it took effect, the progress % fell to 38. (i.e. CPU time /24) Mike - I think you misunderstand the run-time (or I misunderstand your post!). The runtime is not a time-out - it's the preferred run-time for each task. Each task consists of a number of decoys (models) and Rosetta will run as many as it can within the run-time you set. If you change this from 10hrs to 24 hrs then Rosetta will continue running models for 24hrs before calling the task complete and letting BOINC submit it. If the task has run for over 10hrs and you change the preference back to 10hrs now Rosetta will finish the task once it finishes the next decoy. Users with slower computers will still fall within the run-time preference - they just fit fewer decoys into each task in that time. HTH Danny |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Mike, my apologies, I generally dig up a link to info. warning you that changing the runtime impacts all of your existing work, and that it is possible to end up scheduled to miss deadlines. I generally recommend changing runtime gradually over time, so BOINC can react to the change. The good news is that if you change the preference back down, the pending work gets adjusted down as well (but it may not reflect that on work that hasn't been started until BOINC completes a couple of tasks under the new preference). A PIII takes longer to complete a single model, but a 24hr preference is still just 24hrs. So, if a P4 takes 5 hours to complete the recent long running tasks, the PIII might take 10. A PIII would then complete a second model at around 20hrs, and then it would mark it completed (because to begin a third model would be so far over the 24hr preference). So, it still only takes a day to do a 24hr work unit, but the PIII will only do (for example) 2 of the hard models, and a P4 might do 4 of the models of the same level of difficulty. Where a PIII really is hurting is when it is asked to do a 1-3hr runtime preference. It must do at least one model, and for tasks where that take a PIII longer then the runtime preference, he just keeps chugging, and showing the 10min. time to completion, which very gradually decreases over time. Rosetta Moderator: Mod.Sense |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
workunit 134230483 had several sin_cos_range errors. |
eric Send message Joined: 2 Jan 07 Posts: 23 Credit: 815,696 RAC: 0 |
Once again I am having major problems with a new version of Rosetta. On one of my XP boxes the computer is locking up. That computer only has 512 MB of RAM. On one of my Linux boxes I am getting a ton of compute errors. https://boinc.bakerlab.org/rosetta/results.php?hostid=702448 I am stopping Rosetta on that box and if this keeps up I am going to have to move my resources to different projects. That is a shame because I really feel that Rosetta is a great project to support. But on the other hand I can't keep wasting all this electricity on failed work units. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Validate Error yet again Task ID 133449376 Name 1g2z__BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK_RELAX-1g2z_-crystal_foldanddock__2599_17309_0 Just wasted another 4 hours of CPU time Validate error The task was reported but could not be validated, typically because the output files were lost on the server. <-- lost on the server? oh give me a break |
Dave Mickey Send message Joined: 29 Dec 07 Posts: 33 Credit: 4,136,957 RAC: 0 |
I too have bumped into the "10 minutes to go" thing, and not understood, for a couple of reasons. First time, I shut down BOINC and restarted it, and eventually, that unit started again, and went to 10 minutes for a really long time again. I say it eventually restarted, because in the episode where it went to 10 minutes, it somehow monopolized the CPU, and rang up huge STD and LTD, by staying on Rosetta exclusively. Thus when BOINC restarted, it went to s@h for many hours due to debt. This machine is set to switch every 60 minutes, but something in this scenario managed to override that and give Rosetta something like 12 or 15 hours of uninterrupted CPU (should be 50/50). No hints in the BOINC console output log, and BV has not (that I've seen) reported that any deadline problem is the culprit. What is it about this 10 minute to go anomaly that convinces BOINC that Rosetta deserves large chunks of cpu time? (altho, the big debt accumulation started well before it got to the 10 minute thing....) (just trying to understand) Dave |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Dave, I am not certain of the current state of affairs with BOINC. I know at one time they were talking about adding function to try to make task switches just after checkpoints to preserve more work for all projects. And it would make sense as well to try and let a task run another 10min to complete, even if it does not checkpoint, so perhaps BOINC allowed it to run, assuming it's estimated time was correct, and that it would soon finish. As you say, debt balanced everything out in the end. If anyone knows for certain if the short estimated time to completion is disturbing the BOINC Manager's decision, please let me know, or post a link. Rosetta Moderator: Mod.Sense |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
ANOTHER validate error, the second in 24 hours - 18 hrs to be precise between errors. Task ID 133556076 Name s099_1_homologymodel_strictosidine_synthase_2472_63483_0 your killing my average with these errors and I am not sure if the results are making it into your system with this. why do your servers keep losing files? refer to the explanation quoted from the website in my previous post. someone want to answer this? seems like its time for a bit of system maintance before yet another crash happens. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Greg, I am not in a position to know for certain, but I suspect that the DNS attack on the servers may have resulted in some odd things occuring. Rosetta Moderator: Mod.Sense |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
thats possible, everything is ok now, 24hrs no problems reporting or validating. Greg, I am not in a position to know for certain, but I suspect that the DNS attack on the servers may have resulted in some odd things occuring. |
csbyseti Send message Joined: 24 Dec 05 Posts: 11 Credit: 5,202,425 RAC: 5,894 |
2h4o......... seems to have an Problem. Got 3 of them with the same problem. https://boinc.bakerlab.org/rosetta/result.php?resultid=135428621 '<core_client_version>5.3.12.tx36</core_client_version> <stderr_txt> # cpu_run_time_pref: 10800 # random seed: 1755374 ********************************************************************** Rosetta score is stuck or going too long. Watchdog is ending the run! CPU time: 46787.2 seconds. Greater than 4X preferred time: 10800 seconds ********************************************************************** GZIP SILENT FILE: .xx2h4o.out </stderr_txt>' Shutdown by watchdog because of long run time. Should all of the 2h4o WU's deleted? |
Dr Who Fan Send message Joined: 28 May 06 Posts: 79 Credit: 273,880 RAC: 243 |
This Task ID 135491299 failed validation. Name 2tif__LOGREG_ABRELAX_PILOT2_FRAG_CORRECTION_SAVE_ALL_OUT-2tif_-_BARCODE__2670_6464_0 Workunit 123308703 Created 23 Jan 2008 14:44:05 UTC Sent 23 Jan 2008 14:45:03 UTC Received 24 Jan 2008 5:57:28 UTC Server state Over Outcome Validate error Client state Done Exit status 0 (0x0) Computer ID 230539 Report deadline 2 Feb 2008 14:45:03 UTC CPU time 4275.497864 stderr out <core_client_version>5.10.30</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 7200 # random seed: 1671937 == </stderr_txt> ]]> Validate state Invalid Claimed credit 5.53358788153339 Granted credit 0 application version 5.93 |
FalconFly Send message Joined: 11 Jan 08 Posts: 23 Credit: 2,163,056 RAC: 0 |
Noted a couple of 2H4O_BOINC_TWIST_RINGS WorkUnits stuck at ~10min remaining as well, all well beyond their target runtime. CPU time counts upwards but no progress is made. Oddball : Restarting BOINC on a System beyond runtime causes CPU time to drop from beyond target runtime to some point inside target runtime (e.g. 6h16m to 2h16m with a 6h preferences set), progress bar moved back accordingly from 99%. The same happens on a couple of Systems tested (CPU time dropped from 23h back to a seemingly random point within target runtime) Based on granted Credits and Decoys tested, the affected 2H4O_BOINC_TWIST_RINGS will stall at some point, but still cause full CPU utilization. WorkUnit will be ended by Watchdog after hitting 4x expected runtime. ------ All occurred with BOINC V5.10.28 and various Linux Systems. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Falcon, what is your Rosetta Preference for target runtime? Please see related info. in this thread. Rosetta Moderator: Mod.Sense |
csbyseti Send message Joined: 24 Dec 05 Posts: 11 Credit: 5,202,425 RAC: 5,894 |
See my Post above. Its not a Problem with the target runtime, i've got 3 cut off by watchdog and the fourth is aktuell running (only a pic in the native Window, nothing else). |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Ended by watchdog, and running beyond their runtime target are two rather different things. Rosetta Moderator: Mod.Sense |
FalconFly Send message Joined: 11 Jan 08 Posts: 23 Credit: 2,163,056 RAC: 0 |
Falcon, what is your Rosetta Preference for target runtime? Was set at 6 hours until this evening, when I reduced it to 4 (4x4h no progress is at least better than 4x6h no progress) Typical WorkUnits that finished already : Watchdog Terminated Watchdog Terminated + Segmentation Violation (still valid though) Watchdog Terminated Watchdog Terminated ---------- If the WorkUnit just takes that long (and can't finish within 4 or 6 hours on a modern Athlon64 X2), I don't mind the increased runtime. I don't expect that to take 24 hours though (unless the Models are really much more complex than expected, which could be in theory for all I know) Looking at Claimed vs. Granted Credit however, it seems that approx. 50-70% of the runtime is simply lost due to Watchdog not cutting in until 4x the set runtime (not sure what the Client actually does in that time). |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
I think there is something seriously wrong with the 2h4o_ WUs. They just seem to sit there using CPU, but not writing anything to the output files. They never end until the watchdog says they've used 4x the CPU time preference. |
Message boards :
Number crunching :
Problems with Rosetta version 5.93
©2024 University of Washington
https://www.bakerlab.org