Problems with Rosetta version 5.93

Message boards : Number crunching : Problems with Rosetta version 5.93

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next

AuthorMessage
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 50813 - Posted: 19 Jan 2008, 0:59:45 UTC - in response to Message 50812.  

As a follow up to message 50796 etc, when my 1c26 unit approached the 10-hour preferred runtime, I increased the runtime to the maximum of 24 hours. As soon as it took effect, the progress % fell to 38. (i.e. CPU time /24)

Another 7 hours have gone by and the grogress % is still based on CPU time/24.

Another consequence of increasing the runtime was that BOINC Manager woke up to the fact that I had 6 Rosetta units that were liable to miss their deadline and consequently commandeered both cores of my 3800+ dual-core machine for Rosetta at the expense of everything else. This brought a second Rosetta into play, an s099 unit, which now seems to be going along the same lines with 7 hours CPU time and 29% progress.

Heaven help anyone with a PIII machine! They will never finish. Even I am wondering if how many, if any, of my units will finish before the deadline of 23/1/08. I am not expecting them to finish within the 24 hours.

Does anyone know how long these will take, please?

Regards

Mike


Hi Mike, since you changed your runtime to 24hrs that's how long the tasks

will take give or take a few minutes for how may models your computer can do.

Pete.




ID: 50813 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,860,059
RAC: 7,494
Message 50815 - Posted: 19 Jan 2008, 1:48:04 UTC - in response to Message 50812.  

As a follow up to message 50796 etc, when my 1c26 unit approached the 10-hour preferred runtime, I increased the runtime to the maximum of 24 hours. As soon as it took effect, the progress % fell to 38. (i.e. CPU time /24)

Another 7 hours have gone by and the grogress % is still based on CPU time/24.

Another consequence of increasing the runtime was that BOINC Manager woke up to the fact that I had 6 Rosetta units that were liable to miss their deadline and consequently commandeered both cores of my 3800+ dual-core machine for Rosetta at the expense of everything else. This brought a second Rosetta into play, an s099 unit, which now seems to be going along the same lines with 7 hours CPU time and 29% progress.

Heaven help anyone with a PIII machine! They will never finish. Even I am wondering if how many, if any, of my units will finish before the deadline of 23/1/08. I am not expecting them to finish within the 24 hours.

Does anyone know how long these will take, please?

Regards

Mike

Mike - I think you misunderstand the run-time (or I misunderstand your post!). The runtime is not a time-out - it's the preferred run-time for each task. Each task consists of a number of decoys (models) and Rosetta will run as many as it can within the run-time you set. If you change this from 10hrs to 24 hrs then Rosetta will continue running models for 24hrs before calling the task complete and letting BOINC submit it.

If the task has run for over 10hrs and you change the preference back to 10hrs now Rosetta will finish the task once it finishes the next decoy. Users with slower computers will still fall within the run-time preference - they just fit fewer decoys into each task in that time.

HTH
Danny
ID: 50815 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 50817 - Posted: 19 Jan 2008, 3:36:56 UTC
Last modified: 19 Jan 2008, 3:42:21 UTC

Here's a "scoreboard" update. It shows all the errors for all my systems as it pertains to 5.93, and thier percentages. Any error is annoying, but from my perspective, there's not a large percentage of them.

ID: 50817 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 50818 - Posted: 19 Jan 2008, 5:14:56 UTC

Mike, my apologies, I generally dig up a link to info. warning you that changing the runtime impacts all of your existing work, and that it is possible to end up scheduled to miss deadlines. I generally recommend changing runtime gradually over time, so BOINC can react to the change. The good news is that if you change the preference back down, the pending work gets adjusted down as well (but it may not reflect that on work that hasn't been started until BOINC completes a couple of tasks under the new preference).

A PIII takes longer to complete a single model, but a 24hr preference is still just 24hrs. So, if a P4 takes 5 hours to complete the recent long running tasks, the PIII might take 10. A PIII would then complete a second model at around 20hrs, and then it would mark it completed (because to begin a third model would be so far over the 24hr preference). So, it still only takes a day to do a 24hr work unit, but the PIII will only do (for example) 2 of the hard models, and a P4 might do 4 of the models of the same level of difficulty.

Where a PIII really is hurting is when it is asked to do a 1-3hr runtime preference. It must do at least one model, and for tasks where that take a PIII longer then the runtime preference, he just keeps chugging, and showing the 10min. time to completion, which very gradually decreases over time.
Rosetta Moderator: Mod.Sense
ID: 50818 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 50838 - Posted: 20 Jan 2008, 17:21:03 UTC

workunit 134230483 had several sin_cos_range errors.
ID: 50838 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
eric

Send message
Joined: 2 Jan 07
Posts: 23
Credit: 815,696
RAC: 0
Message 50853 - Posted: 21 Jan 2008, 0:14:00 UTC

Once again I am having major problems with a new version of Rosetta. On one of my XP boxes the computer is locking up. That computer only has 512 MB of RAM. On one of my Linux boxes I am getting a ton of compute errors.

https://boinc.bakerlab.org/rosetta/results.php?hostid=702448

I am stopping Rosetta on that box and if this keeps up I am going to have to move my resources to different projects. That is a shame because I really feel that Rosetta is a great project to support. But on the other hand I can't keep wasting all this electricity on failed work units.
ID: 50853 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 50870 - Posted: 21 Jan 2008, 18:50:43 UTC

Validate Error yet again
Task ID 133449376
Name 1g2z__BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK_RELAX-1g2z_-crystal_foldanddock__2599_17309_0

Just wasted another 4 hours of CPU time

Validate error The task was reported but could not be validated, typically because the output files were lost on the server. <-- lost on the server? oh give me a break
ID: 50870 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dave Mickey

Send message
Joined: 29 Dec 07
Posts: 33
Credit: 4,136,957
RAC: 0
Message 50882 - Posted: 22 Jan 2008, 1:47:57 UTC

I too have bumped into the "10 minutes to go" thing,
and not understood, for a couple of reasons. First time,
I shut down BOINC and restarted it, and eventually, that
unit started again, and went to 10 minutes for a really long
time again.

I say it eventually restarted, because in the episode where it
went to 10 minutes, it somehow monopolized the CPU, and rang
up huge STD and LTD, by staying on Rosetta exclusively.
Thus when BOINC restarted, it went to s@h for many hours due to debt.
This machine is set to switch every 60 minutes, but something
in this scenario managed to override that and give Rosetta
something like 12 or 15 hours of uninterrupted CPU (should be 50/50).
No hints in the BOINC console output log, and BV has not (that I've
seen) reported that any deadline problem is the culprit.

What is it about this 10 minute to go anomaly that convinces BOINC
that Rosetta deserves large chunks of cpu time? (altho, the big debt
accumulation started well before it got to the 10 minute thing....)

(just trying to understand)

Dave
ID: 50882 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 50886 - Posted: 22 Jan 2008, 6:25:19 UTC
Last modified: 22 Jan 2008, 6:26:01 UTC

Dave, I am not certain of the current state of affairs with BOINC. I know at one time they were talking about adding function to try to make task switches just after checkpoints to preserve more work for all projects. And it would make sense as well to try and let a task run another 10min to complete, even if it does not checkpoint, so perhaps BOINC allowed it to run, assuming it's estimated time was correct, and that it would soon finish.

As you say, debt balanced everything out in the end.

If anyone knows for certain if the short estimated time to completion is disturbing the BOINC Manager's decision, please let me know, or post a link.
Rosetta Moderator: Mod.Sense
ID: 50886 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 50898 - Posted: 22 Jan 2008, 20:53:56 UTC
Last modified: 22 Jan 2008, 20:55:16 UTC

ANOTHER validate error, the second in 24 hours - 18 hrs to be precise between errors.
Task ID 133556076
Name s099_1_homologymodel_strictosidine_synthase_2472_63483_0

your killing my average with these errors and I am not sure if the results are making it into your system with this.

why do your servers keep losing files? refer to the explanation quoted from the website in my previous post.

someone want to answer this?

seems like its time for a bit of system maintance before yet another crash happens.
ID: 50898 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 50915 - Posted: 23 Jan 2008, 18:40:54 UTC

Greg, I am not in a position to know for certain, but I suspect that the DNS attack on the servers may have resulted in some odd things occuring.
Rosetta Moderator: Mod.Sense
ID: 50915 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 50923 - Posted: 23 Jan 2008, 19:26:09 UTC - in response to Message 50915.  

thats possible, everything is ok now, 24hrs no problems reporting or validating.

Greg, I am not in a position to know for certain, but I suspect that the DNS attack on the servers may have resulted in some odd things occuring.


ID: 50923 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
csbyseti

Send message
Joined: 24 Dec 05
Posts: 11
Credit: 5,202,425
RAC: 5,894
Message 50936 - Posted: 24 Jan 2008, 8:28:47 UTC

2h4o.........
seems to have an Problem. Got 3 of them with the same problem.

https://boinc.bakerlab.org/rosetta/result.php?resultid=135428621

'<core_client_version>5.3.12.tx36</core_client_version>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 1755374
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
CPU time: 46787.2 seconds. Greater than 4X preferred time: 10800 seconds
**********************************************************************
GZIP SILENT FILE: .xx2h4o.out

</stderr_txt>'

Shutdown by watchdog because of long run time.
Should all of the 2h4o WU's deleted?

ID: 50936 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 28 May 06
Posts: 79
Credit: 273,880
RAC: 243
Message 50937 - Posted: 24 Jan 2008, 8:44:21 UTC

This Task ID 135491299 failed validation.

Name 2tif__LOGREG_ABRELAX_PILOT2_FRAG_CORRECTION_SAVE_ALL_OUT-2tif_-_BARCODE__2670_6464_0
Workunit 123308703
Created 23 Jan 2008 14:44:05 UTC
Sent 23 Jan 2008 14:45:03 UTC
Received 24 Jan 2008 5:57:28 UTC
Server state Over
Outcome Validate error
Client state Done
Exit status 0 (0x0)
Computer ID 230539
Report deadline 2 Feb 2008 14:45:03 UTC
CPU time 4275.497864
stderr out

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 7200
# random seed: 1671937
==
</stderr_txt>
]]>

Validate state Invalid
Claimed credit 5.53358788153339
Granted credit 0
application version 5.93

ID: 50937 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 11 Jan 08
Posts: 23
Credit: 2,163,056
RAC: 0
Message 50945 - Posted: 24 Jan 2008, 19:15:47 UTC - in response to Message 50937.  
Last modified: 24 Jan 2008, 19:21:14 UTC

Noted a couple of 2H4O_BOINC_TWIST_RINGS WorkUnits stuck at ~10min remaining as well, all well beyond their target runtime. CPU time counts upwards but no progress is made.

Oddball :
Restarting BOINC on a System beyond runtime causes CPU time to drop from beyond target runtime to some point inside target runtime (e.g. 6h16m to 2h16m with a 6h preferences set), progress bar moved back accordingly from 99%.

The same happens on a couple of Systems tested (CPU time dropped from 23h back to a seemingly random point within target runtime)

Based on granted Credits and Decoys tested, the affected 2H4O_BOINC_TWIST_RINGS will stall at some point, but still cause full CPU utilization. WorkUnit will be ended by Watchdog after hitting 4x expected runtime.

------
All occurred with BOINC V5.10.28 and various Linux Systems.
ID: 50945 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 50946 - Posted: 24 Jan 2008, 19:38:03 UTC

Falcon, what is your Rosetta Preference for target runtime?
Please see related info. in this thread.
Rosetta Moderator: Mod.Sense
ID: 50946 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
csbyseti

Send message
Joined: 24 Dec 05
Posts: 11
Credit: 5,202,425
RAC: 5,894
Message 50948 - Posted: 24 Jan 2008, 19:42:11 UTC

See my Post above. Its not a Problem with the target runtime, i've got 3 cut off by watchdog and the fourth is aktuell running (only a pic in the native Window, nothing else).
ID: 50948 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 50954 - Posted: 24 Jan 2008, 20:53:48 UTC

Ended by watchdog, and running beyond their runtime target are two rather different things.
Rosetta Moderator: Mod.Sense
ID: 50954 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 11 Jan 08
Posts: 23
Credit: 2,163,056
RAC: 0
Message 50955 - Posted: 24 Jan 2008, 21:32:45 UTC - in response to Message 50946.  
Last modified: 24 Jan 2008, 21:37:54 UTC

Falcon, what is your Rosetta Preference for target runtime?
Please see related info. in this thread.


Was set at 6 hours until this evening, when I reduced it to 4 (4x4h no progress is at least better than 4x6h no progress)

Typical WorkUnits that finished already :
Watchdog Terminated
Watchdog Terminated + Segmentation Violation (still valid though)
Watchdog Terminated
Watchdog Terminated

----------
If the WorkUnit just takes that long (and can't finish within 4 or 6 hours on a modern Athlon64 X2), I don't mind the increased runtime. I don't expect that to take 24 hours though (unless the Models are really much more complex than expected, which could be in theory for all I know)

Looking at Claimed vs. Granted Credit however, it seems that approx. 50-70% of the runtime is simply lost due to Watchdog not cutting in until 4x the set runtime (not sure what the Client actually does in that time).
ID: 50955 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 50957 - Posted: 25 Jan 2008, 3:17:55 UTC

I think there is something seriously wrong with the 2h4o_ WUs. They just seem to sit there using CPU, but not writing anything to the output files. They never end until the watchdog says they've used 4x the CPU time preference.
ID: 50957 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next

Message boards : Number crunching : Problems with Rosetta version 5.93



©2024 University of Washington
https://www.bakerlab.org