Report long-running models here

Message boards : Number crunching : Report long-running models here

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 14 · Next

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 57549 - Posted: 3 Dec 2008, 15:46:44 UTC

Greg, if Dennis is seeing long-running models, they tend to not take checkpoints. So, any time BOINC is ended and restarted, the task might lose lots of work. And so the next time it starts, the cycle repeats. This is part of why the team is working to eliminate the long-running models.

Anyway, I just wanted to point out that part of why he's getting those symptoms is because of the long-running models, not the other way around.

Dennis, the runtime preference is not the time per model. The per model times should be as described in the begining of this thread. And then more models are done if the runtime preference allows time for it. So, your approach of aborting anything racking up more then 30 hours is good. And in fact if you see one with only 6 hours, but still on model one, I would abort that too. Or, one with two models that's up to 8 or 9 hours. etc.
Rosetta Moderator: Mod.Sense
ID: 57549 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57552 - Posted: 3 Dec 2008, 16:11:37 UTC - in response to Message 57549.  

Greg, if Dennis is seeing long-running models, they tend to not take checkpoints. So, any time BOINC is ended and restarted, the task might lose lots of work. And so the next time it starts, the cycle repeats. This is part of why the team is working to eliminate the long-running models.

Anyway, I just wanted to point out that part of why he's getting those symptoms is because of the long-running models, not the other way around.

Dennis, the runtime preference is not the time per model. The per model times should be as described in the begining of this thread. And then more models are done if the runtime preference allows time for it. So, your approach of aborting anything racking up more then 30 hours is good. And in fact if you see one with only 6 hours, but still on model one, I would abort that too. Or, one with two models that's up to 8 or 9 hours. etc.



mod- thanks for the clarification.
ID: 57552 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ramostol

Send message
Joined: 6 Feb 07
Posts: 64
Credit: 584,052
RAC: 0
Message 57767 - Posted: 10 Dec 2008, 10:44:49 UTC

The t060 (beta 5.98) wus usually last 4.5 to nearly 6 hours per model on my PPC. This model (t060_1_NMRREF_1_t060_1_id_model_07IGNORE_THE_REST_idl_5381_1234_0) lasted almost 10 hours.
ID: 57767 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 58181 - Posted: 27 Dec 2008, 9:58:20 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=216802506
Name cc2_1_8_mammoth_fa_cst_hb_t303__IGNORE_THE_REST_2AH5A_4_6138_17_0

stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 15270.1 cpu seconds
This process generated 2 decoys from 2 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

and before that:

https://boinc.bakerlab.org/rosetta/result.php?resultid=216802506
Name cc2_1_8_mammoth_fa_cst_hb_t303__IGNORE_THE_REST_2AH5A_4_6138_17_0
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 15270.1 cpu seconds
This process generated 2 decoys from 2 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>


credit sucks on both of these
both tasks granted these exact amounts:
Claimed credit 106.166115188458
Granted credit 74.8691857584611

ID: 58181 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
senatoralex85

Send message
Joined: 27 Sep 05
Posts: 66
Credit: 169,644
RAC: 0
Message 58186 - Posted: 27 Dec 2008, 19:29:44 UTC

My Preferences are set to 4 hours per workunit. This workunit lasted 12 hours.

Task ID 217245063
Name cc2_1_8_mammoth_mix_fa_cst_hb_t313__IGNORE_THE_REST_1BG2A_7_6180_31_0
Workunit 197983062
Created 27 Dec 2008 3:47:57 UTC
Sent 27 Dec 2008 4:19:00 UTC
Received 27 Dec 2008 19:09:47 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 792930
Report deadline 6 Jan 2009 4:19:00 UTC
CPU time 43297.02
stderr out <core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
======================================================
DONE :: 1 starting structures 43296 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>


Validate state Valid
Claimed credit 118.702972598747
Granted credit 59.6309170544315
application version 1.47

ID: 58186 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 58189 - Posted: 27 Dec 2008, 23:03:53 UTC

rifleman is had a zinc task run for 37 hrs before watchdog terminated it.
He has a 12 hr run time set.
see this thread[/url for more information.

the one task can be found at [url]https://boinc.bakerlab.org/rosetta/result.php?resultid=216862173
ID: 58189 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ian_D

Send message
Joined: 21 Sep 05
Posts: 55
Credit: 4,216,173
RAC: 0
Message 58196 - Posted: 28 Dec 2008, 10:41:45 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=217325144


Nearly 16 hrs in when I spotted it and now it reports, after a manual abort, it has done 0 CPU time ?!?!


ID: 58196 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ian_D

Send message
Joined: 21 Sep 05
Posts: 55
Credit: 4,216,173
RAC: 0
Message 58206 - Posted: 28 Dec 2008, 19:56:18 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=217459230

<core_client_version>6.2.15</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 14400
# cpu_run_time_pref: 7200
# cpu_run_time_pref: 7200
**********************************************************************
Rosetta is going too long. Watchdog is ending the run!
CPU time: 26682.9 seconds. Greater than 3X preferred time: 7200 seconds
**********************************************************************
called boinc_finish

</stderr_txt>
]]>


ID: 58206 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aegis Maelstrom

Send message
Joined: 29 Oct 08
Posts: 61
Credit: 2,137,555
RAC: 0
Message 58210 - Posted: 28 Dec 2008, 22:16:04 UTC
Last modified: 28 Dec 2008, 22:19:09 UTC

Task cc_nonideal_0_6_nocst4_hb_t328__IGNORE_THE_REST_2GVKA_7_5916_29
Workunit 197391713.

Terminated automatically as a long run, reported and counted as a "success".

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<stderr_txt>
**********************************************************************
Rosetta is going too long. Watchdog is ending the run!
CPU time: 32978.8 seconds. Greater than 3X preferred time: 10800 seconds
**********************************************************************
called boinc_finish

</stderr_txt>
]]>


Minirosetta 1.47, the latest BOINC client.
Recently and temporarily added Vista laptop with Core 2 duo, underclocked/normally clocked (due to power management), certainly not overclocked. Stable system.

Tasked run for over 9 hours, still on Model 1, over 2 million steps have been made.
I've been watching the task - it looked like some kind of an infinite loop - mostly classic "big" moves but also the second step - smallMoverMoverBase... - as far as I can remember.

What is more - as ModSense had written above no checkpoints were made so I had to wait with this obviously wrong simulation just to be sure it won't work.

By the way - I've seen some other complaints on cc_nonideal tasks in the MiniRosetta 1.47 bug thread.

Have a good luck on a bug hunt! :)
ID: 58210 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 58218 - Posted: 29 Dec 2008, 4:41:49 UTC

This WU:

https://boinc.bakerlab.org/rosetta/result.php?resultid=217250916

took 20 hours on an Athlon XP 2400+ to crunch one decoy.
ID: 58218 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 58221 - Posted: 29 Dec 2008, 9:24:54 UTC

I was looking at the running on one of my computers and noticed that this task seemed to have overshot the 6 hour time set. So, I stopped BOINC and restarted it and it restarted the task at 25% done and 1:36 of compute time vice the real total of over 6 hours.

A long time ago I de-emphasised Rosetta because the project went from a reliable application to some of the worse behaving applications in the BOINC universe. I don't mind occasional errors, but I lost about 8 tasks on one machine because of a lock file error ... the machine is stable and works well on all other projects ...

Now I am finding that there are tasks that never seem to want to finish (now I am paying attention) and worse do not properly checkpoint.

And so, once more I am going to downgrade Rosetta ... I have neither the health nor time to babysit what is supposed to be a mature and PRODUCTION project. I suppose we need to change the way we classify projects because this application is not production by any means ...
ID: 58221 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 58222 - Posted: 29 Dec 2008, 11:22:19 UTC

Well, the task died with too many exits ... I suppose one is way too many ... the file also suffers from the lock file problem.

Well, the queues are draining ...
ID: 58222 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 58263 - Posted: 30 Dec 2008, 19:24:48 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=217585069
1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_73016_0
Outcome Success
CPU time 20686.61 - actual run time
<core_client_version>6.4.5</core_client_version>
# cpu_run_time_pref: 14400 <-- my set run time
======================================================
DONE :: 1 starting structures 20686.3 cpu seconds
This process generated 3 decoys from 3 attempts
ID: 58263 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 1,227
Message 58266 - Posted: 30 Dec 2008, 21:17:22 UTC - in response to Message 49379.  

I'd like to start a thread for reports of long running models. These appear to be related more to the specific batches of work released, then to any given specific application version. So, I've moved the problems with v1.34 posts that seemed more about runtime into this thread.

Here's what I'd like to see:

Firstly, it is hard to talk about total time, when everyone has different CPUs. So, for a frame of reference, we'll try to talk in terms of a fairly modern 3GHz machine. If yours is slower then that, you will have to adjust the times discussed here upwards accordingly.

Each task typically runs through several "models". You can see the model number in the graphic display, or on the web page of the completed task (as the number of "decoys").

So, if you see tasks that are averaging more then 2 hours per model, or specific models that are taking more then 2 hours, please report them as per below.

Another way you might typically spot such tasks is if you have a target runtime of 3hrs or less, and a task takes significantly longer then that to complete (say 1 or more hours passed your target).


A workunit that has already taken 30 hours even though I asked for 14 hour workunits:

1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_78916

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=198338075

Model 2 now running. Can't find any information on whether model 1 was also slow.

Vista SP1 32-bit
BOINC 5.10.45

Still running. Currently using 94 MB memory, not counting any that's swapped out.

Graphics windows opened at least once, will not open again now.
ID: 58266 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 1,227
Message 58275 - Posted: 31 Dec 2008, 3:12:13 UTC

This workunit (already finished) gave fewer decoys than expected for a 14-hour expected length workunit - only 2.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=198155961

I've already given details about my machine when reporting a different problem of this type, with a workunit name similar enough to suggest that it's for the same protein.

The information I can still see doesn't pin this down to any particular model within the workunit.
ID: 58275 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 1,227
Message 58276 - Posted: 31 Dec 2008, 3:17:41 UTC - in response to Message 58266.  
Last modified: 31 Dec 2008, 3:22:53 UTC

A workunit that has already taken 30 hours even though I asked for 14 hour workunits:

1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_78916

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=198338075

Model 2 now running. Can't find any information on whether model 1 was also slow.

Vista SP1 32-bit
BOINC 5.10.45

Still running. Currently using 94 MB memory, not counting any that's swapped out.

Graphics windows opened at least once, will not open again now.


The graphics window finally opened again, although so slowly I was already doing something else then. Since then, the workunit finally finished, after about 31 CPU hours.
ID: 58276 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 1,227
Message 58277 - Posted: 31 Dec 2008, 3:52:53 UTC

This makes 3 workunits in a row where I got only 2 decoys in a workunit expected to last 14 CPU hours:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=198390275

The names of all three of those workunits began with 1nkuA. Do you need to add these workunits to a list of workunit types expected to take significantly more than 2 hours per decoy?
ID: 58277 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ian_D

Send message
Joined: 21 Sep 05
Posts: 55
Credit: 4,216,173
RAC: 0
Message 58280 - Posted: 31 Dec 2008, 7:46:18 UTC
Last modified: 31 Dec 2008, 7:50:44 UTC

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=198560045

One whole decoy in 12.6 hrs - nice !!

Ubuntu 8.10 on a P4 2.6GHz HT with 512 Mb memory with POEM@home running as well.
CPU run time preference set to 8 hrs ("Home" in this case - have now changed to 6 hrs "Work" preference)


ID: 58280 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 58286 - Posted: 31 Dec 2008, 11:12:47 UTC

MiniRosetta 1.47 task 217935249
cc2_1_8_mammoth_mix_cen_cst_hb_t305__IGNORE_THE_REST_2AHSA_1_5874_98_0

BOINC client version 6.4.5 for windows_intelx86
Processor: 2 AuthenticAMD AMD Athlon(tm) 64 X2 Dual-Core Processor TK-55 [x86 Family 15 Model 104 Stepping 1]
OS: Microsoft Windows Vista: Home Premium x86 Editon, Service Pack 1, (06.00.6001.00)

Normal runtime: 3 hours
Currently at: 7 hours 56m
Currently on: Model 1 Step 1335490
ID: 58286 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 58304 - Posted: 31 Dec 2008, 16:03:21 UTC - in response to Message 58286.  

MiniRosetta 1.47 task 217935249
cc2_1_8_mammoth_mix_cen_cst_hb_t305__IGNORE_THE_REST_2AHSA_1_5874_98_0

BOINC client version 6.4.5 for windows_intelx86
Processor: 2 AuthenticAMD AMD Athlon(tm) 64 X2 Dual-Core Processor TK-55 [x86 Family 15 Model 104 Stepping 1]
OS: Microsoft Windows Vista: Home Premium x86 Editon, Service Pack 1, (06.00.6001.00)

Normal runtime: 3 hours
Currently at: 7 hours 56m
Currently on: Model 1 Step 1335490

Typical of my luck, there was a power outage here, the WU restarted from about 1h 45m, but completed about 2h 45m.

Claimed credit 28.0678988927865
Granted credit 63.9157762593521

...so compensated on credit. Just strange it didn't finish earlier on its first run.
ID: 58304 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 14 · Next

Message boards : Number crunching : Report long-running models here



©2024 University of Washington
https://www.bakerlab.org