Report long-running models here

Message boards : Number crunching : Report long-running models here

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 · Next

AuthorMessage
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 61384 - Posted: 26 May 2009, 17:40:45 UTC - in response to Message 61381.  

3) You can upgrade BOINC any time. Even with work in progress. The Rosetta application is still the same, and this is what is truely processing the work, so the BOINC upgrade should not pose a problem.

I'd expect that to be the case, but it's never worked for me. Queued WUs don't get picked up by the new BOINC version and a load more come down in their place. I can see the old WUs sitting on this website, but they never get run and end up expiring.

I thought that happened to everyone. Am I wrong? Looks like it :(
ID: 61384 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 61391 - Posted: 26 May 2009, 20:55:44 UTC - in response to Message 61384.  

3) You can upgrade BOINC any time. Even with work in progress. The Rosetta application is still the same, and this is what is truely processing the work, so the BOINC upgrade should not pose a problem.

I'd expect that to be the case, but it's never worked for me. Queued WUs don't get picked up by the new BOINC version and a load more come down in their place. I can see the old WUs sitting on this website, but they never get run and end up expiring.

I thought that happened to everyone. Am I wrong? Looks like it :(

Not necessarily wrong, just not necessarily right ... :)

I am one of the "lucky" ones and have up and down leveled BOINC versions with abandon and I don't think I have ever lost work.

There have been issues where different BOINC versions calculate things differently and that can cause issues when new versions are used. For example, the later versions of 6.6.x use a very different LTD model than the old versions. SO, there can be instances where changing version can cause issues, and more work downloaded.

I have three very nearly identical systems and they were all connected to WCG where I was trying to get enough work from the new sub-project so I could get my "gold" and one of them has 268 tasks ... why it has so many more than the other two, I have nary a clue ... but it is gamely trying to work through all those tasks before their deadline ... but I am still scratching my head why one downloaded so much work, and the other two only got reasonable amounts ... oops, down to only 244 tasks that are likely to miss deadline ... :)
ID: 61391 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Cesium_133*
Avatar

Send message
Joined: 1 Dec 08
Posts: 28
Credit: 225,332
RAC: 0
Message 61569 - Posted: 4 Jun 2009, 4:07:33 UTC - in response to Message 61358.  

A long-running 1.67 workunit:

5/24/2009 8:56:43 PM rosetta@home Starting epsilon_BOINC_ABRELAX_CONTROL_SAVE_ALL_OUT_IGNORE_THE_REST-S25-9-S3-3--epsilon-_12490_15365_0
5/24/2009 8:56:44 PM rosetta@home Starting task epsilon_BOINC_ABRELAX_CONTROL_SAVE_ALL_OUT_IGNORE_THE_REST-S25-9-S3-3--epsilon-_12490_15365_0 using minirosetta version 167


Here's one more to add to the corpus of aborted WU's:

threading_lb_test1_hb_t373__IGNORE_THE_REST_11850_3473_1

Aborted 3 June 09 23:44:52 EDT (I guess 03:44:52 4 June 09 UTC)

Was 5% done after 5 hours, original estimated run time was about 3h 50m, time to completion was increasing directly 1:1 with the time spent on it, the WU was not performing, and no graphics were visible despite the mini-view's assertion to the contrary. The other Rosetta WU my PC was crunching was running fine, and BOINC had defaulted to an AI WU which was completed; it then apparently went back to the one I aborted, with no success.

Next time I might try closing and re-opening BOINC, as I only saw that suggestion after I aborted the task. I do hope someone keeps a record of all WU's aborted or otherwise; perhaps there's a (or more than 1) common thread(s) to them...
The lovely lady you see isn't I, but Hayley Westenra, a classical crossover singer from Christchurch, NZ. There is no known voice as hers. Check her out- she's seraphic.

ID: 61569 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TestPilot

Send message
Joined: 23 Sep 05
Posts: 30
Credit: 419,033
RAC: 0
Message 61593 - Posted: 6 Jun 2009, 13:25:44 UTC

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=233872277

Rosetta 1.71

Aborted after 27 hours of crunching and counting... And it was assigned to another puter...
TestPilot, AKA Administrator
ID: 61593 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
michaelmastro
Avatar

Send message
Joined: 11 Oct 05
Posts: 51
Credit: 1,530,918
RAC: 0
Message 61596 - Posted: 6 Jun 2009, 15:29:36 UTC

This unit has been running for 13 hours, is 28% complete, with 17 hours remaining:

lb_alnmatrix_threading_alncap__hb_t308__IGNORE_THE_REST_12574_4927_0 using minirosetta version 171

Windows Vista

BOINC 6.6.20

Rosetta Mini 1.71

https://boinc.bakerlab.org/rosetta/results.php?userid=3968

This unit is also running, only in 2.5 hours it has completed over 80% with .5 hour remaining:
lr5_E_rama_map_iter05_rlbd_1ubi_SAVE_ALL_OUT_12503_440_0 using minirosetta version 171



ID: 61596 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
michaelmastro
Avatar

Send message
Joined: 11 Oct 05
Posts: 51
Credit: 1,530,918
RAC: 0
Message 61597 - Posted: 6 Jun 2009, 16:44:41 UTC

BTW - This problem only occurs on my Vista machine. The Mac is having no problems...
ID: 61597 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 61601 - Posted: 7 Jun 2009, 2:55:12 UTC

michaelmastro, it looks like you are running a 6.6 BOINC Manager version. Please select the task in question and click properties. What is shown for the CPU time used by the task? It looks like your prior tasks were running for the 3hr default runtime. If this task has more then 7 hours of actual CPU time(4 hrs over your preference, where the watchdog should have ended it) then it would sound like something isn't right and you should abort it.
Rosetta Moderator: Mod.Sense
ID: 61601 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile William T.M. Theisen

Send message
Joined: 11 Sep 06
Posts: 7
Credit: 527,145
RAC: 0
Message 62097 - Posted: 5 Jul 2009, 20:39:01 UTC

lb_dk_ksync_withtrim_hb_t297__IGNORE_THE_REST_12980_1893_0 Got stuck at 6.888% and has been running 29 hours so far, and has gone up in time for "time to completion" from 60 hours to 65 hours. I'm not sure what is going on with it, should I abort it?
ID: 62097 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 62571 - Posted: 28 Jul 2009, 22:11:04 UTC
Last modified: 28 Jul 2009, 22:11:29 UTC

Not happy about this at all, one of the few tasks i've been able to get and this

happens.

Ten hours for one model on my 3ghz great credit to, others had problems with it to.

1qlx_NNMAKE_CONSTRAINT_BOINC_ABRELAX_SAVE_ALL_OUT_14240_677_2

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=243170635

36,076.19__106.17__3.24
ID: 62571 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 62585 - Posted: 29 Jul 2009, 7:45:40 UTC

And another.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=243854311

abinitio_withrelax_homfrag_129_B_1ubi__SAVE_ALL_OUT_13795_832_0

# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 23518.3 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

That's six & a half hours.

ID: 62585 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 62668 - Posted: 31 Jul 2009, 1:43:54 UTC

I have another over 10hrs for 1 model.

mini 1.87.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=245101531

abinitio_withrelax_homfrag_129_B_1vcc__SAVE_ALL_OUT_13795_3017_0

ID: 62668 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,306,580
RAC: 100
Message 62730 - Posted: 2 Aug 2009, 21:00:59 UTC
Last modified: 2 Aug 2009, 21:01:40 UTC

Long running task: 269551688
name: lr10_seq_score12_rlbd_1prq_IGNORE_THE_REST_DECOY_13841_3329_0
application version: 1.90
OS: Linux

AdeB
ID: 62730 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,306,580
RAC: 100
Message 62932 - Posted: 14 Aug 2009, 17:19:59 UTC
Last modified: 14 Aug 2009, 17:21:24 UTC

Long running task: 272664497
name: lr8_newhb_run02_rlbn_2apb_IGNORE_THE_REST_NATIVE_NOCON_14611_463_1
application version: 1.91
OS: Linux
CPU time: 57738.5s, 14400s + 43200s
Granted credit: 4.01992761072857

AdeB
ID: 62932 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 63071 - Posted: 28 Aug 2009, 22:06:59 UTC

Hi.

I aborted this when i looked at graphics after 3hrs,28min it was sitting on

model 1 step 0 and not moving, so it's gone!

lr5_combine_smooth_torsion_it00_A_rlbd_2hkv_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_14643_667

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=251816801


ID: 63071 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,306,580
RAC: 100
Message 63248 - Posted: 10 Sep 2009, 17:59:39 UTC

Long running task: 278731357
name: lr8_A_seq_score12_ss1.7_rlbd_2ccv_IGNORE_THE_REST_DECOY_14637_3189_0
application version: 1.97
OS: Linux

AdeB
ID: 63248 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1599
Credit: 28,829,389
RAC: 18,556
Message 63380 - Posted: 16 Sep 2009, 23:50:25 UTC

A couple of strange, long-running WUs here. Both successfully completed and credit awarded, but both ran in excess of 8 hours with a 4 hour default runtime:

lr5_score12_gb_run01_rlbd_1unp_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_14707_62_0
CPU time 29385.66 [...]

# cpu_run_time_pref: 14400
Hbond tripped: [2009- 9-16 4:58:53:]
BOINC:: CPU time: 29383.6s, 14400s + 14400s[2009- 9-16 13:13:29:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 29383.6 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
called boinc_finish


lr5_score12_gb_run01_rlbd_1ig5_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_14707_38_0
CPU time 28843.29 [...]

Hbond tripped: [2009- 9-15 19:49:12:]
# cpu_run_time_pref: 14400
Fullatom mode ..
BOINC:: CPU time: 28841.1s, 14400s + 14400s[2009- 9-16 4: 1:53:] :: BOINC
InternalDecoyCount: 0
======================================================
DONE :: 1 starting structures 28841.1 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
called boinc_finish

ID: 63380 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 63416 - Posted: 21 Sep 2009, 11:01:15 UTC - in response to Message 63380.  

A couple of strange, long-running WUs here. Both successfully completed and credit awarded, but both ran in excess of 8 hours with a 4 hour default runtime:

lr5_score12_gb_run01_rlbd_1unp_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_14707_62_0
CPU time 29385.66 [...]

# cpu_run_time_pref: 14400
Hbond tripped: [2009- 9-16 4:58:53:]
BOINC:: CPU time: 29383.6s, 14400s + 14400s[2009- 9-16 13:13:29:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 29383.6 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
called boinc_finish


Exactly the same for me too.

lr5_score12_gb_run01_rlbd_1ugh_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_14707_136_0
CPU time 28841.62 [...]

# cpu_run_time_pref: 14400
Hbond tripped: [2009- 9-21 2:13:17:]
BOINC:: CPU time: 28839.4s, 14400s + 14400s[2009- 9-21 11: 7:12:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 28839.4 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

ID: 63416 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 10,046,961
RAC: 6,099
Message 63428 - Posted: 22 Sep 2009, 17:06:28 UTC

I'm seeing the same thing with lr5_score12_gb* workunits. This workunit 282249916 ran for over seven hours on Mac OS X 10.5, eventiually failing as follows;

Fullatom mode ..
Hbond tripped: [2009- 9-21 5:25:58:]
BOINC:: CPU time: 25274.8s, 14400s + 10800s[2009- 9-21 12:55:41:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)


For most of that the time it was stuck on initialising: (Model 0 Step 0). Here's the output of the Sampler while it was doing that.

Sampling process 21361 for 3 seconds with 1 millisecond of run time between samples
Sampling completed, processing symbols...
Analysis of sampling minirosetta_1.97_i686-apple-darwin (pid 21361) every 1 millisecond
Call graph:
2116 Thread_2507
2116 start
2116 _start
2116 main
2116 protocols::relax::Relax_main(bool)
2116 protocols::jd2::BOINCJobDistributor::go(utility::pointer::owning_ptr<protocols::moves::Mover>)
2116 protocols::jd2::JobDistributor::go(utility::pointer::owning_ptr<protocols::moves::Mover>)
2116 protocols::jd2::JobDistributor::go_main(utility::pointer::owning_ptr<protocols::moves::Mover>)
2116 protocols::relax::SimpleMultiRelax::apply(core::pose::Pose&)
2116 protocols::relax::ClassicRelax::apply(core::pose::Pose&)
2116 protocols::moves::RampingMover::apply(core::pose::Pose&)
2116 protocols::moves::TrialMover::apply(core::pose::Pose&)
2116 protocols::moves::JumpOutMover::apply(core::pose::Pose&)
2116 protocols::moves::MinMover::apply(core::pose::Pose&)
2116 core::optimization::AtomTreeMinimizer::run(core::pose::Pose&, core::kinematics::MoveMap const&, core::scoring::ScoreFunction const&, core::optimization::MinimizerOptions const&) const
2116 core::optimization::Minimizer::run(utility::vector1<double, std::allocator<double> >&)
2116 core::optimization::Minimizer::dfpmin_armijo(utility::vector1<double, std::allocator<double> >&, double&, core::optimization::ConvergenceTest&, bool) const
2116 core::optimization::ArmijoLineMinimization::operator()(utility::vector1<double, std::allocator<double> >&, utility::vector1<double, std::allocator<double> >&)
2116 core::optimization::AtomTreeMultifunc::operator()(utility::vector1<double, std::allocator<double> > const&) const
2116 core::scoring::ScoreFunction::operator()(core::pose::Pose&) const
2116 core::scoring::ScoreFunction::eval_onebody_energies(core::pose::Pose&) const
2116 core::scoring::methods::OmegaTetherEnergy::residue_energy(core::conformation::Residue const&, core::scoring::EMapVector&) const
2113 core::scoring::OmegaTether::eval_omega_score_residue(core::conformation::Residue const&, double&, double&) const
2113 core::scoring::OmegaTether::eval_omega_score_residue(core::conformation::Residue const&, double&, double&) const
3 0xffffffff
3 _sigtramp
3 _sigtramp
2116 Thread_2603
2116 thread_start
2116 _pthread_start
2116 timer_thread(void*)
2116 boinc_sleep(double)
2116 usleep
2116 nanosleep
2116 mach_wait_until
2116 mach_wait_until
2116 Thread_2703
2116 thread_start
2116 _pthread_start
2116 protocols::boinc::watchdog::main_watchdog(void*)
2116 sleep
2116 nanosleep
2116 mach_wait_until
2116 mach_wait_until

Total number in stack (recursive counted multiple, when >=5):

Sort by top of stack, same collapsed (when >= 5):
mach_wait_until 4232
core::scoring::OmegaTether::eval_omega_score_residue(core::conformation::Residue const&, double&, double&) const 2113
Sample analysis of process 21361 written to file /dev/stdout

ID: 63428 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CraniuMod

Send message
Joined: 11 Jan 08
Posts: 3
Credit: 556,988
RAC: 0
Message 63435 - Posted: 23 Sep 2009, 20:18:38 UTC

Will keep an eye out for this from hereon out. Has not happened on Rosetta before.
281082606
Name 1fv5A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1fv5A-_14711_576_0
Workunit 256342237
Created 15 Sep 2009 20:41:03 UTC
Sent 15 Sep 2009 20:53:55 UTC
Received 23 Sep 2009 20:09:18 UTC
Server state Over
Outcome Client error
Client state Aborted by user
Exit status -197 (0xffffff3b)
Computer ID 926185
Report deadline 25 Sep 2009 20:53:55 UTC
CPU time 19541.34
stderr out
<core_client_version>6.6.36</core_client_version>
<![CDATA[
<message>
aborted by user
</message>
]]>
Validate state Invalid
Claimed credit 78.92439165502
Granted credit 0
application version 1.97
ID: 63435 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 4877
Credit: 4,566,920
RAC: 3,234
Message 63440 - Posted: 24 Sep 2009, 8:34:05 UTC - in response to Message 63435.  

Will keep an eye out for this from hereon out. Has not happened on Rosetta before.
281082606
Name 1fv5A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1fv5A-_14711_576_0
Workunit 256342237
Created 15 Sep 2009 20:41:03 UTC
Sent 15 Sep 2009 20:53:55 UTC
Received 23 Sep 2009 20:09:18 UTC
Server state Over
Outcome Client error
Client state Aborted by user
Exit status -197 (0xffffff3b)
Computer ID 926185
Report deadline 25 Sep 2009 20:53:55 UTC
CPU time 19541.34
stderr out
<core_client_version>6.6.36</core_client_version>
<![CDATA[
<message>
aborted by user
</message>
]]>
Validate state Invalid
Claimed credit 78.92439165502
Granted credit 0
application version 1.97



Did you abort this task or what happened? 5.5 hrs is not really a long run.
ID: 63440 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 · Next

Message boards : Number crunching : Report long-running models here



©2021 University of Washington
https://www.bakerlab.org