minirosetta 2.03

Message boards : Number crunching : minirosetta 2.03

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,841,722
RAC: 1,590
Message 64547 - Posted: 20 Dec 2009, 1:34:51 UTC - in response to Message 64539.  
Last modified: 20 Dec 2009, 1:38:42 UTC

I get an issue on this workunit during CPU benchmarks:
mer. 16 déc. 2009 19:57:17 CET||Running CPU benchmarks
mer. 16 déc. 2009 19:57:17 CET||Suspending computation - running CPU benchmarks
mer. 16 déc. 2009 19:57:28 CET|rosetta@home|Task mix_score13_env_rlbd_2apb__IGNORE_THE_RESTlr10_DECOY_16523_731_0: no shared memory segment
mer. 16 déc. 2009 19:57:28 CET|rosetta@home|Task mix_score13_env_rlbd_2apb__IGNORE_THE_RESTlr10_DECOY_16523_731_0 exited with zero status but no 'finished' file
mer. 16 déc. 2009 19:57:28 CET|rosetta@home|If this happens repeatedly you may need to reset the project.
mer. 16 déc. 2009 19:57:49 CET||Benchmark results:
mer. 16 déc. 2009 19:57:49 CET|| Number of CPUs: 2
mer. 16 déc. 2009 19:57:49 CET|| 2217 floating point MIPS (Whetstone) per CPU
mer. 16 déc. 2009 19:57:49 CET|| 6149 integer MIPS (Dhrystone) per CPU
mer. 16 déc. 2009 19:57:50 CET||Resuming computation



But now the task is still running well on Rosetta mini 2.03, no other message.


To me, this one looks like SOMETHING won't run correctly during the CPU benchmarks, but the application is able to recover afterwards. Since the CPU benchmarks aren't run very often, this problem won't be seen very often either.
ID: 64547 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,805,838
RAC: 0
Message 64554 - Posted: 20 Dec 2009, 17:58:31 UTC

broker_idealclose_kic_in20_hb_t312__IGNORE_THE_REST_16513_879_0, task 305259567, failed on Windows 7 with a Compute Error after about an hour.

Setting up checkpointing ...
Setting up graphics native ...
FNAME: native.pdb
FNAME: ss_core_native.pdb
FNAME: ss_core_native_radical.pdb
FNAME: native_notails.pdb
FNAME: native.pdb
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
CLOSING with IDEALIZATION
CLOSING with IDEALIZATION
CLOSING with IDEALIZATION
CLOSING with IDEALIZATION


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x5B202020 read attempt to address 0x5B202020

Engaging BOINC Windows Runtime Debugger...

Followed by pages of W7 debug info
ID: 64554 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SekeRob

Send message
Joined: 7 Sep 06
Posts: 35
Credit: 19,984
RAC: 0
Message 64562 - Posted: 21 Dec 2009, 11:34:09 UTC - in response to Message 64554.  
Last modified: 21 Dec 2009, 12:24:58 UTC

Just for try out, set 1 hour run time pref in home profile [because Rosetta is on a small share], saved, received 2 with about a 1 hour run time, but now this one is running

21/12/2009 12:23:40 rosetta@home [checkpoint_debug] result broker_idealclose_hb_t293__IGNORE_THE_REST_16362_82629_0 checkpointed

1:25 CPU time 1:26 Elapsed, and on 28 percent. It's checkpointing regularly, so don't consider this a bad task. Why this long one in-between? Mini 2.03 release.

PS: First 2 validated, this long one now Pending Validation at 2.11 hours... twice as long from specified.
Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 64562 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,841,722
RAC: 1,590
Message 64563 - Posted: 21 Dec 2009, 13:30:44 UTC - in response to Message 64562.  

Just for try out, set 1 hour run time pref in home profile [because Rosetta is on a small share], saved, received 2 with about a 1 hour run time, but now this one is running

21/12/2009 12:23:40 rosetta@home [checkpoint_debug] result broker_idealclose_hb_t293__IGNORE_THE_REST_16362_82629_0 checkpointed

1:25 CPU time 1:26 Elapsed, and on 28 percent. It's checkpointing regularly, so don't consider this a bad task. Why this long one in-between? Mini 2.03 release.

PS: First 2 validated, this long one now Pending Validation at 2.11 hours... twice as long from specified.


At least partly because the test for whether to end the task normally occurs only at the end of a decoy, so if the last decoy it started run significantly longer than expected, you'd be likely to exceed your time preference.

Also, I remember some discussion about setting the minimum allowed time preference longer than one hour, so you might want to check for signs that it's actually now set longer.
ID: 64563 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 64564 - Posted: 21 Dec 2009, 13:49:17 UTC

I don't believe any change in minimum runtime was made robert.

But the minimum amount of useful work is one model (or "decoy"). If that takes longer then an hour, then that does happen and is normal for this project. The watchdog is still there keeping everyone in line if need be.

With such a short runtime you should expect the % completed to vary widely. BOINC's expectations and Rosetta's estimates will have trouble settling in when one task completed in 45 min. and the next in 2 hours.
Rosetta Moderator: Mod.Sense
ID: 64564 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SekeRob

Send message
Joined: 7 Sep 06
Posts: 35
Credit: 19,984
RAC: 0
Message 64566 - Posted: 21 Dec 2009, 14:43:12 UTC - in response to Message 64564.  
Last modified: 21 Dec 2009, 14:57:49 UTC

The present test standing is:

306433095 279395159 21 Dec 2009 13:45:43 UTC 21 Dec 2009 14:53:27 UTC Over Success Done 3,366.56 17.47 15.56
306421142 279383866 21 Dec 2009 12:39:42 UTC 21 Dec 2009 13:49:54 UTC Over Success Done 3,503.61 18.18 15.74
306417911 279381434 21 Dec 2009 12:22:40 UTC 21 Dec 2009 13:33:02 UTC Over Success Done 3,542.00 18.38 16.13
306408510 279371956 21 Dec 2009 11:33:58 UTC 21 Dec 2009 12:43:56 UTC Over Success Done 3,594.54 18.65 16.79
306393079 279357285 21 Dec 2009 10:11:58 UTC 21 Dec 2009 12:26:53 UTC Over Success Done 7,661.61 39.76 40.86
306381783 279347432 21 Dec 2009 9:11:43 UTC 21 Dec 2009 11:38:12 UTC Over Success Done 3,444.83 17.88 15.05
306365927 279333133 21 Dec 2009 7:44:11 UTC 21 Dec 2009 9:07:33 UTC Over Success Done 3,497.34 18.15 16.63

Looks like it's pretty well figured out that an hour is an hour most of the times. I do appreciate that there is a non-deterministic element and if just incidental, a good project to act as filler when on a shutdown schedule. From 4 O'clock it's power-off.
Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 64566 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,805,838
RAC: 0
Message 64850 - Posted: 7 Jan 2010, 18:32:45 UTC

Workunit 281289676 failed on Windows 7: it appeared to hang (not using processor time) and had to be aborted. It was successfully completed by a wingman on XP.
ID: 64850 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 64868 - Posted: 9 Jan 2010, 3:25:16 UTC

Task: 309276026
Workunit: homopt_nat2.t370_.t370_.IGNORE_THE_REST.S_00003_0000009_04.pdb_00003.pdb.JOB_16836_1
stderr out:
...
ERROR: [ERROR] Error opening RBSeg file 'S_00011_0000013_0_0_00060.pdb_00029.pdb_00011.pdb.loopfile'
ERROR:: Exit from: src/protocols/loops/Loops.cc line: 483
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


AdeB
ID: 64868 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
coturnix

Send message
Joined: 8 Oct 09
Posts: 4
Credit: 760,915
RAC: 0
Message 64879 - Posted: 9 Jan 2010, 19:21:58 UTC

Task: 309472876
Workunit: homopt_nat2.t368_.t368_.IGNORE_THE_REST.S_00002_0000023_04.pdb_00002.pdb.JOB_16835_15

ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file
ID: 64879 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,805,838
RAC: 0
Message 64883 - Posted: 9 Jan 2010, 22:33:23 UTC

Same error as AdeB reported

ERROR: [ERROR] Error opening RBSeg file 'native_0001_2.pdb.loopfile'
ERROR:: Exit from: src/protocols/loops/Loops.cc line: 483
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

Task 309256017

Mac OS X10.6
ID: 64883 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 64886 - Posted: 10 Jan 2010, 3:05:39 UTC

Had these two error within seconds of each other.

homopt_cstmc_1.t308_.t308_.IGNORE_THE_REST.S_00002_0000618_00037.pdb.JOB_16846_1

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=282054734

ERROR: [ERROR] Unable to open constraints file: /work/tex/projects/cm/benchmark/cross_filt/t308_/t308_.aln_list_mike_chosen_bestaln.alns.combined.csts
ERROR:: Exit from: src/core/scoring/constraints/ConstraintIO.cc line: 332
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

=======================================================================================

homopt_nat2.t312_.t312_.IGNORE_THE_REST.S_00022_0000017_04.pdb_00022.pdb.JOB_16828_5

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=282085871

ERROR: [ERROR] Error opening RBSeg file 'S_00002_0000020_0_0_030.pdb_00002.pdb_00002.pdb.loopfile'
ERROR:: Exit from: src/protocols/loops/Loops.cc line: 483
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

ID: 64886 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 207
Credit: 23,310,920
RAC: 11,198
Message 64892 - Posted: 10 Jan 2010, 14:16:16 UTC
Last modified: 10 Jan 2010, 14:23:18 UTC

Seems what some of Rosetta's WU's do not checkpoint at all or do it incorrectly.
For example WUs named "ha_notyr_..." - after several hours of computing "CPU time at last checkpoint" stays "---" (none). If I restart(or shut down) computer (or BOINC client only) while such WU running - all results are lost and after restart computation starts from the very beginning.
Here examples of such tasks:
https://boinc.bakerlab.org/rosetta/result.php?resultid=308985993
https://boinc.bakerlab.org/rosetta/result.php?resultid=309233711

And so they look from BOINС Manager:


And a part of other tasks write about checkpoints, but similar actually them do not do it (or most likely do, but after restarting them do not use it). It looks so: "CPU time at last checkpoint" looks correct (only for some minutes less, than total CPU Time), BUT before exiting from BOINC I look "Show graphics", we will accept there is displayed that 38 models are calculated already. After restarting counting of models starts with 0,1,2.. and so on. Аnd then reporting to the server is referred less models, than has been considered before restarting(for exsample only 20) - similar only that has been computed after the last restarting.

It is a significant problem for me, because i NEED to restart this computer from time to time, and it always shut down when i sleep (because of noise).
Besides similar too most occurs at automatic switching between projects if BOINС calculate some projects simultaneously (for example R@H and E@H).

While I have troubleshot so: has exposed small "CPU target time" (2 hours at present instead of 10 which used earlier) to reduce possible losses of useful calculations to a minimum.
But I think, it only partial and is far not the best solution...

P.S.
I running minirosetta 2.03, BOINC 6.10.18 on Windows XP SP3.
ID: 64892 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,809,927
RAC: 736
Message 64898 - Posted: 10 Jan 2010, 16:04:51 UTC

9gbnnotyr_3gbn_2bk8_9Jan2010_16860_5_0
Completed successfully... 2713 models! It ran for the entire 10 hour target run time without stopping despite having a 60 minute switch interval, plenty of work on board from other projects and a STD and resource share which leads most 10 hour rosetta units to break at least once, usually twice before finishing up. I don't have checkpoint flags enabled and I'm running BOINC 6.2.18 so I have no information on checkpoints. I posted a similar report (more than 100 models, no switching) about a different type of WU over on Ralph.

Snags
ID: 64898 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,805,838
RAC: 0
Message 64899 - Posted: 10 Jan 2010, 17:17:47 UTC

These homopt_nat2.t* models are causing real problems.

On Mac i get a bunch erroring out immediately, e.g.

Task 309606751
Task 309641678

gave this same
ERROR: [ERROR] Error opening RBSeg file 'S_00002_0000022_0_0_00009.pdb_00001.pdb_00002.pdb.loopfile'
ERROR:: Exit from: src/protocols/loops/Loops.cc line: 483
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

while Task [url=https://boinc.bakerlab.org/rosetta/result.php?resultid= 309628562]309628562[/url]

failed like this

Options::initialize() End reached
ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file

On Windows 7 I still get a bunch that have to be aborted as they're hanging but not taking up CPU time

Task [url=https://boinc.bakerlab.org/rosetta/result.php?resultid= 308815202] 308815202[/url]
Task [url=https://boinc.bakerlab.org/rosetta/result.php?resultid= 308815003] 308815003[/url]
Task [url=https://boinc.bakerlab.org/rosetta/result.php?resultid= 308586455] 308586455[/url]



ID: 64899 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1981
Credit: 38,424,259
RAC: 13,236
Message 64902 - Posted: 10 Jan 2010, 19:24:33 UTC
Last modified: 10 Jan 2010, 19:41:50 UTC

A disappointing validate error on my W7 laptop:
ha_notyr_3gbn_2hpj_6Jan2010_16806_6_1

And a Compute Error on my Vista desktop:
ha_notyr_3gbn_1.gz_6Jan2010_16806_2_1
ERROR: data
ERROR:: Exit from: ....srcprotocolsProteinInterfaceDesignread_patchdock.cc line: 70
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


And another 2 for the laptop:
homopt4.t367_.t367_.IGNORE_THE_REST.S_00006_0000013_0_0_00086.pdb_00004.pdb_00006.pdb.JOB_16819_17_1
homopt4.t328_.t328_.IGNORE_THE_REST.S_00001_0000007_0_0_0_0020.pdb_00001.pdb_00001.pdb.JOB_16816_23_1
Both showing:
ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file

ID: 64902 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sarel

Send message
Joined: 11 May 06
Posts: 51
Credit: 81,712
RAC: 0
Message 64903 - Posted: 10 Jan 2010, 20:32:27 UTC - in response to Message 64902.  

Thanks, Sid! The error you report in ha_notyr_3gbn_1.gz_6Jan2010_16806_2_1 is due to some scripting bug, which I've now fixed. Thanks for the bug report! Sarel.

A disappointing validate error on my W7 laptop:
ha_notyr_3gbn_2hpj_6Jan2010_16806_6_1

And a Compute Error on my Vista desktop:
ha_notyr_3gbn_1.gz_6Jan2010_16806_2_1
ERROR: data
ERROR:: Exit from: ....srcprotocolsProteinInterfaceDesignread_patchdock.cc line: 70
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


And another 2 for the laptop:
homopt4.t367_.t367_.IGNORE_THE_REST.S_00006_0000013_0_0_00086.pdb_00004.pdb_00006.pdb.JOB_16819_17_1
homopt4.t328_.t328_.IGNORE_THE_REST.S_00001_0000007_0_0_0_0020.pdb_00001.pdb_00001.pdb.JOB_16816_23_1
Both showing:
ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file


ID: 64903 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 64904 - Posted: 10 Jan 2010, 21:05:56 UTC
Last modified: 10 Jan 2010, 21:19:30 UTC

Two more over night one errored, the other odd and i'm not impressed.
---------------------------------------------------------------------
homopt4.t290_.t290_.IGNORE_THE_REST.S_00002_000.pdb.JOB_16809_13_0.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=282226265

ERROR: [ERROR] Error opening RBSeg file 'S_00001_0000038_04.pdb_00001.pdb.loopfile'
ERROR:: Exit from: src/protocols/loops/Loops.cc line: 483
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

=====================================================================
This ran for over 8hrs none stop didn't let other tasks run, the last model
seems to have taken four hours. My run time is set at 4hrs.

ha_notyr_3gbn_2oeb_8Jan2010_16808_18_0

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=282041421

BOINC:: CPU time: 29392.7s, 14400s + 14400s[2010- 1-10 18:36:45:] :: BOINC
InternalDecoyCount: 87
======================================================
DONE :: 2 starting structures 29392.7 cpu seconds
This process generated 87 decoys from 87 attempts
======================================================
called boinc_finish

Over__Success__Done__29,394.75__69.44__9.95
ID: 64904 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sarel

Send message
Joined: 11 May 06
Posts: 51
Credit: 81,712
RAC: 0
Message 64905 - Posted: 10 Jan 2010, 21:10:33 UTC - in response to Message 64892.  

Hello,

Thanks for your comments! In this sort of WU trajectories are typically not very long, but successful trajectories are extremely rare (credit is allocated per time spent computing, so you get credit for the work regardless of success). Accordingly, don't worry about turning off the computer in the middle of one of these trajectories. The most you're going to lose is a couple of minutes of computation.

Thanks, Sarel.

Seems what some of Rosetta's WU's do not checkpoint at all or do it incorrectly.
For example WUs named "ha_notyr_..." - after several hours of computing "CPU time at last checkpoint" stays "---" (none). If I restart(or shut down) computer (or BOINC client only) while such WU running - all results are lost and after restart computation starts from the very beginning.
Here examples of such tasks:
https://boinc.bakerlab.org/rosetta/result.php?resultid=308985993
https://boinc.bakerlab.org/rosetta/result.php?resultid=309233711

And so they look from BOINС Manager:


And a part of other tasks write about checkpoints, but similar actually them do not do it (or most likely do, but after restarting them do not use it). It looks so: "CPU time at last checkpoint" looks correct (only for some minutes less, than total CPU Time), BUT before exiting from BOINC I look "Show graphics", we will accept there is displayed that 38 models are calculated already. After restarting counting of models starts with 0,1,2.. and so on. Аnd then reporting to the server is referred less models, than has been considered before restarting(for exsample only 20) - similar only that has been computed after the last restarting.

It is a significant problem for me, because i NEED to restart this computer from time to time, and it always shut down when i sleep (because of noise).
Besides similar too most occurs at automatic switching between projects if BOINС calculate some projects simultaneously (for example R@H and E@H).

While I have troubleshot so: has exposed small "CPU target time" (2 hours at present instead of 10 which used earlier) to reduce possible losses of useful calculations to a minimum.
But I think, it only partial and is far not the best solution...

P.S.
I running minirosetta 2.03, BOINC 6.10.18 on Windows XP SP3.


ID: 64905 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
frederick corse

Send message
Joined: 7 Oct 05
Posts: 10
Credit: 1,545,999
RAC: 0
Message 64906 - Posted: 10 Jan 2010, 22:07:30 UTC

Hello


I ran 9gbnnotyr_3gbn_3bfm_9jan2010_16860_ 11 it ran for 14399.07 secs and had 844 decoys the most i ever saw was 100





regards
ID: 64906 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sarel

Send message
Joined: 11 May 06
Posts: 51
Credit: 81,712
RAC: 0
Message 64908 - Posted: 11 Jan 2010, 2:21:26 UTC - in response to Message 64906.  

That's right. Again, this is a different sort of simulation (see https://boinc.bakerlab.org/forum_thread.php?id=4477&nowrap=true#64838 for details).

In these runs many of the trajectories are cut short early on because they are unlikely to yield useful results. Credit for runs is allocated for computational time and we need to know how many times simulations were started on your computers and those are reported as decoys. The amount of information that is sent back to our servers per triaged trajectory is very small though to limit bandwidth loss.

Hello


I ran 9gbnnotyr_3gbn_3bfm_9jan2010_16860_ 11 it ran for 14399.07 secs and had 844 decoys the most i ever saw was 100





regards


ID: 64908 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : minirosetta 2.03



©2024 University of Washington
https://www.bakerlab.org