Posts by AdeB

1) Message boards : Number crunching : minirosetta 2.17 (Message 68462)
Posted 8 Nov 2010 by Profile AdeB
Post:
In several posts Chris wrote:
A few more examples of the Rossmann2x3_abinitio tasks having problems, running until the watchdog nails them, and spitting out gobs of "OVERFLOW ERROR: Error writing" messages.

Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_001_22515_226_0 - Linux 6.10.56
Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_008_22515_182_0 - Darwin 6.10.58
Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_008_22515_1096_0 - Linux 6.10.56

...

Come on guys - I find it hard to believe that I am the only one seeing these Rossmann2X3 tasks chew up their systems. Some complete, some fail, all are running long an are using nearly 2 gig per task. And all spit out the ominous "OVERFLOW ERROR: Error writing" repeatedly.

Here are two which finished - generating just 1 decoy for eight hours of run time:

Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_001_22515_1024_1 - Linux 6.10.56
Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_007_22515_256_0 - Linux 6.10.56

...

You could be right about it being a problem unique to Linux and OSX (Darwin) - in both cases they very well may be built using the same compiler (GCC?) and it is possible they have stumbled on an awkward spot.

I have no way of knowing - in preparation for the purification ceremonies required to reach a higher state of karma and grace, I no longer own or run a Windows system :)

...

AdeB wondered:
Even one of Chris' Linux machines has no problems with them. Could it be machine specific?

The machine you pointed to has had the issue - although the task did not end in error it did eat up all off the memory in sight, run until the watchdog killed it, and spit out repeated "OVERFLOW ERROR: Error writing" messages.

Just because the task runs to completion, does not mean its not a problem task. Extreme memory usage + runtime can be issues when one of these tasks pretty much shut down the other 3 (or 5) cores on a system.

And it is not AMD specific - it also happens on my Xeon based Mac pro too.

But I do appreciate you taking the time to look at it and offer suggestions, I really do.

A couple sample tasks from the the system AdeB pointed to:

Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_004_22515_1706_0 - Linux 6.10.56
Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_004_22515_1706_0 - Linux 6.10.56

I checked a few days ago and I really didn't see any of this, so I've assumed it was OS specific or machine specific, as suggested, but I just glanced at a long-running watchdog-truncated job and find I had the same experience on my W7 x64 laptop.

I've modified Chris's earlier links to show the job names, OS & Boinc version just in case it reveals a more specific pattern of tasks. My task was slightly different in that it does seem to have checkpointed several times before the watchdog cut in at 8+4 hours.

Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_005_22515_1974_0 - Windows 7 64-bit 6.10.58

So the pattern is more specifically "Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_" if that helps.


Of course I only did a quick scan, and missed the problematic tasks on Chris' machine.
Sid's approach clearly shows a pattern, nice catch.

AdeB
2) Message boards : Number crunching : minirosetta 2.17 (Message 68442)
Posted 7 Nov 2010 by Profile AdeB
Post:

You could be right about it being a problem unique to Linux and OSX (Darwin)...


Chris, you obviously run many more WU's than I do, but I haven't had any errors at all running them on my OS X machine. There is a Ross2X3 running as I type this. And I only have 2 gig of total ram installed.

Perhaps only certain WU's are problematic? The larger molecules, I guess.


And I haven't had any errors on my linux machine. Even one of Chris' linux machines has no problems with them. Could it be machine specific?

Adeb
3) Message boards : Number crunching : Report long-running models here (Message 68297)
Posted 31 Oct 2010 by Profile AdeB
Post:
Chris, I have noticed that the PCS_ tasks run very slow in Linux. On my 2.2 GHz Linux box they were taking 10 hours to make two models. On my 2.1 GHz Win 7 box they always seem to make at least 4 models in 6 hours.

A few days ago I was getting a ton of them so I put my Linux machines on WCG for awhile but you can look at the results for my Win 7 box and pick out the PCS tasks just by looking at the granted credit.

Edit: In fact, I have looked at a lot of other Win 7 boxes out there and all of the PCS task on Win 7 seem to be getting much higher granted credit than what was claimed. So maybe it is some kind of dysfunction?


Here too some "low credit" PCS_-tasks:
resultid=375503932
resultid=374981802

OS = linux
CPU = AMD Phenom II X4

AdeB
4) Message boards : Number crunching : minirosetta 2.14 (Message 67173)
Posted 12 Aug 2010 by Profile AdeB
Post:
T0624_refinement_1_5_topology_broker_SAVE_ALL_OUT.IGNORE_THE_REST_2_21730_1889_0
ERROR: ERROR: ArrayPool array size cannot be changed unless the ArrayPool is empty
ERROR:: Exit from: src/core/graph/ArrayPool.hh line: 296
BOINC:: Error reading and gzipping output datafile: default.out


AdeB
5) Message boards : Number crunching : minirosetta 2.05 (Message 65547)
Posted 13 Mar 2010 by Profile AdeB
Post:
In workunit gunn_fragments_SAVE_ALL_OUT_-1wtyA__18642_1106 both tasks (324092645 and 323994500) ended with the same error:
ERROR: ct == final_atoms
ERROR:: Exit from: ....srccorescoringrms_util.cc line: 397
BOINC:: Error reading and gzipping output datafile: default.out

AdeB
6) Message boards : Number crunching : minirosetta 2.05 (Message 65481)
Posted 7 Mar 2010 by Profile AdeB
Post:
My first Protein_interface (validation related?) error as far as I know - MacOS 10.5:

tyrsim_3gbn_2esa_Protein_interface_design_01Feb2010_17949_9_2


Outcome Success
Client state Done
Exit status 0 (0x0)

CPU time 21540.8

<core_client_version>6.10.36</core_client_version>
<![CDATA[
<stderr_txt>

[...]

# cpu_run_time_pref: 21600
======================================================
DONE :: 327 starting structures 21540.3 cpu seconds
This process generated 327 decoys from 327 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>

Validate state Workunit error - check skipped


One of two wingmen validated successfully after his deadline, but with far fewer decoys completed.


There is nothing wrong on your end. This is a very old (and rare) bug in the boinc server software. Take a look here.
Wait a second, the trac item claims that the bug is fixed. Maybe it is time for Rosetta to update the server-code.

AdeB
7) Message boards : Number crunching : REACHED DAILY QUOTA OF 23 RESULTS ????!!!! (Message 65470)
Posted 6 Mar 2010 by Profile AdeB
Post:
I've got 20 2.6Ghz AMD 64 FX machines running, plus my home 3.2Ghz Quad Phenom II running 4 BOINC projects. At the moment Seti and ABC have no work for some reason, and both Einstien and Rosetta are telling me I've reached the quota at 23??? Doesn't make sense to have the machines running with nothing working??? How does one change this setting, or do I have to join another project...
Most machines still have work, but this one has run out, and is getting nothing from any of the projects.
Rosetta has plenty of work available, but don't know if this limit is based on machine, user, or what.


I looked at the results of some of your machines, here is one of them. And it seems that allmost all of the tasks error with no computation on your machines at all.
All the tasks have this error message:
<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
Can't write init file: -108
</message>
]]>

This is why the quota is lowered.
As this happens on all of your machines this suggests that there might be a probem with the way BOINC is installed on those machines, strange though that there are some succesfull results.

AdeB
8) Message boards : Number crunching : minirosetta 2.05 (Message 65311)
Posted 14 Feb 2010 by Profile AdeB
Post:
The same error as P.P.L. and Admin.

ERROR: start_res != middle_res
ERROR:: Exit from: src/protocols/moves/KinematicMover.cc line: 132
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


Task 317684657

AdeB
9) Message boards : Number crunching : minirosetta 2.05 (Message 65023)
Posted 17 Jan 2010 by Profile AdeB
Post:
Task: 311103842
Workunit: homopt_nat2.t368_.t368_.IGNORE_THE_REST.S_00003_0000018_07.pdb_00003.pdb.JOB_16835_29

ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file

AdeB
10) Message boards : Number crunching : minirosetta 2.03 (Message 64868)
Posted 9 Jan 2010 by Profile AdeB
Post:
Task: 309276026
Workunit: homopt_nat2.t370_.t370_.IGNORE_THE_REST.S_00003_0000009_04.pdb_00003.pdb.JOB_16836_1
stderr out:
...
ERROR: [ERROR] Error opening RBSeg file 'S_00011_0000013_0_0_00060.pdb_00029.pdb_00011.pdb.loopfile'
ERROR:: Exit from: src/protocols/loops/Loops.cc line: 483
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


AdeB
11) Message boards : Number crunching : minirosetta 2.02 (Message 64469)
Posted 13 Dec 2009 by Profile AdeB
Post:
Task: 304011788
Workunit: broker_idealclose_kic_bin_hb_t313__IGNORE_THE_REST_16495_32
stderr out:
...
BOINC:: Worker startup. 
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 43200
CLOSING with IDEALIZATION
Hbond tripped: [2009-12-13  1: 0:32:]

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 358
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


AdeB
12) Message boards : Number crunching : Minirosetta 2.00 (Message 64406)
Posted 8 Dec 2009 by Profile AdeB
Post:
Validate errors in workunits with the name: mix_score13_hb_rlbd_1ttz__IGNORE_THE_RESTlr13_DECOY_16324_*

- 1. ----------------------------------------------------------
Task: 303144429
Workunit: mix_score13_hb_rlbd_1ttz__IGNORE_THE_RESTlr13_DECOY_16324_936_0
CPU time: 85.64598
stderr out:
...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Fullatom mode ..
# cpu_run_time_pref: 43200
======================================================
DONE :: 1 starting structures 1201 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

- 2. ----------------------------------------------------------
Task: 302775198
Workunit: mix_score13_hb_rlbd_1ttz__IGNORE_THE_RESTlr13_DECOY_16324_508_1
CPU time: 75.6415
stderr out:
...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Fullatom mode ..
# cpu_run_time_pref: 43200
======================================================
DONE :: 1 starting structures 1201 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish


AdeB
13) Message boards : Number crunching : lr8_combine_smooth_torsion_it00 - All Errors? (Message 64238)
Posted 26 Nov 2009 by Profile AdeB
Post:
Yes, it should be safe now.
The new 'redo' jobs should be good. :p


Most of the 'redo' jobs ended in SIGSEGV: segmentation violation on my computer.

tasks:
299914416
299948625
299957772
300000282

AdeB
14) Message boards : Number crunching : Minirosetta 2.00 (Message 64173)
Posted 24 Nov 2009 by Profile AdeB
Post:
Downloaded 9 wu's and they all errored out. The wingman on the wu's errored out also.

update: downloaded 4 more wu's on another computer...errors


Same here:
ERROR: Value of inactive option accessed: -score:dun08_dir

example: 298867941

AdeB
15) Message boards : Number crunching : Report long-running models here (Message 63248)
Posted 10 Sep 2009 by Profile AdeB
Post:
Long running task: 278731357
name: lr8_A_seq_score12_ss1.7_rlbd_2ccv_IGNORE_THE_REST_DECOY_14637_3189_0
application version: 1.97
OS: Linux

AdeB
16) Message boards : Number crunching : Minirosetta 1.97 (Message 63011)
Posted 22 Aug 2009 by Profile AdeB
Post:
ERROR: Option matching -in:detect_disulfides not found in command line top-level context
in task 275059313

AdeB
17) Message boards : Number crunching : Report long-running models here (Message 62932)
Posted 14 Aug 2009 by Profile AdeB
Post:
Long running task: 272664497
name: lr8_newhb_run02_rlbn_2apb_IGNORE_THE_REST_NATIVE_NOCON_14611_463_1
application version: 1.91
OS: Linux
CPU time: 57738.5s, 14400s + 43200s
Granted credit: 4.01992761072857

AdeB
18) Message boards : Number crunching : Report long-running models here (Message 62730)
Posted 2 Aug 2009 by Profile AdeB
Post:
Long running task: 269551688
name: lr10_seq_score12_rlbd_1prq_IGNORE_THE_REST_DECOY_13841_3329_0
application version: 1.90
OS: Linux

AdeB
19) Message boards : Number crunching : Problems with Minirosetta 1.80 (Message 62013)
Posted 29 Jun 2009 by Profile AdeB
Post:
task 262080735 ended after 16 hours, which happens to be my cpu_run_time_pref + 4 hours.
And then there was a <file_xfer_error>.

BOINC:: CPU time: 57669.2s, 14400s + 43200s[2009- 6-29 14:48:55:] :: BOINC 
Output exists: default.out.gz
InternalDecoyCount: 0 (GZ)
======================================================
DONE ::     1 starting structures  57670.2 cpu seconds
This process generated      1 decoys from       1 attempts
======================================================
called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
  <file_name>real_core_1.5_low200_beta_low200_start_hb_t286__IGNORE_THE_REST_13040_508_0_0</file_name>
  <error_code>-161</error_code>
</file_xfer_error>
20) Message boards : Number crunching : Report long-running models here (Message 59804)
Posted 25 Feb 2009 by Profile AdeB
Post:
And another one: loopbuild_reference_allmodels_hb_t328__IGNORE_THE_REST_1NRIA_8_7691_12_0


Next 20



©2024 University of Washington
https://www.bakerlab.org