minirosetta 2.14

Message boards : Number crunching : minirosetta 2.14

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next

AuthorMessage
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 66499 - Posted: 6 Jun 2010, 20:57:58 UTC

I'm still getting the occasional failure with the ProteinInterfaceDesign task and its "patchdock" file after only a few seconds of processing - the example task, whose output is posted below, was created today (June 6th)

Task 312328443

ERROR: Cannot open patchdock file: 1fAc_2vg9.patchdock
ERROR:: Exit from: src/protocols/ProteinInterfaceDesign/read_patchdock.cc line: 101
ID: 66499 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 66508 - Posted: 7 Jun 2010, 11:37:39 UTC

Long running job just finished - 28894 seconds of CPU, one decoy finished. Killed by watchdog. It continued to take checkpoints throughout the run. SegFault on completion. I have several other jobs across a few systems which appear to be heading down the same path.

All seen to have similar task names: rs_stg0_lrlx_t"xyz"__casp8_SAVE_ALL_OUT

Output follows:

Task ID 344004739
Name rs_stg0_lrlx_t447__casp8_SAVE_ALL_OUT_20806_3438_0
Workunit 314064997
Created 6 Jun 2010 19:33:49 UTC
Sent 6 Jun 2010 20:11:19 UTC
Received 7 Jun 2010 11:25:52 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 1290176
Report deadline 16 Jun 2010 20:11:19 UTC
CPU time 28896.33
stderr out

<core_client_version>6.10.56</core_client_version>
<![CDATA[
<stderr_txt>
[2010- 6- 6 22:21:50:] :: BOINC:: Initializing ... ok.
[2010- 6- 6 22:21:50:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev36507.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/rs_stg0_lrlx_t447__casp8.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400
BOINC:: CPU time: 28894.3s, 14400s + 14400s[2010- 6- 7 6:24:23:] :: BOINC
InternalDecoyCount: 0
======================================================
DONE :: 1 starting structures 28894.3 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
called boinc_finish
SIGSEGV: segmentation violation
Stack trace (25 frames):
[0x992e4a3]
[0x9958378]
[0xf77eb400]
[0x8c1ac97]
[0x8e26032]
[0x8e2646e]
[0x93d1812]
[0x93d3094]
[0x93d511e]
[0x93d1195]
[0x80dac5e]
[0x80d8f91]
[0x810386e]
[0x858db3f]
[0x815324a]
[0x81755cf]
[0x80ace21]
[0x85379f7]
[0x812b7aa]
[0x812c94d]
[0x878038b]
[0x82ff325]
[0x804989b]
[0x99b42dc]
[0x8048121]

Exiting...

</stderr_txt>
]]>

Validate state Valid
Claimed credit 179.245358151844
Granted credit 95.9047142739811
application version 2.14
ID: 66508 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 66510 - Posted: 7 Jun 2010, 12:06:42 UTC

Here is the output from a second task - of the same "family" as the one reported in my previous post - differences: this one ended on its own after running an hour over the preferred time, not killed by watchdog, and no SegFault (could the SegFault have been caused by watchdog killing the task?)

20572 CPU seconds - only 2 decoys.

(both tasks were declared as "success" and both generated reasonable credit)



Task ID 344034237
Name rs_stg0_lrlx_t436__casp8_SAVE_ALL_OUT_20802_3787_0
Workunit 314090933
Created 6 Jun 2010 23:01:21 UTC
Sent 6 Jun 2010 23:13:10 UTC
Received 7 Jun 2010 11:47:34 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 1290176
Report deadline 16 Jun 2010 23:13:10 UTC
CPU time 20572.51
stderr out

<core_client_version>6.10.56</core_client_version>
<![CDATA[
<stderr_txt>
[2010- 6- 7 1: 1:57:] :: BOINC:: Initializing ... ok.
[2010- 6- 7 1: 1:57:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev36507.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/rs_stg0_lrlx_t436__casp8.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400
======================================================
DONE :: 2 starting structures 20572.2 cpu seconds
This process generated 2 decoys from 2 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>

Validate state Valid
Claimed credit 127.612292738641
Granted credit 97.0638812927757
application version 2.14
ID: 66510 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 66563 - Posted: 13 Jun 2010, 7:03:25 UTC

3 tasks died recently with errors

these two just say : Maximum elapsed time exceeded
no cpu time shown and no debug output

T0561_whole_SAVE_ALL_OUT_IGNORE_THE_REST_8-17_21314_677_0
T0561_whole_SAVE_ALL_OUT_IGNORE_THE_REST_3-6_21314_594_0

this one: int2_centerfirst2b_1fAc_2qwt_ProteinInterfaceDesign_23May2010_21231_230_0 is the patchdock error.
ID: 66563 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile VO
Avatar

Send message
Joined: 4 Nov 05
Posts: 7
Credit: 3,250,754
RAC: 0
Message 66578 - Posted: 15 Jun 2010, 15:54:28 UTC

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=313445716
ID: 66578 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 66582 - Posted: 16 Jun 2010, 2:47:03 UTC - in response to Message 66578.  
Last modified: 16 Jun 2010, 2:47:38 UTC

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=313445716


(Notes for Project Team)
Validation errors, with no apparent cause on:
rb_06_02_188_708_t000__t0571_IGNORE_THE_REST_04_05_21338

Resends all failed as well.
Rosetta Moderator: Mod.Sense
ID: 66582 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
cnick6

Send message
Joined: 30 May 06
Posts: 29
Credit: 12,597,623
RAC: 0
Message 66583 - Posted: 16 Jun 2010, 6:01:57 UTC

I have one work unit that is crashing the minirosetta214 executable in Windows and Linux:

Windows TASKID (with debug info): 345248849
Linux TASKID: 345967972

Workunit:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=315213770

WU Name:

rb_06_10_202_765_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21404_3249_1

ID: 66583 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,637,805
RAC: 799
Message 66592 - Posted: 17 Jun 2010, 8:16:04 UTC
Last modified: 17 Jun 2010, 8:16:36 UTC

MiniRosetta 2.14 memory use seems extremely high. I noticed another process in the "Waiting for memory" state, something I don't believe I have seen before. Upon investigation, MiniRosetta was using 800+k.

Is this intentional, or is something not being freeĀ“d?
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 66592 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 66597 - Posted: 17 Jun 2010, 15:11:13 UTC

adrianxw, some tasks use protocols that do require more memory. These are only sent to machines that have more then the minimum memory required. I see both of your machines are reporting 4 CPUs and 2GB of memory. That's only 512MB per CPU, but I believe the check for high-memory tasks is not sensitive to the number of CPUs. So if you happened to get several high-memory tasks at the same time, that would explain the waiting for memory message.

You mentioned seeing Mini using more then 800... I assume you meant MB :) Was that just one task or were several running at the same time with that usage? Task names would be helpful.
Rosetta Moderator: Mod.Sense
ID: 66597 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,637,805
RAC: 799
Message 66599 - Posted: 18 Jun 2010, 7:52:27 UTC

The job is finished and gone now, so I don't know which it was. This one is running right now, and has ~500M, (yes, M, that dates me a bit huh?). There are not processes waiting on here at the moment. Rosetta has quite a high work share value on both my machines so it crunches them fairly quickly, I wouldn't like to guess which wu it was that was causing the event yesterday. As I recall, it was the only Rosetta wu on the machine at that time, it was Climate Prediction that was in the "Waiting for memory" state, not Rosetta.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 66599 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 66600 - Posted: 18 Jun 2010, 13:09:14 UTC
Last modified: 18 Jun 2010, 13:09:46 UTC

I have noticed the same "waiting for memory" messages several times on my system in recent weeks and I have got one right now. For me they only pop up with Rosetta CASP9 WUs with huge protein structures.

Looking at your task history, adrianxw, you were probably processing rb_06_15_205_770_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21438_1071_0.

The one eating up my memory today is rb_06_15_205_770_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21438_19046_0 from the same batch as yours.

There is nothing to worry about with these as they free up the memory again as soon as they are completed.
ID: 66600 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 66605 - Posted: 19 Jun 2010, 1:41:40 UTC

This errored after 22 sec.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=316550658

Sat 19 Jun 2010 11:20:51 EST|rosetta@home|Output file rb_06_15_205_770_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21438_15952_0_0 for task rb_06_15_205_770_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21438_15952_0 absent

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>

ID: 66605 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 66607 - Posted: 19 Jun 2010, 7:39:20 UTC

Another error, this ran for 1hr 59min i have a four hour run time set with two hour switching projects.

It ran the first two hours O.K. when it restarted it failed.

eed_4_eed_1fm4_ProteinInterfaceDesign_7Jun2010_21383_177_0

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=316587259

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>

BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400
SIGSEGV: segmentation violation
Stack trace (11 frames):
[0x992e4a3]
[0x9958378]
[0xffffe500]
[0x84bd3da]
[0x882dfff]
[0x812b7aa]
[0x812c94d]
[0x878038b]
[0x8049a2a]
[0x99b42dc]
[0x8048121]

Exiting...

</stderr_txt>

ID: 66607 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,637,805
RAC: 799
Message 66638 - Posted: 22 Jun 2010, 15:24:46 UTC

Same again...

rb_06_21_217_781_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21462_3794

... two other projects stopped "Waiting for memory" 882M in use.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 66638 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
cnick6

Send message
Joined: 30 May 06
Posts: 29
Credit: 12,597,623
RAC: 0
Message 66685 - Posted: 24 Jun 2010, 20:28:41 UTC

Can one of the mods please look into the low-credit issues lately?

See this thread:

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5366
ID: 66685 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 66693 - Posted: 25 Jun 2010, 3:01:55 UTC

Moderators do not have access to any credit information beyond what you see on the task and WU. The Project Team maintains all of the BOINC databases and etc. but they are pretty busy with CASP at the moment.

I can only assure you that credit is granted based on models completed, and that hooks were placed in the code to report CPU time on a per model basis so that specific protocols or proteins that have a high variability in CPU time between models can be reviewed in more detail.

Generally when credit is that dramatically low, it is the result of a long running model. In other words, if models typically take 10 minutes of CPU time, and your machine runs for an hour and has completed 6 models, and then the 7th takes 3 hours (or more and perhaps is eventually ended by the watchdog) then the credit granted is going to be on par with 70 minutes of processing rather then the 4 hours that was actually spent. This is why there is a thread for reporting long-running models.

Over time, as revisions are made and new protocols become accepted for future use, changes are found which reduce the number of such outlaying long-running models. But if a new protocol is not found to produce better results then prior methods, it will not be run in the future anyway, and so tracking down the 1% outlayers ends up consuming resources that could be invested into developing another new protocol.
Rosetta Moderator: Mod.Sense
ID: 66693 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,805,838
RAC: 0
Message 66699 - Posted: 26 Jun 2010, 19:35:41 UTC

Three recent failures on W7

Task 347583183 ab_06_19_d000_top_broker_server_models_21455_46857_0
Task 347583182 ab_06_19_d000_top_broker_server_models_21455_46856_0
Task 347583171 ab_06_19_d000_top_broker_server_models_21455_46845_0

all failed as follows

Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
ERROR: Option file open failed for: ab_06_19_d000_top_broker_server_models.flags

</stderr_txt>
]]>
ID: 66699 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
billy ewell 1931

Send message
Joined: 30 Mar 07
Posts: 14
Credit: 6,497,991
RAC: 2,714
Message 66708 - Posted: 29 Jun 2010, 22:39:10 UTC

Task ID 349192260: This is a ProteinDesignInterface unit that consumed 7.5 hours of cpu time on an Intel quad 2.66 with 4 gigs of memory. There were 98 starting structures, 98 attempts and 98 decoys resulting. It really irritates me to see the scoring results when a claimed credit amount of 130.12 was reduced to a granted amount of 36.82. This seems to be quite a COMMON result when processing the PDI work units. I am NOT a points chaser but a dedicated supporter of research science and its potential impace for mankind and the world. BUT I still wonder if perhaps 10% or more of my fairly high-quality computing power is going to waste. Three of my computers; an i7 930 and two 9400 2.66 quads were purchased and run 24/7 solely in support of projects like rosetta and other BOINC research initiatives.

Am I terribly wrong here or do I have a legitimate concern as a dedicated and loyal supporter of Rosetta and the current CASP?

My account is 160868

I appreciate so very much the dedicated professional designers of this project and the loyal crunchers who particularly make it possible.

Bill: Austin, Texas USA

ID: 66708 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mhhall

Send message
Joined: 28 Mar 06
Posts: 7
Credit: 10,188,899
RAC: 19
Message 66709 - Posted: 29 Jun 2010, 23:51:34 UTC

Hi folks,
My system is currently executing WU 317305089.

BOINC is showing following properties that would seem to
indicate process is stuck and not checkpointing properly.

CPU Time at last checkpoing: 13:18:13
CPU Time : 15:20:20

Fraction done: 98.925%

Would hate to kill a job so close to comletion,
but I've got to wonder if this is really going to
complete.
ID: 66709 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jochen

Send message
Joined: 6 Jun 06
Posts: 133
Credit: 3,847,433
RAC: 0
Message 66713 - Posted: 30 Jun 2010, 11:33:23 UTC - in response to Message 66709.  

Would hate to kill a job so close to comletion,
but I've got to wonder if this is really going to
complete.

Does this task still create CPU-load? If yes, leave it running, if not try restarting the BOINC-manager (make sure, the client processes will be stopped as well). If it still doesn't create CPU-load after restarting the manager, you should abort it.
cu

Joe


ID: 66713 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next

Message boards : Number crunching : minirosetta 2.14



©2024 University of Washington
https://www.bakerlab.org