Rosetta Beta 6.00

Message boards : Number crunching : Rosetta Beta 6.00

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1876
Credit: 8,325,818
RAC: 10,411
Message 108497 - Posted: 23 Aug 2023, 8:40:59 UTC

A lot of 6.03 errors...

ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.
ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 2442
BOINC:: Error reading and gzipping output datafile: default.out
10:27:24 (25460): called boinc_finish(1)

ID: 108497 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2003
Credit: 38,864,979
RAC: 23,217
Message 108498 - Posted: 23 Aug 2023, 13:23:10 UTC - in response to Message 108497.  

A lot of 6.03 errors...

ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.
ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 2442
BOINC:: Error reading and gzipping output datafile: default.out
10:27:24 (25460): called boinc_finish(1)

Yup. The queue, which was over 300k tasks when I saw it last night, seems to have been removed already, so noticed.
Most tasks crash out within 20 seconds, but I've had a few run for several hours before crashing out.
Also, the ones that do run don't seem to checkpoint, but I've got two (out of 40-50) that do and I'm hoping might even complete successfully.
ID: 108498 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2003
Credit: 38,864,979
RAC: 23,217
Message 108499 - Posted: 23 Aug 2023, 14:20:48 UTC - in response to Message 108498.  

A lot of 6.03 errors...

ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.
ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 2442
BOINC:: Error reading and gzipping output datafile: default.out
10:27:24 (25460): called boinc_finish(1)

Yup. The queue, which was over 300k tasks when I saw it last night, seems to have been removed already, so noticed.
Most tasks crash out within 20 seconds, but I've had a few run for several hours before crashing out.
Also, the ones that do run don't seem to checkpoint, but I've got two (out of 40-50) that do and I'm hoping might even complete successfully.

And they did complete successfully, so not a complete waste of time (just mostly a waste of time)
ID: 108499 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,858,040
RAC: 1,931
Message 108500 - Posted: 23 Aug 2023, 20:47:09 UTC
Last modified: 23 Aug 2023, 20:47:58 UTC

Getting this error on the work units that did not fail in 29 seconds but run for many, many hours (my pre set is 6 hours)

Task 1533688974
Name 7hal_NME_af2_hal_07_283_SAVE_ALL_OUT_2961446_35_0
Workunit 1365145536
Created 23 Aug 2023, 0:58:13 UTC
Sent 23 Aug 2023, 1:35:20 UTC
Report deadline 26 Aug 2023, 1:35:20 UTC
Received 23 Aug 2023, 20:22:44 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 0 (0x00000000)
Computer ID 1503952
Run time 12 hours 20 min 35 sec
CPU time 12 hours 7 min 21 sec
Validate state Invalid
Credit 0.00
Device peak FLOPS 7.87 GFLOPS
Application version Rosetta Beta v6.00
x86_64-pc-linux-gnu
Peak working set size 363.05 MB
Peak swap size 430.13 MB
Peak disk usage 24.06 MB
Stderr output

<core_client_version>7.17.0</core_client_version>
<![CDATA[
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.00_x86_64-pc-linux-gnu @7hal_NME_af2_hal_07_283.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937
Using database: database_0f7f01a1b07/database
======================================================
DONE :: 1 starting structures 43641.1 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
BOINC :: WS_max 0

BOINC :: Watchdog shutting down...
06:20:36 (24284): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>7hal_NME_af2_hal_07_283_SAVE_ALL_OUT_2961446_35_0_r1774002546_0</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
</message>
]]>

I have many others like this.

i also have had 2 that ran normally, for 6 odd hours and completed without error. But only 2, the ones left on the computer have been running for over 10 hours already and only at 60% or so.

This is on Linux.

Conan
ID: 108500 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2003
Credit: 38,864,979
RAC: 23,217
Message 108501 - Posted: 24 Aug 2023, 2:52:20 UTC - in response to Message 108500.  

Getting this error on the work units that did not fail in 29 seconds but run for many, many hours (my pre set is 6 hours)

Task 1533688974
[...]
Outcome Computation error
Client state Compute error
[...]
Run time 12 hours 20 min 35 sec
CPU time 12 hours 7 min 21 sec
[...]
<file_name>7hal_NME_af2_hal_07_283_SAVE_ALL_OUT_2961446_35_0_r1774002546_0</file_name>
<error_code>-161 (not found)</error_code>

I have many others like this.

I also have had 2 that ran normally, for 6 odd hours and completed without error. But only 2, the ones left on the computer have been running for over 10 hours already and only at 60% or so.

This is on Linux.

Conan

Yeah, this is exactly the pattern I saw on my Windows box too.
Majority: Error within 20secs
Minority: Fails to checkpoint for many hours, 1st model completes, filename error on upload
Exception: Completes ok

Some rb tasks have just come down to be going on with, but very few
ID: 108501 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mrchips

Send message
Joined: 11 Nov 09
Posts: 9
Credit: 12,051,078
RAC: 15,834
Message 108502 - Posted: 24 Aug 2023, 18:40:38 UTC

ALL mine have failed

Outcome Computation error
Client state Compute error
Exit status 0 (0x00000000)
Computer ID 6260865
Run time 10 hours 1 min
CPU time 9 hours 55 min 13 sec
Validate state Invalid

10 hours wasted. I will try to abort these when I see them....
ID: 108502 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 178
Credit: 5,718,110
RAC: 3,438
Message 108503 - Posted: 24 Aug 2023, 19:59:44 UTC - in response to Message 108502.  

ALL mine have failed


ALL seven of mine failed too.
This one ran a long time. The others failed pretty fast:

Task 1533723354
Name 	7hal_NME_af2_hal_07_73_SAVE_ALL_OUT_2961577_97_0
Workunit 	1365167789
Created 	23 Aug 2023, 3:30:56 UTC
Sent 	23 Aug 2023, 3:59:17 UTC
Report deadline 	26 Aug 2023, 3:59:17 UTC
Received 	24 Aug 2023, 15:31:30 UTC
Server state 	Over
Outcome 	Computation error
Client state 	Compute error
Exit status 	0 (0x00000000)
Computer ID 	5910575
Run time 	3 hours 47 min 17 sec
CPU time 	3 hours 43 min 42 sec
Validate state 	Invalid
Credit 	0.00
Device peak FLOPS 	6.02 GFLOPS
Application version 	Rosetta Beta v6.00
x86_64-pc-linux-gnu
Peak working set size 	353.82 MB
Peak swap size 	427.60 MB
Peak disk usage 	24.05 MB
Stderr output

<core_client_version>7.20.2</core_client_version>
<![CDATA[
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.00_x86_64-pc-linux-gnu @7hal_NME_af2_hal_07_73.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937
Using database: database_0f7f01a1b07/database
======================================================
DONE ::     1 starting structures  13422.1 cpu seconds
This process generated      1 decoys from       1 attempts
======================================================
BOINC :: WS_max 0

BOINC :: Watchdog shutting down...
11:08:36 (3335130): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
  <file_name>7hal_NME_af2_hal_07_73_SAVE_ALL_OUT_2961577_97_0_r1168317456_0</file_name>
  <error_code>-161 (not found)</error_code>
</file_xfer_error>
</message>
]]>

ID: 108503 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2003
Credit: 38,864,979
RAC: 23,217
Message 108504 - Posted: 24 Aug 2023, 23:21:52 UTC - in response to Message 108503.  

ALL mine have failed

ALL seven of mine failed too.
This one ran a long time. The others failed pretty fast:

If tasks fail quickly, fine.
If tasks don't fail quickly, check the properties of the task first.
- If it hasn't checkpointed, certainly delete it - invariably no good will come of it.
- If it <has> checkpointed (a rarity) let it run. These odd few do seem to succeed based on my limited sample.

If it turns out this advice is wrong, please do come back and correct me.
ID: 108504 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1876
Credit: 8,325,818
RAC: 10,411
Message 108505 - Posted: 25 Aug 2023, 5:32:44 UTC - in response to Message 108503.  

<message>
upload failure: <file_xfer_error>
<file_name>7hal_NME_af2_hal_07_73_SAVE_ALL_OUT_2961577_97_0_r1168317456_0</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
</message>
]]>[/code]


All my failded wus have a little bit different error code
<message>
upload failure: <file_xfer_error>
<file_name>7hal_nme_af2_hal_07_73_SAVE_ALL_OUT_2961656_162_0_r667317858_0</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>

ID: 108505 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1876
Credit: 8,325,818
RAC: 10,411
Message 108506 - Posted: 25 Aug 2023, 5:36:00 UTC - in response to Message 108504.  

If tasks fail quickly, fine.
If tasks don't fail quickly, check the properties of the task first.
- If it hasn't checkpointed, certainly delete it - invariably no good will come of it.
- If it <has> checkpointed (a rarity) let it run. These odd few do seem to succeed based on my limited sample.

If it turns out this advice is wrong, please do come back and correct me.


After over 9hs of running, all errors excetp 3 wus ok.

P.S. I don't see the checkpoint argument in the properties
ID: 108506 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 108512 - Posted: 25 Aug 2023, 18:27:40 UTC

I have not been paying much attention rosetta lately so I didn't notice till just now that a broken `beta` has wasted 20 hours stuck in loop
ID: 108512 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2003
Credit: 38,864,979
RAC: 23,217
Message 108514 - Posted: 25 Aug 2023, 23:03:48 UTC - in response to Message 108506.  

If tasks fail quickly, fine.
If tasks don't fail quickly, check the properties of the task first.
- If it hasn't checkpointed, certainly delete it - invariably no good will come of it.
- If it <has> checkpointed (a rarity) let it run. These odd few do seem to succeed based on my limited sample.

If it turns out this advice is wrong, please do come back and correct me.

After over 9hs of running, all errors except 3 wus ok.

P.S. I don't see the checkpoint argument in the properties

3 out of however many is better than I'm getting tbh.
Only rarely do I see two good tasks at a time.

Regarding checkpointing, select one task and click on Properties

CPU time 00:40:29
CPU time since checkpoint 00:05:13

If they show the same amount of time after 15 minutes or so then it's not checkpointing at all, so abort it straight away.
If they're different, like above, you'll be lucky, in my experience, and it will report correctly and give proper credit too.

I know it's not great advice, but it's all I have to offer anyone
ID: 108514 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2003
Credit: 38,864,979
RAC: 23,217
Message 108515 - Posted: 25 Aug 2023, 23:08:40 UTC - in response to Message 108512.  

I have not been paying much attention rosetta lately so I didn't notice till just now that a broken `beta` has wasted 20 hours stuck in loop

I don't believe it's the Rosetta beta app, but the family of "7hal" tasks that are coming through right now.
Tasks, not the app.
I've just grabbed a few Rosetta 4.20 "rb" tasks and all are running well fwiw
ID: 108515 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 54
Credit: 20,058,207
RAC: 684
Message 108518 - Posted: 25 Aug 2023, 23:52:55 UTC

I wish I had checked task credit before wasting 12 hours of run time per task on 32 completed tasks. I now aborted many others that have not checkpointed.
ID: 108518 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1876
Credit: 8,325,818
RAC: 10,411
Message 108523 - Posted: 26 Aug 2023, 8:16:08 UTC - in response to Message 108515.  

I don't believe it's the Rosetta beta app, but the family of "7hal" tasks that are coming through right now.
Tasks, not the app.


But this family didn't run on Ralph to test it before the production.
So, usual waste of time and resources
ID: 108523 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 108526 - Posted: 26 Aug 2023, 12:52:29 UTC - in response to Message 108515.  

I have not been paying much attention rosetta lately so I didn't notice till just now that a broken `beta` has wasted 20 hours stuck in loop

I don't believe it's the Rosetta beta app, but the family of "7hal" tasks that are coming through right now.
Tasks, not the app.
I've just grabbed a few Rosetta 4.20 "rb" tasks and all are running well fwiw

These Hal7000 tasks have got a mind of their own . . . 7hal_nme_af2_hal_07
Hmm . . was`nt there some computer that sied " I`m sorry Dave but I can`t allow you to process that"
ID: 108526 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 108528 - Posted: 26 Aug 2023, 13:59:17 UTC

Oops , Turns out that one was HAL9000
I just looked it up on wiki
I must have been thinking of HAL7600 that works with win7
and missed the edit hour
ID: 108528 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 178
Credit: 5,718,110
RAC: 3,438
Message 108529 - Posted: 26 Aug 2023, 15:01:54 UTC - in response to Message 108515.  

I don't believe it's the Rosetta beta app, but the family of "7hal" tasks that are coming through right now.
Tasks, not the app.


You may be right.Most of my beta ones are running a long time before failing. Here is one. Of those I received lately, all have run quite a long time and they are all 7hal

Task 1534000991
Name 	7hal_nme_af2_hal_07_313_SAVE_ALL_OUT_2961707_989_1
Workunit 	1365344374
Created 	25 Aug 2023, 1:52:11 UTC
Sent 	25 Aug 2023, 1:52:15 UTC
Report deadline 	28 Aug 2023, 1:52:15 UTC
Received 	26 Aug 2023, 12:18:43 UTC
Server state 	Over
Outcome 	Computation error
Client state 	Compute error
Exit status 	0 (0x00000000)
Computer ID 	5910575
Run time 	3 hours 20 min 58 sec
CPU time 	3 hours 19 min 31 sec
Validate state 	Invalid
Credit 	0.00
Device peak FLOPS 	6.02 GFLOPS
Application version 	Rosetta Beta v6.00
x86_64-pc-linux-gnu
Peak working set size 	352.36 MB
Peak swap size 	426.14 MB
Peak disk usage 	24.05 MB
Stderr output

<core_client_version>7.20.2</core_client_version>
<![CDATA[
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.00_x86_64-pc-linux-gnu @7hal_nme_af2_hal_07_313.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937
Using database: database_0f7f01a1b07/database
======================================================
DONE ::     1 starting structures  11971.5 cpu seconds
This process generated      1 decoys from       1 attempts
======================================================
BOINC :: WS_max 0

BOINC :: Watchdog shutting down...
07:40:01 (3501495): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
  <file_name>7hal_nme_af2_hal_07_313_SAVE_ALL_OUT_2961707_989_1_r394580450_0</file_name>
  <error_code>-161 (not found)</error_code>
</file_xfer_error>
</message>
]]>


The 4.20 ones seem to run just fine.

Task 1533981934
Name 	rb_08_24_544889_539739_ab_t000__h002_robetta_IGNORE_THE_REST_04_12_2961726_20_0
Workunit 	1365333581
Created 	24 Aug 2023, 22:04:43 UTC
Sent 	24 Aug 2023, 22:34:04 UTC
Report deadline 	27 Aug 2023, 22:34:04 UTC
Received 	26 Aug 2023, 13:07:40 UTC
Server state 	Over
Outcome 	Success
Client state 	Done
Exit status 	0 (0x00000000)
Computer ID 	5910575
Run time 	7 hours 52 min 1 sec
CPU time 	7 hours 46 min 47 sec
Validate state 	Valid
Credit 	423.06
Device peak FLOPS 	6.02 GFLOPS
Application version 	Rosetta v4.20
x86_64-pc-linux-gnu
Peak working set size 	988.96 MB
Peak swap size 	1,130.77 MB
Peak disk usage 	31.53 MB
Stderr output

<core_client_version>7.20.2</core_client_version>
<![CDATA[
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu @rb_08_24_544889_539739_ab_t000__h002_robetta_FLAGS -in::file::fasta t000__h002.fasta -in:file:boinc_wu_zip rb_08_24_544889_539739_ab_t000__h002_robetta.zip -frag3 rb_08_24_544889_539739_ab_t000__h002_robetta.200.3mers.index.gz -fragA rb_08_24_544889_539739_ab_t000__h002_robetta.200.12mers.index.gz -fragB rb_08_24_544889_539739_ab_t000__h002_robetta.200.4mers.index.gz -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1370977
Using database: database_357d5d93529_n_methyl/minirosetta_database
======================================================
DONE ::     1 starting structures  28007.6 cpu seconds
This process generated     24 decoys from      24 attempts
======================================================
BOINC :: WS_max 1.01981e+09
09:07:16 (3491732): called boinc_finish(0)

</stderr_txt>
]]>

ID: 108529 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2003
Credit: 38,864,979
RAC: 23,217
Message 108539 - Posted: 28 Aug 2023, 17:34:18 UTC - in response to Message 108529.  

I don't believe it's the Rosetta beta app, but the family of "7hal" tasks that are coming through right now.
Tasks, not the app.

You may be right. Most of my beta ones are running a long time before failing. Here is one. Of those I received lately, all have run quite a long time and they are all 7hal

The 4.20 ones seem to run just fine.

The pattern looks consistent as I look back over my task history.
The proof will be if we get a different batch of Beta 6.03 tasks and they run ok.
We wait in hope.
ID: 108539 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jeff

Send message
Joined: 24 Jan 15
Posts: 4
Credit: 1,185,493
RAC: 989
Message 108540 - Posted: 28 Aug 2023, 21:12:22 UTC

I have been a particiapant in rosetta@home for 8 years, and only rarely do my allocated tasks fail due to computation errors. Yet lately, all but one of about 20 of my tasks on beta 6.03 app, with the 7hal prefix to the task name have led to a 'computation error' message. Sometimes within a few moments of starting, but much more frequently, many times in excess of the original estimated 'remaining time.

I want to process as many rosetta tasks as I can, but a lot of my computation time is wasted by this problem. I expect this is also a problem for rosetta@home, because allocated tasks are not successfully processed by users who also experience this problem.

Does anyone know what accounts for this? Does anyone know how can I deal with this problem?
ID: 108540 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Number crunching : Rosetta Beta 6.00



©2024 University of Washington
https://www.bakerlab.org