Rosetta 4.0+

Message boards : Number crunching : Rosetta 4.0+

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 19 · Next

AuthorMessage
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 5172
Credit: 0
RAC: 0
Message 90506 - Posted: 11 Mar 2019, 19:08:51 UTC

I've talked to the researcher who submitted these jobs and I've also updated the validator to hopefully address this issue. Let us know if this continues.
ID: 90506 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1113
Credit: 4,699,572
RAC: 5,586
Message 90510 - Posted: 14 Mar 2019, 17:11:53 UTC

After 6h...
1062813667

<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.07_windows_x86_64.exe @rb_03_12_1699_1880_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -psipred_ss2 t000_.spider3_ss2 -kill_hairpins t000_.nobuformat.spider3_ss2 -jumps:pairing_file t000_.fasta.bbcontacts.jumps -abinitio::use_filters false -skip_convergence_check -jumps:overlap_chainbreak -seq_sep_stages 1 1 1 -ramp_chainbreaks -sep_switch_accelerate 0.8 -jumps:random_sheets 1 2 1 2 1 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_03_12_1699_1880_ab_t000__robetta.zip -frag3 rb_03_12_1699_1880_ab_t000__robetta.200.3mers.index.gz -fragA rb_03_12_1699_1880_ab_t000__robetta.200.7mers.index.gz -fragB rb_03_12_1699_1880_ab_t000__robetta.200.8mers.index.gz -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 1726629
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 22083s, 14400s + 7200s[2019- 3-14 15:20:59:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 22083 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
15:20:59 (11736): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>rb_03_12_1699_1880_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_08_07_821799_65_0_r858640546_0</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>

ID: 90510 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Juha

Send message
Joined: 28 Mar 16
Posts: 13
Credit: 705,034
RAC: 0
Message 90524 - Posted: 15 Mar 2019, 18:39:07 UTC - in response to Message 90496.  

@David E Kim

I'll look into this.


Could you also take a look at Linux 4.08 x86_64 version?

I don't think I'm exaggerating much if I say it's crashing two thirds of the tasks on my machine. I have an older Linux machine and the previous 4.07 x86_64 version ran just fine. Failed tasks were rare with it. What's curious is that if a task that failed with 4.08 gets a second try on Windows machine it always succeeds. That suggests a bug in the app instead of tasks.
ID: 90524 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 450
Credit: 20,996,646
RAC: 58,823
Message 90528 - Posted: 17 Mar 2019, 15:41:36 UTC - in response to Message 90524.  

I have an older Linux machine and the previous 4.07 x86_64 version ran just fine. Failed tasks were rare with it. What's curious is that if a task that failed with 4.08 gets a second try on Windows machine it always succeeds. That suggests a bug in the app instead of tasks.

Try a later Linux kernel. I had problems with the earlier ones on my Ryzens also, but now they work fine.
ID: 90528 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 295
Credit: 359,460
RAC: 0
Message 90530 - Posted: 18 Mar 2019, 9:23:59 UTC

rb_03_17_1833_1986__t000__0_C2_SAVE_ALL_OUT_IGNORE_THE_REST_821983_78_0

Peak working set size: 1,044.39 MB
Peak swap size: 1,403.04 MB

Error while computing, stderr output:
<core_client_version>7.12.1</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -1073741819 (0xc0000005)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.07_windows_intelx86.exe -run:protocol jd2_scripting @flags_rb_03_17_1833_1986__t000__0_C2_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_rb_03_17_1833_1986__t000__0_C2_robetta.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 1235712
Starting watchdog...
Watchdog active.


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00000000 

Engaging BOINC Windows Runtime Debugger...



********************


BOINC Windows Runtime Debugger Version 7.9.0


Dump Timestamp    : 03/18/19 08:48:30
Install Directory : 
Data Directory    : C:ProgramDataBOINC
Project Symstore  : https://boinc.bakerlab.org/rosetta/symstore
LoadLibraryA( C:ProgramDataBOINCdbghelp.dll ): GetLastError = 126
Loaded Library    : dbghelp.dll
LoadLibraryA( C:ProgramDataBOINCsymsrv.dll ): GetLastError = 126
LoadLibraryA( symsrv.dll ): GetLastError = 126
LoadLibraryA( C:ProgramDataBOINCsrcsrv.dll ): GetLastError = 126
LoadLibraryA( srcsrv.dll ): GetLastError = 126
LoadLibraryA( C:ProgramDataBOINCversion.dll ): GetLastError = 126
Loaded Library    : version.dll
SymInitialize(): GetLastError = 8
*** Dump of the Process Statistics: ***

- I/O Operations Counters -
Read: 73154, Write: 0, Other 12643

- I/O Transfers Counters -
Read: 0, Write: 198584, Other 0

- Paged Pool Usage -
QuotaPagedPoolUsage: 247488, QuotaPeakPagedPoolUsage: 247616
QuotaNonPagedPoolUsage: 33104, QuotaPeakNonPagedPoolUsage: 33104

- Virtual Memory Usage -
VirtualSize: 2120523776, PeakVirtualSize: 2125643776

- Pagefile Usage -
PagefileUsage: 1222176768, PeakPagefileUsage: 1477279744

- Working Set Size -
WorkingSetSize: 582373376, PeakWorkingSetSize: 1095122944, PageFaultCount: 14848221

*** Dump of thread ID 5800 (state: Waiting): ***

- Information -
Status: Wait Reason: UserRequest, , Kernel Time: 726718720.000000, User Time: 132472659968.000000, Wait Time: 21575628.000000

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00000000 


*** Dump of thread ID 9824 (state: Waiting): ***

- Information -
Status: Wait Reason: ExecutionDelay, , Kernel Time: 2031250.000000, User Time: 625000.000000, Wait Time: 21575628.000000


*** Dump of thread ID 428 (state: Waiting): ***

- Information -
Status: Wait Reason: ExecutionDelay, , Kernel Time: 312500.000000, User Time: 0.000000, Wait Time: 21575524.000000



*** Debug Message Dump ****


*** Foreground Window Data ***
    Window Name      : 
    Window Class     : 
    Window Process ID: 0
    Window Thread ID : 0

Exiting...

</stderr_txt>
]]>

.
ID: 90530 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 5172
Credit: 0
RAC: 0
Message 90571 - Posted: 24 Mar 2019, 0:55:08 UTC

This was quite a large protein.
ID: 90571 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1113
Credit: 4,699,572
RAC: 5,586
Message 90589 - Posted: 27 Mar 2019, 13:33:49 UTC

Some wus with this error (ex 1064625564)

-529697949 (0xE06D7363) Unknown error code
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7662C5AF

Engaging BOINC Windows Runtime Debugger...

ID: 90589 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1113
Credit: 4,699,572
RAC: 5,586
Message 90596 - Posted: 30 Mar 2019, 18:43:57 UTC

Other wus after 80/90 minutes
(0x1) - exit code 1 (0x1)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.07_windows_intelx86.exe -run:protocol jd2_scripting @flags_rb_03_27_2191_2381__t000__3_C1_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_rb_03_27_2191_2381__t000__3_C1_robetta.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2997016
Starting watchdog...
Watchdog active.

ID: 90596 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1113
Credit: 4,699,572
RAC: 5,586
Message 90610 - Posted: 4 Apr 2019, 7:45:42 UTC

Again, memory errors on some wus

(unknown error) - exit code -529697949 (0xe06d7363)</message>

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x76F2C632


Please, any admins/developers want to debug this app?
ID: 90610 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 5172
Credit: 0
RAC: 0
Message 90612 - Posted: 4 Apr 2019, 17:31:42 UTC

These memory errors are due to large proteins that are being submitted to our structure prediction server, Robetta. I increased the rsc_memory_bound for these jobs depending on the sequence length but it looks like I should increase the bound further. Sorry for any inconvenience.
ID: 90612 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1113
Credit: 4,699,572
RAC: 5,586
Message 90616 - Posted: 5 Apr 2019, 7:28:35 UTC - in response to Message 90612.  

Sorry for any inconvenience.

No problem, thank for the answer.

P.S.
Do you plan to release a new version of app, with updated protocols and functions?
ID: 90616 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 276
Credit: 202,244
RAC: 222
Message 90624 - Posted: 5 Apr 2019, 20:11:05 UTC
Last modified: 5 Apr 2019, 20:13:35 UTC

this task failed with
https://boinc.bakerlab.org/rosetta/result.php?resultid=1066550803
std::cerr: Exception was thrown: 

File: src/core/pack/dunbrack/SingleResidueDunbrackLibrary.hh:306
chi angle must be between -180 and 180: nan

Rosetta v4.07 on linux 64 bits i'm not sure if it is related to r@h or the model itself
ID: 90624 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 5172
Credit: 0
RAC: 0
Message 90625 - Posted: 5 Apr 2019, 22:47:35 UTC - in response to Message 90616.  

There are no immediate plans for a new version release. But researchers are working on new methods that will eventually get put into production on R@h. I'm not sure about the timeline though.
ID: 90625 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Terrible T

Send message
Joined: 29 Dec 16
Posts: 4
Credit: 1,333,030
RAC: 267
Message 90629 - Posted: 7 Apr 2019, 7:43:35 UTC - in response to Message 90612.  

Apparently these big proteins need still more memory?

Had now several of this serie failing after 20000secs of computing.
FYI:

WU 960589108

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -529697949 (0xe06d7363)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.07_windows_intelx86.exe @rb_04_03_2468_2618_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -psipred_ss2 t000_.spider3_ss2 -kill_hairpins t000_.nobuformat.spider3_ss2 -abinitio::use_filters true -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_04_03_2468_2618_ab_t000__robetta.zip -frag3 rb_04_03_2468_2618_ab_t000__robetta.200.3mers.index.gz -fragA rb_04_03_2468_2618_ab_t000__robetta.200.17mers.index.gz -fragB rb_04_03_2468_2618_ab_t000__robetta.200.9mers.index.gz -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2189964
Starting watchdog...
Watchdog active.


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x74C345A2

Engaging BOINC Windows Runtime Debugger...
ID: 90629 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1113
Credit: 4,699,572
RAC: 5,586
Message 90630 - Posted: 7 Apr 2019, 9:26:57 UTC

1066755948

after 6hrs of calculation
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 21607.1 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
10:23:44 (11200): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>rb_04_05_2574_2712__t000__0_C2_SAVE_ALL_OUT_IGNORE_THE_REST_827890_682_0_r1680908808_0</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>

ID: 90630 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1113
Credit: 4,699,572
RAC: 5,586
Message 90631 - Posted: 7 Apr 2019, 9:28:26 UTC - in response to Message 90625.  

But researchers are working on new methods that will eventually get put into production on R@h. I'm not sure about the timeline though.


Come on guys, we are ready for a lot of new science... :-))
ID: 90631 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 107
Credit: 123,558,692
RAC: 77,946
Message 90638 - Posted: 8 Apr 2019, 17:07:43 UTC - in response to Message 90463.  

These errors are back again in many units. The default processing time (8 hours) is extended and units fail after 12 hours processing time which is frustrating. Are the crunching results used or are they wasted?

All unit starting rb_04_06_2593_2728_ab_t000 seem to be affected in Linux.


<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process got signal 11</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu @rb_04_06_2593_2728_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -psipred_ss2 t000_.spider3_ss2 -kill_hairpins t000_.nobuformat.spider3_ss2 -jumps:pairing_file t000_.fasta.bbcontacts.jumps -abinitio::use_filters false -skip_convergence_check -jumps:overlap_chainbreak -seq_sep_stages 1 1 1 -ramp_chainbreaks -sep_switch_accelerate 0.8 -jumps:random_sheets 7 2 1 1 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_04_06_2593_2728_ab_t000__robetta.zip -frag3 rb_04_06_2593_2728_ab_t000__robetta.200.3mers.index.gz -fragA rb_04_06_2593_2728_ab_t000__robetta.200.6mers.index.gz -fragB rb_04_06_2593_2728_ab_t000__robetta.200.4mers.index.gz -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 1914528
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 43619.4s, 14400s + 28800s[2019- 4- 8 15:26:19:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 43619.4 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
15:26:19 (15379): called boinc_finish(0)

</stderr_txt>
]]>
ID: 90638 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 295
Credit: 359,460
RAC: 0
Message 90639 - Posted: 8 Apr 2019, 18:09:26 UTC

rb_03_27_2191_2381__t000__4_C1_SAVE_ALL_OUT_IGNORE_THE_REST_827088_870

Both results ended with "incorrect function".

<core_client_version>7.12.1</core_client_version>
<![CDATA[
<message>
Unzul´┐Żssige Funktion.
 (0x1) - exit code 1 (0x1)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.07_windows_intelx86.exe -run:protocol jd2_scripting @flags_rb_03_27_2191_2381__t000__4_C1_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_rb_03_27_2191_2381__t000__4_C1_robetta.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2995742
Starting watchdog...
Watchdog active.

</stderr_txt>
]]>

.
ID: 90639 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James W

Send message
Joined: 25 Nov 12
Posts: 101
Credit: 996,246
RAC: 4,487
Message 90659 - Posted: 12 Apr 2019, 7:03:24 UTC

Application version: Rosetta v4.07 windows_intelx86
Device: 1759960, Task: 1066832102, and WU 960999090.
Name: rb_04_04_2540_2669__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_827813_1998_0
Status: Error while computing
Exit status: -529697949 (0xE06D7363) Unknown error code
<core_client_version>7.14.2</core_client_version>
<![CDATA[<message>(unknown error) - exit code -529697949 (0xe06d7363)</message><stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.07_windows_intelx86.exe -run:protocol jd2_scripting @flags_rb_04_04_2540_2669__t000__0_C1_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_rb_04_04_2540_2669__t000__0_C1_robetta.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2013800

Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x76BFC5AF

Engaging BOINC Windows Runtime Debugger...
ID: 90659 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1113
Credit: 4,699,572
RAC: 5,586
Message 90682 - Posted: 17 Apr 2019, 9:42:27 UTC - in response to Message 90659.  

Again, a lot of "C++ out of memory" error

1068437282
1068437281
1068437279
1068437273
1068437317
etc

Please, fix it
ID: 90682 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 19 · Next

Message boards : Number crunching : Rosetta 4.0+



©2020 University of Washington
https://www.bakerlab.org