Rosetta 4.1+ and 4.2+

Message boards : Number crunching : Rosetta 4.1+ and 4.2+

To post messages, you must log in.

Previous · 1 . . . 25 · 26 · 27 · 28 · 29 · 30 · 31 . . . 34 · Next

AuthorMessage
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 99226 - Posted: 3 Oct 2020, 9:18:25 UTC - in response to Message 99223.  

Same with the only DNAN task I have had so far: 1270191608

The examples here have all failed within seconds of starting, so no great loss…
ID: 99226 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Kompakki

Send message
Joined: 14 Jul 14
Posts: 3
Credit: 17,188,801
RAC: 11,949
Message 99227 - Posted: 3 Oct 2020, 10:38:23 UTC

I want to inform software developers of R@H and also need some help with tasks which failed. During the last few days one of my hosts has failed about 500 tasks. For example tasks

bmpr2_att3_9_SAVE_ALL_OUT_IGNORE_THE_REST_2fq8je7s_1013915_3_0 (https://boinc.bakerlab.org/rosetta/result.php?resultid=1271705979)
cd28_1yjd_graft_v1_SAVE_ALL_OUT_IGNORE_THE_REST_0qk3jo8r_1013410_2_0 (https://boinc.bakerlab.org/rosetta/result.php?resultid=1271706673)

both failed with error: Computation error.

One of the stderr messages looks like:

<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process got signal 11</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol predictor_v11_boinc--fuse--covid_at3_design_boinc_v1.xml @bmpr2_att3_flags -in:file:silent bmpr2_att3_9_SAVE_ALL_OUT_IGNORE_THE_REST_2fq8je7s.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip bmpr2_att3_9_SAVE_ALL_OUT_IGNORE_THE_REST_2fq8je7s.zip @bmpr2_att3_9_SAVE_ALL_OUT_IGNORE_THE_REST_2fq8je7s.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1429430
Using database: database_357d5d93529_n_methyl/minirosetta_database

</stderr_txt>
]]>

Host details: AMD Phenom(tm) II X6 1090T, Linux Ubuntu, Ubuntu 18.04.5 LTS [5.4.0-48-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1.2)].

Other Linux hosts have successfully completed tasks with name starting like bmpr2.... or cd28.... .

What's wrong with that one computer? Why does it fail tasks?
ID: 99227 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 350
Credit: 1,017,068
RAC: 223
Message 99228 - Posted: 3 Oct 2020, 11:41:27 UTC - in response to Message 99227.  
Last modified: 3 Oct 2020, 11:42:32 UTC

"process got signal 11"


Could be your RAM.
Maybe clean any dust, reseat the RAM sticks and run memtest.
ID: 99228 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,767,498
RAC: 6,467
Message 99230 - Posted: 3 Oct 2020, 21:07:51 UTC - in response to Message 99222.  

You are using a REALLY old version of Boinc, is that by design or you just haven't updated?


BOINC is just the manager (it's the project's applications that actually process work), and if something isn't broken, then don't fix it.
The only useful thing in the last few versions is that the latest one appears to support a project going from HTTP to HTTPS without the user having to do it.
I may eventually upgrade when i've had enough of the Notices about using an old URL.


That works.
ID: 99230 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1491
Credit: 14,677,518
RAC: 14,569
Message 99232 - Posted: 3 Oct 2020, 21:48:00 UTC - in response to Message 99227.  

What's wrong with that one computer? Why does it fail tasks?
With your computer hidden it makes it difficult to even guess.
Overclocked too much? Over volted too much? Not enough RAM? Faulty RAM module? Faulty power supply? Overheating? All are possible causes.
Grant
Darwin NT
ID: 99232 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1491
Credit: 14,677,518
RAC: 14,569
Message 99233 - Posted: 3 Oct 2020, 21:50:05 UTC - in response to Message 99220.  
Last modified: 3 Oct 2020, 21:50:29 UTC

Just had a new Task crash out. I've already completed several others of the same type without issue, so i'm waiting to see if my Wingman has the same problem with this Work Unit as well.

DNANX53C_DnaN_53C_refine_26_stripped_relax_-1_-1_2_17175399_4mers_0003_SAVE_ALL_OUT_1014011_144_0
Nice to know it wasn't me, WU crashed out in seconds for Wingman as well.
Grant
Darwin NT
ID: 99233 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Bill F
Avatar

Send message
Joined: 29 Jan 08
Posts: 39
Credit: 1,238,659
RAC: 2,100
Message 99235 - Posted: 4 Oct 2020, 2:57:05 UTC - in response to Message 99222.  

You are using a REALLY old version of Boinc, is that by design or you just haven't updated?
BOINC is just the manager (it's the project's applications that actually process work), and if something isn't broken, then don't fix it.
The only useful thing in the last few versions is that the latest one appears to support a project going from HTTP to HTTPS without the user having to do it.
I may eventually upgrade when i've had enough of the Notices about using an old URL.


You may want to read some of the release notes 7.6.22 forward to current again. The BOINC manager has also upgraded Library's that the applications use and GPU tables for newer GPU's as well as improvements in Task scheduling and Task time estimates. It may not be broke but it might be improvable.

Bill F
ID: 99235 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1491
Credit: 14,677,518
RAC: 14,569
Message 99237 - Posted: 4 Oct 2020, 3:51:22 UTC - in response to Message 99235.  

You may want to read some of the release notes 7.6.22 forward to current again. The BOINC manager has also upgraded Library's that the applications use and GPU tables for newer GPU's as well as improvements in Task scheduling and Task time estimates. It may not be broke but it might be improvable.
I'm a one project cruncher, so better Scheduling, GPUs support and time estimates aren't an issue for me here at Rosetta.
But getting rid of the "Old URL" message every time the Scheduler is contacted may yet be reason enough.
Grant
Darwin NT
ID: 99237 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James W

Send message
Joined: 25 Nov 12
Posts: 130
Credit: 1,766,254
RAC: 0
Message 99239 - Posted: 4 Oct 2020, 7:36:21 UTC - in response to Message 99223.  

Name: DNANX53C_DnaN_53C_refine_26_stripped_relax_-1_-1_3_45942938_4mers_0002_SAVE_ALL_OUT_1014073_330_0
Application: Rosetta v4.20 windows_x86_64
Device: 3710630
Task: 1270233986. WU: 1138185669
Status: Error while computing
Exit status: -1073741819 (0xC0000005) STATUS_ACCESS_VIOLATION
Stderr output:
(unknown error) - exit code -1073741819 (0xc0000005)
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0000000000000017

Engaging BOINC Windows Runtime Debugger...
Also rec'd identical error on host 1759960 for task 1270067381. Will see how wingman fares.

Errors: Too many errors (may have bug) Too many total results.
Well, wingman errored out in same approximate time with same errors. Appears this type of task requiring something many hosts are missing. I do note that at least 1 or 2 of these tasks DID validate on my system(s). Not too many of them rec'd, thank goodness.
ID: 99239 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JLDun
Avatar

Send message
Joined: 31 May 08
Posts: 5
Credit: 43,647
RAC: 2
Message 99332 - Posted: 15 Oct 2020, 12:38:26 UTC

One of my (few) errors.

drhicks1_derroids_torricks_all_1_SAVE_ALL_OUT_IGNORE_THE_REST_1fm0nm1e_1016145_3_0


command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_arm-android-linux-gnu -run:protocol jd2_scripting -parser:protocol c2_design.xml @flags_drhicks1 -in:file:silent drhicks1_derroids_torricks_all_1_SAVE_ALL_OUT_IGNORE_THE_REST_1fm0nm1e.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip drhicks1_derroids_torricks_all_1_SAVE_ALL_OUT_IGNORE_THE_REST_1fm0nm1e.zip -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3387230
Using database: database_357d5d93529_n_methyl/minirosetta_database

ERROR: no edge found that contains seqpos!
ERROR:: Exit from: src/core/kinematics/FoldTree.cc line: 2344
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish(1)

ID: 99332 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James W

Send message
Joined: 25 Nov 12
Posts: 130
Credit: 1,766,254
RAC: 0
Message 99400 - Posted: 26 Oct 2020, 4:03:17 UTC

Name: CLPPPJS8_255_ClpP1P2_stub_justPHE_0001_SAVE_ALL_OUT_1018533_362_1
Application: Rosetta v4.20 windows_x86_64
Device: 1759960
Task: 1284630939. WU: 1150838856
Status: Error while computing
Exit status: -1073741819 (0xC0000005) STATUS_ACCESS_VIOLATION
Errors: Too many errors (may have bug). Too many total results.
Stderr output:
(unknown error) - exit code -1073741819 (0xc0000005)
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0000000143141D48

Engaging BOINC Windows Runtime Debugger...

I was wingman on this new for me type WU and we both got same error. Run time about 15 sec., so no big loss. Don't see any more like this in my current queue.
ID: 99400 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1491
Credit: 14,677,518
RAC: 14,569
Message 99402 - Posted: 26 Oct 2020, 8:31:58 UTC - in response to Message 99332.  

One of my (few) errors.

drhicks1_derroids_torricks_all_1_SAVE_ALL_OUT_IGNORE_THE_REST_1fm0nm1e_1016145_3_0


command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_arm-android-linux-gnu -run:protocol jd2_scripting -parser:protocol c2_design.xml @flags_drhicks1 -in:file:silent drhicks1_derroids_torricks_all_1_SAVE_ALL_OUT_IGNORE_THE_REST_1fm0nm1e.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip drhicks1_derroids_torricks_all_1_SAVE_ALL_OUT_IGNORE_THE_REST_1fm0nm1e.zip -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3387230
Using database: database_357d5d93529_n_methyl/minirosetta_database

ERROR: no edge found that contains seqpos!
ERROR:: Exit from: src/core/kinematics/FoldTree.cc line: 2344
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish(1)
I had a similar error on one of those Tasks as well.
Grant
Darwin NT
ID: 99402 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1491
Credit: 14,677,518
RAC: 14,569
Message 99403 - Posted: 26 Oct 2020, 8:34:12 UTC - in response to Message 99400.  

Name: CLPPPJS8_255_ClpP1P2_stub_justPHE_0001_SAVE_ALL_OUT_1018533_362_1
Application: Rosetta v4.20 windows_x86_64
Device: 1759960
Task: 1284630939. WU: 1150838856
Status: Error while computing
Exit status: -1073741819 (0xC0000005) STATUS_ACCESS_VIOLATION
Errors: Too many errors (may have bug). Too many total results.
Stderr output:
(unknown error) - exit code -1073741819 (0xc0000005)
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0000000143141D48

Engaging BOINC Windows Runtime Debugger...

I was wingman on this new for me type WU and we both got same error. Run time about 15 sec., so no big loss. Don't see any more like this in my current queue.
Same here.
Grant
Darwin NT
ID: 99403 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
äxl
Avatar

Send message
Joined: 30 Dec 08
Posts: 9
Credit: 497,080
RAC: 0
Message 99727 - Posted: 21 Nov 2020, 6:18:10 UTC

So many "Error while computing". Should I detach this computer?
https://boinc.bakerlab.org/rosetta/results.php?userid=294942

Running a script to decrease CPU temperature
ID: 99727 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1491
Credit: 14,677,518
RAC: 14,569
Message 99730 - Posted: 21 Nov 2020, 6:59:36 UTC - in response to Message 99727.  

So many "Error while computing". Should I detach this computer?
https://boinc.bakerlab.org/rosetta/results.php?userid=294942
Yep.
You need to figure out what's wrong with it, then re-attach to the project.

Signal 11 errors indicate a memory problem, but it can also be due to overheating CPU, PSU, motherboard, faulty RAM, PSU, motherboard, overclocked too much memory, CPU, etc, etc, etc...
Grant
Darwin NT
ID: 99730 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1866
Credit: 8,186,159
RAC: 7,029
Message 99965 - Posted: 9 Dec 2020, 9:13:08 UTC

A lot of errors of "miniprotein_relax" wus

1305402903
1305403039
etc

command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol fr_cart_fast.xml @fr_flags_bcov2 -in:file:silent miniprotein_relax7_SAVE_ALL_OUT_IGNORE_THE_REST_8cs2zu5j.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip miniprotein_relax7_SAVE_ALL_OUT_IGNORE_THE_REST_8cs2zu5j.zip @miniprotein_relax7_SAVE_ALL_OUT_IGNORE_THE_REST_8cs2zu5j.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3798651
Using database: database_357d5d93529_n_methylminirosetta_database


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00007FF65B038316 read attempt to address 0xFFFFFFFF

Engaging BOINC Windows Runtime Debugger...

ID: 99965 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1491
Credit: 14,677,518
RAC: 14,569
Message 99966 - Posted: 9 Dec 2020, 9:33:34 UTC - in response to Message 99965.  

A lot of errors of "miniprotein_relax" wus
Just had a look and all of mine so far have resulted in computation errors in 50min or less. No Valid results yet.

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00007FF614BE8316 read attempt to address 0xFFFFFFFF

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x000000000000010A

Grant
Darwin NT
ID: 99966 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 99967 - Posted: 9 Dec 2020, 12:54:49 UTC - in response to Message 99965.  

Same here; several have failed with an access violation after a little over an hour.
I’ve got some more that have been running for 5 hours so far; let’s see whether they manage to complete…
ID: 99967 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 99969 - Posted: 9 Dec 2020, 16:42:02 UTC - in response to Message 99967.  

let’s see whether they manage to complete…
They did. (Example.) The failed ones might just have been certain input values exposing a bug in an algorithm.
ID: 99969 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 99975 - Posted: 9 Dec 2020, 23:47:38 UTC - in response to Message 99965.  

And of course all the failed ones get resent…
I’ve just received a couple of dozen. Debating whether to abort them all
ID: 99975 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 25 · 26 · 27 · 28 · 29 · 30 · 31 . . . 34 · Next

Message boards : Number crunching : Rosetta 4.1+ and 4.2+



©2024 University of Washington
https://www.bakerlab.org