Computation errors

Message boards : Number crunching : Computation errors

To post messages, you must log in.

AuthorMessage
Profile David703

Send message
Joined: 17 Jul 17
Posts: 5
Credit: 38,608
RAC: 0
Message 90595 - Posted: 30 Mar 2019, 18:06:19 UTC

Hi, since I've come back to this project I've been seeing some strange errors in some of my WUs, especially in the ones that study big proteins, here are a few examples:
-https://boinc.bakerlab.org/rosetta/result.php?resultid=1065314770
-https://boinc.bakerlab.org/rosetta/result.php?resultid=1065314768
-https://boinc.bakerlab.org/rosetta/result.php?resultid=1065460662

How can I keep these errors from happening?
ID: 90595 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 250
Credit: 8,037,564
RAC: 0
Message 90599 - Posted: 31 Mar 2019, 16:29:03 UTC - in response to Message 90595.  

Hi, since I've come back to this project I've been seeing some strange errors in some of my WUs, especially in the ones that study big proteins, here are a few examples:
-https://boinc.bakerlab.org/rosetta/result.php?resultid=1065314770
-https://boinc.bakerlab.org/rosetta/result.php?resultid=1065314768
-https://boinc.bakerlab.org/rosetta/result.php?resultid=1065460662

How can I keep these errors from happening?


Rosetta developers were quite sloppy in their allocation and use of memory.

Task 1065460662 ran out of memory.
https://boinc.bakerlab.org/rosetta/result.php?resultid=1065460662

The other two error out with "Funzione non corretta" or "incorrect function"

When one WU runs out of memory, other WU may get strange error messages from function calls as developers don't always check the return results of all system calls.

The WU you are running are 64-bit and sometimes take large amounts of memory ... frequently over a GB each.

8gb should be enough to run 4 Rosetta 64-bit WU, so I would examine how memory is being used and change the workload.
Buy more memory if practical.
Lower the number of Rosetta WU running simultaneously with app_config.xml or BOINC -> OPTIONS -> COMPUTING PREFERENCES -> USAGE LIMITS
ID: 90599 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David703

Send message
Joined: 17 Jul 17
Posts: 5
Credit: 38,608
RAC: 0
Message 90600 - Posted: 31 Mar 2019, 19:25:31 UTC - in response to Message 90599.  

Ok, thank you!
ID: 90600 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 187
Credit: 11,462,427
RAC: 6,572
Message 90934 - Posted: 24 Jul 2019, 9:15:55 UTC
Last modified: 24 Jul 2019, 9:16:45 UTC

Seems unlikely they've ever addressed this problem, eh? I see them pretty often. Especially annoying when they have run up 8 hours of effort before crashing, presumably with no points earned. And no, at this point I don't care enough to do the searching to try to figure out if the points were granted. I don't even care enough to read the rest of the thread beyond the Subject: and glancing at a couple of the posts.

Latest example:

Application
Rosetta Mini 3.78
Name
start_close_HHH_rd4_0056.min_rise1.83_whole_pass_aagb.bp_20190406150644_0001_0001_0001_0003_0001_0001_fragments_fold_SAVE_ALL_OUT_833066_1053
State
Computation error
Received
2019年07月22日 08時13分16秒
Report deadline
2019年07月30日 08時13分11秒
Estimated computation size
80,000 GFLOPs
CPU time
07:49:11
Elapsed time
07:59:03
Executable
minirosetta_3.78_x86_64-pc-linux-gnu
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 90934 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 937
Credit: 3,588,605
RAC: 1,216
Message 90937 - Posted: 24 Jul 2019, 13:58:09 UTC - in response to Message 90934.  

Latest example:

Application
Rosetta Mini 3.78
Name
start_close_HHH_rd4_0056.min_rise1.83_whole_pass_aagb.bp_20190406150644_0001_0001_0001_0003_0001_0001_fragments_fold_SAVE_ALL_OUT_833066_1053
State
Computation error


Rosetta Mini 3.78 was release in October 2017.
Since then, a lot of errors and problems.
No debug, no new version. Nothing
ID: 90937 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 41
Credit: 6,068,873
RAC: 6,952
Message 90943 - Posted: 26 Jul 2019, 0:25:06 UTC

I'd rather have the Rosetta mini tasks vs the Rosetta version that runs for 5h then has an error when the set run time is 1hr.
ID: 90943 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
blyons123

Send message
Joined: 8 Apr 14
Posts: 4
Credit: 118,348
RAC: 1
Message 91113 - Posted: 12 Sep 2019, 14:42:48 UTC

Happened again after resetting project!?

9/12/2019 10:20:17 PM | Rosetta@home | Task bc96_EHEE_hb1_2413_fold_SAVE_ALL_OUT_857815_1040_0 exited with zero status but no 'finished' file
9/12/2019 10:20:17 PM | Rosetta@home | If this happens repeatedly you may need to reset the project.
9/12/2019 10:22:08 PM | Rosetta@home | Task Longxing_ems_ferrM_2260.11745_fold_SAVE_ALL_OUT_863531_13_0 exited with zero status but no 'finished' file
9/12/2019 10:22:08 PM | Rosetta@home | If this happens repeatedly you may need to reset the project.
9/12/2019 10:23:49 PM | Rosetta@home | Task Longxing_ems_4hM_2152.10077_fold_SAVE_ALL_OUT_861867_13_0 exited with zero status but no 'finished' file
9/12/2019 10:23:49 PM | Rosetta@home | If this happens repeatedly you may need to reset the project.
9/12/2019 10:24:33 PM | Rosetta@home | Task bc96_4h_hb1_1620_fold_SAVE_ALL_OUT_857813_1040_0 exited with zero status but no 'finished' file
9/12/2019 10:24:33 PM | Rosetta@home | If this happens repeatedly you may need to reset the project.
9/12/2019 10:25:54 PM | Rosetta@home | Task bc96_EHEE_hb1_2413_fold_SAVE_ALL_OUT_857815_1040_0 exited with zero status but no 'finished' file
9/12/2019 10:25:54 PM | Rosetta@home | If this happens repeatedly you may need to reset the project.
9/12/2019 10:30:08 PM | Rosetta@home | Task bc96_EHEE_hb1_2413_fold_SAVE_ALL_OUT_857815_1040_0 exited with zero status but no 'finished' file
9/12/2019 10:30:08 PM | Rosetta@home | If this happens repeatedly you may need to reset the project.
9/12/2019 10:34:11 PM | Rosetta@home | Task bc96_EHEE_hb1_2413_fold_SAVE_ALL_OUT_857815_1040_0 exited with zero status but no 'finished' file
9/12/2019 10:34:11 PM | Rosetta@home | If this happens repeatedly you may need to reset the project.
9/12/2019 10:36:06 PM | Rosetta@home | work fetch suspended by user
9/12/2019 10:36:15 PM | Rosetta@home | Task Longxing_ems_4hM_2152.10077_fold_SAVE_ALL_OUT_861867_13_0 exited with zero status but no 'finished' file
9/12/2019 10:36:15 PM | Rosetta@home | If this happens repeatedly you may need to reset the project.
9/12/2019 10:36:32 PM | Rosetta@home | task bc96_4h_hb1_1620_fold_SAVE_ALL_OUT_857813_1040_0 suspended by user
9/12/2019 10:36:35 PM | Rosetta@home | Starting task foldit_2007855_0007_fold_and_dock_SAVE_ALL_OUT_849408_1557_0
9/12/2019 10:36:35 PM | Rosetta@home | task Longxing_ems_4hM_2152.10077_fold_SAVE_ALL_OUT_861867_13_0 suspended by user
9/12/2019 10:36:35 PM | Rosetta@home | task Longxing_ems_4hM_2152.10077_fold_SAVE_ALL_OUT_861867_13_0 resumed by user
9/12/2019 10:36:37 PM | Rosetta@home | task Longxing_ems_4hM_2152.10077_fold_SAVE_ALL_OUT_861867_13_0 suspended by user
9/12/2019 10:36:39 PM | Rosetta@home | Starting task Longxing_ems_ferrM_3025.11863_fold_SAVE_ALL_OUT_863659_13_0
9/12/2019 10:36:40 PM | Rosetta@home | task Longxing_ems_ferrM_2260.11745_fold_SAVE_ALL_OUT_863531_13_0 suspended by user
ID: 91113 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
blyons123

Send message
Joined: 8 Apr 14
Posts: 4
Credit: 118,348
RAC: 1
Message 91153 - Posted: 23 Sep 2019, 10:51:08 UTC

every mini task gives me this error.
9/23/2019 6:22:43 PM | Rosetta@home | Task Longxing_ems_ferrM_5178.12181_fold_SAVE_ALL_OUT_863970_24_0 exited with zero status but no 'finished' file
ID: 91153 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 564
Credit: 6,255,708
RAC: 4,884
Message 91157 - Posted: 24 Sep 2019, 16:00:33 UTC

I've also had two work units crash out today, one with this...

Exit status 1 (0x00000001) Unknown error code

... the other with this...

Exit status -529697949 (0xE06D7363) Unknown error code

No new tasks set for now.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 91157 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 208
Credit: 7,343,511
RAC: 2,814
Message 91158 - Posted: 24 Sep 2019, 16:39:04 UTC

rb_09_19_8636_8623_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_03_05_867741_37 failed with invalid chi angle on Windows

File: C:\cygwin64\home\boinc\Rosetta\main\source\src\core/pack/dunbrack/SingleResidueDunbrackLibrary.hh:306
chi angle must be between -180 and 180: -nan(ind)

</stderr_txt>
[/url]

Task 1094602971 http://boinc.bakerlab.org/rosetta/result.php?resultid=109460297
Workunit 985919585 http://boinc.bakerlab.org/rosetta/workunit.php?wuid=985919585
ID: 91158 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
wolfman1360

Send message
Joined: 18 Feb 17
Posts: 15
Credit: 672,117
RAC: 1,501
Message 91159 - Posted: 24 Sep 2019, 20:43:54 UTC

Some errored out tasks, 10 in total. That's a lot of computing time gone. I'm not sure which file is in use or if it was Rosetta or Boinc. This machine has been up solid. Suspending for now.
https://boinc.bakerlab.org/rosetta/result.php?resultid=1095086805
https://boinc.bakerlab.org/rosetta/result.php?resultid=1095124487
https://boinc.bakerlab.org/rosetta/result.php?resultid=1095064719
ID: 91159 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 564
Credit: 6,255,708
RAC: 4,884
Message 91330 - Posted: 3 Nov 2019, 11:32:47 UTC

I have had four units crash out in recent days. One with "Aborted by Server" so I discount that one. The other three with "Out of Memory". I think this is because I was sent "Rosetta v4.07
windows_intelx86" to run the job, and not "Rosetta v4.07 windows_intelx86_64". Of wingmen on the failing jobs Others have crashed with the same error, except one, who completed the unit, but was running Rosetta v4.07 windows_intelx86_64. Obviously, a 64 bit system can access a much greater memory range than a 32 bit. The question that arises though, is why was I sent x86 and not x86_64? My system runs 64 bit Windows and has more memory installed and available to BOINC than the chap that completed the job without error.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 91330 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 564
Credit: 6,255,708
RAC: 4,884
Message 91443 - Posted: 7 Dec 2019, 14:40:10 UTC

I've got another couple of weird ones now. One is 0.586% done but has 16:08:53 elapsed and 114d 02:32:50 remaining increasing quite rapidly, the other 0.259% after 06:15:56 elapsed and 12:42:49 with the last digit flipping 48 - 49 - 48 - 49.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 91443 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Computation errors



©2019 University of Washington
http://www.bakerlab.org