Message boards : Number crunching : Minirosetta v1.45 bug thread
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
xsc2 Send message Joined: 9 Jul 08 Posts: 4 Credit: 62,354 RAC: 0 |
Exit code: -1073741819 (0xc0000005) https://boinc.bakerlab.org/rosetta/result.php?resultid=212596936 |
googloo Send message Joined: 15 Sep 06 Posts: 133 Credit: 22,813,645 RAC: 3,531 |
Task ID 212423733 Name cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_ccr19_olange_5384_13138_0 Workunit 193652172 Validate state Invalid Claimed credit 14.8874783714289 Granted credit 0 application version 1.45 |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Task ID 212423733 --- here is the link to his task: https://boinc.bakerlab.org/rosetta/result.php?resultid=212423733 another (0xc0000005) error |
guhungry Send message Joined: 1 Dec 08 Posts: 1 Credit: 620,505 RAC: 0 |
I have a lot of them and all errors I take a look returned exit code -1073741819 (0xc0000005) from task cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs. --------------------------------------------- 212713228 212673521 212467256 212463757 212356024 212292194 212243396 212205055 212172410 212319134 212285788 212244308 |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 2,014 |
Another (0xc0000005) error: cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_nsp1_olange_5389_30836_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=212733641 Is there a problem with the cs_vanilla workunits? I notice that this is one of the first two 1.45 workunits I've seen running on my dual-core machine at the same time - is there some problem with that and my memory size (2 GB total)? |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2141 Credit: 41,518,559 RAC: 10,612 |
Promising results. All the silly warning messages have disappeared for me. No NAN hbonding errors either (yet?). Where I used to have 2 out of 5 WUs crash out with the error "Can't acquire lockfile" it's dropped to 2 out of 7 crashing out for that reason (over my first 21 results only - may not turn out to be representative). 9 out of the next 11 were successful too, making 24 good out of 32, which is the best performance I've had for a very long time. Combined with a continuing 100% record on Beta 5.98s (much more coming through recently) I'm officially happier and less frustrated. My 5th best day ever! Not perfect yet, but just reporting some better news instead of constant misery. You must be working on the right lines. Keep it up! |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Another crash on Mac OSX 10.4.11. Task 212684901: Workunit 193888088 Same area of code as before (update_domain_map) <core_client_version>6.2.18</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> # cpu_run_time_pref: 14400 SIGBUS: bus error Crashed executable name: minirosetta_1.45_i686-apple-darwin built using BOINC library version 6.5.0 Machine type Intel 80486 (32-bit executable) System version: Macintosh OS 10.4.11 build 8S2167 Sun Dec 7 06:56:14 2008 Thread 0 Crashed: 0 ...etta_1.45_i686-apple-darwin 0x00226273 __ZNK4core10kinematics8AtomTree17update_domain_mapERN9ObjexxFCL8FArray1DIiEERKNS_2id10AtomID_MapIbEESA_ + 41 1 ...etta_1.45_i686-apple-darwin 0x0001596f __ZNK4core12conformation12Conformation17update_domain_mapERN9ObjexxFCL8FArray1DIiEE + 403 2 ...etta_1.45_i686-apple-darwin 0x000830f3 __ZN4core4pose4Pose13scoring_beginEN7utility7pointer10owning_ptrINS_7scoring17ScoreFunctionInfoEEE + 1329 3 ...etta_1.45_i686-apple-darwin 0x000fa4e2 __ZNK4core7scoring13ScoreFunctionclERNS_4pose4PoseE + 4686 4 ...etta_1.45_i686-apple-darwin 0x001938b1 __ZNK9protocols8abinitio18AbrelaxApplication13process_decoyERN4core4pose4PoseERKNS2_7scoring13ScoreFunctionESsRNS2_2io6silent12SilentStructE + 35 5 ...etta_1.45_i686-apple-darwin 0x001afe27 __ZN9protocols8abinitio18AbrelaxApplication4foldEv + 9651 6 ...etta_1.45_i686-apple-darwin 0x001b5381 __ZN9protocols8abinitio18AbrelaxApplication3runEv + 881 7 ...etta_1.45_i686-apple-darwin 0x00009a87 _main + 3941 8 ...etta_1.45_i686-apple-darwin 0x0000292e __start + 216 9 ...etta_1.45_i686-apple-darwin 0x00002855 start + 41 etc. |
steve Send message Joined: 27 Nov 08 Posts: 7 Credit: 1,085 RAC: 0 |
David, I just recieved an error during this file analysis: Time of DownLoad: 12/7/2008 12:30:34 PM|rosetta@home|Starting task fixed_bb_hb_rlbd_1tig_IGNORE_THE_REST_DECOY_5470_34_0 using minirosetta version 145 Time of Error 12/7/2008 3:31:37 PM|rosetta@home|Started upload of fixed_bb_hb_rlbd_1tig_IGNORE_THE_REST_DECOY_5470_34_0_0 Error Message "Could not write to a specified memory location". The message asked me if I wanted to DeBug.. I pressed cancel. Time of Upload: The file was upload to your server as: 12/7/2008 3:31:43 PM|rosetta@home|Finished upload of fixed_bb_hb_rlbd_1tig_IGNORE_THE_REST_DECOY_5470_34_0_0 Steve Please post bugs and issues regarding minirosetta version 1.45. |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi. This one broke after 2hrs, 44min. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=194032196 Mon 08 Dec 2008 15:03:25 EST|rosetta@home|Output file cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_flua_olange_5385_36439_0_0 for task cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_flua_olange_5385_36439_0 absent pete. |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
Here's some more bad cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs WUs: https://boinc.bakerlab.org/rosetta/result.php?resultid=212592040 https://boinc.bakerlab.org/rosetta/result.php?resultid=212475523 https://boinc.bakerlab.org/rosetta/result.php?resultid=212454329 https://boinc.bakerlab.org/rosetta/result.php?resultid=212415902 https://boinc.bakerlab.org/rosetta/result.php?resultid=212349479 https://boinc.bakerlab.org/rosetta/result.php?resultid=212298709 https://boinc.bakerlab.org/rosetta/result.php?resultid=212268849 https://boinc.bakerlab.org/rosetta/result.php?resultid=212260558 |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 2,014 |
Here's some more bad cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs WUs: Makes me suspect that at least one of the following is true: 1. cs_vanilla workunits are a high fraction of the workunits now going out. 2. The cs_vanilla workunits are using a new feature of 1.45 that hasn't been adequately tested for its ability to finish properly. |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
ERROR: Illegal value for integer option -run:jran specified: in workunit 1g73A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1g73A-_5476_258_1 AdeB |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2141 Credit: 41,518,559 RAC: 10,612 |
Promising results. All the silly warning messages have disappeared for me. No NAN hbonding errors either (yet?). Where I used to have 2 out of 5 WUs crash out with the error "Can't acquire lockfile" it's dropped to 2 out of 7 crashing out for that reason (over my first 21 results only - may not turn out to be representative). And now 16 more successes out of 18 making 40 out of 50. Most errors came early, so I'm now confident enough to up my run-time from 2 to 3 hours again. Good work on this "Can't acquire lockfile" problem. I'm just going to tidy up the lockfiles, reboot and see if the good results continue. Efforts much appreciated here. Let's see if it can be nailed in the next update. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2141 Credit: 41,518,559 RAC: 10,612 |
I'm just going to tidy up the lockfiles, reboot and see if the good results continue. On this, when trying to stop the BOINC service I wasn't allowed to until I'd ended the boinc.exe client process under User Name boinc_master (Vista64 OS quad-core AMD Phenom). In the Task Manager I 'showed processes for all users' to do this and saw that 2 rosetta_beta_5.98_windows_x86_64.exe*32 processes were still running (correct) but also about 20 minirosetta_1.4x_windows_x86_64.exe*32 processes were running. About half of those were for MiniRosetta 1.40 and the other half for v1.45. All under the User Name boinc_project. There should just have been 2 for v1.45. My last Mini 1.40 WU was completed late last Friday, so these 10-ish 1.40 processes have persisted for 3 days (no re-boots in that time). I manually ended all these processes. Going to the C:ProgramDataBOINCslots folder, there were 23 folders (numbered from 0 to 22), the first 19 of which contained a 0-byte boinc_lockfile file and a stderr.txt and a stdout.txt file. The other 4 folders contained the files I'd expect for running processes. I deleted all the boinc_lockfile files and re-booted. On start-up, the first 19 folders had been removed, leaving the 4 active ones. I'm no programmer and may be talking out of my hat, but could these old processes still running have something to do with being unable to acquire boinc_lockfile ? When there's a Compute Error is there some fault in the way the process closes down and releases the files it holds open? Or could it be to do with the user (me) aborting a WU manually when I see it's stalled and producing error messages? I'm guessing wildly, obviously, but hopefully this means something more sensible to you clever chaps. You seem to be close to some solutions for a problem that's persisted over several versions, so maybe this is the final clue you need? Hope it helps. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Some more workunits failing on Mac OS X 10.4.11 Task 212933060 : Workunit 194105301 Task 212892168 : Workunit 194071153 (names **_ZNMP_RELAX_**) both failing at startup ERROR: Illegal value for integer option -run:jran specified: Also Task 212828576 : Workunit 194016740 (cs_vanilla_* again) failing halfway through in Update_domain_map Thread 0 Crashed: 0 ...etta_1.45_i686-apple-darwin 0x00226273 __ZNK4core10kinematics8AtomTree17update_domain_mapERN9ObjexxFCL8FArray1DIiEERKNS_2id10AtomID_MapIbEESA_ + 41 1 ...etta_1.45_i686-apple-darwin 0x0001596f __ZNK4core12conformation12Conformation17update_domain_mapERN9ObjexxFCL8FArray1DIiEE + 403 etc. |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi, me again. This one crashed overnight after 5hrs, 30min. not good. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=194070527 Mon 08 Dec 2008 20:55:01 EST|rosetta@home|Output file cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_nsp1_olange_5389_38205_0_0 for task cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_nsp1_olange_5389_38205_0 absent <core_client_version>6.2.14</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> # cpu_run_time_pref: 21600 SIGSEGV: segmentation violation Stack trace (12 frames): [0x8b883ab] [0x8bb211c] [0xffffe500] [0x85d2f9a] [0x85b766e] [0x83efc90] [0x811a22a] [0x812a216] [0x812be61] [0x804b884] [0x8c0dc1c] [0x8048111] Exiting... </stderr_txt> pete. |
Path7 Send message Joined: 25 Aug 07 Posts: 128 Credit: 61,751 RAC: 0 |
Running Windows XP-home I found 2 Wu's with this error: ERROR: Illegal value for integer option -run:jran specified: 1wjdA_ZNMP_ABRELAX_tetraL_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1wjdA-_5475_616_0 1dsvA_ZNMP_ABRELAX_tetraL_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1dsvA-_5475_81_0 The first WU also had the same error on the second run. Have a nice day, Path7. |
stewjack Send message Joined: 23 Apr 06 Posts: 39 Credit: 95,871 RAC: 0 |
This WU was crunched under v.1.45 and exited with an error. ======== Task ID 212609043 Name: loopbuild_reference_hombench_loopbuild_t327__IGNORE_THE_REST_1XMAA_2_5453_3_0 Workunit 193817482 https://boinc.bakerlab.org/rosetta/result.php?resultid=212609043 ==== stderr out ====== <core_client_version>6.2.19</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x007FA87A read attempt to address 0x000002E8 <core_client_version>6.2.19</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x007FA87A read attempt to address 0x000002E8 Jack |
Chu Send message Joined: 23 Feb 06 Posts: 120 Credit: 112,439 RAC: 0 |
Hi everyone, due to a limitation on command line length set by BOINC, the jobs with name "*_ZN_ABRELAX_*" have their command lines automatically truncated when sent out on Rosetta@Home. That is why you see it stopped with an ERROR like "ERROR: Illegal value for integer option -run:jran specified: " right away. In fact, the same workunits returned with very good success rate on the testing server. We are investigating why the alpha testing server did not catch such errors in the first place. This is another unfortunate incident which will be a new lesson for us. Sorry for any inconvenience this has brought to you. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 2,014 |
Here's some more bad cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs WUs: I notice that for most of those workunits, your wingman returned an error on the same workunit, so I'd suspect problems built into either the cs_vanilla workunits or a new feature of minirosetta that few other workunits have used before. In those cases where your wingman completed the workunit without a problem, it was on an Intel Xeon CPU with a lot more memory. This leads me to believe that cs_vanilla workunits need a lot more memory than your computer has, and suggests that your rather old version of BOINC (5.2.13) may not be sending the information needed to choose only workunits that will work with your memory size. The version of BOINC used where those workunits were successful was 6.2.19. My computer is in between - using BOINC 5.10.45 with a total RAM memory of 2 GB, and the only cs_vanilla workunit I've had so far ran longer than most of yours, but still failed. |
Message boards :
Number crunching :
Minirosetta v1.45 bug thread
©2024 University of Washington
https://www.bakerlab.org