Problems with version 5.96

Author	Message
nouqraz Send message Joined: 8 Apr 08 Posts: 6 Credit: 496,443 RAC: 4	Message 53579 - Posted: 7 Jun 2008, 14:00:39 UTC Woke up this morning to an error message on my screen - rosetta_beta_5.96_windows_intelx86.exe caused an error and needs to close blah blah... This task: https://boinc.bakerlab.org/rosetta/result.php?resultid=169429650 <core_client_version>5.10.45</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 10800 # random seed: 2005994 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x00BF05EB read attempt to address 0x0E3AF000 Engaging BOINC Windows Runtime Debugger... .... full debugging info in link to task. ID: 53579 · Rating: 0 · rate: / Reply Quote

(_KoDAk_) Send message Joined: 18 Jul 06 Posts: 109 Credit: 1,859,263 RAC: 0	Message 53583 - Posted: 7 Jun 2008, 19:40:43 UTC https://boinc.bakerlab.org/rosetta/result.php?resultid=169050697 https://boinc.bakerlab.org/rosetta/result.php?resultid=168985785 ID: 53583 · Rating: 0 · rate: / Reply Quote

(_KoDAk_) Send message Joined: 18 Jul 06 Posts: 109 Credit: 1,859,263 RAC: 0	Message 53584 - Posted: 7 Jun 2008, 19:41:14 UTC https://boinc.bakerlab.org/rosetta/result.php?resultid=169050697 https://boinc.bakerlab.org/rosetta/result.php?resultid=168985785 ID: 53584 · Rating: 0 · rate: / Reply Quote

David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0	Message 53587 - Posted: 7 Jun 2008, 23:32:04 UTC Last modified: 7 Jun 2008, 23:34:03 UTC And here we have a 24 hour validate error: resultid=169434889 The stderr_txt has basically nothing to say as to why this might be invalid... Rosie, Rosie, she's our gal, If she can't do it, no one shall! ID: 53587 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 53592 - Posted: 8 Jun 2008, 19:05:56 UTC two validate errors with no information. full run of 4 hrs. FRA_t401_CASP8_2PRV_2ICG_1_IGNORE_THE_RESTt401_1_aaT0401_2ICGA_13_0001_3601_14377_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=168398596 FRA_t401_CASP8_2PRV_2ICG_1_IGNORE_THE_RESTt401_1_aaT0401_2ICGA_13_0001_3601_14380_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=168398613 ID: 53592 · Rating: 0 · rate: / Reply Quote

Azurrio Send message Joined: 20 Feb 06 Posts: 8 Credit: 237,979 RAC: 0	Message 53616 - Posted: 10 Jun 2008, 9:24:34 UTC - in response to Message 53592. Validate error on this which ran 24h. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=153656004 ID: 53616 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 53637 - Posted: 11 Jun 2008, 23:22:15 UTC The WU t407__CASP8_JUMPAB_SAMPLE1_res1to79_SAVE_ALL_OUT_BARCODE_hom001__3653_173 crunched for the normal length of time, but was marked "invalid". From the stderr: <message><file_xfer_error> <file_name>t407__CASP8_JUMPAB_SAMPLE1_res1to79_SAVE_ALL_OUT_BARCODE_hom001__3653_173_1_0</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> The other person to crunch this WU got the same error. ID: 53637 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 53671 - Posted: 13 Jun 2008, 16:40:43 UTC The WU FRA_t411_CASP8_1UAR_2_IGNORE_THE_RESTt411_2_aa1uar.man_align1.template_0010_3670_4598_1 crunched the normal length of time, it crunched 1277 decoys, and the stderr looks normal, but it was marked "invalid". ID: 53671 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 53685 - Posted: 14 Jun 2008, 11:17:28 UTC Here's another FRA_t411_CASP8_1UAR_2_IGNORE_THE_RESTt411_2_aa1uar.man_align1.template_0010_3670_ WU that had a validate error for both crunchers. ID: 53685 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 53689 - Posted: 14 Jun 2008, 16:38:10 UTC this FRA_t401_CASP8_2PRV_2ICG_1_IGNORE_THE_RESTt401_1_aaT0401_2ICGA_13_0001_3601_17465_1 crashed and burned with a validate error on my system and one before me. ID: 53689 · Rating: 0 · rate: / Reply Quote

Resnick_MEDIC_Lab Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,276,053 RAC: 0	Message 53691 - Posted: 14 Jun 2008, 20:15:52 UTC FRA_t421_CASP8_ROB_1_IGNORE_THE_RESTt421_1_t411_.finalmodel05.noNCterm_3667_452_0 CPU time 11039.21 stderr out <core_client_version>5.10.13</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 10800 # random seed: 2726968 == </stderr_txt> ]]> Validate state Invalid Claimed credit 46.989849356256 Granted credit 0 application version 5.96 ID: 53691 · Rating: 0 · rate: / Reply Quote

Bill Hepburn Send message Joined: 18 Sep 05 Posts: 14 Credit: 14,975,271 RAC: 0	Message 53713 - Posted: 16 Jun 2008, 15:18:36 UTC This one ran about 28%, then sat there "running", but consuming no CPU cycles for about 12 hours or so until I noticed it, and aborted it. That hasn't happened in ages. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=156585874 ID: 53713 · Rating: 0 · rate: / Reply Quote

ConflictingEmotions Send message Joined: 5 Jun 08 Posts: 10 Credit: 3,081,990 RAC: 0	Message 53714 - Posted: 16 Jun 2008, 15:22:28 UTC I have been getting a number of errors on 64-bit SMP Linux (Fedora 8): * glibc detected * free(): invalid next size (normal): 0x0959d100 * * glibc detected * free(): invalid next size (normal): 0x0959d100 * * glibc detected * free(): invalid next size (normal): 0x0959d100 * * glibc detected * double free or corruption (!prev): 0x0959d100 * * glibc detected * double free or corruption (!prev): 0x0959d100 * * glibc detected * free(): invalid next size (normal): 0x0959d100 * * glibc detected * free(): invalid next size (normal): 0x0959d100 * * glibc detected * free(): invalid next size (normal): 0x0959d100 * These appear to freeze boinc because these continue after restarting boinc. The tasks have the prefix t405_CASP8_JUMPAB eg. t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_356_0 The stdout.txt files contain the following message many times: res 13 and var 1 at position 1 is not a proper Nterm variant STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 7.83973 1 1 8.21224 0.746619 STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 4.71364 2 1 4.14663 0.817297 STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 9.48273 3 1 -10.3568 -1.28817 ID: 53714 · Rating: 0 · rate: / Reply Quote

mikus Send message Joined: 7 Nov 05 Posts: 58 Credit: 700,115 RAC: 0	Message 53717 - Posted: 16 Jun 2008, 17:29:26 UTC Looked at my multi-core AMD machine (64-bit Linux client) and saw that four cores were idle. In each case, the WU had "completed" (that is, had used up as much as it should of the time allocated), but had FAILED to tell the BOINC client that the task was done. As a result, the BOINC client kept that task dispatched on a CPU core -- but since the task had finished; the corresponding core was idle. [Boincmgr showed 100% complete for these tasks, and showed that these "running" tasks were NOT accumulating any more CPU time.] Unfortunately, the only way I had to get my CPUs running again was to abort these tasks (and that in turn caused their results to be thrown away -- my system had crunched these tasks uselessly). I've encountered this failure before with Rosetta - but that was on a different hardware system, then using the 32-bit Linux client. My conclusion is that there is a problem (at least on multi-core hardware running Linux) with how the Rosetta executable notifies the BOINC client that an application's workunit has finished. ID: 53717 · Rating: 0 · rate: / Reply Quote

netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0	Message 53718 - Posted: 16 Jun 2008, 18:17:15 UTC - in response to Message 53717. Looked at my multi-core AMD machine (64-bit Linux client) and saw that four cores were idle. In each case, the WU had "completed" (that is, had used up as much as it should of the time allocated), but had FAILED to tell the BOINC client that the task was done. As a result, the BOINC client kept that task dispatched on a CPU core -- but since the task had finished; the corresponding core was idle. [Boincmgr showed 100% complete for these tasks, and showed that these "running" tasks were NOT accumulating any more CPU time.] Unfortunately, the only way I had to get my CPUs running again was to abort these tasks (and that in turn caused their results to be thrown away -- my system had crunched these tasks uselessly). I've encountered this failure before with Rosetta - but that was on a different hardware system, then using the 32-bit Linux client. My conclusion is that there is a problem (at least on multi-core hardware running Linux) with how the Rosetta executable notifies the BOINC client that an application's workunit has finished. Myself as well. I am seeing a lot of these. It does not seem to matter what the processor is, but, seems to be happening on the CASP8 tasks. Any reason that CASP8 is being run on a BETA version??? I will be aborting about a dozen units here in a few minutes. *Looking for a team ??? Join BoincSynergy!!* ID: 53718 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 53721 - Posted: 16 Jun 2008, 18:27:44 UTC https://boinc.bakerlab.org/rosetta/result.php?resultid=170401850 just validate error with 1313 decoys generated(?) ID: 53721 · Rating: 0 · rate: / Reply Quote

ConflictingEmotions Send message Joined: 5 Jun 08 Posts: 10 Credit: 3,081,990 RAC: 0	Message 53723 - Posted: 16 Jun 2008, 18:38:48 UTC - in response to Message 53714. After multiple restarts of the boinc client, I have terminated these t405_CASP8_JUMPAB tasks. There clearly are some major bugs in rosetta and boinc involved here! I have been getting a number of errors on 64-bit SMP Linux (Fedora 8): * glibc detected * free(): invalid next size (normal): 0x0959d100 * * glibc detected * free(): invalid next size (normal): 0x0959d100 * * glibc detected * free(): invalid next size (normal): 0x0959d100 * * glibc detected * double free or corruption (!prev): 0x0959d100 * * glibc detected * double free or corruption (!prev): 0x0959d100 * * glibc detected * free(): invalid next size (normal): 0x0959d100 * * glibc detected * free(): invalid next size (normal): 0x0959d100 * * glibc detected * free(): invalid next size (normal): 0x0959d100 * These appear to freeze boinc because these continue after restarting boinc. The tasks have the prefix t405_CASP8_JUMPAB eg. t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_356_0 The stdout.txt files contain the following message many times: res 13 and var 1 at position 1 is not a proper Nterm variant STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 7.83973 1 1 8.21224 0.746619 STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 4.71364 2 1 4.14663 0.817297 STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 9.48273 3 1 -10.3568 -1.28817 ID: 53723 · Rating: 0 · rate: / Reply Quote

netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0	Message 53724 - Posted: 16 Jun 2008, 19:26:06 UTC - in response to Message 53718. Last modified: 16 Jun 2008, 19:27:13 UTC Looked at my multi-core AMD machine (64-bit Linux client) and saw that four cores were idle. In each case, the WU had "completed" (that is, had used up as much as it should of the time allocated), but had FAILED to tell the BOINC client that the task was done. As a result, the BOINC client kept that task dispatched on a CPU core -- but since the task had finished; the corresponding core was idle. [Boincmgr showed 100% complete for these tasks, and showed that these "running" tasks were NOT accumulating any more CPU time.] Unfortunately, the only way I had to get my CPUs running again was to abort these tasks (and that in turn caused their results to be thrown away -- my system had crunched these tasks uselessly). I've encountered this failure before with Rosetta - but that was on a different hardware system, then using the 32-bit Linux client. My conclusion is that there is a problem (at least on multi-core hardware running Linux) with how the Rosetta executable notifies the BOINC client that an application's workunit has finished. Myself as well. I am seeing a lot of these. It does not seem to matter what the processor is, but, seems to be happening on the CASP8 tasks. Any reason that CASP8 is being run on a BETA version??? I will be aborting about a dozen units here in a few minutes. Here are a few of the aborts... All the machines with trouble are NetBurst or Dual Core systems... none of my uni-processor machines are showing these symptoms. The NetBurst MP systems have 4 or 8GB ram. The Dual-Cores are all 2GB ram.. so memory is probably not an issue. These all seem to be t404 or t405 CASP8 units.... https://boinc.bakerlab.org/rosetta/result.php?resultid=171381753 https://boinc.bakerlab.org/rosetta/result.php?resultid=171425636 https://boinc.bakerlab.org/rosetta/result.php?resultid=171446390 https://boinc.bakerlab.org/rosetta/result.php?resultid=171523734 https://boinc.bakerlab.org/rosetta/result.php?resultid=171536681 https://boinc.bakerlab.org/rosetta/result.php?resultid=171139128 *Looking for a team ??? Join BoincSynergy!!* ID: 53724 · Rating: 0 · rate: / Reply Quote

David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0	Message 53725 - Posted: 16 Jun 2008, 19:38:03 UTC Last modified: 16 Jun 2008, 19:40:31 UTC Compute error here at about 20 minutes CPU time: resultid=171712340 Large debugger output at the link. I was the wingman, both computers had similar issues. Rosie, Rosie, she's our gal, If she can't do it, no one shall! ID: 53725 · Rating: 0 · rate: / Reply Quote

Allan Hojgaard Send message Joined: 4 May 08 Posts: 9 Credit: 591,749 RAC: 0	Message 53727 - Posted: 16 Jun 2008, 20:59:00 UTC Here are my problems with the 5.96 beta version: WU 156587340: t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_38843_0 WU 156567311: t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_28688_0 WU 156416544: t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_10358_0 Those WUs were all completed (100%), but BOINC still listed them as running and the processes were not consuming any CPU %. I had to abort them in order for BOINC to pick the next WUs in line. All 3 had Exit status -197 (0xffffff3b). Zero credit received. I am using BOINC 5.10.45 on Ubuntu 8.04 with an Intel Core2 Duo T7300 @ 2GHz in an Mobile Intel® 965 Express Chipset with 2GB of RAM. ID: 53727 · Rating: 0 · rate: / Reply Quote