Message boards : Number crunching : Problems with version 5.96
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 10 · Next
Author | Message |
---|---|
nouqraz Send message Joined: 8 Apr 08 Posts: 6 Credit: 394,541 RAC: 2,696 |
Woke up this morning to an error message on my screen - rosetta_beta_5.96_windows_intelx86.exe caused an error and needs to close blah blah... This task: https://boinc.bakerlab.org/rosetta/result.php?resultid=169429650 <core_client_version>5.10.45</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 10800 # random seed: 2005994 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x00BF05EB read attempt to address 0x0E3AF000 Engaging BOINC Windows Runtime Debugger... .... full debugging info in link to task. |
(_KoDAk_) Send message Joined: 18 Jul 06 Posts: 109 Credit: 1,859,263 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=169050697 https://boinc.bakerlab.org/rosetta/result.php?resultid=168985785 |
(_KoDAk_) Send message Joined: 18 Jul 06 Posts: 109 Credit: 1,859,263 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=169050697 https://boinc.bakerlab.org/rosetta/result.php?resultid=168985785 |
David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0 |
And here we have a 24 hour validate error: resultid=169434889 The stderr_txt has basically nothing to say as to why this might be invalid... Rosie, Rosie, she's our gal, If she can't do it, no one shall! |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
two validate errors with no information. full run of 4 hrs. FRA_t401_CASP8_2PRV_2ICG_1_IGNORE_THE_RESTt401_1_aaT0401_2ICGA_13_0001_3601_14377_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=168398596 FRA_t401_CASP8_2PRV_2ICG_1_IGNORE_THE_RESTt401_1_aaT0401_2ICGA_13_0001_3601_14380_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=168398613 |
Azurrio Send message Joined: 20 Feb 06 Posts: 8 Credit: 237,979 RAC: 0 |
Validate error on this which ran 24h. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=153656004 |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
The WU t407__CASP8_JUMPAB_SAMPLE1_res1to79_SAVE_ALL_OUT_BARCODE_hom001__3653_173 crunched for the normal length of time, but was marked "invalid". From the stderr: <message><file_xfer_error> <file_name>t407__CASP8_JUMPAB_SAMPLE1_res1to79_SAVE_ALL_OUT_BARCODE_hom001__3653_173_1_0</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> The other person to crunch this WU got the same error. |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
The WU FRA_t411_CASP8_1UAR_2_IGNORE_THE_RESTt411_2_aa1uar.man_align1.template_0010_3670_4598_1 crunched the normal length of time, it crunched 1277 decoys, and the stderr looks normal, but it was marked "invalid". |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
Here's another FRA_t411_CASP8_1UAR_2_IGNORE_THE_RESTt411_2_aa1uar.man_align1.template_0010_3670_ WU that had a validate error for both crunchers. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
this FRA_t401_CASP8_2PRV_2ICG_1_IGNORE_THE_RESTt401_1_aaT0401_2ICGA_13_0001_3601_17465_1 crashed and burned with a validate error on my system and one before me. |
The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,271,025 RAC: 0 |
FRA_t421_CASP8_ROB_1_IGNORE_THE_RESTt421_1_t411_.finalmodel05.noNCterm_3667_452_0 CPU time 11039.21 stderr out <core_client_version>5.10.13</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 10800 # random seed: 2726968 == </stderr_txt> ]]> Validate state Invalid Claimed credit 46.989849356256 Granted credit 0 application version 5.96 |
Bill Hepburn Send message Joined: 18 Sep 05 Posts: 14 Credit: 14,953,680 RAC: 3,146 |
This one ran about 28%, then sat there "running", but consuming no CPU cycles for about 12 hours or so until I noticed it, and aborted it. That hasn't happened in ages. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=156585874 |
ConflictingEmotions Send message Joined: 5 Jun 08 Posts: 10 Credit: 3,081,990 RAC: 0 |
I have been getting a number of errors on 64-bit SMP Linux (Fedora 8): *** glibc detected *** free(): invalid next size (normal): 0x0959d100 *** *** glibc detected *** free(): invalid next size (normal): 0x0959d100 *** *** glibc detected *** free(): invalid next size (normal): 0x0959d100 *** *** glibc detected *** double free or corruption (!prev): 0x0959d100 *** *** glibc detected *** double free or corruption (!prev): 0x0959d100 *** *** glibc detected *** free(): invalid next size (normal): 0x0959d100 *** *** glibc detected *** free(): invalid next size (normal): 0x0959d100 *** *** glibc detected *** free(): invalid next size (normal): 0x0959d100 *** These appear to freeze boinc because these continue after restarting boinc. The tasks have the prefix t405_CASP8_JUMPAB eg. t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_356_0 The stdout.txt files contain the following message many times: res 13 and var 1 at position 1 is not a proper Nterm variant STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 7.83973 1 1 8.21224 0.746619 STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 4.71364 2 1 4.14663 0.817297 STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 9.48273 3 1 -10.3568 -1.28817 |
mikus Send message Joined: 7 Nov 05 Posts: 58 Credit: 700,115 RAC: 0 |
Looked at my multi-core AMD machine (64-bit Linux client) and saw that four cores were idle. In each case, the WU had "completed" (that is, had used up as much as it should of the time allocated), but had FAILED to tell the BOINC client that the task was done. As a result, the BOINC client kept that task dispatched on a CPU core -- but since the task had finished; the corresponding core was idle. [Boincmgr showed 100% complete for these tasks, and showed that these "running" tasks were NOT accumulating any more CPU time.] Unfortunately, the only way I had to get my CPUs running again was to abort these tasks (and that in turn caused their results to be thrown away -- my system had crunched these tasks uselessly). I've encountered this failure before with Rosetta - but that was on a different hardware system, then using the 32-bit Linux client. My conclusion is that there is a problem (at least on multi-core hardware running Linux) with how the Rosetta executable notifies the BOINC client that an application's workunit has finished. |
netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0 |
Looked at my multi-core AMD machine (64-bit Linux client) and saw that four cores were idle. In each case, the WU had "completed" (that is, had used up as much as it should of the time allocated), but had FAILED to tell the BOINC client that the task was done. As a result, the BOINC client kept that task dispatched on a CPU core -- but since the task had finished; the corresponding core was idle. [Boincmgr showed 100% complete for these tasks, and showed that these "running" tasks were NOT accumulating any more CPU time.] Unfortunately, the only way I had to get my CPUs running again was to abort these tasks (and that in turn caused their results to be thrown away -- my system had crunched these tasks uselessly). Myself as well. I am seeing a lot of these. It does not seem to matter what the processor is, but, seems to be happening on the CASP8 tasks. Any reason that CASP8 is being run on a BETA version??? I will be aborting about a dozen units here in a few minutes. Looking for a team ??? Join BoincSynergy!! |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=170401850 just validate error with 1313 decoys generated(?) |
ConflictingEmotions Send message Joined: 5 Jun 08 Posts: 10 Credit: 3,081,990 RAC: 0 |
After multiple restarts of the boinc client, I have terminated these t405_CASP8_JUMPAB tasks. There clearly are some major bugs in rosetta and boinc involved here! I have been getting a number of errors on 64-bit SMP Linux (Fedora 8): |
netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0 |
Looked at my multi-core AMD machine (64-bit Linux client) and saw that four cores were idle. In each case, the WU had "completed" (that is, had used up as much as it should of the time allocated), but had FAILED to tell the BOINC client that the task was done. As a result, the BOINC client kept that task dispatched on a CPU core -- but since the task had finished; the corresponding core was idle. [Boincmgr showed 100% complete for these tasks, and showed that these "running" tasks were NOT accumulating any more CPU time.] Unfortunately, the only way I had to get my CPUs running again was to abort these tasks (and that in turn caused their results to be thrown away -- my system had crunched these tasks uselessly). Here are a few of the aborts... All the machines with trouble are NetBurst or Dual Core systems... none of my uni-processor machines are showing these symptoms. The NetBurst MP systems have 4 or 8GB ram. The Dual-Cores are all 2GB ram.. so memory is probably not an issue. These all seem to be t404 or t405 CASP8 units.... https://boinc.bakerlab.org/rosetta/result.php?resultid=171381753 https://boinc.bakerlab.org/rosetta/result.php?resultid=171425636 https://boinc.bakerlab.org/rosetta/result.php?resultid=171446390 https://boinc.bakerlab.org/rosetta/result.php?resultid=171523734 https://boinc.bakerlab.org/rosetta/result.php?resultid=171536681 https://boinc.bakerlab.org/rosetta/result.php?resultid=171139128 Looking for a team ??? Join BoincSynergy!! |
David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0 |
Compute error here at about 20 minutes CPU time: resultid=171712340 Large debugger output at the link. I was the wingman, both computers had similar issues. Rosie, Rosie, she's our gal, If she can't do it, no one shall! |
Allan Hojgaard Send message Joined: 4 May 08 Posts: 9 Credit: 591,749 RAC: 0 |
Here are my problems with the 5.96 beta version: WU 156587340: t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_38843_0 WU 156567311: t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_28688_0 WU 156416544: t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_10358_0 Those WUs were all completed (100%), but BOINC still listed them as running and the processes were not consuming any CPU %. I had to abort them in order for BOINC to pick the next WUs in line. All 3 had Exit status -197 (0xffffff3b). Zero credit received. I am using BOINC 5.10.45 on Ubuntu 8.04 with an Intel Core2 Duo T7300 @ 2GHz in an Mobile IntelĀ® 965 Express Chipset with 2GB of RAM. |
Message boards :
Number crunching :
Problems with version 5.96
©2024 University of Washington
https://www.bakerlab.org