Message boards : Number crunching : Problems with Rosetta version 5.80
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next
Author | Message |
---|---|
Purple Rabbit Send message Joined: 24 Sep 05 Posts: 28 Credit: 4,316,947 RAC: 967 |
I think this the same problem Paul is describing. I've had two WU error out with "error 1" after an hour or two of processing. These are from two different Linux computers. I had a third one complete successfully on yet another Linux computer. These are the only "v001" WU I've finished. Looking at Paul's computer he is also seeing some success and some failure with "v001". All my computers are running Suse 10. The first bad one I shrugged off as gremlins. The second happened less than an hour later so it looked like a trend :-) I've got 5 more of these puppies waiting...sigh Bad: v001_1_NMRREF_1_v001_1_id_model_13IGNORE_THE_REST_idl_2125_1698 Bad #1 v001_1_NMRREF_1_v001_1_id_model_10IGNORE_THE_REST_idl_2125_1167 Bad #2 Good: v001_1_NMRREF_1_v001_1_id_model_20IGNORE_THE_REST_idl_2125_424 Good #1 |
Greenshit Send message Joined: 30 Jan 07 Posts: 3 Credit: 55,173 RAC: 0 |
Compute error here: https://boinc.bakerlab.org/rosetta/result.php?resultid=108782014 - exit code -1073741819 (0xc0000005) |
The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,271,025 RAC: 0 |
|
Purple Rabbit Send message Joined: 24 Sep 05 Posts: 28 Credit: 4,316,947 RAC: 967 |
...I've got 5 more of these puppies waiting...sigh... An update: All five of the puppies finished successfully. One of my bad WU was successfully completed by another. The other bad one died a second time and was put to rest. With the scattered reports of occasional failures I'm guessing this is probably an initial conditions (random seed) problem and/or 5.80 not being able to handle the output for particular starting conditions. Things aren't totally broken for "v001", but something ain't quite right :-) Rick Can the forum moderator fix the formatting for this thread? It's way off my screen to the right. I've spent some time adding hard returns to make my posts more readable. |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
Add this one to the list. Watchdog is not activated. v001_1_NMRREF_1_v001_1_id_model_03IGNORE_THE_REST_idl_2125_1909 |
Dotsch Send message Joined: 12 Feb 06 Posts: 111 Credit: 241,803 RAC: 0 |
I had a 5.8 WU which reseted to 0 % after stop and start of the BOINC client. |
M.L. Send message Joined: 21 Nov 06 Posts: 182 Credit: 180,462 RAC: 0 |
Result ID 108706598 Name v001_1_NMRREF_1_v001_1_id_model_18IGNORE_THE_REST_idl_2125_859_0 Workunit 98755213 Created 28 Sep 2007 6:18:07 UTC Sent 28 Sep 2007 6:19:01 UTC Received 1 Oct 2007 9:37:48 UTC Server state Over Outcome Client error Client state Compute error Exit status -1073741819 (0xc0000005) Computer ID 510574 Report deadline 8 Oct 2007 6:19:01 UTC CPU time 2098.765625 stderr out <core_client_version>5.10.20</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 21600 # random seed: 1801142 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x00853013 read attempt to address 0xE9C9ED2C Engaging BOINC Windows Runtime Debugger... |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I had a 5.8 WU which reseted to 0 % after stop and start of the BOINC client. This would be normal if the task had not completed the first model, and not reached a checkpoint. You will always lose something when you exit BOINC. For most tasks, and most machine configurations, you will lose less then about 15min. But for some types of tasks, it can be more on the order of an hour. Rosetta Moderator: Mod.Sense |
Dotsch Send message Joined: 12 Feb 06 Posts: 111 Credit: 241,803 RAC: 0 |
I had a 5.8 WU which reseted to 0 % after stop and start of the BOINC client. The Task was at about 60 to 70 % completed (at about 2 hours computing time). So I think it is not normal. |
M.L. Send message Joined: 21 Nov 06 Posts: 182 Credit: 180,462 RAC: 0 |
And now another! Result ID 108699749 Name v001_1_NMRREF_1_v001_1_id_model_18IGNORE_THE_REST_idl_2125_625_0 Workunit 98748895 Created 28 Sep 2007 5:36:50 UTC Sent 28 Sep 2007 5:37:48 UTC Received 1 Oct 2007 14:18:38 UTC Server state Over Outcome Client error Client state Compute error Exit status -1073741819 (0xc0000005) Computer ID 510574 Report deadline 8 Oct 2007 5:37:48 UTC CPU time 13589.703125 stderr out <core_client_version>5.10.20</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 21600 # random seed: 1801376 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x00715B96 read attempt to address 0x5698A8EE Engaging BOINC Windows Runtime Debugger... |
Christoph Jansen Send message Joined: 6 Jun 06 Posts: 248 Credit: 267,153 RAC: 0 |
This one: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=98865167 keeps on stopping, meaning the processor time keeps stopping to count up indefinitely, so the watchdog does not shut it down. I had that on another one over last night, it must have been stuck for some hours. Didn't note the number though. |
Alexander Send message Joined: 29 May 07 Posts: 1 Credit: 119,573 RAC: 0 |
5.80 repeatedly crashed. My computer ID = 519167. The following is the crash information: Faulting application rosetta_beta_5.80_windows_intelx86.exe, version 0.0.0.0, faulting module ntdll.dll, version 5.1.2600.2180, fault address 0x00013396. <?xml version="1.0" encoding="UTF-16"?> <DATABASE> <EXE NAME="rosetta_beta_5.80_windows_intelx86.exe" FILTER="GRABMI_FILTER_PRIVACY"> <MATCHING_FILE NAME="rosetta_5.69_windows_intelx86.exe" SIZE="2570240" CHECKSUM="0x57279008" MODULE_TYPE="WIN32" PE_CHECKSUM="0x0" LINKER_VERSION="0x0" LINK_DATE="08/20/2007 21:12:18" UPTO_LINK_DATE="08/20/2007 21:12:18" /> <MATCHING_FILE NAME="rosetta_beta_5.80_windows_intelx86.exe" SIZE="2575872" CHECKSUM="0xA6936F6C" MODULE_TYPE="WIN32" PE_CHECKSUM="0x0" LINKER_VERSION="0x0" LINK_DATE="09/12/2007 04:42:46" UPTO_LINK_DATE="09/12/2007 04:42:46" /> </EXE> <EXE NAME="ntdll.dll" FILTER="GRABMI_FILTER_THISFILEONLY"> <MATCHING_FILE NAME="ntdll.dll" SIZE="708096" CHECKSUM="0x9D20568" BIN_FILE_VERSION="5.1.2600.2180" BIN_PRODUCT_VERSION="5.1.2600.2180" PRODUCT_VERSION="5.1.2600.2180" FILE_DESCRIPTION="NT Layer DLL" COMPANY_NAME="Microsoft Corporation" PRODUCT_NAME="Microsoft® Windows® Operating System" FILE_VERSION="5.1.2600.2180 (xpsp_sp2_rtm.040803-2158)" ORIGINAL_FILENAME="ntdll.dll" INTERNAL_NAME="ntdll.dll" LEGAL_COPYRIGHT="© Microsoft Corporation. All rights reserved." VERFILEDATEHI="0x0" VERFILEDATELO="0x0" VERFILEOS="0x40004" VERFILETYPE="0x2" MODULE_TYPE="WIN32" PE_CHECKSUM="0xAF2F7" LINKER_VERSION="0x50001" UPTO_BIN_FILE_VERSION="5.1.2600.2180" UPTO_BIN_PRODUCT_VERSION="5.1.2600.2180" LINK_DATE="08/04/2004 07:56:36" UPTO_LINK_DATE="08/04/2004 07:56:36" VER_LANGUAGE="English (United States) [0x409]" /> </EXE> <EXE NAME="kernel32.dll" FILTER="GRABMI_FILTER_THISFILEONLY"> <MATCHING_FILE NAME="kernel32.dll" SIZE="984576" CHECKSUM="0xF0B331F6" BIN_FILE_VERSION="5.1.2600.3119" BIN_PRODUCT_VERSION="5.1.2600.3119" PRODUCT_VERSION="5.1.2600.3119" FILE_DESCRIPTION="Windows NT BASE API Client DLL" COMPANY_NAME="Microsoft Corporation" PRODUCT_NAME="Microsoft® Windows® Operating System" FILE_VERSION="5.1.2600.3119 (xpsp_sp2_gdr.070416-1301)" ORIGINAL_FILENAME="kernel32" INTERNAL_NAME="kernel32" LEGAL_COPYRIGHT="© Microsoft Corporation. All rights reserved." VERFILEDATEHI="0x0" VERFILEDATELO="0x0" VERFILEOS="0x40004" VERFILETYPE="0x2" MODULE_TYPE="WIN32" PE_CHECKSUM="0xF9293" LINKER_VERSION="0x50001" UPTO_BIN_FILE_VERSION="5.1.2600.3119" UPTO_BIN_PRODUCT_VERSION="5.1.2600.3119" LINK_DATE="04/16/2007 15:52:53" UPTO_LINK_DATE="04/16/2007 15:52:53" VER_LANGUAGE="English (United States) [0x409]" /> </EXE> </DATABASE> |
Mac-Nic Send message Joined: 6 Jul 06 Posts: 7 Credit: 50,523 RAC: 0 |
There seems tobe a problem with this unit. |
Christoph Jansen Send message Joined: 6 Jun 06 Posts: 248 Credit: 267,153 RAC: 0 |
Now finally BOINC hung when I tried to shutdown my computer (I did't notice that for about an hour). This resulted in all of the rest of the WUs still present on the drive to error out, whatever the cause for that may be. BOINC probably finished a WU and tried to execute the next one but didn't get any system resources to do so as the shutdown process was meant to go on. So one after one they all marched straight into oblivion. Poor wretches... |
Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0 |
This WU 98810686 got stuck after 2:19:44. Quitting and restarting BOINC got things going again. First Einstein and WCG Wus ran for about 3 1/2 hours before Rosetta started again from the beginning. It ran three times (from 21:46 to 1:52, 7:57 to 11:26, 14:57 to 21:36) before completing 3 decoys in 36113.17 CPU seconds. My runtime is set for 10 hours. Hope this helps. |
Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0 |
This WU 98810686 got stuck after 2:19:44. Quitting and restarting BOINC got things going again. First Einstein and WCG Wus ran for about 3 1/2 hours before Rosetta started again from the beginning. It ran three times (from 21:46 to 1:52, 7:57 to 11:26, 14:57 to 21:36) before completing 3 decoys in 36113.17 CPU seconds. My runtime is set for 10 hours. Hope this helps. I meant to add that this is the second 5.80 WU that got stuck but completed successfully after restarting BOINC. I apologize for not noting the number. I dashed off to work as soon as I restarted and forgot about it until this one stuck. |
TomaszPawel Send message Joined: 28 Apr 07 Posts: 54 Credit: 2,791,145 RAC: 0 |
Errors: See this: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=98806026 and https://boinc.bakerlab.org/rosetta/workunit.php?wuid=98840340 what you think? |
Tuc Send message Joined: 30 Sep 07 Posts: 4 Credit: 1,006 RAC: 0 |
Your gonna LOVE me for this.... :) BACKGROUND: I'm running the rosetta_beta_5.80_i686-pc-linux-gnu under linum emulation of FreeBSD.. (So already I'm sure I've really confused things. ;) ) When I first attached to the project, as it was downloading files all of a sudden my boinc_client just "disappeared". No core, nothing. Not sure why. Weird... Never did that before PROBLEM 1 : It started to run "HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_2121_45549_0". After a while, the process started to use 0 CPU, so I checked... the stderr.txt had : No heartbeat from core client for 31 sec - exiting pure virtual method called terminate called without an active exception SIGABRT: abort called *** glibc detected *** corrupted double-linked list: 0x08f61f98 *** SIGABRT: abort called No concept why I'd miss a heartbeat. I killed the processes. PROBLEM 2: It restarted that WU and then later on : No heartbeat from core client for 31 sec - exiting SIGSEGV: segmentation violation SIGABRT: abort called SIGABRT: abort called (And about 1304 more of thse) It was still taking alot of CPU, but BOINC Manager didn't show any updates to anything, so I restarted it again.... I'm still trying to get through my first WU under this emulation.. Thanks, Tuc |
JChojnacki Send message Joined: 17 Sep 05 Posts: 71 Credit: 10,747,694 RAC: 1,797 |
Hey, Got an error today with this work unit: HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_2121_43593_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=109858641 Thanks, ~Joel |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Hi all: We're trying to track down several sources of error. I'm not sure if anyone's posted about this, but a small number of workuntis with the batch number 2156: mcr1__BOINC_ABRELAX-mcr1_-mfr__2056_ appear to be flawed. I've cancelled the job; you should also feel free to abort these jobs if you see them. There aren't that many. I just fixed the problem and sent out a similar job with ID 2059. We're looking into a few more issues too.. I've just contacted the people in charge of the other jobs... thanks *very* much for posting! |
Message boards :
Number crunching :
Problems with Rosetta version 5.80
©2024 University of Washington
https://www.bakerlab.org