Problems with Rosetta version 5.80

Message boards : Number crunching : Problems with Rosetta version 5.80

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next

AuthorMessage
Profile Purple Rabbit
Avatar

Send message
Joined: 24 Sep 05
Posts: 28
Credit: 4,316,947
RAC: 967
Message 47153 - Posted: 28 Sep 2007, 17:44:48 UTC
Last modified: 28 Sep 2007, 18:27:48 UTC

I think this the same problem Paul is describing.

I've had two WU error out with "error 1" after an hour or two of processing. These are from two
different Linux computers. I had a third one complete successfully on yet another Linux computer. These are the only "v001" WU
I've finished. Looking at Paul's computer he is also seeing some success and some failure with "v001".

All my computers are running Suse 10. The first bad one I shrugged off as gremlins. The second happened less than an hour later
so it looked like a trend :-) I've got 5 more of these puppies waiting...sigh

Bad:

v001_1_NMRREF_1_v001_1_id_model_13IGNORE_THE_REST_idl_2125_1698 Bad #1

v001_1_NMRREF_1_v001_1_id_model_10IGNORE_THE_REST_idl_2125_1167 Bad #2

Good:

v001_1_NMRREF_1_v001_1_id_model_20IGNORE_THE_REST_idl_2125_424 Good #1
ID: 47153 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Greenshit

Send message
Joined: 30 Jan 07
Posts: 3
Credit: 55,173
RAC: 0
Message 47164 - Posted: 28 Sep 2007, 19:18:50 UTC

Compute error here:
https://boinc.bakerlab.org/rosetta/result.php?resultid=108782014

- exit code -1073741819 (0xc0000005)
ID: 47164 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The_Bad_Penguin
Avatar

Send message
Joined: 5 Jun 06
Posts: 2751
Credit: 4,271,025
RAC: 0
Message 47184 - Posted: 29 Sep 2007, 2:47:09 UTC
Last modified: 29 Sep 2007, 3:41:27 UTC

ID: 47184 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Purple Rabbit
Avatar

Send message
Joined: 24 Sep 05
Posts: 28
Credit: 4,316,947
RAC: 967
Message 47230 - Posted: 29 Sep 2007, 22:45:43 UTC - in response to Message 47153.  
Last modified: 29 Sep 2007, 23:09:42 UTC

...I've got 5 more of these puppies waiting...sigh...

An update:

All five of the puppies finished successfully. One of my bad WU was successfully completed by another. The other bad one died
a second time and was put to rest.

With the scattered reports of occasional failures I'm guessing this is probably an initial conditions (random seed) problem
and/or 5.80 not being able to handle the output for particular starting conditions. Things aren't totally broken for "v001",
but something ain't quite right :-)

Rick

Can the forum moderator fix the formatting for this thread? It's way off my screen to the right. I've spent some time
adding hard returns to make my posts more readable.
ID: 47230 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 47233 - Posted: 30 Sep 2007, 8:01:35 UTC

Add this one to the list. Watchdog is not activated.

v001_1_NMRREF_1_v001_1_id_model_03IGNORE_THE_REST_idl_2125_1909
ID: 47233 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dotsch
Avatar

Send message
Joined: 12 Feb 06
Posts: 111
Credit: 241,803
RAC: 0
Message 47235 - Posted: 30 Sep 2007, 9:39:56 UTC

I had a 5.8 WU which reseted to 0 % after stop and start of the BOINC client.
ID: 47235 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
M.L.

Send message
Joined: 21 Nov 06
Posts: 182
Credit: 180,462
RAC: 0
Message 47263 - Posted: 1 Oct 2007, 9:53:38 UTC

Result ID 108706598
Name v001_1_NMRREF_1_v001_1_id_model_18IGNORE_THE_REST_idl_2125_859_0
Workunit 98755213
Created 28 Sep 2007 6:18:07 UTC
Sent 28 Sep 2007 6:19:01 UTC
Received 1 Oct 2007 9:37:48 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
Computer ID 510574
Report deadline 8 Oct 2007 6:19:01 UTC
CPU time 2098.765625
stderr out <core_client_version>5.10.20</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 1801142


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00853013 read attempt to address 0xE9C9ED2C

Engaging BOINC Windows Runtime Debugger...




ID: 47263 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 47269 - Posted: 1 Oct 2007, 12:45:51 UTC - in response to Message 47235.  

I had a 5.8 WU which reseted to 0 % after stop and start of the BOINC client.


This would be normal if the task had not completed the first model, and not reached a checkpoint.

You will always lose something when you exit BOINC. For most tasks, and most machine configurations, you will lose less then about 15min. But for some types of tasks, it can be more on the order of an hour.
Rosetta Moderator: Mod.Sense
ID: 47269 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dotsch
Avatar

Send message
Joined: 12 Feb 06
Posts: 111
Credit: 241,803
RAC: 0
Message 47274 - Posted: 1 Oct 2007, 13:38:17 UTC - in response to Message 47269.  

I had a 5.8 WU which reseted to 0 % after stop and start of the BOINC client.


This would be normal if the task had not completed the first model, and not reached a checkpoint.

You will always lose something when you exit BOINC. For most tasks, and most machine configurations, you will lose less then about 15min. But for some types of tasks, it can be more on the order of an hour.

The Task was at about 60 to 70 % completed (at about 2 hours computing time). So I think it is not normal.
ID: 47274 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
M.L.

Send message
Joined: 21 Nov 06
Posts: 182
Credit: 180,462
RAC: 0
Message 47276 - Posted: 1 Oct 2007, 14:20:34 UTC

And now another!

Result ID 108699749
Name v001_1_NMRREF_1_v001_1_id_model_18IGNORE_THE_REST_idl_2125_625_0
Workunit 98748895
Created 28 Sep 2007 5:36:50 UTC
Sent 28 Sep 2007 5:37:48 UTC
Received 1 Oct 2007 14:18:38 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
Computer ID 510574
Report deadline 8 Oct 2007 5:37:48 UTC
CPU time 13589.703125
stderr out <core_client_version>5.10.20</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 1801376


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00715B96 read attempt to address 0x5698A8EE

Engaging BOINC Windows Runtime Debugger...




ID: 47276 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Christoph Jansen
Avatar

Send message
Joined: 6 Jun 06
Posts: 248
Credit: 267,153
RAC: 0
Message 47295 - Posted: 1 Oct 2007, 21:06:06 UTC
Last modified: 1 Oct 2007, 21:06:35 UTC

This one:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=98865167

keeps on stopping, meaning the processor time keeps stopping to count up indefinitely, so the watchdog does not shut it down.

I had that on another one over last night, it must have been stuck for some hours. Didn't note the number though.
ID: 47295 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alexander

Send message
Joined: 29 May 07
Posts: 1
Credit: 119,573
RAC: 0
Message 47313 - Posted: 2 Oct 2007, 10:07:54 UTC

5.80 repeatedly crashed. My computer ID = 519167.

The following is the crash information:

Faulting application rosetta_beta_5.80_windows_intelx86.exe, version 0.0.0.0, faulting module ntdll.dll, version 5.1.2600.2180, fault address 0x00013396.

<?xml version="1.0" encoding="UTF-16"?>
<DATABASE>
<EXE NAME="rosetta_beta_5.80_windows_intelx86.exe" FILTER="GRABMI_FILTER_PRIVACY">
<MATCHING_FILE NAME="rosetta_5.69_windows_intelx86.exe" SIZE="2570240" CHECKSUM="0x57279008" MODULE_TYPE="WIN32" PE_CHECKSUM="0x0" LINKER_VERSION="0x0" LINK_DATE="08/20/2007 21:12:18" UPTO_LINK_DATE="08/20/2007 21:12:18" />
<MATCHING_FILE NAME="rosetta_beta_5.80_windows_intelx86.exe" SIZE="2575872" CHECKSUM="0xA6936F6C" MODULE_TYPE="WIN32" PE_CHECKSUM="0x0" LINKER_VERSION="0x0" LINK_DATE="09/12/2007 04:42:46" UPTO_LINK_DATE="09/12/2007 04:42:46" />
</EXE>
<EXE NAME="ntdll.dll" FILTER="GRABMI_FILTER_THISFILEONLY">
<MATCHING_FILE NAME="ntdll.dll" SIZE="708096" CHECKSUM="0x9D20568" BIN_FILE_VERSION="5.1.2600.2180" BIN_PRODUCT_VERSION="5.1.2600.2180" PRODUCT_VERSION="5.1.2600.2180" FILE_DESCRIPTION="NT Layer DLL" COMPANY_NAME="Microsoft Corporation" PRODUCT_NAME="Microsoft® Windows® Operating System" FILE_VERSION="5.1.2600.2180 (xpsp_sp2_rtm.040803-2158)" ORIGINAL_FILENAME="ntdll.dll" INTERNAL_NAME="ntdll.dll" LEGAL_COPYRIGHT="© Microsoft Corporation. All rights reserved." VERFILEDATEHI="0x0" VERFILEDATELO="0x0" VERFILEOS="0x40004" VERFILETYPE="0x2" MODULE_TYPE="WIN32" PE_CHECKSUM="0xAF2F7" LINKER_VERSION="0x50001" UPTO_BIN_FILE_VERSION="5.1.2600.2180" UPTO_BIN_PRODUCT_VERSION="5.1.2600.2180" LINK_DATE="08/04/2004 07:56:36" UPTO_LINK_DATE="08/04/2004 07:56:36" VER_LANGUAGE="English (United States) [0x409]" />
</EXE>
<EXE NAME="kernel32.dll" FILTER="GRABMI_FILTER_THISFILEONLY">
<MATCHING_FILE NAME="kernel32.dll" SIZE="984576" CHECKSUM="0xF0B331F6" BIN_FILE_VERSION="5.1.2600.3119" BIN_PRODUCT_VERSION="5.1.2600.3119" PRODUCT_VERSION="5.1.2600.3119" FILE_DESCRIPTION="Windows NT BASE API Client DLL" COMPANY_NAME="Microsoft Corporation" PRODUCT_NAME="Microsoft® Windows® Operating System" FILE_VERSION="5.1.2600.3119 (xpsp_sp2_gdr.070416-1301)" ORIGINAL_FILENAME="kernel32" INTERNAL_NAME="kernel32" LEGAL_COPYRIGHT="© Microsoft Corporation. All rights reserved." VERFILEDATEHI="0x0" VERFILEDATELO="0x0" VERFILEOS="0x40004" VERFILETYPE="0x2" MODULE_TYPE="WIN32" PE_CHECKSUM="0xF9293" LINKER_VERSION="0x50001" UPTO_BIN_FILE_VERSION="5.1.2600.3119" UPTO_BIN_PRODUCT_VERSION="5.1.2600.3119" LINK_DATE="04/16/2007 15:52:53" UPTO_LINK_DATE="04/16/2007 15:52:53" VER_LANGUAGE="English (United States) [0x409]" />
</EXE>
</DATABASE>

ID: 47313 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Mac-Nic

Send message
Joined: 6 Jul 06
Posts: 7
Credit: 50,523
RAC: 0
Message 47314 - Posted: 2 Oct 2007, 10:17:41 UTC

There seems tobe a problem with this unit.
ID: 47314 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Christoph Jansen
Avatar

Send message
Joined: 6 Jun 06
Posts: 248
Credit: 267,153
RAC: 0
Message 47316 - Posted: 2 Oct 2007, 10:38:07 UTC
Last modified: 2 Oct 2007, 10:38:50 UTC

Now finally BOINC hung when I tried to shutdown my computer (I did't notice that for about an hour). This
resulted in all of the rest of the WUs still present on the drive to error out, whatever the cause for that may be.

BOINC probably finished a WU and tried to execute the next one but didn't get any system resources to do so as the shutdown
process was meant to go on. So one after one they all marched straight into oblivion. Poor wretches...
ID: 47316 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,888,320
RAC: 0
Message 47324 - Posted: 2 Oct 2007, 13:49:39 UTC

This WU 98810686 got stuck after 2:19:44. Quitting and restarting BOINC got things going again. First Einstein and WCG Wus ran for about 3 1/2 hours before Rosetta started again from the beginning. It ran three times (from 21:46 to 1:52, 7:57 to 11:26, 14:57 to 21:36) before completing 3 decoys in 36113.17 CPU seconds. My runtime is set for 10 hours. Hope this helps.
ID: 47324 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,888,320
RAC: 0
Message 47326 - Posted: 2 Oct 2007, 15:06:39 UTC - in response to Message 47324.  

This WU 98810686 got stuck after 2:19:44. Quitting and restarting BOINC got things going again. First Einstein and WCG Wus ran for about 3 1/2 hours before Rosetta started again from the beginning. It ran three times (from 21:46 to 1:52, 7:57 to 11:26, 14:57 to 21:36) before completing 3 decoys in 36113.17 CPU seconds. My runtime is set for 10 hours. Hope this helps.



I meant to add that this is the second 5.80 WU that got stuck but completed successfully after restarting BOINC. I apologize for not noting the number. I dashed off to work as soon as I restarted and forgot about it until this one stuck.
ID: 47326 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TomaszPawel

Send message
Joined: 28 Apr 07
Posts: 54
Credit: 2,791,145
RAC: 0
Message 47328 - Posted: 2 Oct 2007, 15:52:34 UTC

Errors:

See this:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=98806026

and

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=98840340

what you think?
ID: 47328 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tuc

Send message
Joined: 30 Sep 07
Posts: 4
Credit: 1,006
RAC: 0
Message 47344 - Posted: 3 Oct 2007, 2:20:04 UTC

Your gonna LOVE me for this.... :)

BACKGROUND: I'm running the rosetta_beta_5.80_i686-pc-linux-gnu under linum emulation of FreeBSD.. (So already I'm sure I've really confused things. ;) )

When I first attached to the project, as it was downloading files all of a sudden my boinc_client just "disappeared". No core, nothing. Not sure why. Weird... Never did that before

PROBLEM 1 : It started to run "HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_2121_45549_0".

After a while, the process started to use 0 CPU, so I checked... the stderr.txt had :

No heartbeat from core client for 31 sec - exiting
pure virtual method called
terminate called without an active exception
SIGABRT: abort called
*** glibc detected *** corrupted double-linked list: 0x08f61f98 ***
SIGABRT: abort called


No concept why I'd miss a heartbeat.

I killed the processes.

PROBLEM 2: It restarted that WU and then later on :
No heartbeat from core client for 31 sec - exiting
SIGSEGV: segmentation violation
SIGABRT: abort called
SIGABRT: abort called
(And about 1304 more of thse)

It was still taking alot of CPU, but BOINC Manager didn't show any updates to anything, so I restarted it again....


I'm still trying to get through my first WU under this emulation..

Thanks, Tuc
ID: 47344 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile JChojnacki
Avatar

Send message
Joined: 17 Sep 05
Posts: 71
Credit: 10,747,694
RAC: 1,797
Message 47408 - Posted: 4 Oct 2007, 22:03:31 UTC

Hey,

Got an error today with this work unit:
HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_2121_43593_0
https://boinc.bakerlab.org/rosetta/result.php?resultid=109858641

Thanks,

~Joel

ID: 47408 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 47424 - Posted: 5 Oct 2007, 20:08:27 UTC
Last modified: 5 Oct 2007, 20:11:59 UTC

Hi all: We're trying to track down several sources of error. I'm not sure if anyone's posted about this, but a small number of workuntis with the batch number 2156:

mcr1__BOINC_ABRELAX-mcr1_-mfr__2056_

appear to be flawed. I've cancelled the job; you should also feel free to abort these jobs if you see them. There aren't that many. I just fixed the problem and sent out a similar job with ID 2059.

We're looking into a few more issues too.. I've just contacted the people in charge of the other jobs... thanks *very* much for posting!
ID: 47424 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next

Message boards : Number crunching : Problems with Rosetta version 5.80



©2024 University of Washington
https://www.bakerlab.org