Report problems with Rosetta version 5.32

Message boards : Number crunching : Report problems with Rosetta version 5.32

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Keith Akins

Send message
Joined: 22 Oct 05
Posts: 176
Credit: 71,779
RAC: 0
Message 29637 - Posted: 19 Oct 2006, 14:40:08 UTC

Hm. I got a "Read Access Violation" for:

1hz6A_BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_27001_0.

Seems to be happening once to twice a day.
ID: 29637 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile MM Sihombing
Avatar

Send message
Joined: 22 May 06
Posts: 15
Credit: 1,424,082
RAC: 0
Message 29670 - Posted: 20 Oct 2006, 1:21:54 UTC

10/17/2006 5:20:50 AM|rosetta@home|Unrecoverable error for result 1n0u__BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_8622_0 ( - exit code 1073807364 (0x40010004))

ID: 29670 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sam Miorelli

Send message
Joined: 16 Feb 06
Posts: 7
Credit: 1,303,044
RAC: 0
Message 29719 - Posted: 20 Oct 2006, 20:03:06 UTC

I've just started running Rosetta on an Athlon 64 X2 4200+ (not overclocked) and while the other projects seem to be going OK on it, I've already had a Rosetta WU crash. I get the Windows process dump reporting message when this happens so I believe it is occurrring while the screensaver is running. I had a similar problem on a P4 3Ghz Prescott machine over the summer that eventually resulted in me no longer running Rosetta on it. The exit code from BOINC is below. Does anyone know what caused this error?

10/20/2006 2:14:33 PM|rosetta@home|Unrecoverable error for result 1hz6A_BOINC_NATIVEJUMPS_CLOSE_CHAINBREAKS_VARY_BOND_ANGLES_SAVE_ALL_OUT__1273_41187_0 ( - exit code 1073807364 (0x40010004))
ID: 29719 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Akins

Send message
Joined: 22 Oct 05
Posts: 176
Credit: 71,779
RAC: 0
Message 29790 - Posted: 21 Oct 2006, 22:20:54 UTC

DOC_2PTC_pose_u_pert_bbmin_from_short_relax_1290_187_0

<core_client_version>5.4.9</core_client_version>
<stderr_txt>
# cpu_run_time_pref: 28800
# random seed: 1931404
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -7.93815 for 3600 seconds
**********************************************************************
GZIP SILENT FILE: .ee2PTC.out

</stderr_txt>

1dtj__BOINC_NEWRELAXFLAGS_WOBBLECCD_ABRELAX_SAVE_ALL_OUT__1285_6677_0

<core_client_version>5.4.9</core_client_version>
<stderr_txt>
# cpu_run_time_pref: 28800
# random seed: 2428124
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -1.32255 for 3600 seconds
**********************************************************************
GZIP SILENT FILE: .xx1dtj.out

</stderr_txt>
ID: 29790 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Buffalo Bill
Avatar

Send message
Joined: 25 Mar 06
Posts: 71
Credit: 1,630,458
RAC: 0
Message 29796 - Posted: 21 Oct 2006, 23:37:48 UTC
Last modified: 21 Oct 2006, 23:38:26 UTC

This one seemed fine at first but has an error and no credit granted.

42988975
<core_client_version>5.4.9</core_client_version>
<stderr_txt>
# random seed: 2052132
# cpu_run_time_pref: 14400
WARNING! error deleting file .aa1t4o.out
======================================================
DONE :: 1 starting structures built 17 (nstruct) times
This process generated 17 decoys from 17 attempts
0 starting pdbs were skipped
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>


Validate state Workunit error - check skipped
Claimed credit 31.7338913214193
Granted credit 0
application version 5.32
ID: 29796 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile hedera
Avatar

Send message
Joined: 15 Jul 06
Posts: 76
Credit: 5,153,876
RAC: 692
Message 29798 - Posted: 22 Oct 2006, 3:21:27 UTC

I don't know if this is BOINC or Rosetta but Rosetta is the only project I'm working on, and it's been crashing like a NASCAR driver for the last few days. Just today I've had 3 crashes. Here's my system:

10/21/2006 8:37:27 AM||Starting BOINC client version 5.4.11 for windows_intelx86
10/21/2006 8:37:27 AM||libcurl/7.15.3 OpenSSL/0.9.8a zlib/1.2.3
10/21/2006 8:37:27 AM||Data directory: C:Program FilesBOINC
10/21/2006 8:37:27 AM||Processor: 2 GenuineIntel Intel(R) Pentium(R) 4 CPU 3.20GHz
10/21/2006 8:37:27 AM||Memory: 1022.09 MB physical, 2.40 GB virtual
10/21/2006 8:37:27 AM||Disk: 145.27 GB total, 122.85 GB free
10/21/2006 8:37:27 AM|rosetta@home|URL: https://boinc.bakerlab.org/rosetta/; Computer ID: 272841; location: home; project prefs: default
10/21/2006 8:37:27 AM||No general preferences found - using BOINC defaults
10/21/2006 8:37:27 AM||Local control only allowed
10/21/2006 8:37:27 AM||Listening on port 31416

So far I've had 3 crashes just today, here are the BOINC log errors and Windows event log entries:

10/21/2006 11:21:30 AM|rosetta@home|Unrecoverable error for result BENCH_ABRELAX_SAVE_ALL_OUT_4ubpA_BARCODE_R72_filters_1292_701_0 ( - exit code -1073741819 (0xc0000005))

(No Windows error)
=============================================
10/21/2006 1:19:58 PM|rosetta@home|Unrecoverable error for result 1b72__LARS_ABRELAX_PAIR5_BARCODE__1294_672_0 ( - exit code -1073741819 (0xc0000005))

Event Type: Error
Event Source: Application Error
Event Category: None
Event ID: 1001
Date: 10/21/2006
Time: 1:19:57 PM
User: N/A
Computer: KAREN_8400
Description:
Fault bucket 334968245.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 42 75 63 6b 65 74 3a 20 Bucket:
0008: 33 33 34 39 36 38 32 34 33496824
0010: 35 0d 0a 5..

=======================================================

10/21/2006 4:32:35 PM|rosetta@home|Unrecoverable error for result 1r69__BOINC_NEWRELAXFLAGS_DOUBLEFARLXCYCLES_ABRELAX_SAVE_ALL_OUT__1287_6053_0 ( - exit code -1073741819 (0xc0000005))

Event Type: Error
Event Source: Application Error
Event Category: None
Event ID: 1001
Date: 10/21/2006
Time: 4:32:35 PM
User: N/A
Computer: KAREN_8400
Description:
Fault bucket 335025642.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 42 75 63 6b 65 74 3a 20 Bucket:
0008: 33 33 35 30 32 35 36 34 33502564
0010: 32 0d 0a 2..
==========================================================
I also got 2 Windows event log errors for which I have no log entry in BOINC:
============ 1 ============
Event Type: Error
Event Source: Application Error
Event Category: None
Event ID: 1000
Date: 10/21/2006
Time: 12:39:05 PM
User: N/A
Computer: KAREN_8400
Description:
Faulting application rosetta_5.32_windows_intelx86.exe, version 0.0.0.0, faulting module rosetta_5.32_windows_intelx86.exe, version 0.0.0.0, fault address 0x0036d5d2.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 41 70 70 6c 69 63 61 74 Applicat
0008: 69 6f 6e 20 46 61 69 6c ion Fail
0010: 75 72 65 20 20 72 6f 73 ure ros
0018: 65 74 74 61 5f 35 2e 33 etta_5.3
0020: 32 5f 77 69 6e 64 6f 77 2_window
0028: 73 5f 69 6e 74 65 6c 78 s_intelx
0030: 38 36 2e 65 78 65 20 30 86.exe 0
0038: 2e 30 2e 30 2e 30 20 69 .0.0.0 i
0040: 6e 20 72 6f 73 65 74 74 n rosett
0048: 61 5f 35 2e 33 32 5f 77 a_5.32_w
0050: 69 6e 64 6f 77 73 5f 69 indows_i
0058: 6e 74 65 6c 78 38 36 2e ntelx86.
0060: 65 78 65 20 30 2e 30 2e exe 0.0.
0068: 30 2e 30 20 61 74 20 6f 0.0 at o
0070: 66 66 73 65 74 20 30 30 ffset 00
0078: 33 36 64 35 64 32 0d 0a 36d5d2..

=================== 2 ==========================
Event Type: Error
Event Source: Application Error
Event Category: None
Event ID: 1000
Date: 10/21/2006
Time: 3:26:13 PM
User: N/A
Computer: KAREN_8400
Description:
Faulting application rosetta_5.32_windows_intelx86.exe, version 0.0.0.0, faulting module rosetta_5.32_windows_intelx86.exe, version 0.0.0.0, fault address 0x0036cf47.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 41 70 70 6c 69 63 61 74 Applicat
0008: 69 6f 6e 20 46 61 69 6c ion Fail
0010: 75 72 65 20 20 72 6f 73 ure ros
0018: 65 74 74 61 5f 35 2e 33 etta_5.3
0020: 32 5f 77 69 6e 64 6f 77 2_window
0028: 73 5f 69 6e 74 65 6c 78 s_intelx
0030: 38 36 2e 65 78 65 20 30 86.exe 0
0038: 2e 30 2e 30 2e 30 20 69 .0.0.0 i
0040: 6e 20 72 6f 73 65 74 74 n rosett
0048: 61 5f 35 2e 33 32 5f 77 a_5.32_w
0050: 69 6e 64 6f 77 73 5f 69 indows_i
0058: 6e 74 65 6c 78 38 36 2e ntelx86.
0060: 65 78 65 20 30 2e 30 2e exe 0.0.
0068: 30 2e 30 20 61 74 20 6f 0.0 at o
0070: 66 66 73 65 74 20 30 30 ffset 00
0078: 33 36 63 66 34 37 0d 0a 36cf47..
================================================
I hope this information will help someone debug this. Much of the other error information I've seen in my (incomplete) glance through the threads seems to be flavors of Unix.


--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

ID: 29798 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Akins

Send message
Joined: 22 Oct 05
Posts: 176
Credit: 71,779
RAC: 0
Message 29877 - Posted: 23 Oct 2006, 15:35:31 UTC

FRA_t380_NEWFLAGS_hom001_4_t380_4_2fhqA_IGNORE_THE_REST_162_1296_154_0

<core_client_version>5.4.9</core_client_version>
<stderr_txt>
# cpu_run_time_pref: 28800
# random seed: 1675502
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score 4.843 for 3600 seconds
**********************************************************************
GZIP SILENT FILE: .aat380.out

</stderr_txt>
ID: 29877 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 29914 - Posted: 24 Oct 2006, 0:51:53 UTC

ID: 29914 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Akins

Send message
Joined: 22 Oct 05
Posts: 176
Credit: 71,779
RAC: 0
Message 29919 - Posted: 24 Oct 2006, 3:30:45 UTC

Hm. One memory read access violation, two watchdog shutdowns and one without any watchdog or debug info.

Maybe we'll get some anwsers this week.

Atleast you got credit for your crashed UW's. Ive got crashed WU's going back three to four days without any granted credit.

Maybe the server doesn't like me.

ID: 29919 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 7 Oct 05
Posts: 65
Credit: 10,612,039
RAC: 0
Message 29953 - Posted: 24 Oct 2006, 18:22:22 UTC

running 5.32

Workunit: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=38429684

It has frozen at 46 minutes of runtime, I suspect this occured 27 hours ago as my system idle time is up to 27 hours now.

I have shut BOINC down and restarted. the same work unit is now running again from 0.



ID: 29953 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 7 Oct 05
Posts: 65
Credit: 10,612,039
RAC: 0
Message 29958 - Posted: 24 Oct 2006, 18:42:06 UTC - in response to Message 29953.  
Last modified: 24 Oct 2006, 18:44:33 UTC

running 5.32

Workunit: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=38429684

It has frozen at 46 minutes of runtime, I suspect this occured 27 hours ago as my system idle time is up to 27 hours now.

I have shut BOINC down and restarted. the same work unit is now running again from 0.




Result errored out after about 12 minutes. Dump is avaliable in result page.
An interesting thing to note, is that when it "hung" it dumped with an error of LoadLibraryA(srcsrv.dll): GetLastError = 126 followed by an access violation. Then it hung.

So, there are 2 dumps in the file.

I would have thought they would be a bit more agressive on Ralph taking care of the problems posted in this thread. I dont like these kinds of problems on my production machines, but I do have a machine that Ralph runs on to help ferret out these problems.




ID: 29958 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Akins

Send message
Joined: 22 Oct 05
Posts: 176
Credit: 71,779
RAC: 0
Message 29967 - Posted: 24 Oct 2006, 19:24:28 UTC

Result ID: 43550168

FRA_t369_NEWFLAGS_hom001_4_t369_4_1rxqA_IGNORE_THE_REST_131_1302_9_0

<core_client_version>5.4.9</core_client_version>
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 28800
# random seed: 1587362


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0076A524 read attempt to address 0x00000011

ID: 29967 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Akins

Send message
Joined: 22 Oct 05
Posts: 176
Credit: 71,779
RAC: 0
Message 29976 - Posted: 25 Oct 2006, 0:33:22 UTC

Oh, I've had to reboot five times in the last three days.
ID: 29976 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 29977 - Posted: 25 Oct 2006, 0:35:03 UTC - in response to Message 29958.  

We're of course concerned about some of the reports below of machines that constantly error out. We did check all these workunits on ralph -- the error rates are pretty low there. Even weirder, the error rates here on rosetta@home are pretty low too!

Its possible that the next update (to 5.34) will help; please keep posting if these problems keep occurring. Otherwise, my best advice is to help us out by running on ralph -- and to avoid futzing around too much with the "show graphics" window. There are obviously issues with turning on and off graphics, and the BOINC developers are thinking of ways to fix them.

running 5.32

Workunit: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=38429684

It has frozen at 46 minutes of runtime, I suspect this occured 27 hours ago as my system idle time is up to 27 hours now.

I have shut BOINC down and restarted. the same work unit is now running again from 0.




Result errored out after about 12 minutes. Dump is avaliable in result page.
An interesting thing to note, is that when it "hung" it dumped with an error of LoadLibraryA(srcsrv.dll): GetLastError = 126 followed by an access violation. Then it hung.

So, there are 2 dumps in the file.

I would have thought they would be a bit more agressive on Ralph taking care of the problems posted in this thread. I dont like these kinds of problems on my production machines, but I do have a machine that Ralph runs on to help ferret out these problems.





ID: 29977 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Akins

Send message
Joined: 22 Oct 05
Posts: 176
Credit: 71,779
RAC: 0
Message 29980 - Posted: 25 Oct 2006, 0:47:39 UTC

OK, I'm biting on the RALPH Thing. If I attatch, would you prefer me to run XP or Linux?

ID: 29980 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 29981 - Posted: 25 Oct 2006, 2:35:41 UTC

One of the systems having continuous errors lately ended up being a problem with bad ram. Perhaps others having a high failure rate could test out their system with memtest86+ from http://www.memtest.org/. I remember some of the errors that popped up with Distributed Folding actually telling us to test our systems with memtest86 to verify that our memory was okay.




ID: 29981 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,818,279
RAC: 728
Message 29984 - Posted: 25 Oct 2006, 4:13:29 UTC

>> Have been getting lock ups for a couple of weeks now. Originally caused by Ralph then moved across to Rosetta. Locks up on the screen saver and unable to get back to main programme.
Getting this error:-
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score 0.682356 for 3600 seconds
**********************************************************************
GZIP SILENT FILE: .dd1BVK.out

The stuck score varies,
https://boinc.bakerlab.org/rosetta/result.php?resultid=42889001
https://boinc.bakerlab.org/rosetta/result.php?resultid=42888956
https://boinc.bakerlab.org/rosetta/result.php?resultid=42230633
https://boinc.bakerlab.org/rosetta/result.php?resultid=42230590
https://boinc.bakerlab.org/rosetta/result.php?resultid=42230539
https://boinc.bakerlab.org/rosetta/result.php?resultid=42230545

Have also had compute errors of :- "exit code -1073741819" "Unhandled Exception Detected: reason Access Violation (0xc0000005)
at address 0x0076CF20 read attempt to address 0x00000011 on Result id 42230652
at address 0x0076CE15 read attempt to address 0x000000A on result id 42230637
at address 0x0076D4FD read attempt to address 0x00000017 on result id 42230634
at address 0x0076D514 read attempt to address 0x00000011 on result id 42230565
at address 0x0076D4FD read attempt to address 0x00000017 on result id 42230423

One other curious result that I received was classed as successful but only did 1 decoy (my settings are for 6 hours) and returned 3.10 Cobblestones, this seems a bit low in any book. The result is
https://boinc.bakerlab.org/rosetta/result.php?resultid=42230653.

The lockups often require a reboot to get things going again.
ID: 29984 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 30007 - Posted: 25 Oct 2006, 22:07:43 UTC

Watchdog is set to trigger at 4x pref run time. On a 667MHz box this means that the FRA_... wu error out when run with a 1hr pref, for example this and that tasks.

Maybe it would be more useful to have the watchdog trigger at a min of (say) 8hrs, or 4x pref whichever is greater?

If a normally running decoy of this series really does need 5 or 6 hours to complete on some boxes, it is not appropriate to have watchdog killing it at somewhere between 4 and 5 hours when it would probably have run OK.

User workaround

From the user end the workaround is not to use a pref lower than 2hrs on a box with a clock speed less than (say) 1GHz and not use a pref less than 3hrs on a box with a clock speed of 667 or less.

Anyone illicitly using a box of less than the Rosetta recommended min of 500MHz should use an even longer preferred run time.

River~~
ID: 30007 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TLAF

Send message
Joined: 17 Oct 06
Posts: 2
Credit: 2,535,507
RAC: 0
Message 30040 - Posted: 26 Oct 2006, 5:15:47 UTC

With regards to this result:

https://boinc.bakerlab.org/rosetta/result.php?resultid=43474514

I had the graphics window open for about 30 seconds before the WU failed. Now that may be purely coincidence but with no problems on this CPU before (and having never opened the graphics window before) I find that rather unlikely.

Hope that helps.

N.B. This is a repost of https://boinc.bakerlab.org/rosetta/forum_thread.php?id=2473
ID: 30040 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,818,279
RAC: 728
Message 30050 - Posted: 26 Oct 2006, 11:58:11 UTC

>>> The freezing of the Rosetta and Ralph work units is definately a ScreenSaver problem. I have Rosetta running on 7 machines, all bar 2 do not have graphics (2 are Linux and 3 are XP installed as services, 2 are XP installed at default user settings).
>>> The 2 machines that use Graphics are the only 2 machines to have any problems, whether it be the WU becoming stuck, returning some 'access violation' or 'exit code' with no cause.

These are a few more that stuck then errored out:-
https://boinc.bakerlab.org/rosetta/result.php?resultid=43600036
https://boinc.bakerlab.org/rosetta/result.php?resultid=42883408
https://boinc.bakerlab.org/rosetta/result.php?resultid=42889015
https://boinc.bakerlab.org/rosetta/result.php?resultid=42888958

These 2 had Access Violations:-
at address 0x0076D4FD read attempt to address 0x00000012 on result id 43369182
at address 0x0076D507 read attempt to address 0x00000011 on result id 42889050

And these 2 came up as invalid with no real error just 'exit code 1073807364 (0x40010004)
https://boinc.bakerlab.org/rosetta/result.php?resultid=42888965
https://boinc.bakerlab.org/rosetta/result.php?resultid=42888964

Hope this can help as it is limiting my output when the screen stops doing anything and you find that the cpus are not doing anything either, on one machine when this happened on the 19/10 the computer then did nothing till I came back from a break on the 24/10, 5 days of lost production.
ID: 30050 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Report problems with Rosetta version 5.32



©2024 University of Washington
https://www.bakerlab.org