Rosetta@home

minirosetta v1.19 bug thread

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search
Message boards : Number crunching : minirosetta v1.19 bug thread

Sort
AuthorMessage
James Thompson

Joined: Oct 13 05
Posts: 46
ID: 4392
Credit: 186,109
RAC: 0
Message 52876 - Posted 6 May 2008 0:37:02 UTC

We have an updated version of minirosetta v1.19 which should fix some of the stability issues with v1.15. Post minirosetta v1.19 bugs here.
____________

David Emigh Profile
Avatar

Joined: Mar 13 06
Posts: 158
ID: 65176
Credit: 417,178
RAC: 0
Message 52900 - Posted 7 May 2008 17:55:14 UTC

Here is an access violation error after 68,000+ seconds of CPU time:

Reason: Access Violation (0xc0000005) at address 0x005C3051 write attempt to address 0x00000024

There is a large and detailed debugger message.
____________
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!

glaesum

Joined: Oct 16 06
Posts: 21
ID: 120376
Credit: 106,074
RAC: 0
Message 52910 - Posted 8 May 2008 12:54:00 UTC

things must be going pretty well as the thread is so quiet...

good news too with win98 OS - the 1.19 app is running, completing and validating although an error message is still getting thrown up. no idea if this matters or not.

on all three wus completed so far this is the message:

Task ID 161439715
Name score13_hb_envtest62_A_1ctf__3171_14411_0
Workunit 147493846
Received 8 May 2008 11:10:33 UTC
Outcome Success

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
AllocateAndInitializeSid Error 120
failed to create shared mem segment
# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 13875.8 cpu seconds
This process generated 3 decoys from 3 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>

work unit ID nos are:
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=147390671
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=147405464
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=147493846

radu

Joined: May 7 08
Posts: 4
ID: 257126
Credit: 66,301
RAC: 0
Message 52911 - Posted 8 May 2008 13:22:26 UTC
Last modified: 8 May 2008 13:24:08 UTC

I get a crash when I detach from the project.
I'm not sure if this is a minirosetta bug.
Log messages seem to show that minirosetta was running when the crash occurred.

I'm running Gentoo linux 2.6.24-r7.
boinc-5.10.45

Logs:


08-May-2008 16:07:47 [rosetta@home] Starting task fa_max_dis_9-2vik_-test_2008-5-6_3222_134_0 using minirosetta version 119
08-May-2008 16:09:29 [rosetta@home] Resetting project
08-May-2008 16:09:30 [rosetta@home] Detaching from project
SIGSEGV: segmentation violation
Stack trace (9 frames):
/usr/bin/boinc_client[0x46cbf9]
/lib/libpthread.so.0[0x2aba6d950ed0]
/usr/bin/boinc_client[0x40afec]
/usr/bin/boinc_client[0x43060e]
/usr/bin/boinc_client[0x4310bc]
/usr/bin/boinc_client[0x422319]
/usr/bin/boinc_client[0x4516a4]
/lib/libc.so.6(__libc_start_main+0xf4)[0x2aba6ddfdb74]
/usr/bin/boinc_client(__gxx_personality_v0+0x1b1)[0x4048f9]

Exiting...

Pepo
Avatar

Joined: Sep 28 05
Posts: 115
ID: 1676
Credit: 101,358
RAC: 0
Message 52913 - Posted 8 May 2008 13:56:23 UTC - in response to Message ID 52911.

I get a crash when I detach from the project.
I'm not sure if this is a minirosetta bug.
Log messages seem to show that minirosetta was running when the crash occurred.

It is quite possible (and logical IMO) that the client forcibly terminates all related processes upon detach. Otherwise it could not clean up client_state.xml, slots/ and projects/.

Peter

radu

Joined: May 7 08
Posts: 4
ID: 257126
Credit: 66,301
RAC: 0
Message 52914 - Posted 8 May 2008 15:32:48 UTC - in response to Message ID 52913.
Last modified: 8 May 2008 15:37:21 UTC

I get a crash when I detach from the project.
I'm not sure if this is a minirosetta bug.
Log messages seem to show that minirosetta was running when the crash occurred.

It is quite possible (and logical IMO) that the client forcibly terminates all related processes upon detach. Otherwise it could not clean up client_state.xml, slots/ and projects/.

Peter

I'm new to BOINC so I don't know how the detach operation is handled.

I don't use the gui manager and boinc_client appears to be the only BOINC related process running:

$ ps -e | grep boinc
6279 ? 00:00:05 boinc_client

Anyway killing related processes should not generate segmentation faults, so it's clearly an error in boinc_client.
I don't know if it has anything to do with minirosetta though.

Pepo
Avatar

Joined: Sep 28 05
Posts: 115
ID: 1676
Credit: 101,358
RAC: 0
Message 52915 - Posted 8 May 2008 15:42:05 UTC - in response to Message ID 52914.

I get a crash when I detach from the project.
I'm not sure if this is a minirosetta bug.
Log messages seem to show that minirosetta was running when the crash occurred.

It is quite possible (and logical IMO) that the client forcibly terminates all related processes upon detach. Otherwise it could not clean up client_state.xml, slots/ and projects/.

I'm new to BOINC so I don't know how the detach operation is handled.

Anyway killing related processes should not generate segmentation faults, so it's clearly an error in boinc_client.

I'm sorry, you are right. I was thinking on Rosetta crashing and omitted that actually the client crashed. Off course it should not. (And actually the application should also exit cleanly if asked to by the client.)

I don't know if it has anything to do with minirosetta though.

It should not. Which client, 5.10.45?

Peter

radu

Joined: May 7 08
Posts: 4
ID: 257126
Credit: 66,301
RAC: 0
Message 52916 - Posted 8 May 2008 15:45:30 UTC - in response to Message ID 52915.

It should not. Which client, 5.10.45?

yes, 5.10.45

Rob

Joined: Oct 16 06
Posts: 3
ID: 120303
Credit: 121,375
RAC: 0
Message 52917 - Posted 8 May 2008 18:55:53 UTC

Someone forgot to post the Minirosetta 1.19 details on the version thread.

Alexander Klauer

Joined: Mar 10 08
Posts: 3
ID: 246483
Credit: 110,308
RAC: 0
Message 52933 - Posted 9 May 2008 8:35:22 UTC

Hi, I switched off my computer yesterday, in the middle (maybe 60%) of a task. When I switched it back on today, I got

Fri 09 May 2008 09:51:30 AM CEST|rosetta@home|URL: http://boinc.bakerlab.org/rosetta/; Computer ID: 762923; location: (none); project prefs: default
Fri 09 May 2008 09:51:31 AM CEST|rosetta@home|Restarting task fa_max_dis_9-1ptq_-test_2008-5-6_3222_268_0 using minirosetta version 119
Fri 09 May 2008 09:52:00 AM CEST|rosetta@home|Computation for task fa_max_dis_9-1ptq_-test_2008-5-6_3222_268_0 finished
Fri 09 May 2008 09:52:01 AM CEST|rosetta@home|Starting lambda_repressor_folding_3191_8370_0
Fri 09 May 2008 09:52:01 AM CEST|rosetta@home|Starting task lambda_repressor_folding_3191_8370_0 using rosetta_beta version 596
Fri 09 May 2008 09:52:03 AM CEST|rosetta@home|Started upload of fa_max_dis_9-1ptq_-test_2008-5-6_3222_268_0_0
Fri 09 May 2008 09:52:14 AM CEST|rosetta@home|Finished upload of fa_max_dis_9-1ptq_-test_2008-5-6_3222_268_0_0

so the task finished virtually immediately after restart.

When I switched my computer on yesterday morning, I also had some task crunching at 0%. Back then I believed an old task had been restarted from the beginning due to some fluke, but now it seems more likely that the same thing as today has happened. To me, it seems too much of a coincidence of a task interrupted in the middle being finished immediately after resume, twice in a row.

Betting Slip

Joined: Sep 26 05
Posts: 71
ID: 1160
Credit: 5,702,246
RAC: 0
Message 52937 - Posted 9 May 2008 11:54:47 UTC - in response to Message ID 52910.

Really

All access violations


http://boinc.bakerlab.org/rosetta/result.php?resultid=161740698

http://boinc.bakerlab.org/rosetta/result.php?resultid=160201341

http://boinc.bakerlab.org/rosetta/result.php?resultid=159794241

http://boinc.bakerlab.org/rosetta/result.php?resultid=160129454

http://boinc.bakerlab.org/rosetta/result.php?resultid=160185394

http://boinc.bakerlab.org/rosetta/result.php?resultid=161332559

http://boinc.bakerlab.org/rosetta/result.php?resultid=159408171
____________

Rom Walton (BOINC)
Forum moderator
Project administrator
Project developer

Joined: Sep 17 05
Posts: 18
ID: 84
Credit: 40,071
RAC: 0
Message 52961 - Posted 10 May 2008 4:10:29 UTC - in response to Message ID 52937.
Last modified: 10 May 2008 4:10:59 UTC


All access violations

http://boinc.bakerlab.org/rosetta/result.php?resultid=161740698
http://boinc.bakerlab.org/rosetta/result.php?resultid=160201341
http://boinc.bakerlab.org/rosetta/result.php?resultid=159794241
http://boinc.bakerlab.org/rosetta/result.php?resultid=160129454
http://boinc.bakerlab.org/rosetta/result.php?resultid=160185394
http://boinc.bakerlab.org/rosetta/result.php?resultid=161332559
http://boinc.bakerlab.org/rosetta/result.php?resultid=159408171


All those crashes are a result of an out of memory error.
____________
----- Rom
My Blog

Ian_D Profile

Joined: Sep 21 05
Posts: 55
ID: 757
Credit: 4,216,173
RAC: 0
Message 52967 - Posted 10 May 2008 6:48:21 UTC
Last modified: 10 May 2008 6:48:45 UTC

My latest weirdness

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<message>
Maximum memory exceeded
</message>
]]>

resultid=161607307
____________


Quidgydog

Joined: Sep 28 06
Posts: 3
ID: 115109
Credit: 499,462
RAC: 0
Message 52969 - Posted 10 May 2008 8:22:42 UTC
Last modified: 10 May 2008 8:24:56 UTC

Having exactly the same issue as I was having with the v1.15 WU. WU just sits there, CPU time not running, no progress.

Log file......


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x7C82A714 read attempt to address 0x00D767E5

Engaging BOINC Windows Runtime Debugger...


I'm detaching this computer until this is resolved.
____________

Betting Slip

Joined: Sep 26 05
Posts: 71
ID: 1160
Credit: 5,702,246
RAC: 0
Message 52970 - Posted 10 May 2008 9:49:08 UTC - in response to Message ID 52961.


All access violations

http://boinc.bakerlab.org/rosetta/result.php?resultid=161740698
http://boinc.bakerlab.org/rosetta/result.php?resultid=160201341
http://boinc.bakerlab.org/rosetta/result.php?resultid=159794241
http://boinc.bakerlab.org/rosetta/result.php?resultid=160129454
http://boinc.bakerlab.org/rosetta/result.php?resultid=160185394
http://boinc.bakerlab.org/rosetta/result.php?resultid=161332559
http://boinc.bakerlab.org/rosetta/result.php?resultid=159408171


All those crashes are a result of an out of memory error.



With 4Gb of memory what do I do to put it right?
____________

Pepo
Avatar

Joined: Sep 28 05
Posts: 115
ID: 1676
Credit: 101,358
RAC: 0
Message 52972 - Posted 10 May 2008 10:09:31 UTC - in response to Message ID 52970.
Last modified: 10 May 2008 10:10:08 UTC

All those crashes are a result of an out of memory error.

With 4Gb of memory what do I do to put it right?

You could once get out of memory with also 64 GB of RAM... (Do you know the sentence about 64 KB of RAM?)

How much pagefile do you have available there? Any other memory load? Like other projects' applications, preempted and waiting in memory? Take occasionally a look into Task Manager, Performance tab - what are the Commit Charge values like? If the Total (or Peak) anytimes reach the Limit, that's it. You're running at least 7 projects on the host, each Rosetta can require up to 600-900 MB, CPDN at least some 200-300 MB, other projects as well something, and it is a quad...

Peter

Betting Slip

Joined: Sep 26 05
Posts: 71
ID: 1160
Credit: 5,702,246
RAC: 0
Message 52973 - Posted 10 May 2008 10:30:52 UTC - in response to Message ID 52972.

All those crashes are a result of an out of memory error.

With 4Gb of memory what do I do to put it right?

You could once get out of memory with also 64 GB of RAM... (Do you know the sentence about 64 KB of RAM?)

How much pagefile do you have available there? Any other memory load? Like other projects' applications, preempted and waiting in memory? Take occasionally a look into Task Manager, Performance tab - what are the Commit Charge values like? If the Total (or Peak) anytimes reach the Limit, that's it. You're running at least 7 projects on the host, each Rosetta can require up to 600-900 MB, CPDN at least some 200-300 MB, other projects as well something, and it is a quad...

Peter



Yes, I understand but my commit charge is a fraction of of my available charge 10% at the moment. I have increased my page file to 6GB with a total memory of 4GB on Win XP Pro 64

It just strikes me that the very kowledgeable Rom is arrogant enough to point to the cause without indicating any sort of a solution.
____________

alpha Profile

Joined: Nov 4 06
Posts: 27
ID: 127202
Credit: 953,255
RAC: 781
Message 52974 - Posted 10 May 2008 13:58:09 UTC

This work unit finished earlier than expected, but with no errors:

http://boinc.bakerlab.org/rosetta/result.php?resultid=161362748

Claimed 130.48, granted 32.86. :(
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 52977 - Posted 10 May 2008 15:04:09 UTC

Fat Loss, I'm guessing that the error is an indication that the task grew to exceed the maximum memory it was configured for, and so was terminated by BOINC. And so, regardless of your machine's physical configuration or % memory used to BOINC etc. etc. it still would have failed. So that would tend to indicate a logic problem in Mini, or perhaps a task that should be created with a higher memory maximum allowed.

We'll have to wait to see what DK finds.
____________
Rosetta Moderator: Mod.Sense

Rom Walton (BOINC)
Forum moderator
Project administrator
Project developer

Joined: Sep 17 05
Posts: 18
ID: 84
Credit: 40,071
RAC: 0
Message 52979 - Posted 10 May 2008 15:34:13 UTC - in response to Message ID 52973.


It just strikes me that the very kowledgeable Rom is arrogant enough to point to the cause without indicating any sort of a solution.


In this particular case there isn't anything that any of us can do, I've passed the info on to the MiniRosetta devs. Basically MiniRosetta is a 32-bit process, and generally 32-bit processes are limited to 2GB of user-mode memory. MiniRosetta hit that limit and so when it asked for more the OS said NO, leading to the crash.

The sign that this sort of problem has occurred is:
LoadLibraryA( dbghelp.dll ): GetLastError = 8

and
- Virtual Memory Usage -
VirtualSize: 2127511552, PeakVirtualSize: 2127511552


Sorry for not explaining the situation sooner, I was heading for bed and I started thinking about how I was going to help the devs debug this problem in the wild if they are unable to reproduce this issue in the lab.

At present there isn't anything in the BOINC application framework that'll help them debug this in the wild.



____________
----- Rom
My Blog

David Emigh Profile
Avatar

Joined: Mar 13 06
Posts: 158
ID: 65176
Credit: 417,178
RAC: 0
Message 52982 - Posted 11 May 2008 2:05:18 UTC

I've got a "Compute Error" on 76,000+ seconds of CPU time.

Reason: Access Violation (0xc0000005) at address 0x005C3051 write attempt to address 0x00000024

There is a large and detailed debugger report.
____________
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!

Betting Slip

Joined: Sep 26 05
Posts: 71
ID: 1160
Credit: 5,702,246
RAC: 0
Message 52985 - Posted 11 May 2008 11:27:02 UTC - in response to Message ID 52979.


It just strikes me that the very kowledgeable Rom is arrogant enough to point to the cause without indicating any sort of a solution.


In this particular case there isn't anything that any of us can do, I've passed the info on to the MiniRosetta devs. Basically MiniRosetta is a 32-bit process, and generally 32-bit processes are limited to 2GB of user-mode memory. MiniRosetta hit that limit and so when it asked for more the OS said NO, leading to the crash.

The sign that this sort of problem has occurred is:
LoadLibraryA( dbghelp.dll ): GetLastError = 8

and
- Virtual Memory Usage -
VirtualSize: 2127511552, PeakVirtualSize: 2127511552


Sorry for not explaining the situation sooner, I was heading for bed and I started thinking about how I was going to help the devs debug this problem in the wild if they are unable to reproduce this issue in the lab.

At present there isn't anything in the BOINC application framework that'll help them debug this in the wild.





Thanks Rom and sorry for being a bit short with you. Sometimes wonder where all this irritability comes from.

I sometimes long for a slower pace of life LOL
____________

M.L.

Joined: Nov 21 06
Posts: 182
ID: 130574
Credit: 180,462
RAC: 0
Message 52988 - Posted 11 May 2008 15:22:43 UTC
Last modified: 11 May 2008 15:24:52 UTC

Task ID 162368970
Name SSPAIR_MIN_ABINITIO_1fna_3115_6915_2
Workunit 145958649
Created 10 May 2008 22:29:48 UTC
Sent 10 May 2008 22:30:26 UTC
Received 11 May 2008 15:18:04 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 1 (0x1)
Computer ID 735230
Report deadline 20 May 2008 22:30:26 UTC
CPU time 0
stderr out <core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
ERROR: Option matching -fudge not found in command line top-level context

</stderr_txt>
]]>


Validate state Invalid
Claimed credit 0
Granted credit 0
application version 1.19

Fudge is gooood, except in this case.

BobCat13

Joined: Jun 18 06
Posts: 3
ID: 95755
Credit: 126,838
RAC: 2
Message 52989 - Posted 11 May 2008 16:04:36 UTC - in response to Message ID 52979.

Rom Walton wrote:
In this particular case there isn't anything that any of us can do, I've passed the info on to the MiniRosetta devs. Basically MiniRosetta is a 32-bit process, and generally 32-bit processes are limited to 2GB of user-mode memory. MiniRosetta hit that limit and so when it asked for more the OS said NO, leading to the crash.


I just errored out with this same problem:

http://boinc.bakerlab.org/rosetta/result.php?resultid=162306731

Watching Process Explorer, the MiniRosetta application was constantly grabbing more memory. Both physical and virtual were increasing throughout the task's run. My preference was set to 24 hours, but it only made it to ~15.5 hours before reaching the 2GB limit. I then changed preferences to 2 hours and a task finished properly. It appears I will have to set preferences at 12 hours on this machine to avoid the 2GB limit.

For people running a Core2 at 3GHz or higher, you may want to try setting preferences at 8 hours or less to see if that helps.
____________

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 52991 - Posted 11 May 2008 17:30:44 UTC

I just had a couple of mtlr_test2_S.00000001.*_3238_1 WUs error out.

http://boinc.bakerlab.org/rosetta/result.php?resultid=161862548
http://boinc.bakerlab.org/rosetta/result.php?resultid=161862513

In both cases the WU ran the normal length of time (16 hr), then printed a bunch of:

can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
...

lines to sterr. The WUs ended up being marked "invalid".

These WUs were on separate machines, both running Linux.

David Emigh Profile
Avatar

Joined: Mar 13 06
Posts: 158
ID: 65176
Credit: 417,178
RAC: 0
Message 52993 - Posted 11 May 2008 19:20:44 UTC

A second "Compute Error", this one on 85,000+ seconds of CPU time:

Reason: Access Violation (0xc0000005) at address 0x005C3051 write attempt to address 0x00000024

There is a large and detailed debugger message.

This error, and the one I reported earlier in this thread, have the same signature as the errors I was getting with mini 1.15, errors which crippled two stable and reliable crunchers until I discovered a workaround.

The only difference now is that the mini 1.19 workunits take about twice as long to crash, resulting in twice as much wasted CPU time...
____________
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!

radu

Joined: May 7 08
Posts: 4
ID: 257126
Credit: 66,301
RAC: 0
Message 52999 - Posted 12 May 2008 4:07:10 UTC
Last modified: 12 May 2008 4:08:43 UTC

More segfaults,on linux running 5.10.45 client.

Apparently there was a problem with the network connection and the client kept trying to reconnect.

All tasks whose results were about to be sent were marked with "compute error", for example: http://boinc.bakerlab.org/rosetta/result.php?resultid=162637592

I hope this helps.

Output of dmesg:


tg3: eth0: Link is down.
minirosetta_1.1[1243]: segfault at ff5fbff8 rip 881dcb0 rsp ff5fbed8 error 6
minirosetta_1.1[1258]: segfault at ff3fbff8 rip 881dcb0 rsp ff3fbed8 error 6
Clocksource tsc unstable (delta = -116217092 ns)
minirosetta_1.1[1348]: segfault at ff7fbff8 rip 881dcb0 rsp ff7fbed8 error 6
rosetta_beta_5.[1353]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6
minirosetta_1.1[1363]: segfault at ff7fbff8 rip 881dcb0 rsp ff7fbed8 error 6
rosetta_beta_5.[1367]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
minirosetta_1.1[1375]: segfault at ff1fbff8 rip 881dcb0 rsp ff1fbed8 error 6
rosetta_beta_5.[1379]: segfault at ff5fbfe8 rip 8e8fe90 rsp ff5fbec8 error 6
minirosetta_1.1[1390]: segfault at ff3fbff8 rip 881dcb0 rsp ff3fbed8 error 6
rosetta_beta_5.[1395]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
rosetta_beta_5.[1411]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
rosetta_beta_5.[1407]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
rosetta_beta_5.[1422]: segfault at ff5fbfe8 rip 8e8fe90 rsp ff5fbec8 error 6
rosetta_beta_5.[1426]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6
rosetta_beta_5.[1434]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6
rosetta_beta_5.[1440]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6
rosetta_beta_5.[1449]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
rosetta_beta_5.[1454]: segfault at ff5fbfe8 rip 8e8fe90 rsp ff5fbec8 error 6
rosetta_beta_5.[1470]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6
rosetta_beta_5.[1463]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6
minirosetta_1.1[1486]: segfault at ff3fbff8 rip 881dcb0 rsp ff3fbed8 error 6
rosetta_beta_5.[1481]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6
minirosetta_1.1[1498]: segfault at ff1fbff8 rip 881dcb0 rsp ff1fbed8 error 6
rosetta_beta_5.[1503]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6
minirosetta_1.1[1509]: segfault at ff5fbff8 rip 881dcb0 rsp ff5fbed8 error 6
rosetta_beta_5.[1514]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
minirosetta_1.1[1520]: segfault at ff3fbff8 rip 881dcb0 rsp ff3fbed8 error 6
rosetta_beta_5.[1526]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
rosetta_beta_5.[1537]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6
rosetta_beta_5.[1543]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6

David Emigh Profile
Avatar

Joined: Mar 13 06
Posts: 158
ID: 65176
Credit: 417,178
RAC: 0
Message 53001 - Posted 12 May 2008 4:59:17 UTC

A third "Compute Error", this one on 73,000+ seconds of CPU time.

The system cannot find the path specified. (0x3) - exit code 3 (0x3)
____________
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!

Jipsu

Joined: Jan 27 08
Posts: 10
ID: 238496
Credit: 454,555
RAC: 0
Message 53004 - Posted 12 May 2008 8:30:32 UTC

Had three WUs fail because they exceeded the 2GB memory limit.

WU1
WU2
WU3

For some reason this seems to be a problem with windown version of minirosetta. On my linux server the memory usage peak seems to be around 150MB and both computers have same runtime prefences.

Path7

Joined: Aug 25 07
Posts: 128
ID: 201002
Credit: 61,751
RAC: 0
Message 53005 - Posted 12 May 2008 10:55:36 UTC
Last modified: 12 May 2008 10:56:29 UTC

Hello all,
Running Ubuntu 7.10 x86 this Task ID: 162048556 has Outcome = Success, but a double message:

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 14400
# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 13761.5 cpu seconds
This process generated 4 decoys from 4 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish
# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 16847.8 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>

From Boinc I got this message:
ma 12 mei 2008 01:33:52 CEST|rosetta@home|Task 1bkrA_BOINC_ABRELAX_IGNORE_THE_REST-S25-10-S3-11--1bkrA-_3181_3_1
exited with zero status but no 'finished' file
ma 12 mei 2008 01:33:52 CEST|rosetta@home|If this happens repeatedly you may need to reset the project.

Its total runtime was 16848.04 seconds.

This WU errored before, running on Windows XP as Invalid.

Have a nice day,
Path7.

BitSpit
Avatar

Joined: Nov 5 05
Posts: 33
ID: 9581
Credit: 4,147,344
RAC: 0
Message 53006 - Posted 12 May 2008 11:32:20 UTC - in response to Message ID 52999.

More segfaults,on linux running 5.10.45 client.

Apparently there was a problem with the network connection and the client kept trying to reconnect.


That's usually caused by a known, unfixed BOINC flaw, not Rosetta. When BOINC is resolving a domain name, it blocks all other communication, including running tasks. If that continues past 30 seconds, things start failing/crashing. The only know workaround is changing the DNS timeout. That's done in resolv.conf (usually located at /etc/resolv.conf) by adding the line options timeout:2 That makes each attempt 2 seconds with the default of 2 retries per DNS server. You can play with the options some based on your number of DNS server but try not to go over 25 seconds.

glaesum

Joined: Oct 16 06
Posts: 21
ID: 120376
Credit: 106,074
RAC: 0
Message 53007 - Posted 12 May 2008 12:04:12 UTC - in response to Message ID 52991.
Last modified: 12 May 2008 12:07:33 UTC

I got a similar err msg as AMD_is_logical except the task succeeded and validated even on top of the error reporting with every work unit done on win98 os. note, the 'psipred' line only occurred three times = no. of decoys, hmmmm??
http://boinc.bakerlab.org/rosetta/result.php?resultid=162095619

Received 11 May 2008 2:20:03 UTC
<core_client_version>5.10.30</core_client_version>
<stderr_txt>
AllocateAndInitializeSid Error 120
failed to create shared mem segment
WARNING: Override of option -out:nstruct sets a different value
can not open psipred_ss2 file tt
# cpu_run_time_pref: 14400
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
======================================================
DONE :: 1 starting structures 10977 cpu seconds
This process generated 3 decoys from 3 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>

(this work unit had been through boinc v.6.2 with another user where it failed to validate)
]]>

I just had a couple of mtlr_test2_S.00000001.*_3238_1 WUs error out.

http://boinc.bakerlab.org/rosetta/result.php?resultid=161862548
http://boinc.bakerlab.org/rosetta/result.php?resultid=161862513

In both cases the WU ran the normal length of time (16 hr), then printed a bunch of:

can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
...

lines to sterr. The WUs ended up being marked "invalid".

These WUs were on separate machines, both running Linux.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 53010 - Posted 12 May 2008 13:30:03 UTC

Bitspit, is that the root cause of the "no heartbeat from client in 31 secs" msgs?? Do you have a link to the trac item for this?
____________
Rosetta Moderator: Mod.Sense

Alan Roberts

Joined: Jun 7 06
Posts: 61
ID: 93009
Credit: 6,031,648
RAC: 2,884
Message 53015 - Posted 12 May 2008 18:24:19 UTC

I've got a Mini 1.19 work unit with a duration of 13:27:06 (machine is set for 14hr target) that has consumed 33:17:05, with a progress of 0.000%.

Is this a stuck job that should be aborted, or should I let it grind on in hopes of producing something? Thanks.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 53016 - Posted 12 May 2008 18:52:43 UTC

Is this a stuck job that should be aborted, or should I let it grind on in hopes of producing something?


I'd suggest you suspend it and resume it again and if progress % doesn't change within 5min of going back to a "running" status, I'd abort it.
____________
Rosetta Moderator: Mod.Sense

MacDitch

Joined: Aug 1 06
Posts: 10
ID: 102759
Credit: 206,444
RAC: 0
Message 53017 - Posted 12 May 2008 19:10:02 UTC

This computer errors on every Rosetta Mini work unit it gets - immediately!
WU 1, WU 2, WU 3 & WU 4

I've literally just done this WU 5 and the messages in the manager were:

12/05/2008 18:09:02|rosetta@home|Starting fa_max_dis_9-1cei_-test_2008-5-6_3222_7535_0
12/05/2008 18:09:02|rosetta@home|Starting task fa_max_dis_9-1cei_-test_2008-5-6_3222_7535_0 using minirosetta version 119
12/05/2008 18:09:04|rosetta@home|Computation for task fa_max_dis_9-1cei_-test_2008-5-6_3222_7535_0 finished
12/05/2008 18:09:04|rosetta@home|Output file fa_max_dis_9-1cei_-test_2008-5-6_3222_7535_0_0 for task fa_max_dis_9-1cei_-test_2008-5-6_3222_7535_0 absent


Note: The computer happily crunches on ~15 projects, has had no changes in weeks and does Rosetta Beta without problems... :?

Any ideas out there?
____________

BitSpit
Avatar

Joined: Nov 5 05
Posts: 33
ID: 9581
Credit: 4,147,344
RAC: 0
Message 53020 - Posted 12 May 2008 20:51:19 UTC - in response to Message ID 53010.

Bitspit, is that the root cause of the "no heartbeat from client in 31 secs" msgs?? Do you have a link to the trac item for this?


Yes, that usually is the cause of it. I don't know if there's an official bug report on it. I do know it's a question that shows up in the BOINC forums every few months where the explanation is given and they claim it would be too much effort to fix.

Ingleside

Joined: Sep 25 05
Posts: 105
ID: 986
Credit: 186,681
RAC: 0
Message 53021 - Posted 12 May 2008 21:11:41 UTC - in response to Message ID 53010.

Bitspit, is that the root cause of the "no heartbeat from client in 31 secs" msgs?? Do you have a link to the trac item for this?

Not sure if there's a trac for this, but waiting for DNS-lookup is definitely one of the reasons for "no heartbeat". If not mis-remembers, I've seen this problem on win2k, there one time internet-connection went down it was spitting-out "no heartbeat" all the time, while another running win2003 just continued crunching even didn't manage doing DNS-lookup...

Not sure, but during very heavy disk-usage it's likely also possible to get a "no heartbeat".

And, atleast in my experience, each and every time any of the dvd-players makes a nasty noise before spitting-out "read-error", I'm getting a "no heartbeat" in BOINC...


____________
"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

Alan Roberts

Joined: Jun 7 06
Posts: 61
ID: 93009
Credit: 6,031,648
RAC: 2,884
Message 53023 - Posted 12 May 2008 22:15:11 UTC

Mod.Sense, thanks for the suggestion. It doesn't seem to have fixed anything for this machine, but it does produce an interesting result.

According to BOINC Manager, this task is now suspended, and I see no CPU time accumulation within BOINC Manager. According to Windows Task Manager, minirosetta_1.1 is still grinding along, consuming CPU. When I resume the task the CPU time display within BOINC Manager catches up with what Windows Task Manager reports.

I'm off to see if there is a later version of BOINC, but this work unit is looking like an abort.
____________

senatoralex85

Joined: Sep 27 05
Posts: 66
ID: 1329
Credit: 169,644
RAC: 0
Message 53025 - Posted 12 May 2008 22:57:04 UTC - in response to Message ID 53021.
Last modified: 12 May 2008 22:58:12 UTC

Bitspit, is that the root cause of the "no heartbeat from client in 31 secs" msgs?? Do you have a link to the trac item for this?

Not sure if there's a trac for this, but waiting for DNS-lookup is definitely one of the reasons for "no heartbeat". If not mis-remembers, I've seen this problem on win2k, there one time internet-connection went down it was spitting-out "no heartbeat" all the time, while another running win2003 just continued crunching even didn't manage doing DNS-lookup...

Not sure, but during very heavy disk-usage it's likely also possible to get a "no heartbeat".

And, atleast in my experience, each and every time any of the dvd-players makes a nasty noise before spitting-out "read-error", I'm getting a "no heartbeat" in BOINC...



David Baker has gotten this error on his own laptop.

See here

http://boinc.bakerlab.org/rosetta/result.php?resultid=161624205

stderr out <core_client_version>5.4.9</core_client_version>
<stderr_txt>
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
FILE_LOCK::unlock(): close failed.: Bad file descriptor
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
Too many restarts with no progress. Keep application in memory while preempted.
======================================================
DONE :: 1 starting structures 131.476 cpu seconds
This process generated 0 decoys from 0 attempts

**Edit** Added Error Log results!
____________

Ingleside

Joined: Sep 25 05
Posts: 105
ID: 986
Credit: 186,681
RAC: 0
Message 53028 - Posted 13 May 2008 0:07:27 UTC - in response to Message ID 53020.

Yes, that usually is the cause of it. I don't know if there's an official bug report on it. I do know it's a question that shows up in the BOINC forums every few months where the explanation is given and they claim it would be too much effort to fix.

Adding a little more, there's atleast 2 open Trac-tickets about this, #113 and #336.


____________
"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

Rom Walton (BOINC)
Forum moderator
Project administrator
Project developer

Joined: Sep 17 05
Posts: 18
ID: 84
Credit: 40,071
RAC: 0
Message 53029 - Posted 13 May 2008 4:33:35 UTC

I'll throw in a bit more about the no heartbeat message.

At least once per release cycle we try to resolve this issue, so far the attempts to resolve the issue has lead to crashes within the core client.

DNS resolution is done through libcurl, and using either libcurl's native async-dns solution or the c-ares library hasn't resolved the issue. We haven't found a way to reproduce this issue in a lab environment, and so we haven't bee able to give the libcurl guys enough information to get it fixed.

So until we can get more info to the libcurl guys who can then fix it, the no heartbeat message is better than a crash.

____________
----- Rom
My Blog

Betting Slip

Joined: Sep 26 05
Posts: 71
ID: 1160
Credit: 5,702,246
RAC: 0
Message 53032 - Posted 13 May 2008 11:37:03 UTC

Finally got one to finish http://boinc.bakerlab.org/rosetta/workunit.php?wuid=148643026

It consumed 1,063MB of memory and similar VM. This was on a 12hr run.

Bet if I had rebooted it would have failed.

I watched the last 5% in task manager. The to completion time stopped at 9 mins 59 secs and the WU finished at 96.6%

I then got a lot less credit than requested LOL

Hope the result was worth it. :)
____________

[KWSN]John Galt 007 Profile
Avatar

Joined: Aug 4 06
Posts: 6
ID: 103245
Credit: 1,012,507
RAC: 0
Message 53035 - Posted 13 May 2008 14:55:53 UTC

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=145523230

Client errors on 2 machines, one of which is mine. 0.00 seconds, so no time lost.
____________

RiverboatSam

Joined: Dec 9 05
Posts: 1
ID: 33514
Credit: 59,080
RAC: 0
Message 53037 - Posted 13 May 2008 16:58:46 UTC

Suddenly this morning, Rosetta is using all my CPU resources. I am having to kill it in order to do any work. I need to figure out how to leave the project - I cannot have this happening.
____________

glaesum

Joined: Oct 16 06
Posts: 21
ID: 120376
Credit: 106,074
RAC: 0
Message 53038 - Posted 13 May 2008 17:03:35 UTC

error #161 (whatever that is)

finally a wu failed, that's on top of the usual non-fatal 120 error:
resultid=162869266

<core_client_version>5.10.30</core_client_version>
<stderr_txt>
AllocateAndInitializeSid Error 120
failed to create shared mem segment
# cpu_run_time_pref: 14400
:
BOINC :: Watchdog shutting down...
called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
<file_name>rb_05_12_11631_20348_T0397_IGNORE_THE_REST_10_16_3247_49_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>
]]>

David Emigh Profile
Avatar

Joined: Mar 13 06
Posts: 158
ID: 65176
Credit: 417,178
RAC: 0
Message 53040 - Posted 13 May 2008 18:51:06 UTC
Last modified: 13 May 2008 18:55:46 UTC

Error number 4, at 77,900+ CPU seconds.

Reason: Access Violation (0xc0000005) at address 0x005C1E7C write attempt to address 0x00000024

Large and detailed debugger report available at the link, if anyone is reading those things at this point.

The host that received the above error is 1/4 on mini 1.19 tasks that have a runtime preference in excess of 12 hours, but is 8/8 on mini 1.19 tasks with a runtime preference of 12 hours or less.
____________
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 53041 - Posted 13 May 2008 19:11:28 UTC - in response to Message ID 53037.

Suddenly this morning, Rosetta is using all my CPU resources. I am having to kill it in order to do any work. I need to figure out how to leave the project - I cannot have this happening.


Nothing has changed on the end of Rosetta suddenly this morning. It is designed to run at a low priority, so anything else your computer is working on is ahead in line for the CPU. You can configure BOINC to use a fraction of the CPU, or to only run at specific times of day. You can just go to the advanced view, then use the advanced pulldown menu, and select preferences to set these up for that specific machine.

So, using all of your CPU is normal, when you aren't doing anything else. And if it is causing any noticible impact on your work, it is actually more likely an issue of how much memory is available then the CPU being used.
____________
Rosetta Moderator: Mod.Sense

Alan Roberts

Joined: Jun 7 06
Posts: 61
ID: 93009
Credit: 6,031,648
RAC: 2,884
Message 53043 - Posted 13 May 2008 20:20:04 UTC

I have just finished an "observing" session on a Windows 2K server where multiple Mini 1.19 tasks were not honoring suspend behavior. I'm allowed to run Rosetta jobs on this machine during off hours. When I examined the tasks within BOINC Manager they reported as suspended, and were not accumulating CPU time. Checking in Windows Task Manager showed Rosetta Mini merrily consuming CPU. When I toggled Activity with BOINC Manager from Run based on preferences to Run always I would see the CPU time within BOINC Manager "catch up" to that shown in Windows Task Manager.

I aborted the first Mini job, and the second started and demonstrated the same behavior. Shutdown the BOINC service (which did kill everything), and restarted. Problem continued. Shutdown BOINC again, uninstalled and reinstalled BOINC (5.10.45) and restarted. Problem continued. Aborted the second Mini job, observed problem with the third one, also aborted the job.

Now I've got Beta 5.96 tasks downloaded, and these are obeying suspend/resume flawlessly.

Has anyone else seen this, and more importantly if so is there a fix? I have a collection of machines where I'm allowed to run Rosetta only during off hours ... I'll have to pull them out of action if I can't count on reliable time of day suspends. Alternatively, is there any thing I can do to tell any machine exhibiting this behavior to avoid Mini jobs, since Beta 5.96 is behaving correctly?
____________

caesar1987 Profile
Avatar

Joined: Nov 28 06
Posts: 13
ID: 131900
Credit: 22,268
RAC: 0
Message 53048 - Posted 14 May 2008 0:45:13 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=162352303
____________

Venturini Dario[VENETO] Profile

Joined: May 25 07
Posts: 22
ID: 179805
Credit: 245,028
RAC: 0
Message 53049 - Posted 14 May 2008 8:50:12 UTC

Validate error on a 84k+ seconds task (I'd say... rather annoying)

http://boinc.bakerlab.org/rosetta/result.php?resultid=162388905

(_KoDAk_) Profile

Joined: Jul 18 06
Posts: 109
ID: 100677
Credit: 1,859,263
RAC: 0
Message 53054 - Posted 14 May 2008 17:00:30 UTC

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x005C3051 write attempt to address 0x00000024
http://boinc.bakerlab.org/rosetta/result.php?resultid=162428256

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x76A342EB
\\\ + 1480Mb use of ram
http://boinc.bakerlab.org/rosetta/result.php?resultid=162386305

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x76A342EB

http://boinc.bakerlab.org/rosetta/result.php?resultid=162246424
____________

(_KoDAk_) Profile

Joined: Jul 18 06
Posts: 109
ID: 100677
Credit: 1,859,263
RAC: 0
Message 53055 - Posted 14 May 2008 17:03:01 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=161724764
http://boinc.bakerlab.org/rosetta/result.php?resultid=161724764
http://boinc.bakerlab.org/rosetta/result.php?resultid=161544482
http://boinc.bakerlab.org/rosetta/result.php?resultid=161438499
http://boinc.bakerlab.org/rosetta/result.php?resultid=161028445
____________

popandbob

Joined: Oct 30 05
Posts: 4
ID: 7724
Credit: 221,645
RAC: 0
Message 53059 - Posted 14 May 2008 19:31:42 UTC

3 errors...

Error 1

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 36000
# cpu_run_time_pref: 36000

ERROR: Conformation: fold_tree nres should match size
ERROR:: Exit from: ..\..\src\core\conformation\Conformation.cc line: 192
called boinc_finish

</stderr_txt>

error2

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>

ERROR: unrecognized atom_type_name HOH
ERROR:: Exit from: c:\cygwin\home\boinc\boinc_build\minirosetta_1.19\mini\src\core/chemical/AtomTypeSet.hh line: 79
called boinc_finish

</stderr_txt>
]]>

error 3

core_client_version>5.10.30</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 36000


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x005C3030 write attempt to address 0x00000004

Engaging BOINC Windows Runtime Debugger...
____________

alpha Profile

Joined: Nov 4 06
Posts: 27
ID: 127202
Credit: 953,255
RAC: 781
Message 53061 - Posted 14 May 2008 19:43:11 UTC

Access violation (exit code -1073741819 (0xc0000005)) after nearly 22,000 seconds:

http://boinc.bakerlab.org/rosetta/result.php?resultid=162546878
____________

Jipsu

Joined: Jan 27 08
Posts: 10
ID: 238496
Credit: 454,555
RAC: 0
Message 53072 - Posted 15 May 2008 12:47:45 UTC
Last modified: 15 May 2008 12:48:23 UTC

I think the out of memory error is corrected already in minirosetta v1.2 which is going thru testing at ralph at the moment.

24h minorosetta v1.2 tasks are taking around 150M of memory and the out of memory error in minirosetta v1.19 seems to exist only in windows version of the application.

Just throwing my thoughts around, but I think it's pointless to post out of memory errors since the problem is already fixed in v1.2.

David Emigh Profile
Avatar

Joined: Mar 13 06
Posts: 158
ID: 65176
Credit: 417,178
RAC: 0
Message 53079 - Posted 15 May 2008 18:38:43 UTC - in response to Message ID 53072.

{...}
Just throwing my thoughts around, but I think it's pointless to post out of memory errors since the problem is already fixed in v1.2.


"Pointless" only for those who: 1) Participate in RALPH@home, 2) Have long runtime preferences, 3) Run Windows operating systems, and 4) Agree with the conclusion that the problem is solved.
____________
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!

Venturini Dario[VENETO] Profile

Joined: May 25 07
Posts: 22
ID: 179805
Credit: 245,028
RAC: 0
Message 53100 - Posted 17 May 2008 10:02:17 UTC - in response to Message ID 53049.

Validate error on a 84k+ seconds task (I'd say... rather annoying)

http://boinc.bakerlab.org/rosetta/result.php?resultid=162388905



And here's another one:

http://boinc.bakerlab.org/rosetta/result.php?resultid=162547290


Both happened after a segfault error some hours before completion.

Dr Who Fan
Avatar

Joined: May 28 06
Posts: 35
ID: 85050
Credit: 63,554
RAC: 43
Message 53112 - Posted 17 May 2008 22:27:58 UTC
Last modified: 17 May 2008 22:30:04 UTC

Anyone else seen this yet?

I have a single incidence of minirosetta v1.19 using both "cores" of my Pentium 4 with Hyper Thread.
It is not following the BOINC rules to use only 1 core/app/cpu.

It is currently running:
Task ID 164060225
Task Name h003__BOINC_ABRELAX_IGNORE_THE_REST-S25-5-S3-3--h003_-_3321_121_0
____________

The_Bad_Penguin
Avatar

Joined: Jun 5 06
Posts: 2747
ID: 89694
Credit: 1,859,902
RAC: 0
Message 53116 - Posted 18 May 2008 1:19:31 UTC

For those with inquiring minds:

rb_05_17_11407_20379_tim23_IGNORE_THE_REST_06_10_3329_49

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
======================================================
DONE :: 1 starting structures 10440.8 cpu seconds
This process generated 8 decoys from 8 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>


Validate state Invalid
Claimed credit 28.6254621435188
Granted credit 0
application version 1.19



and


rb_05_17_11462_20386_CRF-BP_IGNORE_THE_REST_06_17_3330_30

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
======================================================
DONE :: 1 starting structures 10685.3 cpu seconds
This process generated 6 decoys from 6 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>


Validate state Invalid
Claimed credit 29.2953911287149
Granted credit 0
application version 1.19

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 53118 - Posted 18 May 2008 3:48:05 UTC
Last modified: 18 May 2008 3:48:26 UTC

huge debug dump on this task: rb_05_16_11639_20372_T0405_IGNORE_THE_REST_08_11_3323_227_0
http://boinc.bakerlab.org/rosetta/result.php?resultid=164159635

it completed most of its computing before hitting a big error: -1073741819 (0xc0000005)

CPU time 8536.469
stderr out

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
# cpu_run_time_pref: 28800


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x005321B4 read attempt to address 0x3FC662A7

Engaging BOINC Windows Runtime Debugger...
____________

The_Bad_Penguin
Avatar

Joined: Jun 5 06
Posts: 2747
ID: 89694
Credit: 1,859,902
RAC: 0
Message 53119 - Posted 18 May 2008 5:19:03 UTC

rb_05_16_11639_20371_T0405_IGNORE_THE_REST_05_11_3322_57_0



<core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
======================================================
DONE :: 1 starting structures 9119.05 cpu seconds
This process generated 4 decoys from 4 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
<file_name>rb_05_16_11639_20371_T0405_IGNORE_THE_REST_05_11_3322_57_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>


Validate state Invalid
Claimed credit 23.6733820084193
Granted credit 0
application version 1.19

Path7

Joined: Aug 25 07
Posts: 128
ID: 201002
Credit: 61,751
RAC: 0
Message 53123 - Posted 18 May 2008 9:33:37 UTC

Running Ubuntu 7.10 x86 this task:
1opd__BOINC_ABRELAX_SAVE_ALL_OUT_IGNORE_THE_REST-S25-11-S3-4--1opd_-_3252_12
ended with a validate error for me after 11,727.76 seconds and ended successfully on the second run after 7,642.63 seconds running on Windows XP Professional Edition.
I've switched my computer off while this WU was running (no issues with that before).

Have a nice day,
Path7.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 53177 - Posted 19 May 2008 20:09:09 UTC

another long and scary debug thread here.
i think it has to do with I was trying to install a usb card reader that caused the system to go nuts.

its here you can read the post mortom

rb_05_16_11639_20372_T0405_IGNORE_THE_REST_06_11_3323_425_0
____________

Scott McInness

Joined: Mar 15 08
Posts: 1
ID: 247250
Credit: 393,032
RAC: 0
Message 53197 - Posted 20 May 2008 14:59:01 UTC
Last modified: 20 May 2008 15:00:35 UTC

I've just updated BOINC on a PC that I haven't used for BOINC for about 12 months (wow, there's an x64 version now!) and every work unit initiated with mini 1.19 x86_64 crashes after less than a second. It also seems to run as a 32-bit process...

165005856 - Access Violation (0xc0000005) at address 0x73010175 read attempt to address 0x73010175
165012983 - Access Violation (0xc0000005) at address 0x73010175 read attempt to address 0x73010175
165017550 - Access Violation (0xc0000005) at address 0x73010175 read attempt to address 0x73010175
165018958 - Access Violation (0xc0000005) at address 0x73010175 read attempt to address 0x73010175
165019984 - Access Violation (0xc0000005) at address 0x73010175 read attempt to address 0x73010175

There is a Rosetta Beta 5.96 x86_64 task running atm (which is also running as a 32-bit process) just on 13% without problem, and SETI tasks (32-bit only) seem to work too.

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 53204 - Posted 20 May 2008 22:11:57 UTC

This WU was marked "invalid" despite having a completely normal looking stderr.

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 521,019
RAC: 10
Message 53205 - Posted 20 May 2008 22:33:50 UTC

Task ID 164939423
Name rb_05_19_11641_20436_T0407_IGNORE_THE_REST_04_16_3332_224_0 had a Compute error

CPU time 4351.235
stderr out <core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 3600
# cpu_run_time_pref: 3600
======================================================
DONE :: 1 starting structures 4351.19 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
<file_name>rb_05_19_11641_20436_T0407_IGNORE_THE_REST_04_16_3332_224_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>
Validate state Invalid
Claimed credit 17.070482774889
Granted credit 0
application version 1.19

Is this a goast wu?
Cheers
Speedy
____________
Have a crunching good day!!

Pepo
Avatar

Joined: Sep 28 05
Posts: 115
ID: 1676
Credit: 101,358
RAC: 0
Message 53213 - Posted 21 May 2008 0:43:27 UTC - in response to Message ID 53204.

This WU was marked "invalid" despite having a completely normal looking stderr.

Bad luck, I suppose it was because of the wrong WU settings:
minimum quorum: 1
initial replication: 1
max # of error/total/success tasks: 1, 2, 1
errors: Too many error results Cancelled


IMHO you should not have got the task resent after your wingman failed - a task born to be cancelled?

Devs?

Peter

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 521,019
RAC: 10
Message 53219 - Posted 21 May 2008 8:00:30 UTC

This is not a bug. I was wondering are there any plans to display what model the work unit is up to? Thanks for your hard work on this application on behalf of all of the cruncher's.
Cheers
Speedy

____________
Have a crunching good day!!

nouqraz Profile

Joined: Apr 8 08
Posts: 6
ID: 251899
Credit: 156,000
RAC: 0
Message 53225 - Posted 21 May 2008 12:05:54 UTC
Last modified: 21 May 2008 12:13:26 UTC

One of my systems seems to be having issues runing minirosetta v1.19 WUs. It is a 4 processor Intel Xeon CPU X3210 (two dual core chips) running server 2003 R2.

It seems to be crunching through Rosetta Beta 5.96 WUs no problem, but when it goes to start a mini 1.19 WU, it switches the task to "running" but CPU time is ever used and the task stays at 0%. If I suspend all of the mini 1.19 WUs that are queued up the system immediately begins crunching on any Rosetta Beta 5.96 WUs without any problem. I have left the system sitting in the "running" @ 0% state on mini units for hours and it hasn't gotten anywhere, my only option seems to be to suspend or abort the work units.

I have two other machines - one an Intel P4, the other a Core 2 Quad 9300, both running XP - that seem to have no problems running mini or beta WUs.

Is it possible to get the client to not receive mini WUs? Or is there some known reason behind these stalled work units that there is a workaround for?

Thanks,
Adam

Jeremy

Joined: May 15 08
Posts: 13
ID: 259031
Credit: 2,636
RAC: 0
Message 53248 - Posted 21 May 2008 20:34:42 UTC - in response to Message ID 53225.

I have had nothing but Compute errors with the mini version of rosetta. See this page
http://boinc.bakerlab.org/rosetta/results.php?userid=259031

I'd rather only have the normal ones for 2 reasons. One it keeps giving errors so the cpu time isn't putt to use. It doesn't have propper grafics, but I've read that that is not a priority.

I'd like to help debugging this application by sending whatever information you need.

Here is my host sheet.
http://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=812509

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 53294 - Posted 23 May 2008 9:37:05 UTC

5croA_BOINC_ABRELAX_SAVE_ALL_OUT_IGNORE_THE_REST-S25-7-S3-6--5croA-_3325_1_0 crashed and burned in a compute error.

Long error dump yet again.

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600


Unhandled Exception Detected...

____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 53351 - Posted 26 May 2008 13:44:56 UTC
Last modified: 26 May 2008 13:47:09 UTC

another one:
h001__BOINC_ABRELAX_IGNORE_THE_REST-S25-11-S3-5--h001_-_3324_45140_0
Client error
Client state Done
Exit status -1073741819 (0xc0000005)
Computer ID 293392
Report deadline 30 May 2008 19:09:52 UTC
CPU time 19774.5
stderr out <core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x005C3030 write attempt to address 0x00000004

Engaging BOINC Windows Runtime Debugger...


it did grant me credit amazing enough
____________

Pepo
Avatar

Joined: Sep 28 05
Posts: 115
ID: 1676
Credit: 101,358
RAC: 0
Message 53726 - Posted 16 Jun 2008 19:38:54 UTC - in response to Message ID 52916.

It should not. Which client, 5.10.45?

yes, 5.10.45

The "crash on project detach" bug should be fixed in next 6.2 release (changeset [15407]).

Peter

Message boards : Number crunching : minirosetta v1.19 bug thread


Home | Join | About | Participants | Community | Statistics

Copyright © 2017 University of Washington

Last Modified: 10 Nov 2010 1:51:38 UTC
Back to top ^