minirosetta v1.19 bug thread

Message boards : Number crunching : minirosetta v1.19 bug thread

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 52982 - Posted: 11 May 2008, 2:05:18 UTC

I've got a "Compute Error" on 76,000+ seconds of CPU time.

Reason: Access Violation (0xc0000005) at address 0x005C3051 write attempt to address 0x00000024

There is a large and detailed debugger report.
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 52982 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Betting Slip

Send message
Joined: 26 Sep 05
Posts: 71
Credit: 5,702,246
RAC: 0
Message 52985 - Posted: 11 May 2008, 11:27:02 UTC - in response to Message 52979.  


It just strikes me that the very kowledgeable Rom is arrogant enough to point to the cause without indicating any sort of a solution.


In this particular case there isn't anything that any of us can do, I've passed the info on to the MiniRosetta devs. Basically MiniRosetta is a 32-bit process, and generally 32-bit processes are limited to 2GB of user-mode memory. MiniRosetta hit that limit and so when it asked for more the OS said NO, leading to the crash.

The sign that this sort of problem has occurred is:
LoadLibraryA( dbghelp.dll ): GetLastError = 8

and
- Virtual Memory Usage -
VirtualSize: 2127511552, PeakVirtualSize: 2127511552


Sorry for not explaining the situation sooner, I was heading for bed and I started thinking about how I was going to help the devs debug this problem in the wild if they are unable to reproduce this issue in the lab.

At present there isn't anything in the BOINC application framework that'll help them debug this in the wild.





Thanks Rom and sorry for being a bit short with you. Sometimes wonder where all this irritability comes from.

I sometimes long for a slower pace of life LOL
ID: 52985 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
M.L.

Send message
Joined: 21 Nov 06
Posts: 182
Credit: 180,462
RAC: 0
Message 52988 - Posted: 11 May 2008, 15:22:43 UTC
Last modified: 11 May 2008, 15:24:52 UTC

Task ID 162368970
Name SSPAIR_MIN_ABINITIO_1fna_3115_6915_2
Workunit 145958649
Created 10 May 2008 22:29:48 UTC
Sent 10 May 2008 22:30:26 UTC
Received 11 May 2008 15:18:04 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 1 (0x1)
Computer ID 735230
Report deadline 20 May 2008 22:30:26 UTC
CPU time 0
stderr out <core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
ERROR: Option matching -fudge not found in command line top-level context

</stderr_txt>
]]>


Validate state Invalid
Claimed credit 0
Granted credit 0
application version 1.19

Fudge is gooood, except in this case.
ID: 52988 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BobCat13

Send message
Joined: 18 Jun 06
Posts: 4
Credit: 130,387
RAC: 0
Message 52989 - Posted: 11 May 2008, 16:04:36 UTC - in response to Message 52979.  

Rom Walton wrote:
In this particular case there isn't anything that any of us can do, I've passed the info on to the MiniRosetta devs. Basically MiniRosetta is a 32-bit process, and generally 32-bit processes are limited to 2GB of user-mode memory. MiniRosetta hit that limit and so when it asked for more the OS said NO, leading to the crash.


I just errored out with this same problem:

https://boinc.bakerlab.org/rosetta/result.php?resultid=162306731

Watching Process Explorer, the MiniRosetta application was constantly grabbing more memory. Both physical and virtual were increasing throughout the task's run. My preference was set to 24 hours, but it only made it to ~15.5 hours before reaching the 2GB limit. I then changed preferences to 2 hours and a task finished properly. It appears I will have to set preferences at 12 hours on this machine to avoid the 2GB limit.

For people running a Core2 at 3GHz or higher, you may want to try setting preferences at 8 hours or less to see if that helps.
ID: 52989 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 52991 - Posted: 11 May 2008, 17:30:44 UTC

I just had a couple of mtlr_test2_S.00000001.*_3238_1 WUs error out.

https://boinc.bakerlab.org/rosetta/result.php?resultid=161862548
https://boinc.bakerlab.org/rosetta/result.php?resultid=161862513

In both cases the WU ran the normal length of time (16 hr), then printed a bunch of:

can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
...

lines to sterr. The WUs ended up being marked "invalid".

These WUs were on separate machines, both running Linux.
ID: 52991 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 52993 - Posted: 11 May 2008, 19:20:44 UTC

A second "Compute Error", this one on 85,000+ seconds of CPU time:

Reason: Access Violation (0xc0000005) at address 0x005C3051 write attempt to address 0x00000024

There is a large and detailed debugger message.

This error, and the one I reported earlier in this thread, have the same signature as the errors I was getting with mini 1.15, errors which crippled two stable and reliable crunchers until I discovered a workaround.

The only difference now is that the mini 1.19 workunits take about twice as long to crash, resulting in twice as much wasted CPU time...
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 52993 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
radu

Send message
Joined: 7 May 08
Posts: 4
Credit: 66,301
RAC: 0
Message 52999 - Posted: 12 May 2008, 4:07:10 UTC
Last modified: 12 May 2008, 4:08:43 UTC

More segfaults,on linux running 5.10.45 client.

Apparently there was a problem with the network connection and the client kept trying to reconnect.

All tasks whose results were about to be sent were marked with "compute error", for example: https://boinc.bakerlab.org/rosetta/result.php?resultid=162637592

I hope this helps.

Output of dmesg:
tg3: eth0: Link is down.
minirosetta_1.1[1243]: segfault at ff5fbff8 rip 881dcb0 rsp ff5fbed8 error 6
minirosetta_1.1[1258]: segfault at ff3fbff8 rip 881dcb0 rsp ff3fbed8 error 6
Clocksource tsc unstable (delta = -116217092 ns)
minirosetta_1.1[1348]: segfault at ff7fbff8 rip 881dcb0 rsp ff7fbed8 error 6
rosetta_beta_5.[1353]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6
minirosetta_1.1[1363]: segfault at ff7fbff8 rip 881dcb0 rsp ff7fbed8 error 6
rosetta_beta_5.[1367]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
minirosetta_1.1[1375]: segfault at ff1fbff8 rip 881dcb0 rsp ff1fbed8 error 6
rosetta_beta_5.[1379]: segfault at ff5fbfe8 rip 8e8fe90 rsp ff5fbec8 error 6
minirosetta_1.1[1390]: segfault at ff3fbff8 rip 881dcb0 rsp ff3fbed8 error 6
rosetta_beta_5.[1395]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
rosetta_beta_5.[1411]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
rosetta_beta_5.[1407]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
rosetta_beta_5.[1422]: segfault at ff5fbfe8 rip 8e8fe90 rsp ff5fbec8 error 6
rosetta_beta_5.[1426]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6
rosetta_beta_5.[1434]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6
rosetta_beta_5.[1440]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6
rosetta_beta_5.[1449]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
rosetta_beta_5.[1454]: segfault at ff5fbfe8 rip 8e8fe90 rsp ff5fbec8 error 6
rosetta_beta_5.[1470]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6
rosetta_beta_5.[1463]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6
minirosetta_1.1[1486]: segfault at ff3fbff8 rip 881dcb0 rsp ff3fbed8 error 6
rosetta_beta_5.[1481]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6
minirosetta_1.1[1498]: segfault at ff1fbff8 rip 881dcb0 rsp ff1fbed8 error 6
rosetta_beta_5.[1503]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6
minirosetta_1.1[1509]: segfault at ff5fbff8 rip 881dcb0 rsp ff5fbed8 error 6
rosetta_beta_5.[1514]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
minirosetta_1.1[1520]: segfault at ff3fbff8 rip 881dcb0 rsp ff3fbed8 error 6
rosetta_beta_5.[1526]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
rosetta_beta_5.[1537]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6
rosetta_beta_5.[1543]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6
ID: 52999 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 53001 - Posted: 12 May 2008, 4:59:17 UTC

A third "Compute Error", this one on 73,000+ seconds of CPU time.

The system cannot find the path specified. (0x3) - exit code 3 (0x3)
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 53001 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jipsu

Send message
Joined: 27 Jan 08
Posts: 10
Credit: 454,555
RAC: 0
Message 53004 - Posted: 12 May 2008, 8:30:32 UTC

Had three WUs fail because they exceeded the 2GB memory limit.

WU1
WU2
WU3

For some reason this seems to be a problem with windown version of minirosetta. On my linux server the memory usage peak seems to be around 150MB and both computers have same runtime prefences.
ID: 53004 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Path7

Send message
Joined: 25 Aug 07
Posts: 128
Credit: 61,751
RAC: 0
Message 53005 - Posted: 12 May 2008, 10:55:36 UTC
Last modified: 12 May 2008, 10:56:29 UTC

Hello all,
Running Ubuntu 7.10 x86 this Task ID: 162048556 has Outcome = Success, but a double message:

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 14400
# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 13761.5 cpu seconds
This process generated 4 decoys from 4 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish
# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 16847.8 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>

From Boinc I got this message:
ma 12 mei 2008 01:33:52 CEST|rosetta@home|Task 1bkrA_BOINC_ABRELAX_IGNORE_THE_REST-S25-10-S3-11--1bkrA-_3181_3_1
exited with zero status but no 'finished' file
ma 12 mei 2008 01:33:52 CEST|rosetta@home|If this happens repeatedly you may need to reset the project.

Its total runtime was 16848.04 seconds.

This WU errored before, running on Windows XP as Invalid.

Have a nice day,
Path7.
ID: 53005 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BitSpit
Avatar

Send message
Joined: 5 Nov 05
Posts: 33
Credit: 4,147,344
RAC: 0
Message 53006 - Posted: 12 May 2008, 11:32:20 UTC - in response to Message 52999.  

More segfaults,on linux running 5.10.45 client.

Apparently there was a problem with the network connection and the client kept trying to reconnect.


That's usually caused by a known, unfixed BOINC flaw, not Rosetta. When BOINC is resolving a domain name, it blocks all other communication, including running tasks. If that continues past 30 seconds, things start failing/crashing. The only know workaround is changing the DNS timeout. That's done in resolv.conf (usually located at /etc/resolv.conf) by adding the line options timeout:2 That makes each attempt 2 seconds with the default of 2 retries per DNS server. You can play with the options some based on your number of DNS server but try not to go over 25 seconds.
ID: 53006 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
glaesum

Send message
Joined: 16 Oct 06
Posts: 21
Credit: 467,675
RAC: 0
Message 53007 - Posted: 12 May 2008, 12:04:12 UTC - in response to Message 52991.  
Last modified: 12 May 2008, 12:07:33 UTC

I got a similar err msg as AMD_is_logical except the task succeeded and validated even on top of the error reporting with every work unit done on win98 os. note, the 'psipred' line only occurred three times = no. of decoys, hmmmm??
https://boinc.bakerlab.org/rosetta/result.php?resultid=162095619

Received 11 May 2008 2:20:03 UTC
<core_client_version>5.10.30</core_client_version>
<stderr_txt>
AllocateAndInitializeSid Error 120
failed to create shared mem segment
WARNING: Override of option -out:nstruct sets a different value
can not open psipred_ss2 file tt
# cpu_run_time_pref: 14400
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
======================================================
DONE :: 1 starting structures 10977 cpu seconds
This process generated 3 decoys from 3 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>

(this work unit had been through boinc v.6.2 with another user where it failed to validate)
]]>

I just had a couple of mtlr_test2_S.00000001.*_3238_1 WUs error out.

https://boinc.bakerlab.org/rosetta/result.php?resultid=161862548
https://boinc.bakerlab.org/rosetta/result.php?resultid=161862513

In both cases the WU ran the normal length of time (16 hr), then printed a bunch of:

can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
...

lines to sterr. The WUs ended up being marked "invalid".

These WUs were on separate machines, both running Linux.
ID: 53007 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 53010 - Posted: 12 May 2008, 13:30:03 UTC

Bitspit, is that the root cause of the "no heartbeat from client in 31 secs" msgs?? Do you have a link to the trac item for this?
Rosetta Moderator: Mod.Sense
ID: 53010 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alan Roberts

Send message
Joined: 7 Jun 06
Posts: 61
Credit: 6,901,926
RAC: 0
Message 53015 - Posted: 12 May 2008, 18:24:19 UTC

I've got a Mini 1.19 work unit with a duration of 13:27:06 (machine is set for 14hr target) that has consumed 33:17:05, with a progress of 0.000%.

Is this a stuck job that should be aborted, or should I let it grind on in hopes of producing something? Thanks.
ID: 53015 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 53016 - Posted: 12 May 2008, 18:52:43 UTC

Is this a stuck job that should be aborted, or should I let it grind on in hopes of producing something?


I'd suggest you suspend it and resume it again and if progress % doesn't change within 5min of going back to a "running" status, I'd abort it.
Rosetta Moderator: Mod.Sense
ID: 53016 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MacDitch

Send message
Joined: 1 Aug 06
Posts: 10
Credit: 206,444
RAC: 0
Message 53017 - Posted: 12 May 2008, 19:10:02 UTC

This computer errors on every Rosetta Mini work unit it gets - immediately!
WU 1, WU 2, WU 3 & WU 4

I've literally just done this WU 5 and the messages in the manager were:
12/05/2008 18:09:02|rosetta@home|Starting fa_max_dis_9-1cei_-test_2008-5-6_3222_7535_0
12/05/2008 18:09:02|rosetta@home|Starting task fa_max_dis_9-1cei_-test_2008-5-6_3222_7535_0 using minirosetta version 119
12/05/2008 18:09:04|rosetta@home|Computation for task fa_max_dis_9-1cei_-test_2008-5-6_3222_7535_0 finished
12/05/2008 18:09:04|rosetta@home|Output file fa_max_dis_9-1cei_-test_2008-5-6_3222_7535_0_0 for task fa_max_dis_9-1cei_-test_2008-5-6_3222_7535_0 absent


Note: The computer happily crunches on ~15 projects, has had no changes in weeks and does Rosetta Beta without problems... :?

Any ideas out there?
ID: 53017 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BitSpit
Avatar

Send message
Joined: 5 Nov 05
Posts: 33
Credit: 4,147,344
RAC: 0
Message 53020 - Posted: 12 May 2008, 20:51:19 UTC - in response to Message 53010.  

Bitspit, is that the root cause of the "no heartbeat from client in 31 secs" msgs?? Do you have a link to the trac item for this?


Yes, that usually is the cause of it. I don't know if there's an official bug report on it. I do know it's a question that shows up in the BOINC forums every few months where the explanation is given and they claim it would be too much effort to fix.
ID: 53020 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ingleside

Send message
Joined: 25 Sep 05
Posts: 107
Credit: 1,007,873
RAC: 2,567
Message 53021 - Posted: 12 May 2008, 21:11:41 UTC - in response to Message 53010.  

Bitspit, is that the root cause of the "no heartbeat from client in 31 secs" msgs?? Do you have a link to the trac item for this?

Not sure if there's a trac for this, but waiting for DNS-lookup is definitely one of the reasons for "no heartbeat". If not mis-remembers, I've seen this problem on win2k, there one time internet-connection went down it was spitting-out "no heartbeat" all the time, while another running win2003 just continued crunching even didn't manage doing DNS-lookup...

Not sure, but during very heavy disk-usage it's likely also possible to get a "no heartbeat".

And, atleast in my experience, each and every time any of the dvd-players makes a nasty noise before spitting-out "read-error", I'm getting a "no heartbeat" in BOINC...


"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
ID: 53021 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alan Roberts

Send message
Joined: 7 Jun 06
Posts: 61
Credit: 6,901,926
RAC: 0
Message 53023 - Posted: 12 May 2008, 22:15:11 UTC

Mod.Sense, thanks for the suggestion. It doesn't seem to have fixed anything for this machine, but it does produce an interesting result.

According to BOINC Manager, this task is now suspended, and I see no CPU time accumulation within BOINC Manager. According to Windows Task Manager, minirosetta_1.1 is still grinding along, consuming CPU. When I resume the task the CPU time display within BOINC Manager catches up with what Windows Task Manager reports.

I'm off to see if there is a later version of BOINC, but this work unit is looking like an abort.
ID: 53023 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
senatoralex85

Send message
Joined: 27 Sep 05
Posts: 66
Credit: 169,644
RAC: 0
Message 53025 - Posted: 12 May 2008, 22:57:04 UTC - in response to Message 53021.  
Last modified: 12 May 2008, 22:58:12 UTC

Bitspit, is that the root cause of the "no heartbeat from client in 31 secs" msgs?? Do you have a link to the trac item for this?

Not sure if there's a trac for this, but waiting for DNS-lookup is definitely one of the reasons for "no heartbeat". If not mis-remembers, I've seen this problem on win2k, there one time internet-connection went down it was spitting-out "no heartbeat" all the time, while another running win2003 just continued crunching even didn't manage doing DNS-lookup...

Not sure, but during very heavy disk-usage it's likely also possible to get a "no heartbeat".

And, atleast in my experience, each and every time any of the dvd-players makes a nasty noise before spitting-out "read-error", I'm getting a "no heartbeat" in BOINC...



David Baker has gotten this error on his own laptop.

See here

https://boinc.bakerlab.org/rosetta/result.php?resultid=161624205

stderr out <core_client_version>5.4.9</core_client_version>
<stderr_txt>
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
FILE_LOCK::unlock(): close failed.: Bad file descriptor
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
Too many restarts with no progress. Keep application in memory while preempted.
======================================================
DONE :: 1 starting structures 131.476 cpu seconds
This process generated 0 decoys from 0 attempts

**Edit** Added Error Log results!
ID: 53025 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : minirosetta v1.19 bug thread



©2022 University of Washington
https://www.bakerlab.org