Posts by Rom Walton (BOINC)

1) Message boards : Number crunching : minirosetta v1.19 bug thread (Message 53029)
Posted 13 May 2008 by Rom Walton (BOINC)
Post:
I'll throw in a bit more about the no heartbeat message.

At least once per release cycle we try to resolve this issue, so far the attempts to resolve the issue has lead to crashes within the core client.

DNS resolution is done through libcurl, and using either libcurl's native async-dns solution or the c-ares library hasn't resolved the issue. We haven't found a way to reproduce this issue in a lab environment, and so we haven't bee able to give the libcurl guys enough information to get it fixed.

So until we can get more info to the libcurl guys who can then fix it, the no heartbeat message is better than a crash.
2) Message boards : Number crunching : minirosetta v1.19 bug thread (Message 52979)
Posted 10 May 2008 by Rom Walton (BOINC)
Post:

It just strikes me that the very kowledgeable Rom is arrogant enough to point to the cause without indicating any sort of a solution.


In this particular case there isn't anything that any of us can do, I've passed the info on to the MiniRosetta devs. Basically MiniRosetta is a 32-bit process, and generally 32-bit processes are limited to 2GB of user-mode memory. MiniRosetta hit that limit and so when it asked for more the OS said NO, leading to the crash.

The sign that this sort of problem has occurred is:
LoadLibraryA( dbghelp.dll ): GetLastError = 8

and
- Virtual Memory Usage -
VirtualSize: 2127511552, PeakVirtualSize: 2127511552


Sorry for not explaining the situation sooner, I was heading for bed and I started thinking about how I was going to help the devs debug this problem in the wild if they are unable to reproduce this issue in the lab.

At present there isn't anything in the BOINC application framework that'll help them debug this in the wild.


3) Message boards : Number crunching : minirosetta v1.19 bug thread (Message 52961)
Posted 10 May 2008 by Rom Walton (BOINC)
Post:

All access violations

http://boinc.bakerlab.org/rosetta/result.php?resultid=161740698
http://boinc.bakerlab.org/rosetta/result.php?resultid=160201341
http://boinc.bakerlab.org/rosetta/result.php?resultid=159794241
http://boinc.bakerlab.org/rosetta/result.php?resultid=160129454
http://boinc.bakerlab.org/rosetta/result.php?resultid=160185394
http://boinc.bakerlab.org/rosetta/result.php?resultid=161332559
http://boinc.bakerlab.org/rosetta/result.php?resultid=159408171


All those crashes are a result of an out of memory error.
4) Message boards : Number crunching : BOINC Q&A (Message 26908)
Posted 16 Sep 2006 by Rom Walton (BOINC)
Post:
Click on the comments link at the bottom of the article.
5) Message boards : Number crunching : BOINC Q&A (Message 26778)
Posted 14 Sep 2006 by Rom Walton (BOINC)
Post:
In an effort to improve communication between the BOINC project and the community about the future of the BOINC project I'll be holding a weekly Q&A on my blog.

I'm fielding this as an experiment right now, as a way to find out the kinds of things the community is interested in knowning.

If there is a lot of interest in this sort of thing maybe the guys who publish bunc will be willing to pick it up as part of their newsletter.

What do you all think?
6) Message boards : Number crunching : Simple boinc installer (Message 24233)
Posted 22 Aug 2006 by Rom Walton (BOINC)
Post:
Sorry, that was me.

:)

Forgot to switch back to me account before posting.
7) Message boards : Number crunching : Rosetta@Home Presentation (Message 20286)
Posted 16 Jul 2006 by Rom Walton (BOINC)
Post:
I don't know if this has already been posted or not but I happened to watch this on TV and thought you all would be interested.

http://norfolk.cs.washington.edu/htbin-post/unrestricted/colloq/details.cgi?id=449

It is a presentation David Baker gave to the computer science department at the University of Washington.

Enjoy.
8) Message boards : Number crunching : Report Problems with Rosetta Version 5.24 (Message 19254)
Posted 24 Jun 2006 by Rom Walton (BOINC)
Post:

Ideally ofcourse the BOINC-server should be smarter and noticing your PC can't handle the ultra-big WU, send you one of the regular, smaller jobs. But this feature is not yet available in BOINC server code unfortunately.


Actually it is. There is only enough space in the feeder queue for 1,000 workunits. When the scheduler connects up to the feeder queue to get work it cycles through all 1,000 slots looking for available work. When all 1,000 queue slots are filled up with large jobs that is what the server returns.

Splitting the queue up equally is supported with different applications. If this is really a big problem we could set things up in such a way that the project believes it has more than one application and 50% of the queue is saved for each application.
9) Message boards : Number crunching : Miscellaneous Work Unit Errors (Message 12797)
Posted 29 Mar 2006 by Rom Walton (BOINC)
Post:
Report all Work Unit errors on this thread that are NOT -

    "1%" Hang"
    "Max Time Exceeded"
    or other "stuck" or "hung" workuinits




Hi all,

Have seen the message about downloading the PDB file (I dl'd version 4.83 to "match" the v4.83 application I have) and having had issues before, thought that maybe, if I had problems this time around, then at least some decent reports will go back.

And I've just had the following errors:

29/03/2006 13:36:59|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5153_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:37:42|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5070_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:38:25|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5196_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:39:08|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5188_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:39:50|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5215_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:40:31|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5154_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:41:12|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5114_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:41:53|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5117_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:42:32|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5175_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:43:12|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5184_0 ( - exit code -529697949 (0xe06d7363))


This is using 3GHz P4 (with HT), 512Mb memory, Win XP (Srv Pck 2) + BOINC v5.3.28

Hope this helps the "cause" to resolve the bugs.



Will go back to crunching on RALPH instead...!


regards,

Tim


That error code useally means the machine ran out of memory during the execution of the workunit. Since you only have 512MB of RAM and one instance of Rosetta can use up to 250MB of Ram, I would recommend turning off HT.
10) Message boards : Number crunching : Help us solve the 1% bug! (Message 12747)
Posted 28 Mar 2006 by Rom Walton (BOINC)
Post:
Contact me offline and I'll let you know where to send it.

----- Rom
11) Message boards : Number crunching : Help us solve the 1% bug! (Message 12741)
Posted 28 Mar 2006 by Rom Walton (BOINC)
Post:
What is the size of your BOINC directory?

How many days worth of workunits do your have? Which projects are attached?

Would you be willing to make a copy of the directory and in the copy abort all of the other workunits except the one that is stalling and zip everything up and send it to me?
12) Message boards : Number crunching : Help us solve the 1% bug! (Message 12478)
Posted 22 Mar 2006 by Rom Walton (BOINC)
Post:
A new version of Rosetta has been posted in the RALPH@Home project.

Release Notes

For those who are so inclined, please help us track down the issue by running RALPH@Home and if/when you find a workunit with the '1% bug' feel free to abort it and call it out in this thread.

Thanks in advance for any help you can provide.

----- Rom
13) Message boards : Number crunching : Miscellaneous Work Unit Errors (Message 12304)
Posted 20 Mar 2006 by Rom Walton (BOINC)
Post:
Ummmmm, that is the bug, or rather the manifestation of the bug that you are seeing.

The bug I fixed is the stay in memory or crash bug.

----- Rom


Rom,

The problem is many people running Ralph are also running Rosetta. The latest Ralph app has not yet been deployed in Rosetta. So when people remove the application to run Ralph, this impacts their Rossetta work adversely.

What they really need is guidance on how to run Ralph under these conditions.


I wish I had a good answer for ya, all I can say is this issue will become a thing of the past over the next day or two.

14) Message boards : Number crunching : Miscellaneous Work Unit Errors (Message 12277)
Posted 19 Mar 2006 by Rom Walton (BOINC)
Post:
Ummmmm, that is the bug, or rather the manifestation of the bug that you are seeing.

The bug I fixed is the stay in memory or crash bug.

----- Rom
15) Message boards : Number crunching : Miscellaneous Work Unit Errors (Message 12245)
Posted 19 Mar 2006 by Rom Walton (BOINC)
Post:
David Baker asked me to post a more detailed write-up on what we have been able to track down thus far.

I have posted the additional information to my blog since it can handle tables.

Think of it as a birds eye view of the project.

----- Rom
16) Message boards : Number crunching : Report stuck & aborted WU here please (Message 12230)
Posted 19 Mar 2006 by Rom Walton (BOINC)
Post:
17) Message boards : Number crunching : Report stuck & aborted WU here please (Message 12228)
Posted 19 Mar 2006 by Rom Walton (BOINC)
Post:
18) Message boards : Number crunching : Report stuck & aborted WU here please (Message 12225)
Posted 18 Mar 2006 by Rom Walton (BOINC)
Post:

So the top 3 errors on Rosetta aren't the top 3 errors on Ralph. Great news. How far down on the Rosetta list are the top 3 errors that Ralph is having?


The ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED no longer appears on the list at all, and the 0xC0000005 only accounts for 6 of the 49 errors reported in the last 24 hours.

If the data of Ralph is any indication about how the application is going to behave on the public project it should result in a 60%-70% in error rate for the public project.

----- Rom






©2024 University of Washington
https://www.bakerlab.org