Major problems... multiple machines various errors + 100% complete lock down

Message boards : Number crunching : Major problems... multiple machines various errors + 100% complete lock down

To post messages, you must log in.

AuthorMessage
Dougga

Send message
Joined: 27 Nov 06
Posts: 28
Credit: 5,248,050
RAC: 0
Message 53949 - Posted: 24 Jun 2008, 5:47:09 UTC

It seems that Rosetta is undergoing some growing pains. I live in Seattle on the block with one of the programmers. I need to buy him a few beers to really hear what's going on. If you surf my machines you'll see problems all over the place. The biggest annoyance is associated with locking up the client.

It seems if a work unit is approaching expiration, it indicates that it is running on High Priority. It seems to me that this is a flag for trouble. If a unit is running high priority, it will lock up the client when it reaches 100%. I have 1 intel Core 2 Quad and 1 Core 2 Duo and both are showing this behavior. My overall productivity has taken a beating due to these irregularities.

In addition to this, I'm seeing lots of segmentation faults and misc. programming errors. I'm thiking this is not machine based but somehow related to the code in the application.
ID: 53949 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,706,358
RAC: 1,749
Message 53952 - Posted: 24 Jun 2008, 7:40:34 UTC

Odd how you lock up in high priority.
I am running a ton of stuff in high priority because I got a ton of work schedulded for the same day with various hours of expiration and I have never had any lock up issues. I have a Core2 Duo as well and am not suffering the problem your describing. I am even pushing the CPU with OC and not suffering the problem you describe.

Anyone else reading this have his problem?
ID: 53952 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 115,751,398
RAC: 58,222
Message 53953 - Posted: 24 Jun 2008, 7:52:43 UTC

just want to make the distinction that 'high priority' in BOINC doesn't mean the thread is 'high' priority from an Operating System point of veiw - the Rosetta thread will always run as a low priority thread.
ID: 53953 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 53955 - Posted: 24 Jun 2008, 8:21:13 UTC

Doug, can you give some specifics like what work units are getting stuck? We had a bad batch of work units sent out last week. The task names started with "t405_." I just posted a news item about it and am working on a fix.

It was a pretty bad bug that caused the client to sometimes stall and sit idle. We didn't catch it on Ralph because the stalled jobs did not get reported back so we had no information about their status. The successful jobs did get reported back of course, so it appeared okay on Ralph.

ID: 53955 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B^S] thierry@home
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 281,902
RAC: 0
Message 53964 - Posted: 24 Jun 2008, 18:44:24 UTC
Last modified: 24 Jun 2008, 18:45:04 UTC

Here's a "bad" WU with a Q9300:
t434_1_NMRREF_1_t434_1_T0434_2QPWA_2JV0_hybridIGNORE_THE_REST_truncated_4104_10212_0


<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
Fonction incorrecte. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 28800
# random seed: 2584433
ERROR:: Exit from: .refold.cc line: 338

</stderr_txt>
]]>
ID: 53964 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,706,358
RAC: 1,749
Message 53966 - Posted: 24 Jun 2008, 19:57:26 UTC

i'm curios about this one as well, its in my to do list in few days.
mine is 4104_7660_0, it's the only one.
ID: 53966 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Vatsan

Send message
Joined: 19 Nov 05
Posts: 2
Credit: 6
RAC: 0
Message 53969 - Posted: 24 Jun 2008, 21:27:04 UTC - in response to Message 53966.  

i'm curios about this one as well, its in my to do list in few days.
mine is 4104_7660_0, it's the only one.


This was my workunit. I am tracking down the problem. Here is my analysis based on preliminary investigation:
There are two stages in refinement. The first stage is aggressive loop modeling in the regions that are unaligned with the template and the second stage is full-atom relax. In full atom relax, the full chain structure (no broken loops) is back-bone perturbed, side chain repacked and minimized over a number of cycles.
However, it is possible that not all loops could be closed in the first stage. In such a case, Rosetta will not do full-atom relax. If the loop is not fully closed at the end of first stage, Rosetta should write out the broken loop structure and exit. I suspect, this is not happening cleanly and that might be the problem.
There is a mechanism in Rosetta to stochastically extend the length of the defined loop region to try and close the loop. As a result, if it is a hard-to-close loop, extending the loop could close it.
For this WU, not all jobs failed, those that extended the loop, went onto the second stage and completed successfully. Those that did extend the loop adequately failed.
The bottom-line is, if the first stage, loop modeling, fails it should exit without an error and that is not happening now. I am looking into it.
Sorry for all the trouble.
ID: 53969 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Major problems... multiple machines various errors + 100% complete lock down



©2024 University of Washington
https://www.bakerlab.org