Errors galore!! Multiple machines

Message boards : Number crunching : Errors galore!! Multiple machines

To post messages, you must log in.

AuthorMessage
Dougga

Send message
Joined: 27 Nov 06
Posts: 28
Credit: 5,248,050
RAC: 0
Message 53538 - Posted: 5 Jun 2008, 0:46:51 UTC

I'm getting errors from all sorts of machines.

I get too many restarts errors on a machine with the "keep in memory" option selected. This is an AMD64 machine.

Here's an error I'm seing on another machine Intel Core 2 Quad:

stderr out

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 10800
# random seed: 3358807
ERROR:: Exit from: fullatom_energy.cc line: 1958

</stderr_txt>
]]>

It seems things are suddenly unstable. People are suggesting my machines are showing bad memory but I don't really buy this.

ARe others seeing issues with Rosetta?
ID: 53538 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 53539 - Posted: 5 Jun 2008, 2:22:25 UTC - in response to Message 53538.  

I'm getting errors from all sorts of machines.

I get too many restarts errors on a machine with the "keep in memory" option selected. This is an AMD64 machine.

Here's an error I'm seing on another machine Intel Core 2 Quad:

stderr out

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 10800
# random seed: 3358807 i know in the last 3 or 4 weeks ive had a lot more than usual
ERROR:: Exit from: fullatom_energy.cc line: 1958

</stderr_txt>
]]>

It seems things are suddenly unstable. People are suggesting my machines are showing bad memory but I don't really buy this.

ARe others seeing issues with Rosetta?

ID: 53539 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,711,666
RAC: 764
Message 53541 - Posted: 5 Jun 2008, 8:43:34 UTC - in response to Message 53538.  

could you post a few links to the tasks that are making errors?
they need to know what application of rosie you are using and what specific task or set of tasks that are failing. I would guess just based on the one line that says ERROR:: Exit from: fullatom_energy.cc line: 1958, that there is a problem in the program itself, not with your machine. I had a whole rash of tasks that failed on disk space errors, but the next batch was just fine.

I'm getting errors from all sorts of machines.

I get too many restarts errors on a machine with the "keep in memory" option selected. This is an AMD64 machine.

Here's an error I'm seing on another machine Intel Core 2 Quad:

stderr out

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 10800
# random seed: 3358807
ERROR:: Exit from: fullatom_energy.cc line: 1958

</stderr_txt>
]]>

It seems things are suddenly unstable. People are suggesting my machines are showing bad memory but I don't really buy this.

ARe others seeing issues with Rosetta?

ID: 53541 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ingleside

Send message
Joined: 25 Sep 05
Posts: 107
Credit: 1,514,472
RAC: 0
Message 53542 - Posted: 5 Jun 2008, 10:21:08 UTC - in response to Message 53538.  

It seems things are suddenly unstable. People are suggesting my machines are showing bad memory but I don't really buy this.

ARe others seeing issues with Rosetta?

Well, you didn't say if you've done any of the suggestions made in your last thread...

It doesn't need to be bad memory, it can be bad cpu, or something else...

The Amd is possibly a problem with OS or drivers to OS, or possibly access-rights.

The quad... a very quick look shows a couple wu's crashing within 1 minute, this is likely bad wu's. But, there's also around 25 other crashes...

A very quick look through top-computer-list, and looking on 3 Linux-systems from top-60, showed some 1-minute-crashing, but of the longer-running there was only 4 crashes across 3 computers...
I've no idea on the "Validation"-errors, and I've not counted them, possibly this is a Rosetta-server-based problem...

So, maybe you're just unlucky, but with 20x the error-rate of other Linux-computers, would still guess it's a computer-problem...

BTW, it doesn't need to be anything hardware-related, it can be the Linux-distibution you're using, or the libraries installed, is reason for the errors, while the other linux-computers usesother distribution/librarier and doesn't get the errors...

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
ID: 53542 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,711,666
RAC: 764
Message 53545 - Posted: 5 Jun 2008, 13:21:55 UTC
Last modified: 5 Jun 2008, 13:27:00 UTC

take a look at this error msg i found in one of his tasks:
https://boinc.bakerlab.org/rosetta/result.php?resultid=168818917

8741.23
stderr out <core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 10800
# random seed: 3343687
ABORT: bad to aa_rotno_to_packedrotno
aa,rot1/2/3/4: ILE 8 0 2 0 0
chi no 1 nchi 2 aav 1 is_chi_proton_rotamer(aa,aav,i) 0
ERROR:: Exit from: rotamer_functions.cc line: 1465

He has one or two others like this as well.
Later he gets a validation error after succesfully completing the task.

Of course being that some of these are CASP8 that could be a cause.
They are running on roesetta 5.96

The quad machine had 5 errors in 24 hours of which 4 were program errors and 1 was a validate error. One of the dual cores has validate errors which is a RAH issue not his computer.

Another random sample of work shows a mini that crashed on 2 systems immediatly.

I would call it a string of bad luck, not a hardware issue.
ID: 53545 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ingleside

Send message
Joined: 25 Sep 05
Posts: 107
Credit: 1,514,472
RAC: 0
Message 53566 - Posted: 6 Jun 2008, 8:43:56 UTC - in response to Message 53545.  
Last modified: 6 Jun 2008, 8:45:17 UTC

take a look at this error msg i found in one of his tasks:
https://boinc.bakerlab.org/rosetta/result.php?resultid=168818917

8741.23
stderr out <core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 10800
# random seed: 3343687
ABORT: bad to aa_rotno_to_packedrotno
aa,rot1/2/3/4: ILE 8 0 2 0 0
chi no 1 nchi 2 aav 1 is_chi_proton_rotamer(aa,aav,i) 0
ERROR:: Exit from: rotamer_functions.cc line: 1465

He has one or two others like this as well.
Later he gets a validation error after succesfully completing the task.

Of course being that some of these are CASP8 that could be a cause.
They are running on roesetta 5.96

The quad machine had 5 errors in 24 hours of which 4 were program errors and 1 was a validate error. One of the dual cores has validate errors which is a RAH issue not his computer.

Another random sample of work shows a mini that crashed on 2 systems immediatly.

I would call it a string of bad luck, not a hardware issue.

The "Validate errors" is a Rosetta-problem, and the wu's crashing after a couple seconds on 2 different computers is obviously buggy.

The problem in my opinion is, (appart for crappy keyboard - not my computer),
is all the wu's his comtuter is erroring-ot while someone else manages to finish correctly...
Example, 154098541 that gives a "ERROR:: Exit from: fullatom_energy.cc line: 2030"
153933379 same error
1538933379 with "ERROR:: Exit from: refold.cc line: 338"
153835521 with "ERROR: NANs occured in hbonding!
ERROR:: Exit from: src/core/scoring/hbonds/hbonds_geom.cc line: 763"
153204401 with a long string of "sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range"

And the list goes on, with both Mini and Beta-application.

Now, only a couple is paired with another Linux, so it is possible 2 buggy Linux-aplications. For the few he is paired with Linux, either shorter run-time or slower speed can be possible his crash is longer out in wu, so not a good indication either way.

Still, his 4-core having much higher computer error-rate than 3 of the 8-core Linux-comuters in top-60 looks suspicious to me, so taking a little closer look on his computer shouldn't be a big problem.

Afterall, atleast a couple of the checks like "Overclocked or not" or "Oops, the cpu is running at 100 Celsius" is easil checked (and answered)...

Now, running Gromacs, Prime95 and memory-tests on the other hand is much more time-consuming...


BTW, one method to test if it's a bad Rosetta-application or not is, download a ton of work, disable network, exit boinc, backup boinc, and re-start boinc.
If one or more of wu's gives an error, re-run the same wu from the backup.
If the backup-copy crashes on the same spot (example 1st. crashed after 2h and backup after 2h1m), it's most likely a bad wu or application.

If on the other hand the backup-copy finish withot crashing, or one copy crashed after 1 hour while the other after 2.5 hours, it looks more like a hardware-problem than a wu/application-problem...

If there aren't any errors, this method will only lose the 1-minute or something taken to make a backup-copy. And, even if there are errors, re-running a couple wu's (optimal is to check 4 errors at once), will only take a couple hours, and not 24h+ that using another program will do.

BTW, in case stop/re-start from checkpoint has any influence, let wu's run from start to finish...
"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
ID: 53566 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Errors galore!! Multiple machines



©2024 University of Washington
https://www.bakerlab.org