Lots of jobs in error

Message boards : Number crunching : Lots of jobs in error

To post messages, you must log in.

AuthorMessage
Profile Cureseekers~Kristof

Send message
Joined: 5 Nov 05
Posts: 80
Credit: 689,603
RAC: 0
Message 32583 - Posted: 13 Dec 2006, 13:29:30 UTC

Hello,

One of our members of DPC has got some jobs in error.
(6 jobs in error, out of 15)
See https://boinc.bakerlab.org/rosetta/results.php?hostid=373642

This is an example of a job in error:
See https://boinc.bakerlab.org/rosetta/result.php?resultid=51874373

Can someone explain how this error can happen?


Member of Dutch Power Cows
ID: 32583 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 32586 - Posted: 13 Dec 2006, 14:11:05 UTC

The error code shown in your example of job in error is a 0xC0000005, which is the code for "Access Violation", which essentially means that the process in question was trying to access memory that it wasn't supposed to access.

From the call-stack, it seems like you're in an NVidia graphics driver that has gone into some sort of recursive call - but that could just be me misunderstanding the crash-dump... Or that the crash dump isn't very clever with certain types of stack-patterns.

--
Mats
ID: 32586 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 32590 - Posted: 13 Dec 2006, 15:39:31 UTC

The -107 error code, and the fact that you have lots of them sounds to me like the screensaver problems. I see another WU that was ended by the watchdog. This is another sign to me of screensaver problems.

The suggestion is to set Windows screensaver to none. There is a new version under test presently on Ralph which should be available here on Rosetta in just a few days, which seems to have resolved the screensaver problems. So, hang in there!
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 32590 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Don

Send message
Joined: 28 Oct 06
Posts: 2
Credit: 294,270
RAC: 0
Message 33042 - Posted: 21 Dec 2006, 14:36:53 UTC

I run R@H on four machines. I get lots of errors on two of them that lock up the computer. Sometimes they are cleared by ctl-alt-del and halting the rosetta task, other times it requires a reboot. Only the two machines with hyperthreading get errors, the others (athlon 64 and a PIII have never had an error. All four machines also run other clients (Seti and Einstein) with no problems.

I have tried keeping the job in memory and will see if that works, next I will try to disable the screensaver, but these do not explain why other CPUs are not affected. Maybe the size of the job.
ID: 33042 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 33043 - Posted: 21 Dec 2006, 14:47:51 UTC - in response to Message 33042.  

I run R@H on four machines. I get lots of errors on two of them that lock up the computer. Sometimes they are cleared by ctl-alt-del and halting the rosetta task, other times it requires a reboot. Only the two machines with hyperthreading get errors, the others (athlon 64 and a PIII have never had an error. All four machines also run other clients (Seti and Einstein) with no problems.

I have tried keeping the job in memory and will see if that works, next I will try to disable the screensaver, but these do not explain why other CPUs are not affected. Maybe the size of the job.


it is believed to be a syncronisation error and happens (or seems to happen) more often than not on a computer running more than one boinc project at the same time (Hyperthreading technology or multicore processors do this)


Hence the PIII, A64 generally would not see this.
Team mauisun.org
ID: 33043 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 33044 - Posted: 21 Dec 2006, 14:54:32 UTC - in response to Message 33043.  

I run R@H on four machines. I get lots of errors on two of them that lock up the computer. Sometimes they are cleared by ctl-alt-del and halting the rosetta task, other times it requires a reboot. Only the two machines with hyperthreading get errors, the others (athlon 64 and a PIII have never had an error. All four machines also run other clients (Seti and Einstein) with no problems.

I have tried keeping the job in memory and will see if that works, next I will try to disable the screensaver, but these do not explain why other CPUs are not affected. Maybe the size of the job.


it is believed to be a syncronisation error and happens (or seems to happen) more often than not on a computer running more than one boinc project at the same time (Hyperthreading technology or multicore processors do this)


Hence the PIII, A64 generally would not see this.


Synchronisation issues is by far more likely to be a problem on systems that have multiple execution units that run different threads at the same time (so SMP and HT/Multicore systems), as those would technically be able to get things into a "unsynched" state much more easily by accessing the data in parallel (and the data being in an inconsistent state due to one thread being half-way through some udpate, and the other one reading the "half-baked" data).

It's of course possible to get this to happen on a single processor system as well, but the likelyhood of actually hitting the failure point is less likely.

--
Mats
ID: 33044 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 33048 - Posted: 21 Dec 2006, 16:03:27 UTC

The problems are ALSO actually more likely on a computer that you are not actively using. Because if you were using it to do something, then one processor thread would often be working on what YOU are doing rather then what Rosetta is doing. And thus the two Rosetta threads are less likely to run at the same time or to be preempted at a key point in time.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 33048 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Lots of jobs in error



©2024 University of Washington
https://www.bakerlab.org