Large amount of failed WUs.

Message boards : Number crunching : Large amount of failed WUs.

To post messages, you must log in.

AuthorMessage
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 67092 - Posted: 4 Aug 2010, 5:45:16 UTC
Last modified: 4 Aug 2010, 5:52:30 UTC

https://boinc.bakerlab.org/rosetta/results.php?hostid=1188956

As you can see, my PC gave many many errors on plenty of WUs, and then suddenly stopped and went back working properly. I updated it's graphics card drivers and BOINC version to "fix" the error. (There were graphics errors, black boxes in the titles... etc. which gave me a clue as to what the problem could be.)

The weird part is, that while it gave Rosetta errors, it only gave 1 or 2 Collatz Conjecture errors (GPU Only)... EVEN though the problem was fixed AFTER a GRAPHICS driver update and BOINC update...

Can anyone understand the info that came back with the WUs and pin point the problem? All failed WUs failed right at the start, suggesting a software rather than a hardware problem (heat... etc)

Thanks.
ID: 67092 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Hammeh

Send message
Joined: 11 Nov 08
Posts: 63
Credit: 211,283
RAC: 0
Message 67093 - Posted: 4 Aug 2010, 7:25:27 UTC

Setting up folding (abrelax) ...


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0094700B write attempt to address 0xFFFFFF7A

Engaging BOINC Windows Runtime Debugger...



This is the information from the task page. I do not know what has caused this error but it seems like rosetta can't access/write the files it needs.
Have you tried resetting the project?

PS. It looks like WU are still failing on that machine, last computution error was reported today.
ID: 67093 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jochen

Send message
Joined: 6 Jun 06
Posts: 133
Credit: 3,847,433
RAC: 0
Message 67094 - Posted: 4 Aug 2010, 8:20:58 UTC - in response to Message 67093.  

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0094700B write attempt to address 0xFFFFFF7A


This is actually an Access Violation: The application tried to access memory out of the range that's owned by it.
This could by a driver issue, but usually it's an faulty pointer in a process. This could be faulty WUs (which I acrually doubt, since I have had only three compute errors in the last 400 WUs). This could as well be a hardware problem (CPU or memory failing).

Hard to tell, even harder to give any advice. If it was my computer, I would run some stress tests (Prime95 for CPU and Memory, Furmark for Graphics Card). Maybe this gives a hint...

Good luck!

Joe
ID: 67094 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Speedy
Avatar

Send message
Joined: 25 Sep 05
Posts: 163
Credit: 800,690
RAC: 20
Message 67117 - Posted: 6 Aug 2010, 8:04:42 UTC

357381146 357381134 & 357381125 all tasks start with lrm_jorj_combined_tlrm_jorj_combined_torsion. All tasks end with Compute error. I'm thinking lrm_jorj_combined_tlrm_jorj_combined_torsion is a bad bad batch of tasks.
Have a crunching good day!!
ID: 67117 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jochen

Send message
Joined: 6 Jun 06
Posts: 133
Credit: 3,847,433
RAC: 0
Message 67118 - Posted: 6 Aug 2010, 8:37:41 UTC - in response to Message 67117.  

357381146 357381134 & 357381125 all tasks start with lrm_jorj_combined_tlrm_jorj_combined_torsion. All tasks end with Compute error. I'm thinking lrm_jorj_combined_tlrm_jorj_combined_torsion is a bad bad batch of tasks.


Yes I had a couple of those as well last night.
Error:
ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_databasescoring/weights/dslf_weights.wts exist
ERROR:: Exit from: ....srccorescoringScoreFunctionFactory.cc line: 178
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

As well I had this task with a compute error:
cs-only-2-sen15_8-6_20161_233_1
Error:
ERROR: rsd_type_list.size()
ERROR:: Exit from: ....srccorefragmentFrame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

But Chilean got access violations, wich I rather consider to be hardware related.

cu Joe
ID: 67118 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 67121 - Posted: 6 Aug 2010, 18:05:45 UTC

Seems to be running fine overall, except for the 2 errors I've gotten so far since the big incident. The PC is slightly overclocked (3.2 -> 3.3). I'll put it back to default.
ID: 67121 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jochen

Send message
Joined: 6 Jun 06
Posts: 133
Credit: 3,847,433
RAC: 0
Message 67124 - Posted: 6 Aug 2010, 19:23:12 UTC - in response to Message 67121.  

Seems to be running fine overall

Did you do some stress-testing with Prime 95? I use it for testing the system stability of my OCed computer.

But if the errors go away with standard clocks, you'll know as well. ;)

cu Joe



ID: 67124 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Large amount of failed WUs.



©2024 University of Washington
https://www.bakerlab.org