Lots of workunit failures...

Message boards : Number crunching : Lots of workunit failures...

To post messages, you must log in.

AuthorMessage
Profile Jack Shaftoe
Avatar

Send message
Joined: 30 Apr 06
Posts: 115
Credit: 1,307,916
RAC: 0
Message 53127 - Posted: 18 May 2008, 13:00:46 UTC
Last modified: 18 May 2008, 13:08:37 UTC

Been attached for all of about 24 hours and already 3 failed workunits. Frustrating.

https://boinc.bakerlab.org/rosetta/result.php?resultid=164381455

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# cpu_run_time_pref: 86400

</stderr_txt>
]]>
ID: 53127 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Shaftoe
Avatar

Send message
Joined: 30 Apr 06
Posts: 115
Credit: 1,307,916
RAC: 0
Message 53128 - Posted: 18 May 2008, 13:03:22 UTC - in response to Message 53127.  
Last modified: 18 May 2008, 13:08:08 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=163455869

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 86400


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x005C3030 write attempt to address 0x00000004

Engaging BOINC Windows Runtime Debugger...



********************


BOINC Windows Runtime Debugger Version 6.3.0


Dump Timestamp : 05/16/08 15:01:41
LoadLibraryA( dbghelp.dll ): GetLastError = 8
*** Dump of the Process Statistics: ***

- I/O Operations Counters -
Read: 28146, Write: 0, Other 6765

- I/O Transfers Counters -
Read: 0, Write: 38294, Other 0

- Paged Pool Usage -
QuotaPagedPoolUsage: 44156, QuotaPeakPagedPoolUsage: 44156
QuotaNonPagedPoolUsage: 5688, QuotaPeakNonPagedPoolUsage: 5688

- Virtual Memory Usage -
VirtualSize: 2144301056, PeakVirtualSize: 2144301056

- Pagefile Usage -
PagefileUsage: 1021972480, PeakPagefileUsage: 1029021696

- Working Set Size -
WorkingSetSize: 1023954944, PeakWorkingSetSize: 1031032832, PageFaultCount: 23105986

*** Dump of thread ID 1876 (state: Waiting): ***

- Information -
Status: Wait Reason: UserRequest, , Kernel Time: 1477656192.000000, User Time: 294476873728.000000, Wait Time: 16713152.000000

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x005C3030 write attempt to address 0x00000004


*** Dump of thread ID 3036 (state: Waiting): ***

- Information -
Status: Wait Reason: ExecutionDelay, , Kernel Time: 0.000000, User Time: 0.000000, Wait Time: 16713148.000000


*** Dump of thread ID 900 (state: Waiting): ***

- Information -
Status: Wait Reason: ExecutionDelay, , Kernel Time: 0.000000, User Time: 0.000000, Wait Time: 16713100.000000



*** Debug Message Dump ****


*** Foreground Window Data ***
Window Name :
Window Class :
Window Process ID: 0
Window Thread ID : 0

Exiting...

</stderr_txt>
]]>
ID: 53128 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Shaftoe
Avatar

Send message
Joined: 30 Apr 06
Posts: 115
Credit: 1,307,916
RAC: 0
Message 53129 - Posted: 18 May 2008, 13:04:29 UTC - in response to Message 53128.  
Last modified: 18 May 2008, 13:10:07 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=163455882

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# cpu_run_time_pref: 86400

</stderr_txt>
]]>

The thing that is frustrating about these path errors, is that they open a C++ error window, and the workunit just continues to use CPU until you hit OK. One of them failed at like 4 am, and didn't stop until I checked the machine 10 minutes ago - I hit OK, the workunit fails and then starts another one. Grrr.... If this continues, I'm going to have to go back to my other projects.
ID: 53129 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Shaftoe
Avatar

Send message
Joined: 30 Apr 06
Posts: 115
Credit: 1,307,916
RAC: 0
Message 53167 - Posted: 19 May 2008, 12:39:53 UTC - in response to Message 53129.  

I think I found a solution. I dropped my runtime from 24 hours down to 4 hours and didn't get a single failure last night. After a few days if things continue to remain stable I will increase to 6 hours.
ID: 53167 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 53169 - Posted: 19 May 2008, 14:13:33 UTC
Last modified: 19 May 2008, 14:13:53 UTC

I also have discovered the workaround of decreasing runtime.
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 53169 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Lots of workunit failures...



©2024 University of Washington
https://www.bakerlab.org