Miscellaneous Work Unit Errors

Message boards : Number crunching : Miscellaneous Work Unit Errors

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 10 · Next

AuthorMessage
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 10953 - Posted: 19 Feb 2006, 18:42:01 UTC
Last modified: 19 Feb 2006, 18:44:42 UTC

Report all Work Unit errors on this thread that are NOT -

    "1%" Hang"
    "Max Time Exceeded"
    or other "stuck" or "hung" workuinits


Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 10953 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Andrew

Send message
Joined: 17 Feb 06
Posts: 3
Credit: 349,161
RAC: 0
Message 10958 - Posted: 19 Feb 2006, 19:02:11 UTC

Error running WU

19/02/2006 6:17:09|rosetta@home|Unrecoverable error for result HBLR_1.0_1b72_314_924_0 ( - exit code -1073741819 (0xc0000005))

Checked the Results ID for that WU and got the following data;

<core_client_version>5.2.13</core_client_version>
<message> - exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 28800
# random seed: 1087963

***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x7C911E58 read attempt to address 0x3FF3718A

1: 02/19/06 18:17:08
1: SymGetLineFromAddr(): GetLastError = 126



</stderr_txt>


I think the error occurred when the WU was moved out of memory, when i activated the pc. Have since changed my preferences to leave WU in memory when preempted.


ID: 10958 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DoubleTop

Send message
Joined: 20 Sep 05
Posts: 10
Credit: 1,120,456
RAC: 0
Message 10959 - Posted: 19 Feb 2006, 19:18:11 UTC

Error running Boinc on systems that have previously ran with no problems. These are running LTSP (diskless linux) and I've no had the problem for a while. Still testing whether this is due to having attached the BBC project, but first log shows that this was happening before I attached the new project.

2006-02-19 19:10:05 [rosetta@home] Resuming computation for result NO_SIM_ANNEAL_1dcj_228_1611_2 using rosetta version 480
SIGSEGV: segmentation violationStack trace (6 frames):
./boinc[0x80845b2]
/lib/libpthread.so.0[0x40163a85]
/lib/libc.so.6[0x400428e8]
./boinc[0x805c9ef]
./boinc[0x80784d9]
[0x31313537]

Exiting...

I've now seen this on three machines, and not all using the same simulation. I hope someone else can help further, me I'll just report it and move on.

DT.
ID: 10959 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DoubleTop

Send message
Joined: 20 Sep 05
Posts: 10
Credit: 1,120,456
RAC: 0
Message 10962 - Posted: 19 Feb 2006, 20:21:02 UTC

Please ignore the above post - I've isolated the problem to the BBC project. Attached a test node with just that project and there are some library problems on my diskless setup to run that successfully.

DT.
ID: 10962 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 8 Oct 05
Posts: 60
Credit: 702,872
RAC: 777
Message 10984 - Posted: 20 Feb 2006, 3:36:56 UTC

I've had a 4.82 WU crash today:

2/19/2006 7:36:41 PM|rosetta@home|Resuming result HBLR_1.0_1di2_314_135_1 using rosetta version 482
2/19/2006 8:01:05 PM|rosetta@home|Unrecoverable error for result HBLR_1.0_1di2_314_135_1 ( - exit code -1073741811 (0xc000000d))
2/19/2006 8:01:07 PM||request_reschedule_cpus: process exited
2/19/2006 8:01:07 PM|rosetta@home|Computation for result HBLR_1.0_1di2_314_135_1 finished


This WU:

https://boinc.bakerlab.org/rosetta/result.php?resultid=11796212

Nothing unusual was going on, "Leave in Memory" is set to YES. (It wasn't being swapped anyway.)

ID: 10984 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Robert Everly

Send message
Joined: 8 Oct 05
Posts: 27
Credit: 665,094
RAC: 0
Message 10985 - Posted: 20 Feb 2006, 4:11:38 UTC
Last modified: 20 Feb 2006, 4:13:09 UTC

Not sure if this WU is cursed or not. Three errors. This is the first WU that I've had die in a long time. Only change recently was to NOT have the WU remain in memory. Guess thats not quite fixed yet. Putting settings back to remain in memory to yes.

If at all possible, I'd like to re-run this WU on the same machine to see if it happens with the changed setting.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=9512202

<core_client_version>5.2.12</core_client_version>
<message> - exit code -164 (0xffffff5c)
</message>
<stderr_txt>
# random seed: 1086714
# cpu_run_time_pref: 28800
# cpu_run_time_pref: 28800
# cpu_run_time_pref: 28800

***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x0047E9E3 read attempt to address 0x1285D784


</stderr_txt>


ID: 10985 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 7 Oct 05
Posts: 234
Credit: 15,020
RAC: 0
Message 10999 - Posted: 20 Feb 2006, 14:52:38 UTC

This WU:

https://boinc.bakerlab.org/rosetta/result.php?resultid=11719411

It ran full time, nothing special to see. I had it suspended some times to run some Pirate WU's, but else, nothing unusual happened or was seen. I even had the graphic open at 97.50% to see it, and all looked normal.

And I haven't had any Ralph WU's yet, in case they would interfere.


[b]"I'm trying to maintain a shred of dignity in this world." - Me[/b]

ID: 10999 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 8 Oct 05
Posts: 60
Credit: 702,872
RAC: 777
Message 11006 - Posted: 20 Feb 2006, 15:58:57 UTC

Got another 4.82 crash. This one brought up a Microsoft Dialog "Please report this error..."

Looks like a carbon copy of the previous one. Same machine. Same settings.

https://boinc.bakerlab.org/rosetta/result.php?resultid=11805479

Here's the goings-on around the time of the error:
2/20/2006 10:02:25 AM|rosetta@home|Resuming result HBLR_1.0_2reb_314_890_1 using rosetta version 482
2/20/2006 10:02:25 AM|SETI@home|Pausing result 05ap00aa.5327.11904.572166.1.187_1 (left in memory)
2/20/2006 10:08:18 AM|Pirates@Home|Sending scheduler request to http://pirates.spy-hill.net/cgi-bin/scheduler
2/20/2006 10:08:18 AM|Pirates@Home|Reason: To fetch work
2/20/2006 10:08:18 AM|Pirates@Home|Requesting 17280 seconds of new work
2/20/2006 10:08:23 AM|Pirates@Home|Scheduler request to http://pirates.spy-hill.net/cgi-bin/scheduler succeeded
2/20/2006 10:08:23 AM|Pirates@Home|Message from server: No work sent
2/20/2006 10:08:23 AM|Pirates@Home|Message from server: (there was work for other platforms)
2/20/2006 10:08:23 AM|Pirates@Home|No work from project
2/20/2006 10:33:57 AM|rosetta@home|Unrecoverable error for result HBLR_1.0_2reb_314_890_1 ( - exit code -1073741811 (0xc000000d))
2/20/2006 10:34:00 AM||request_reschedule_cpus: process exited
2/20/2006 10:34:00 AM|rosetta@home|Computation for result HBLR_1.0_2reb_314_890_1 finished
2/20/2006 10:34:00 AM|Einstein@Home|Resuming result r1_0992.0__526_S4R2a_2 using albert version 437




ID: 11006 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 8 Oct 05
Posts: 60
Credit: 702,872
RAC: 777
Message 11059 - Posted: 21 Feb 2006, 1:43:44 UTC
Last modified: 21 Feb 2006, 1:49:22 UTC

Yet another 4.82 crash. Same as the others.

https://boinc.bakerlab.org/rosetta/result.php?resultid=11823719

I'm setting Rosetta to No New Work on that machine. It didn't have any problems with 4.81.

Any tests I could do here? Seems 4.82 fails pretty reliably (100%) on this machine.

Currently also running CPDN (Sulphur), Einstein, Seti, Seti Beta, and an occasional Pirates.
ID: 11059 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
truckpuller

Send message
Joined: 5 Nov 05
Posts: 40
Credit: 229,134
RAC: 0
Message 11069 - Posted: 21 Feb 2006, 5:17:16 UTC

How about computation errors we report them here also if so i have had my shares still and now this 1, HBLR_1.0_1r69_314_911_0
Visit us at Christianboards.org
ID: 11069 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 11071 - Posted: 21 Feb 2006, 5:26:46 UTC - in response to Message 11059.  

Yet another 4.82 crash. Same as the others.

https://boinc.bakerlab.org/rosetta/result.php?resultid=11823719

I'm setting Rosetta to No New Work on that machine. It didn't have any problems with 4.81.

Any tests I could do here? Seems 4.82 fails pretty reliably (100%) on this machine.

Currently also running CPDN (Sulphur), Einstein, Seti, Seti Beta, and an occasional Pirates.



Can you attach this host to the Ralph project if you haven't already?
ID: 11071 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
XS_Vietnam_Soldiers

Send message
Joined: 11 Jan 06
Posts: 240
Credit: 2,880,653
RAC: 0
Message 11081 - Posted: 21 Feb 2006, 7:30:47 UTC - in response to Message 11071.  
Last modified: 21 Feb 2006, 7:43:20 UTC

Sirs: You got a huge problem here. This 4.82 version is raising hell with my machines. 3-dual xeon setup's and a Dothan on a P4 Asus MB. I've just watched 2 work units back to back on this dothan whitch has computational power equal to a amd 64 go almost 8 hours and then crap out.
What I want is a simple answer: How do I go back to ver 4.81?
I lost over 20 WU's between the 4 machines in the last 48 hours.This isn't a case where the WU runs 30-40 mins and errors out. That I can live with but this running full term and then nothing is not acceptable.
Thank you for your time. I look forward to hearing from you.
Movieman from XS
ddhunt@adelphia.net
ID: 11081 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 11087 - Posted: 21 Feb 2006, 8:25:47 UTC

XS_Vietnam_Soldiers, you have so many computers, I can't seem to find the ones you are talking about. If they continue to give you problems, I would suspend them and attach them to the Ralph test project. If you change your target cpu time preference to 2 hours, you may not lose as much cpu time for those jobs that randomly fail now and then. The computers I had a chance to look at from your long list of hosts were completing results okay with the new app. Sorry for the troubles you are facing.
ID: 11087 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
XS_Vietnam_Soldiers

Send message
Joined: 11 Jan 06
Posts: 240
Credit: 2,880,653
RAC: 0
Message 11090 - Posted: 21 Feb 2006, 9:13:04 UTC - in response to Message 11087.  
Last modified: 21 Feb 2006, 9:59:45 UTC

XS_Vietnam_Soldiers, you have so many computers, I can't seem to find the ones you are talking about. If they continue to give you problems, I would suspend them and attach them to the Ralph test project. If you change your target cpu time preference to 2 hours, you may not lose as much cpu time for those jobs that randomly fail now and then. The computers I had a chance to look at from your long list of hosts were completing results okay with the new app. Sorry for the troubles you are facing.


Thank you for your time.That list also presents a huge problem when trying to find what has failed and why. It is frustrating from my perspective to have a machine spend 5-8 hours crunching a WU and then because of an error get absolutely no credit for that time spent.Personally I feel that if a WU gets to 88% completion and it fails, the account should get 88% of the credit that the WU would get were it to complete. After all, it is generally( from what I've seen)not the fault of the PC when the WU fails.IE: I've done my part but received no credit due to a failure on the WU's design.
The other point is you failed to answer my question: How do I go back to the ver 4.81 and solve my immediate problem while you work out the bugs on ver 4.82?
Thank you.
EDIT: I just went back through the results on these computers and their are more than 30(computational error) going back maybe 20 "pages"..I don't have the time to add up all the PC time involved at the moment but will be glad to if it would help you understand my frustration over this matter.
These are all high end machines. The ones I personally own are dual xeon on Supermicro MB's with top quality ram, high end PS and all on large UPS's.All are on XP.Pro-SP1.My point is that they are stable and I don't beleive the issue is with my equipment.
ID: 11090 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
XS_Vietnam_Soldiers

Send message
Joined: 11 Jan 06
Posts: 240
Credit: 2,880,653
RAC: 0
Message 11094 - Posted: 21 Feb 2006, 10:33:45 UTC - in response to Message 11090.  
Last modified: 21 Feb 2006, 10:41:20 UTC

XS_Vietnam_Soldiers, you have so many computers, I can't seem to find the ones you are talking about. If they continue to give you problems, I would suspend them and attach them to the Ralph test project. If you change your target cpu time preference to 2 hours, you may not lose as much cpu time for those jobs that randomly fail now and then. The computers I had a chance to look at from your long list of hosts were completing results okay with the new app. Sorry for the troubles you are facing.


Thank you for your time.That list also presents a huge problem when trying to find what has failed and why. It is frustrating from my perspective to have a machine spend 5-8 hours crunching a WU and then because of an error get absolutely no credit for that time spent.Personally I feel that if a WU gets to 88% completion and it fails, the account should get 88% of the credit that the WU would get were it to complete. After all, it is generally( from what I've seen)not the fault of the PC when the WU fails.IE: I've done my part but received no credit due to a failure on the WU's design.
The other point is you failed to answer my question: How do I go back to the ver 4.81 and solve my immediate problem while you work out the bugs on ver 4.82?
Thank you.
EDIT: I just went back through the results on these computers and their are more than 30(computational error) going back maybe 20 "pages"..I don't have the time to add up all the PC time involved at the moment but will be glad to if it would help you understand my frustration over this matter.
These are all high end machines. The ones I personally own are dual xeon on Supermicro MB's with top quality ram, high end PS and all on large UPS's.All are on XP.Pro-SP1.My point is that they are stable and I don't beleive the issue is with my equipment.

Addendum:
I just took this from the log on my dothan machine:
2/20/2006 8:32:47 AM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1ogw_251_14828_2 (Maximum CPU time exceeded)
2/20/2006 4:20:57 PM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1dcj_251_15090_2 (Maximum CPU time exceeded)
2/21/2006 12:09:07 AM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1di2_251_20632_1 (Maximum CPU time exceeded)
Since you may not be familar with the dothan cpu: This is the Intel Pentium M 770(2130mhz) laptop cpu run with an asus adapter on a Asus P4P800 SE MB.
Fantastic computational power and yet running only one work unit at a time it times out? Strange wouldn't you agree?
Net result of those 3 WU timing out is that this machine received no credit for an entire 24 hours work.THAT greatly upsets me! This machine was averaging 600-650 points a day with ver 4.81..The day it changed to ver 4.82 it received a grand total of zero!

ID: 11094 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 11095 - Posted: 21 Feb 2006, 10:41:28 UTC

Hi guys, I'm baaaaack... I've had 8 failures (not counting 6 CPU time outs) in the past two days with most of them being exception errors... all 14 errors were with version 4.82. If needed I can post the errors.

Thanks, Owlie
Join the Teddies@WCG
ID: 11095 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 8 Oct 05
Posts: 60
Credit: 702,872
RAC: 777
Message 11101 - Posted: 21 Feb 2006, 12:15:26 UTC - in response to Message 11071.  
Last modified: 21 Feb 2006, 12:18:30 UTC

Yet another 4.82 crash. Same as the others.

https://boinc.bakerlab.org/rosetta/result.php?resultid=11823719

I'm setting Rosetta to No New Work on that machine. It didn't have any problems with 4.81.

Any tests I could do here? Seems 4.82 fails pretty reliably (100%) on this machine.

Currently also running CPDN (Sulphur), Einstein, Seti, Seti Beta, and an occasional Pirates.



Can you attach this host to the Ralph project if you haven't already?


Will do.
[edit]
OK, it's this one:
http://ralph.bakerlab.org/show_host_detail.php?hostid=953
[/edit]

ID: 11101 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile KSMarksPsych
Avatar

Send message
Joined: 15 Oct 05
Posts: 199
Credit: 22,337
RAC: 0
Message 11111 - Posted: 21 Feb 2006, 13:33:39 UTC - in response to Message 11090.  

XS_Vietnam_Soldiers, you have so many computers, I can't seem to find the ones you are talking about. If they continue to give you problems, I would suspend them and attach them to the Ralph test project. If you change your target cpu time preference to 2 hours, you may not lose as much cpu time for those jobs that randomly fail now and then. The computers I had a chance to look at from your long list of hosts were completing results okay with the new app. Sorry for the troubles you are facing.


Thank you for your time.That list also presents a huge problem when trying to find what has failed and why. It is frustrating from my perspective to have a machine spend 5-8 hours crunching a WU and then because of an error get absolutely no credit for that time spent.Personally I feel that if a WU gets to 88% completion and it fails, the account should get 88% of the credit that the WU would get were it to complete. After all, it is generally( from what I've seen)not the fault of the PC when the WU fails.IE: I've done my part but received no credit due to a failure on the WU's design.
The other point is you failed to answer my question: How do I go back to the ver 4.81 and solve my immediate problem while you work out the bugs on ver 4.82?
Thank you.
EDIT: I just went back through the results on these computers and their are more than 30(computational error) going back maybe 20 "pages"..I don't have the time to add up all the PC time involved at the moment but will be glad to if it would help you understand my frustration over this matter.
These are all high end machines. The ones I personally own are dual xeon on Supermicro MB's with top quality ram, high end PS and all on large UPS's.All are on XP.Pro-SP1.My point is that they are stable and I don't beleive the issue is with my equipment.


I'm pretty sure it isn't possible to go back to a previous app version. There (if I recall correctly) are some changes to the science app as well as the function for the user to specify run times.

Kathryn

Kathryn :o)
The BOINC FAQ Service
The Unofficial BOINC Wiki
The Trac System
More BOINC information than you can shake a stick of RAM at.
ID: 11111 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
XS_Vietnam_Soldiers

Send message
Joined: 11 Jan 06
Posts: 240
Credit: 2,880,653
RAC: 0
Message 11120 - Posted: 21 Feb 2006, 16:08:07 UTC - in response to Message 11111.  

XS_Vietnam_Soldiers, you have so many computers, I can't seem to find the ones you are talking about. If they continue to give you problems, I would suspend them and attach them to the Ralph test project. If you change your target cpu time preference to 2 hours, you may not lose as much cpu time for those jobs that randomly fail now and then. The computers I had a chance to look at from your long list of hosts were completing results okay with the new app. Sorry for the troubles you are facing.


Thank you for your time.That list also presents a huge problem when trying to find what has failed and why. It is frustrating from my perspective to have a machine spend 5-8 hours crunching a WU and then because of an error get absolutely no credit for that time spent.Personally I feel that if a WU gets to 88% completion and it fails, the account should get 88% of the credit that the WU would get were it to complete. After all, it is generally( from what I've seen)not the fault of the PC when the WU fails.IE: I've done my part but received no credit due to a failure on the WU's design.
The other point is you failed to answer my question: How do I go back to the ver 4.81 and solve my immediate problem while you work out the bugs on ver 4.82?
Thank you.
EDIT: I just went back through the results on these computers and their are more than 30(computational error) going back maybe 20 "pages"..I don't have the time to add up all the PC time involved at the moment but will be glad to if it would help you understand my frustration over this matter.
These are all high end machines. The ones I personally own are dual xeon on Supermicro MB's with top quality ram, high end PS and all on large UPS's.All are on XP.Pro-SP1.My point is that they are stable and I don't beleive the issue is with my equipment.


I'm pretty sure it isn't possible to go back to a previous app version. There (if I recall correctly) are some changes to the science app as well as the function for the user to specify run times.

Kathryn

Thank you for your reply but since I'm still running ver 4.81 on one machine at this moment you may be wrong. I'm waiting to hear from the admins on this so I can decide which way to procede.


ID: 11120 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
XS_Vietnam_Soldiers

Send message
Joined: 11 Jan 06
Posts: 240
Credit: 2,880,653
RAC: 0
Message 11129 - Posted: 21 Feb 2006, 17:43:31 UTC

This is the machine ID involved:
https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=160238
ID: 11129 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 10 · Next

Message boards : Number crunching : Miscellaneous Work Unit Errors



©2024 University of Washington
https://www.bakerlab.org