Miscellaneous Work Unit Errors

Author	Message
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0	Message 10953 - Posted: 19 Feb 2006, 18:42:01 UTC Last modified: 19 Feb 2006, 18:44:42 UTC Report all Work Unit errors on this thread that are NOT - "1%" Hang" "Max Time Exceeded" or other "stuck" or "hung" workuinits Moderator9 ROSETTA@home FAQ Moderator Contact ID: 10953 · Rating: 0 · rate: / Reply Quote

Andrew Send message Joined: 17 Feb 06 Posts: 3 Credit: 349,161 RAC: 0	Message 10958 - Posted: 19 Feb 2006, 19:02:11 UTC Error running WU 19/02/2006 6:17:09\|rosetta@home\|Unrecoverable error for result HBLR_1.0_1b72_314_924_0 ( - exit code -1073741819 (0xc0000005)) Checked the Results ID for that WU and got the following data; <core_client_version>5.2.13</core_client_version> <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 28800 # random seed: 1087963 *UNHANDLED EXCEPTION** Reason: Access Violation (0xc0000005) at address 0x7C911E58 read attempt to address 0x3FF3718A 1: 02/19/06 18:17:08 1: SymGetLineFromAddr(): GetLastError = 126 </stderr_txt> I think the error occurred when the WU was moved out of memory, when i activated the pc. Have since changed my preferences to leave WU in memory when preempted. ID: 10958 · Rating: 0 · rate: / Reply Quote

DoubleTop Send message Joined: 20 Sep 05 Posts: 10 Credit: 1,120,456 RAC: 0	Message 10959 - Posted: 19 Feb 2006, 19:18:11 UTC Error running Boinc on systems that have previously ran with no problems. These are running LTSP (diskless linux) and I've no had the problem for a while. Still testing whether this is due to having attached the BBC project, but first log shows that this was happening before I attached the new project. 2006-02-19 19:10:05 [rosetta@home] Resuming computation for result NO_SIM_ANNEAL_1dcj_228_1611_2 using rosetta version 480 SIGSEGV: segmentation violationStack trace (6 frames): ./boinc[0x80845b2] /lib/libpthread.so.0[0x40163a85] /lib/libc.so.6[0x400428e8] ./boinc[0x805c9ef] ./boinc[0x80784d9] [0x31313537] Exiting... I've now seen this on three machines, and not all using the same simulation. I hope someone else can help further, me I'll just report it and move on. DT. ID: 10959 · Rating: 0 · rate: / Reply Quote

DoubleTop Send message Joined: 20 Sep 05 Posts: 10 Credit: 1,120,456 RAC: 0	Message 10962 - Posted: 19 Feb 2006, 20:21:02 UTC Please ignore the above post - I've isolated the problem to the BBC project. Attached a test node with just that project and there are some library problems on my diskless setup to run that successfully. DT. ID: 10962 · Rating: 0 · rate: / Reply Quote

genes Send message Joined: 8 Oct 05 Posts: 60 Credit: 701,880 RAC: 968	Message 10984 - Posted: 20 Feb 2006, 3:36:56 UTC I've had a 4.82 WU crash today: 2/19/2006 7:36:41 PM\|rosetta@home\|Resuming result HBLR_1.0_1di2_314_135_1 using rosetta version 482 2/19/2006 8:01:05 PM\|rosetta@home\|Unrecoverable error for result HBLR_1.0_1di2_314_135_1 ( - exit code -1073741811 (0xc000000d)) 2/19/2006 8:01:07 PM\|\|request_reschedule_cpus: process exited 2/19/2006 8:01:07 PM\|rosetta@home\|Computation for result HBLR_1.0_1di2_314_135_1 finished This WU: https://boinc.bakerlab.org/rosetta/result.php?resultid=11796212 Nothing unusual was going on, "Leave in Memory" is set to YES. (It wasn't being swapped anyway.) ID: 10984 · Rating: 0 · rate: / Reply Quote

Robert Everly Send message Joined: 8 Oct 05 Posts: 27 Credit: 665,094 RAC: 0	Message 10985 - Posted: 20 Feb 2006, 4:11:38 UTC Last modified: 20 Feb 2006, 4:13:09 UTC Not sure if this WU is cursed or not. Three errors. This is the first WU that I've had die in a long time. Only change recently was to NOT have the WU remain in memory. Guess thats not quite fixed yet. Putting settings back to remain in memory to yes. If at all possible, I'd like to re-run this WU on the same machine to see if it happens with the changed setting. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=9512202 <core_client_version>5.2.12</core_client_version> <message> - exit code -164 (0xffffff5c) </message> <stderr_txt> # random seed: 1086714 # cpu_run_time_pref: 28800 # cpu_run_time_pref: 28800 # cpu_run_time_pref: 28800 *UNHANDLED EXCEPTION** Reason: Access Violation (0xc0000005) at address 0x0047E9E3 read attempt to address 0x1285D784 </stderr_txt> ID: 10985 · Rating: 0 · rate: / Reply Quote

Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0	Message 10999 - Posted: 20 Feb 2006, 14:52:38 UTC This WU: https://boinc.bakerlab.org/rosetta/result.php?resultid=11719411 It ran full time, nothing special to see. I had it suspended some times to run some Pirate WU's, but else, nothing unusual happened or was seen. I even had the graphic open at 97.50% to see it, and all looked normal. And I haven't had any Ralph WU's yet, in case they would interfere. [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] ID: 10999 · Rating: 0 · rate: / Reply Quote

genes Send message Joined: 8 Oct 05 Posts: 60 Credit: 701,880 RAC: 968	Message 11006 - Posted: 20 Feb 2006, 15:58:57 UTC Got another 4.82 crash. This one brought up a Microsoft Dialog "Please report this error..." Looks like a carbon copy of the previous one. Same machine. Same settings. https://boinc.bakerlab.org/rosetta/result.php?resultid=11805479 Here's the goings-on around the time of the error: 2/20/2006 10:02:25 AM\|rosetta@home\|Resuming result HBLR_1.0_2reb_314_890_1 using rosetta version 482 2/20/2006 10:02:25 AM\|SETI@home\|Pausing result 05ap00aa.5327.11904.572166.1.187_1 (left in memory) 2/20/2006 10:08:18 AM\|Pirates@Home\|Sending scheduler request to http://pirates.spy-hill.net/cgi-bin/scheduler 2/20/2006 10:08:18 AM\|Pirates@Home\|Reason: To fetch work 2/20/2006 10:08:18 AM\|Pirates@Home\|Requesting 17280 seconds of new work 2/20/2006 10:08:23 AM\|Pirates@Home\|Scheduler request to http://pirates.spy-hill.net/cgi-bin/scheduler succeeded 2/20/2006 10:08:23 AM\|Pirates@Home\|Message from server: No work sent 2/20/2006 10:08:23 AM\|Pirates@Home\|Message from server: (there was work for other platforms) 2/20/2006 10:08:23 AM\|Pirates@Home\|No work from project 2/20/2006 10:33:57 AM\|rosetta@home\|Unrecoverable error for result HBLR_1.0_2reb_314_890_1 ( - exit code -1073741811 (0xc000000d)) 2/20/2006 10:34:00 AM\|\|request_reschedule_cpus: process exited 2/20/2006 10:34:00 AM\|rosetta@home\|Computation for result HBLR_1.0_2reb_314_890_1 finished 2/20/2006 10:34:00 AM\|Einstein@Home\|Resuming result r1_0992.0__526_S4R2a_2 using albert version 437 ID: 11006 · Rating: 0 · rate: / Reply Quote

genes Send message Joined: 8 Oct 05 Posts: 60 Credit: 701,880 RAC: 968	Message 11059 - Posted: 21 Feb 2006, 1:43:44 UTC Last modified: 21 Feb 2006, 1:49:22 UTC Yet another 4.82 crash. Same as the others. https://boinc.bakerlab.org/rosetta/result.php?resultid=11823719 I'm setting Rosetta to No New Work on that machine. It didn't have any problems with 4.81. Any tests I could do here? Seems 4.82 fails pretty reliably (100%) on this machine. Currently also running CPDN (Sulphur), Einstein, Seti, Seti Beta, and an occasional Pirates. ID: 11059 · Rating: 0 · rate: / Reply Quote

truckpuller Send message Joined: 5 Nov 05 Posts: 40 Credit: 229,134 RAC: 0	Message 11069 - Posted: 21 Feb 2006, 5:17:16 UTC How about computation errors we report them here also if so i have had my shares still and now this 1, HBLR_1.0_1r69_314_911_0 Visit us at Christianboards.org ID: 11069 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0	Message 11071 - Posted: 21 Feb 2006, 5:26:46 UTC - in response to Message 11059. Yet another 4.82 crash. Same as the others. https://boinc.bakerlab.org/rosetta/result.php?resultid=11823719 I'm setting Rosetta to No New Work on that machine. It didn't have any problems with 4.81. Any tests I could do here? Seems 4.82 fails pretty reliably (100%) on this machine. Currently also running CPDN (Sulphur), Einstein, Seti, Seti Beta, and an occasional Pirates. Can you attach this host to the Ralph project if you haven't already? ID: 11071 · Rating: 0 · rate: / Reply Quote

XS_Vietnam_Soldiers Send message Joined: 11 Jan 06 Posts: 240 Credit: 2,880,653 RAC: 0	Message 11081 - Posted: 21 Feb 2006, 7:30:47 UTC - in response to Message 11071. Last modified: 21 Feb 2006, 7:43:20 UTC Sirs: You got a huge problem here. This 4.82 version is raising hell with my machines. 3-dual xeon setup's and a Dothan on a P4 Asus MB. I've just watched 2 work units back to back on this dothan whitch has computational power equal to a amd 64 go almost 8 hours and then crap out. What I want is a simple answer: How do I go back to ver 4.81? I lost over 20 WU's between the 4 machines in the last 48 hours.This isn't a case where the WU runs 30-40 mins and errors out. That I can live with but this running full term and then nothing is not acceptable. Thank you for your time. I look forward to hearing from you. Movieman from XS ddhunt@adelphia.net ID: 11081 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0	Message 11087 - Posted: 21 Feb 2006, 8:25:47 UTC XS_Vietnam_Soldiers, you have so many computers, I can't seem to find the ones you are talking about. If they continue to give you problems, I would suspend them and attach them to the Ralph test project. If you change your target cpu time preference to 2 hours, you may not lose as much cpu time for those jobs that randomly fail now and then. The computers I had a chance to look at from your long list of hosts were completing results okay with the new app. Sorry for the troubles you are facing. ID: 11087 · Rating: 0 · rate: / Reply Quote

XS_Vietnam_Soldiers Send message Joined: 11 Jan 06 Posts: 240 Credit: 2,880,653 RAC: 0	Message 11090 - Posted: 21 Feb 2006, 9:13:04 UTC - in response to Message 11087. Last modified: 21 Feb 2006, 9:59:45 UTC XS_Vietnam_Soldiers, you have so many computers, I can't seem to find the ones you are talking about. If they continue to give you problems, I would suspend them and attach them to the Ralph test project. If you change your target cpu time preference to 2 hours, you may not lose as much cpu time for those jobs that randomly fail now and then. The computers I had a chance to look at from your long list of hosts were completing results okay with the new app. Sorry for the troubles you are facing. Thank you for your time.That list also presents a huge problem when trying to find what has failed and why. It is frustrating from my perspective to have a machine spend 5-8 hours crunching a WU and then because of an error get absolutely no credit for that time spent.Personally I feel that if a WU gets to 88% completion and it fails, the account should get 88% of the credit that the WU would get were it to complete. After all, it is generally( from what I've seen)not the fault of the PC when the WU fails.IE: I've done my part but received no credit due to a failure on the WU's design. The other point is you failed to answer my question: How do I go back to the ver 4.81 and solve my immediate problem while you work out the bugs on ver 4.82? Thank you. EDIT: I just went back through the results on these computers and their are more than 30(computational error) going back maybe 20 "pages"..I don't have the time to add up all the PC time involved at the moment but will be glad to if it would help you understand my frustration over this matter. These are all high end machines. The ones I personally own are dual xeon on Supermicro MB's with top quality ram, high end PS and all on large UPS's.All are on XP.Pro-SP1.My point is that they are stable and I don't beleive the issue is with my equipment. ID: 11090 · Rating: 1 · rate: / Reply Quote

XS_Vietnam_Soldiers Send message Joined: 11 Jan 06 Posts: 240 Credit: 2,880,653 RAC: 0	Message 11094 - Posted: 21 Feb 2006, 10:33:45 UTC - in response to Message 11090. Last modified: 21 Feb 2006, 10:41:20 UTC XS_Vietnam_Soldiers, you have so many computers, I can't seem to find the ones you are talking about. If they continue to give you problems, I would suspend them and attach them to the Ralph test project. If you change your target cpu time preference to 2 hours, you may not lose as much cpu time for those jobs that randomly fail now and then. The computers I had a chance to look at from your long list of hosts were completing results okay with the new app. Sorry for the troubles you are facing. Thank you for your time.That list also presents a huge problem when trying to find what has failed and why. It is frustrating from my perspective to have a machine spend 5-8 hours crunching a WU and then because of an error get absolutely no credit for that time spent.Personally I feel that if a WU gets to 88% completion and it fails, the account should get 88% of the credit that the WU would get were it to complete. After all, it is generally( from what I've seen)not the fault of the PC when the WU fails.IE: I've done my part but received no credit due to a failure on the WU's design. The other point is you failed to answer my question: How do I go back to the ver 4.81 and solve my immediate problem while you work out the bugs on ver 4.82? Thank you. EDIT: I just went back through the results on these computers and their are more than 30(computational error) going back maybe 20 "pages"..I don't have the time to add up all the PC time involved at the moment but will be glad to if it would help you understand my frustration over this matter. These are all high end machines. The ones I personally own are dual xeon on Supermicro MB's with top quality ram, high end PS and all on large UPS's.All are on XP.Pro-SP1.My point is that they are stable and I don't beleive the issue is with my equipment. Addendum: I just took this from the log on my dothan machine: 2/20/2006 8:32:47 AM\|rosetta@home\|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1ogw_251_14828_2 (Maximum CPU time exceeded) 2/20/2006 4:20:57 PM\|rosetta@home\|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1dcj_251_15090_2 (Maximum CPU time exceeded) 2/21/2006 12:09:07 AM\|rosetta@home\|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1di2_251_20632_1 (Maximum CPU time exceeded) Since you may not be familar with the dothan cpu: This is the Intel Pentium M 770(2130mhz) laptop cpu run with an asus adapter on a Asus P4P800 SE MB. Fantastic computational power and yet running only one work unit at a time it times out? Strange wouldn't you agree? Net result of those 3 WU timing out is that this machine received no credit for an entire 24 hours work.THAT greatly upsets me! This machine was averaging 600-650 points a day with ver 4.81..The day it changed to ver 4.82 it received a grand total of zero! ID: 11094 · Rating: 0 · rate: / Reply Quote

Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0	Message 11095 - Posted: 21 Feb 2006, 10:41:28 UTC Hi guys, I'm baaaaack... I've had 8 failures (not counting 6 CPU time outs) in the past two days with most of them being exception errors... all 14 errors were with version 4.82. If needed I can post the errors. Thanks, Owlie Join the Teddies@WCG ID: 11095 · Rating: 0 · rate: / Reply Quote

genes Send message Joined: 8 Oct 05 Posts: 60 Credit: 701,880 RAC: 968	Message 11101 - Posted: 21 Feb 2006, 12:15:26 UTC - in response to Message 11071. Last modified: 21 Feb 2006, 12:18:30 UTC Yet another 4.82 crash. Same as the others. https://boinc.bakerlab.org/rosetta/result.php?resultid=11823719 I'm setting Rosetta to No New Work on that machine. It didn't have any problems with 4.81. Any tests I could do here? Seems 4.82 fails pretty reliably (100%) on this machine. Currently also running CPDN (Sulphur), Einstein, Seti, Seti Beta, and an occasional Pirates. Can you attach this host to the Ralph project if you haven't already? Will do. [edit] OK, it's this one: http://ralph.bakerlab.org/show_host_detail.php?hostid=953 [/edit] ID: 11101 · Rating: 0 · rate: / Reply Quote

KSMarksPsych Send message Joined: 15 Oct 05 Posts: 199 Credit: 22,337 RAC: 0	Message 11111 - Posted: 21 Feb 2006, 13:33:39 UTC - in response to Message 11090. XS_Vietnam_Soldiers, you have so many computers, I can't seem to find the ones you are talking about. If they continue to give you problems, I would suspend them and attach them to the Ralph test project. If you change your target cpu time preference to 2 hours, you may not lose as much cpu time for those jobs that randomly fail now and then. The computers I had a chance to look at from your long list of hosts were completing results okay with the new app. Sorry for the troubles you are facing. Thank you for your time.That list also presents a huge problem when trying to find what has failed and why. It is frustrating from my perspective to have a machine spend 5-8 hours crunching a WU and then because of an error get absolutely no credit for that time spent.Personally I feel that if a WU gets to 88% completion and it fails, the account should get 88% of the credit that the WU would get were it to complete. After all, it is generally( from what I've seen)not the fault of the PC when the WU fails.IE: I've done my part but received no credit due to a failure on the WU's design. The other point is you failed to answer my question: How do I go back to the ver 4.81 and solve my immediate problem while you work out the bugs on ver 4.82? Thank you. EDIT: I just went back through the results on these computers and their are more than 30(computational error) going back maybe 20 "pages"..I don't have the time to add up all the PC time involved at the moment but will be glad to if it would help you understand my frustration over this matter. These are all high end machines. The ones I personally own are dual xeon on Supermicro MB's with top quality ram, high end PS and all on large UPS's.All are on XP.Pro-SP1.My point is that they are stable and I don't beleive the issue is with my equipment. I'm pretty sure it isn't possible to go back to a previous app version. There (if I recall correctly) are some changes to the science app as well as the function for the user to specify run times. Kathryn Kathryn :o) The BOINC FAQ Service The Unofficial BOINC Wiki The Trac System More BOINC information than you can shake a stick of RAM at. ID: 11111 · Rating: 0 · rate: / Reply Quote

XS_Vietnam_Soldiers Send message Joined: 11 Jan 06 Posts: 240 Credit: 2,880,653 RAC: 0	Message 11120 - Posted: 21 Feb 2006, 16:08:07 UTC - in response to Message 11111. XS_Vietnam_Soldiers, you have so many computers, I can't seem to find the ones you are talking about. If they continue to give you problems, I would suspend them and attach them to the Ralph test project. If you change your target cpu time preference to 2 hours, you may not lose as much cpu time for those jobs that randomly fail now and then. The computers I had a chance to look at from your long list of hosts were completing results okay with the new app. Sorry for the troubles you are facing. Thank you for your time.That list also presents a huge problem when trying to find what has failed and why. It is frustrating from my perspective to have a machine spend 5-8 hours crunching a WU and then because of an error get absolutely no credit for that time spent.Personally I feel that if a WU gets to 88% completion and it fails, the account should get 88% of the credit that the WU would get were it to complete. After all, it is generally( from what I've seen)not the fault of the PC when the WU fails.IE: I've done my part but received no credit due to a failure on the WU's design. The other point is you failed to answer my question: How do I go back to the ver 4.81 and solve my immediate problem while you work out the bugs on ver 4.82? Thank you. EDIT: I just went back through the results on these computers and their are more than 30(computational error) going back maybe 20 "pages"..I don't have the time to add up all the PC time involved at the moment but will be glad to if it would help you understand my frustration over this matter. These are all high end machines. The ones I personally own are dual xeon on Supermicro MB's with top quality ram, high end PS and all on large UPS's.All are on XP.Pro-SP1.My point is that they are stable and I don't beleive the issue is with my equipment. I'm pretty sure it isn't possible to go back to a previous app version. There (if I recall correctly) are some changes to the science app as well as the function for the user to specify run times. Kathryn Thank you for your reply but since I'm still running ver 4.81 on one machine at this moment you may be wrong. I'm waiting to hear from the admins on this so I can decide which way to procede. ID: 11120 · Rating: 0 · rate: / Reply Quote

XS_Vietnam_Soldiers Send message Joined: 11 Jan 06 Posts: 240 Credit: 2,880,653 RAC: 0	Message 11129 - Posted: 21 Feb 2006, 17:43:31 UTC This is the machine ID involved: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=160238 ID: 11129 · Rating: 0 · rate: / Reply Quote