Message boards : Number crunching : Miscellaneous Work Unit Errors
Author | Message |
---|---|
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
|
Andrew Send message Joined: 17 Feb 06 Posts: 3 Credit: 349,161 RAC: 0 |
Error running WU 19/02/2006 6:17:09|rosetta@home|Unrecoverable error for result HBLR_1.0_1b72_314_924_0 ( - exit code -1073741819 (0xc0000005)) Checked the Results ID for that WU and got the following data; <core_client_version>5.2.13</core_client_version> <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 28800 # random seed: 1087963 ***UNHANDLED EXCEPTION**** Reason: Access Violation (0xc0000005) at address 0x7C911E58 read attempt to address 0x3FF3718A 1: 02/19/06 18:17:08 1: SymGetLineFromAddr(): GetLastError = 126 </stderr_txt> I think the error occurred when the WU was moved out of memory, when i activated the pc. Have since changed my preferences to leave WU in memory when preempted. |
DoubleTop Send message Joined: 20 Sep 05 Posts: 10 Credit: 1,120,456 RAC: 0 |
Error running Boinc on systems that have previously ran with no problems. These are running LTSP (diskless linux) and I've no had the problem for a while. Still testing whether this is due to having attached the BBC project, but first log shows that this was happening before I attached the new project. 2006-02-19 19:10:05 [rosetta@home] Resuming computation for result NO_SIM_ANNEAL_1dcj_228_1611_2 using rosetta version 480 SIGSEGV: segmentation violationStack trace (6 frames): ./boinc[0x80845b2] /lib/libpthread.so.0[0x40163a85] /lib/libc.so.6[0x400428e8] ./boinc[0x805c9ef] ./boinc[0x80784d9] [0x31313537] Exiting... I've now seen this on three machines, and not all using the same simulation. I hope someone else can help further, me I'll just report it and move on. DT. |
DoubleTop Send message Joined: 20 Sep 05 Posts: 10 Credit: 1,120,456 RAC: 0 |
Please ignore the above post - I've isolated the problem to the BBC project. Attached a test node with just that project and there are some library problems on my diskless setup to run that successfully. DT. |
genes Send message Joined: 8 Oct 05 Posts: 60 Credit: 701,880 RAC: 968 |
I've had a 4.82 WU crash today:
This WU: https://boinc.bakerlab.org/rosetta/result.php?resultid=11796212 Nothing unusual was going on, "Leave in Memory" is set to YES. (It wasn't being swapped anyway.) |
Robert Everly Send message Joined: 8 Oct 05 Posts: 27 Credit: 665,094 RAC: 0 |
Not sure if this WU is cursed or not. Three errors. This is the first WU that I've had die in a long time. Only change recently was to NOT have the WU remain in memory. Guess thats not quite fixed yet. Putting settings back to remain in memory to yes. If at all possible, I'd like to re-run this WU on the same machine to see if it happens with the changed setting. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=9512202 <core_client_version>5.2.12</core_client_version> <message> - exit code -164 (0xffffff5c) </message> <stderr_txt> # random seed: 1086714 # cpu_run_time_pref: 28800 # cpu_run_time_pref: 28800 # cpu_run_time_pref: 28800 ***UNHANDLED EXCEPTION**** Reason: Access Violation (0xc0000005) at address 0x0047E9E3 read attempt to address 0x1285D784 </stderr_txt> |
Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0 |
This WU: https://boinc.bakerlab.org/rosetta/result.php?resultid=11719411 It ran full time, nothing special to see. I had it suspended some times to run some Pirate WU's, but else, nothing unusual happened or was seen. I even had the graphic open at 97.50% to see it, and all looked normal. And I haven't had any Ralph WU's yet, in case they would interfere. [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] |
genes Send message Joined: 8 Oct 05 Posts: 60 Credit: 701,880 RAC: 968 |
Got another 4.82 crash. This one brought up a Microsoft Dialog "Please report this error..." Looks like a carbon copy of the previous one. Same machine. Same settings. https://boinc.bakerlab.org/rosetta/result.php?resultid=11805479 Here's the goings-on around the time of the error: 2/20/2006 10:02:25 AM|rosetta@home|Resuming result HBLR_1.0_2reb_314_890_1 using rosetta version 482 |
genes Send message Joined: 8 Oct 05 Posts: 60 Credit: 701,880 RAC: 968 |
Yet another 4.82 crash. Same as the others. https://boinc.bakerlab.org/rosetta/result.php?resultid=11823719 I'm setting Rosetta to No New Work on that machine. It didn't have any problems with 4.81. Any tests I could do here? Seems 4.82 fails pretty reliably (100%) on this machine. Currently also running CPDN (Sulphur), Einstein, Seti, Seti Beta, and an occasional Pirates. |
truckpuller Send message Joined: 5 Nov 05 Posts: 40 Credit: 229,134 RAC: 0 |
How about computation errors we report them here also if so i have had my shares still and now this 1, HBLR_1.0_1r69_314_911_0 Visit us at Christianboards.org |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Yet another 4.82 crash. Same as the others. Can you attach this host to the Ralph project if you haven't already? |
XS_Vietnam_Soldiers Send message Joined: 11 Jan 06 Posts: 240 Credit: 2,880,653 RAC: 0 |
Sirs: You got a huge problem here. This 4.82 version is raising hell with my machines. 3-dual xeon setup's and a Dothan on a P4 Asus MB. I've just watched 2 work units back to back on this dothan whitch has computational power equal to a amd 64 go almost 8 hours and then crap out. What I want is a simple answer: How do I go back to ver 4.81? I lost over 20 WU's between the 4 machines in the last 48 hours.This isn't a case where the WU runs 30-40 mins and errors out. That I can live with but this running full term and then nothing is not acceptable. Thank you for your time. I look forward to hearing from you. Movieman from XS ddhunt@adelphia.net |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
XS_Vietnam_Soldiers, you have so many computers, I can't seem to find the ones you are talking about. If they continue to give you problems, I would suspend them and attach them to the Ralph test project. If you change your target cpu time preference to 2 hours, you may not lose as much cpu time for those jobs that randomly fail now and then. The computers I had a chance to look at from your long list of hosts were completing results okay with the new app. Sorry for the troubles you are facing. |
XS_Vietnam_Soldiers Send message Joined: 11 Jan 06 Posts: 240 Credit: 2,880,653 RAC: 0 |
XS_Vietnam_Soldiers, you have so many computers, I can't seem to find the ones you are talking about. If they continue to give you problems, I would suspend them and attach them to the Ralph test project. If you change your target cpu time preference to 2 hours, you may not lose as much cpu time for those jobs that randomly fail now and then. The computers I had a chance to look at from your long list of hosts were completing results okay with the new app. Sorry for the troubles you are facing. Thank you for your time.That list also presents a huge problem when trying to find what has failed and why. It is frustrating from my perspective to have a machine spend 5-8 hours crunching a WU and then because of an error get absolutely no credit for that time spent.Personally I feel that if a WU gets to 88% completion and it fails, the account should get 88% of the credit that the WU would get were it to complete. After all, it is generally( from what I've seen)not the fault of the PC when the WU fails.IE: I've done my part but received no credit due to a failure on the WU's design. The other point is you failed to answer my question: How do I go back to the ver 4.81 and solve my immediate problem while you work out the bugs on ver 4.82? Thank you. EDIT: I just went back through the results on these computers and their are more than 30(computational error) going back maybe 20 "pages"..I don't have the time to add up all the PC time involved at the moment but will be glad to if it would help you understand my frustration over this matter. These are all high end machines. The ones I personally own are dual xeon on Supermicro MB's with top quality ram, high end PS and all on large UPS's.All are on XP.Pro-SP1.My point is that they are stable and I don't beleive the issue is with my equipment. |
XS_Vietnam_Soldiers Send message Joined: 11 Jan 06 Posts: 240 Credit: 2,880,653 RAC: 0 |
XS_Vietnam_Soldiers, you have so many computers, I can't seem to find the ones you are talking about. If they continue to give you problems, I would suspend them and attach them to the Ralph test project. If you change your target cpu time preference to 2 hours, you may not lose as much cpu time for those jobs that randomly fail now and then. The computers I had a chance to look at from your long list of hosts were completing results okay with the new app. Sorry for the troubles you are facing. Addendum: I just took this from the log on my dothan machine: 2/20/2006 8:32:47 AM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1ogw_251_14828_2 (Maximum CPU time exceeded) 2/20/2006 4:20:57 PM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1dcj_251_15090_2 (Maximum CPU time exceeded) 2/21/2006 12:09:07 AM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1di2_251_20632_1 (Maximum CPU time exceeded) Since you may not be familar with the dothan cpu: This is the Intel Pentium M 770(2130mhz) laptop cpu run with an asus adapter on a Asus P4P800 SE MB. Fantastic computational power and yet running only one work unit at a time it times out? Strange wouldn't you agree? Net result of those 3 WU timing out is that this machine received no credit for an entire 24 hours work.THAT greatly upsets me! This machine was averaging 600-650 points a day with ver 4.81..The day it changed to ver 4.82 it received a grand total of zero! |
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
Hi guys, I'm baaaaack... I've had 8 failures (not counting 6 CPU time outs) in the past two days with most of them being exception errors... all 14 errors were with version 4.82. If needed I can post the errors. Thanks, Owlie Join the Teddies@WCG |
genes Send message Joined: 8 Oct 05 Posts: 60 Credit: 701,880 RAC: 968 |
Yet another 4.82 crash. Same as the others. Will do. [edit] OK, it's this one: http://ralph.bakerlab.org/show_host_detail.php?hostid=953 [/edit] |
KSMarksPsych Send message Joined: 15 Oct 05 Posts: 199 Credit: 22,337 RAC: 0 |
XS_Vietnam_Soldiers, you have so many computers, I can't seem to find the ones you are talking about. If they continue to give you problems, I would suspend them and attach them to the Ralph test project. If you change your target cpu time preference to 2 hours, you may not lose as much cpu time for those jobs that randomly fail now and then. The computers I had a chance to look at from your long list of hosts were completing results okay with the new app. Sorry for the troubles you are facing. I'm pretty sure it isn't possible to go back to a previous app version. There (if I recall correctly) are some changes to the science app as well as the function for the user to specify run times. Kathryn Kathryn :o) The BOINC FAQ Service The Unofficial BOINC Wiki The Trac System More BOINC information than you can shake a stick of RAM at. |
XS_Vietnam_Soldiers Send message Joined: 11 Jan 06 Posts: 240 Credit: 2,880,653 RAC: 0 |
XS_Vietnam_Soldiers, you have so many computers, I can't seem to find the ones you are talking about. If they continue to give you problems, I would suspend them and attach them to the Ralph test project. If you change your target cpu time preference to 2 hours, you may not lose as much cpu time for those jobs that randomly fail now and then. The computers I had a chance to look at from your long list of hosts were completing results okay with the new app. Sorry for the troubles you are facing. Thank you for your reply but since I'm still running ver 4.81 on one machine at this moment you may be wrong. I'm waiting to hear from the admins on this so I can decide which way to procede. |
XS_Vietnam_Soldiers Send message Joined: 11 Jan 06 Posts: 240 Credit: 2,880,653 RAC: 0 |
This is the machine ID involved: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=160238 |
Message boards :
Number crunching :
Miscellaneous Work Unit Errors
©2024 University of Washington
https://www.bakerlab.org