Message boards : Number crunching : Miscellaneous Work Unit Errors
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 10 · Next
Author | Message |
---|---|
tng* Send message Joined: 28 Oct 05 Posts: 14 Credit: 5,389,798 RAC: 0 |
For people having many work Unit Errors!! Having received half a dozen errors on 4.82 on two machines that I don't believe have had that many errors in three months, I did this. Within hours, another error: 12092296 (edited to show the correct result -- oops) Not the same as the earlier ones, but there still seem to be problems with a two-hour setting. The machines having problems are Dell Dimension 9100s, Pentium D 820, 1 gig, XP SP2 with all critical updates, Boinc 5.2.13 (updated from from 5.2.2, but that didn't seem to fix anything). My athlon 4000+ laptop has had no problems -- maybe something with multi-CPU systems. |
Osku87 Send message Joined: 1 Nov 05 Posts: 17 Credit: 280,268 RAC: 0 |
My athlon 4000+ laptop has had no problems -- maybe something with multi-CPU systems. It's not something with multi-CPU systems. I have Sempron 2400+ and have the same problems. Reducing the time setting to 2 hours didn't solve the problem. For example: 9642560 |
doc :) Send message Joined: 4 Oct 05 Posts: 47 Credit: 1,106,102 RAC: 0 |
27/02/2006 20:09:06|rosetta@home|Unrecoverable error for result ABINITe6_hom011_1e6iA_320_4_0 ( - exit code -1073741811 (0xc000000d)) i get that error with that WU type (ABINIT**_hom***...) and 4.82 when i got the graphics open, it crashes for each WU i tried that with at about the time the first model should be finished (i am not opening the graphics in a WU that is past its first checkpoint right now to avoid wasting more cycles than necesarry). if i leave it running by itself without graphics in the background it finishes without problems. some examples: result, result, result this WU from the HBLR type of WUs failed with the same error while i had graphics open at a later point. |
Osku87 Send message Joined: 1 Nov 05 Posts: 17 Credit: 280,268 RAC: 0 |
Reducing the time setting did really solve the problem. Not with the first unit but the second and the following. Now it works really fine. |
Marie Lucie Send message Joined: 9 Dec 05 Posts: 5 Credit: 40,616 RAC: 0 |
Hello, For me the problems continue ... 27/02/2006 21:08:33|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_QUADRUPLELONGRANGEANTIPARALLEL_2acy__311_501_1 ( - exit code -164 (0xffffff5c)) 27/02/2006 21:08:33||request_reschedule_cpus: process exited 27/02/2006 21:08:33|rosetta@home|Computation for result PRODUCTION_ABINITIO_QUADRUPLELONGRANGEANTIPARALLEL_2acy__311_501_1 finished 27/02/2006 23:43:29|rosetta@home|Unrecoverable error for result ABINITai_hom022_1aiu__320_64_1 ( - exit code -164 (0xffffff5c)) 27/02/2006 23:43:29||request_reschedule_cpus: process exited 27/02/2006 23:43:29|rosetta@home|Computation for result ABINITai_hom022_1aiu__320_64_1 finished 28/02/2006 07:02:13|rosetta@home|Unrecoverable error for result HBLR_1.0_1di2_323_143_0 ( - exit code -1073741819 (0xc0000005)) 28/02/2006 07:02:13||request_reschedule_cpus: process exited 28/02/2006 07:02:13|rosetta@home|Computation for result HBLR_1.0_1di2_323_143_0 finished |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
Reducing the time setting did really solve the problem. Not with the first unit but the second and the following. Now it works really fine. great. we hope to locate the sources of the errors this week. in the meantime, you can control the fraction of wu that have errors since the probability of error seems constant over the duration of the run. so roughly speaking, if 50% of your wu fail with the 8 hour run time, only 12.5% should fail with a 2 hour run time, and an even smaller fraction with a 1 hour run time. of course, this is only a very temporary fix since there are many reasons why longer run times are preferable. |
Koen Send message Joined: 29 Sep 05 Posts: 8 Credit: 8,542,574 RAC: 0 |
Don't know if it has anything to do with it but I noticed that the RAM footprint of the rosetta-app. sometimes rises to as high as 130MB when processing HBLR_1.0-workunits.Do I remember correctly that this also caused problems a couple of months ago? Looking at the errors my fellow-crunchers experienced I noticed that a lot of those errors occur on the above mentioned HBLR_1.0-workunits.So I thought this was worth mentioning.If not, please allow my appologies for wasting your time. K. |
mags Send message Joined: 22 Nov 05 Posts: 33 Credit: 108,630 RAC: 0 |
Reducing the time does not fix it for me, the whole boinc/rosetta freezes and it does not restart until I manually fix the problem, now if I didn't have a life/family/job this just might be okay.................. And don't even ask if I have leave in memory ticked, this does not solve every boinc/rosetta problem. I'm really peeved at the lack of upfront info about what appears to be a major problem for so many. As I've said in the past talk to us, and in english not 'techno giberish'. ps I have 1gig of ram join Fadbeens |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
Reducing the time does not fix it for me, the whole boinc/rosetta freezes and it does not restart until I manually fix the problem, now if I didn't have a life/family/job this just might be okay.................. I'm sorry about these frustrating problems! David Kim had a great idea for solving almost all of these problems that he is testing on RALPH; if it works you will soon see it here at rosetta@home. if a rosetta or boinc error occurs, rather than killing the whole process, a special termination routine will be called which will send back to us all structures computed to that point. this is good for us, and for you since credit will be awarded for the process up to this point. so instead of seeing errors, you will see an occasional work unit with a shorter run time. |
bruce boytler Send message Joined: 17 Sep 05 Posts: 68 Credit: 3,565,442 RAC: 0 |
I inadvertently left my graphics running and it managed to crash the workunit with a computational error. I figured it was just coincedence and thought nothing of it. Untill I accidently did it again a day or two later and the same thing happened. I am now afraid to turn on graphics at all. This wu's crashed with 6 to 7 hours of cpu time clocked on them. BTW. out of the last 77 results using the default 8 hours I have experianced 5 crashes on a AMD64X2 3800 with a gig of ram. Cheers.......I like the science faq's |
SallyH Send message Joined: 4 Nov 05 Posts: 6 Credit: 4,799,395 RAC: 0 |
I turned off the BOINC screensaver and have not had an error since.....had a few hung errors but not the other errors..... |
mags Send message Joined: 22 Nov 05 Posts: 33 Credit: 108,630 RAC: 0 |
Thanks guys, I had all but given up. I have turned off the boinc screensaver, hopefully that will do the trick. join Fadbeens |
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
I don't think turning off the Screensaver will help these problems. I've always had it turned off and had 5 failures today, out of 25 returned results... The thing that will help is what David Baker stated below... Join the Teddies@WCG |
ecafkid Send message Joined: 5 Oct 05 Posts: 40 Credit: 15,177,319 RAC: 0 |
These are 2 errors for today so far 3/1/2006 4:23:18 AM|rosetta@home|Unrecoverable error for result ABINITac_hom020_1acf__320_49_1 ( - exit code -1073741811 (0xc000000d)) 3/1/2006 3:36:37 PM|rosetta@home|Unrecoverable error for result ABINITen_hom023_1enh__322_48_0 ( - exit code -1073741811 (0xc000000d)) |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
3/4/2006 6:26:59 PM|rosetta@home|Unrecoverable error for result HOMSdt_homDB004_1dtj__340_50_0 (Incorrect function. (0x1) - exit code 1 (0x1)) appeared.. so it's obvious my system isn't immune from the UEs.. (running 24 hours each just makes it take longer to find them..) |
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
I have a few: Computer ID 142540 HOMSdt_homDB009_1dtj__340_142_1 <core_client_version>5.2.13</core_client_version> <message>Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> Computer ID 142263 HOMSdt_homDB027_1dtj__340_175_0 <core_client_version>5.2.13</core_client_version> <message>Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> Computer ID 56911 HOMSdt_homDB009_1dtj__340_12_0 <core_client_version>5.2.13</core_client_version> <message>Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> Computer ID 57040 HBLR_1.0_1r69_323_710_2 <core_client_version>5.2.13</core_client_version> <message>Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # random seed: 3816991 # cpu_run_time_pref: 86400 </stderr_txt> I'm sure there's more but it will take awhile to check the rest of my puters... Join the Teddies@WCG |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
If this is helpful.. -------- stderr.txt # random seed: 3468381 # cpu_run_time_pref: 86400 --- 2/19/2006 6:24:18 PM||Starting BOINC client version 5.2.13 for windows_intelx86 2/19/2006 6:24:18 PM||libcurl/7.14.0 OpenSSL/0.9.8 zlib/1.2.3 2/19/2006 6:24:18 PM||Executing as a daemon 2/19/2006 6:24:18 PM||Data directory: C:Program FilesBOINC 2/19/2006 6:24:18 PM||BOINC is running as a service and as a non-system user. 2/19/2006 6:24:18 PM||No application graphics will be available. 2/19/2006 6:24:18 PM||Processor: 1 AuthenticAMD AMD Athlon(tm) 64 Processor 3000+ 2/19/2006 6:24:18 PM||Memory: 1023.48 MB physical, 1.65 GB virtual 2/19/2006 6:24:18 PM||Disk: 29.29 GB total, 3.84 GB free 2/19/2006 6:24:18 PM|rosetta@home|Computer ID: 121218; location: home; project prefs: default 2/19/2006 6:24:18 PM||General prefs: from rosetta@home (last modified 2005-12-29 13:52:58) 2/19/2006 6:24:18 PM||General prefs: no separate prefs for home; using your defaults 2/19/2006 6:24:19 PM||Remote control not allowed; using loopback address ------ 3/4/2006 6:26:59 PM|rosetta@home|Unrecoverable error for result HOMSdt_homDB004_1dtj__340_50_0 (Incorrect function. (0x1) - exit code 1 (0x1)) 3/4/2006 6:26:59 PM||request_reschedule_cpus: process exited 3/4/2006 6:26:59 PM|rosetta@home|Computation for result HOMSdt_homDB004_1dtj__340_50_0 finished ------ 25.98 seconds.. it sure failed quickly. Mine is a 754 pin Athlon 64; running WinXP Pro SP2. (supposedly, fully updated.. minus the microsoft anti spyware package.) Panda Titanium antivirus. |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
I am also getting "Computation errors" resulting in exit code 1073741811(0xc000000d). The last unit failed 1 hour 13 minutes and some odd seconds in, so I do not believe that the 2 hour setting would have helped this unit. Is this a similar problem (bug) to what Einstein@home recently suffered from? They had a graphics bug that had to do with certain G L cards. This is running on a Gateway 840, running at 3 GHz, with 1 gig of RAM, running Windows XP Professional. I would like to get an answer to this before too many units fail. Please contact me at "Dbidanset at aol dot com" with some info on this. {EDIT NOTE: I changed the email address to protect you from "sniffers". Original message deleted on request of user for the same purpose and corrected a spelling error} I will send your message to the project team at RALPH. You may want to attach your machine to the RALPH project to help trouble shoot this problem. If you do not hear from the project team in a day or so let me know, Moderator9 ROSETTA@home FAQ Moderator Contact |
Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0 |
The following three WUs: HOMSdt_homDB002_1dtj__340_124_0 HOMSdt_homDB002_1dtj__352_271_0 HOMSdt_homDB004_1dtj__352_669_0 exited with error status 1 after about 30 seconds of run time on my Linux computer as well as on several other computers. Since three out of three units of this particular type have failed on this computer which usually has almost no errors I believe this is a WU specific error which may need investigating. |
Carlos_Pfitzner Send message Joined: 22 Dec 05 Posts: 71 Credit: 138,867 RAC: 0 |
Exit status 1 (0x1) https://boinc.bakerlab.org/rosetta/result.php?resultid=12859932 Rosetta_4.82 windows 2000 server sp4 512mb RAM *This error ocurred while I asleep ... rosetta was the only running program Click signature for global team stats |
Message boards :
Number crunching :
Miscellaneous Work Unit Errors
©2024 University of Washington
https://www.bakerlab.org