Message boards : Number crunching : Miscellaneous Work Unit Errors
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 9 · Next
Author | Message |
---|---|
Osku87 Send message Joined: 1 Nov 05 Posts: 17 Credit: 280,268 RAC: 0 |
Reducing the time setting did really solve the problem. Not with the first unit but the second and the following. Now it works really fine. ![]() |
Marie Lucie Send message Joined: 9 Dec 05 Posts: 5 Credit: 40,616 RAC: 0 |
Hello, For me the problems continue ... 27/02/2006 21:08:33|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_QUADRUPLELONGRANGEANTIPARALLEL_2acy__311_501_1 ( - exit code -164 (0xffffff5c)) 27/02/2006 21:08:33||request_reschedule_cpus: process exited 27/02/2006 21:08:33|rosetta@home|Computation for result PRODUCTION_ABINITIO_QUADRUPLELONGRANGEANTIPARALLEL_2acy__311_501_1 finished 27/02/2006 23:43:29|rosetta@home|Unrecoverable error for result ABINITai_hom022_1aiu__320_64_1 ( - exit code -164 (0xffffff5c)) 27/02/2006 23:43:29||request_reschedule_cpus: process exited 27/02/2006 23:43:29|rosetta@home|Computation for result ABINITai_hom022_1aiu__320_64_1 finished 28/02/2006 07:02:13|rosetta@home|Unrecoverable error for result HBLR_1.0_1di2_323_143_0 ( - exit code -1073741819 (0xc0000005)) 28/02/2006 07:02:13||request_reschedule_cpus: process exited 28/02/2006 07:02:13|rosetta@home|Computation for result HBLR_1.0_1di2_323_143_0 finished ![]() |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
Reducing the time setting did really solve the problem. Not with the first unit but the second and the following. Now it works really fine. great. we hope to locate the sources of the errors this week. in the meantime, you can control the fraction of wu that have errors since the probability of error seems constant over the duration of the run. so roughly speaking, if 50% of your wu fail with the 8 hour run time, only 12.5% should fail with a 2 hour run time, and an even smaller fraction with a 1 hour run time. of course, this is only a very temporary fix since there are many reasons why longer run times are preferable. |
Koen Send message Joined: 29 Sep 05 Posts: 8 Credit: 8,542,574 RAC: 0 |
Don't know if it has anything to do with it but I noticed that the RAM footprint of the rosetta-app. sometimes rises to as high as 130MB when processing HBLR_1.0-workunits.Do I remember correctly that this also caused problems a couple of months ago? Looking at the errors my fellow-crunchers experienced I noticed that a lot of those errors occur on the above mentioned HBLR_1.0-workunits.So I thought this was worth mentioning.If not, please allow my appologies for wasting your time. K. |
![]() ![]() Send message Joined: 22 Nov 05 Posts: 33 Credit: 108,630 RAC: 0 |
Reducing the time does not fix it for me, the whole boinc/rosetta freezes and it does not restart until I manually fix the problem, now if I didn't have a life/family/job this just might be okay.................. And don't even ask if I have leave in memory ticked, this does not solve every boinc/rosetta problem. I'm really peeved at the lack of upfront info about what appears to be a major problem for so many. As I've said in the past talk to us, and in english not 'techno giberish'. ps I have 1gig of ram join Fadbeens ![]() |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
Reducing the time does not fix it for me, the whole boinc/rosetta freezes and it does not restart until I manually fix the problem, now if I didn't have a life/family/job this just might be okay.................. I'm sorry about these frustrating problems! David Kim had a great idea for solving almost all of these problems that he is testing on RALPH; if it works you will soon see it here at rosetta@home. if a rosetta or boinc error occurs, rather than killing the whole process, a special termination routine will be called which will send back to us all structures computed to that point. this is good for us, and for you since credit will be awarded for the process up to this point. so instead of seeing errors, you will see an occasional work unit with a shorter run time. |
![]() ![]() Send message Joined: 17 Sep 05 Posts: 68 Credit: 3,565,442 RAC: 0 |
I inadvertently left my graphics running and it managed to crash the workunit with a computational error. I figured it was just coincedence and thought nothing of it. Untill I accidently did it again a day or two later and the same thing happened. I am now afraid to turn on graphics at all. This wu's crashed with 6 to 7 hours of cpu time clocked on them. BTW. out of the last 77 results using the default 8 hours I have experianced 5 crashes on a AMD64X2 3800 with a gig of ram. Cheers.......I like the science faq's ![]() |
![]() Send message Joined: 4 Nov 05 Posts: 6 Credit: 4,799,395 RAC: 0 |
I turned off the BOINC screensaver and have not had an error since.....had a few hung errors but not the other errors..... |
![]() ![]() Send message Joined: 22 Nov 05 Posts: 33 Credit: 108,630 RAC: 0 |
Thanks guys, I had all but given up. I have turned off the boinc screensaver, hopefully that will do the trick. join Fadbeens ![]() |
![]() ![]() Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
I don't think turning off the Screensaver will help these problems. I've always had it turned off and had 5 failures today, out of 25 returned results... The thing that will help is what David Baker stated below... Join the Teddies@WCG ![]() |
![]() Send message Joined: 5 Oct 05 Posts: 40 Credit: 15,177,319 RAC: 0 |
These are 2 errors for today so far 3/1/2006 4:23:18 AM|rosetta@home|Unrecoverable error for result ABINITac_hom020_1acf__320_49_1 ( - exit code -1073741811 (0xc000000d)) 3/1/2006 3:36:37 PM|rosetta@home|Unrecoverable error for result ABINITen_hom023_1enh__322_48_0 ( - exit code -1073741811 (0xc000000d)) ![]() |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
3/4/2006 6:26:59 PM|rosetta@home|Unrecoverable error for result HOMSdt_homDB004_1dtj__340_50_0 (Incorrect function. (0x1) - exit code 1 (0x1)) appeared.. so it's obvious my system isn't immune from the UEs.. (running 24 hours each just makes it take longer to find them..) |
![]() ![]() Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
I have a few: Computer ID 142540 HOMSdt_homDB009_1dtj__340_142_1 <core_client_version>5.2.13</core_client_version> <message>Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> Computer ID 142263 HOMSdt_homDB027_1dtj__340_175_0 <core_client_version>5.2.13</core_client_version> <message>Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> Computer ID 56911 HOMSdt_homDB009_1dtj__340_12_0 <core_client_version>5.2.13</core_client_version> <message>Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> Computer ID 57040 HBLR_1.0_1r69_323_710_2 <core_client_version>5.2.13</core_client_version> <message>Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # random seed: 3816991 # cpu_run_time_pref: 86400 </stderr_txt> I'm sure there's more but it will take awhile to check the rest of my puters... Join the Teddies@WCG ![]() |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
If this is helpful.. -------- stderr.txt # random seed: 3468381 # cpu_run_time_pref: 86400 --- 2/19/2006 6:24:18 PM||Starting BOINC client version 5.2.13 for windows_intelx86 2/19/2006 6:24:18 PM||libcurl/7.14.0 OpenSSL/0.9.8 zlib/1.2.3 2/19/2006 6:24:18 PM||Executing as a daemon 2/19/2006 6:24:18 PM||Data directory: C:Program FilesBOINC 2/19/2006 6:24:18 PM||BOINC is running as a service and as a non-system user. 2/19/2006 6:24:18 PM||No application graphics will be available. 2/19/2006 6:24:18 PM||Processor: 1 AuthenticAMD AMD Athlon(tm) 64 Processor 3000+ 2/19/2006 6:24:18 PM||Memory: 1023.48 MB physical, 1.65 GB virtual 2/19/2006 6:24:18 PM||Disk: 29.29 GB total, 3.84 GB free 2/19/2006 6:24:18 PM|rosetta@home|Computer ID: 121218; location: home; project prefs: default 2/19/2006 6:24:18 PM||General prefs: from rosetta@home (last modified 2005-12-29 13:52:58) 2/19/2006 6:24:18 PM||General prefs: no separate prefs for home; using your defaults 2/19/2006 6:24:19 PM||Remote control not allowed; using loopback address ------ 3/4/2006 6:26:59 PM|rosetta@home|Unrecoverable error for result HOMSdt_homDB004_1dtj__340_50_0 (Incorrect function. (0x1) - exit code 1 (0x1)) 3/4/2006 6:26:59 PM||request_reschedule_cpus: process exited 3/4/2006 6:26:59 PM|rosetta@home|Computation for result HOMSdt_homDB004_1dtj__340_50_0 finished ------ 25.98 seconds.. it sure failed quickly. Mine is a 754 pin Athlon 64; running WinXP Pro SP2. (supposedly, fully updated.. minus the microsoft anti spyware package.) Panda Titanium antivirus. |
![]() ![]() Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0 |
The following three WUs: HOMSdt_homDB002_1dtj__340_124_0 HOMSdt_homDB002_1dtj__352_271_0 HOMSdt_homDB004_1dtj__352_669_0 exited with error status 1 after about 30 seconds of run time on my Linux computer as well as on several other computers. Since three out of three units of this particular type have failed on this computer which usually has almost no errors I believe this is a WU specific error which may need investigating. |
![]() ![]() Send message Joined: 22 Dec 05 Posts: 71 Credit: 138,867 RAC: 0 |
Exit status 1 (0x1) https://boinc.bakerlab.org/rosetta/result.php?resultid=12859932 Rosetta_4.82 windows 2000 server sp4 512mb RAM *This error ocurred while I asleep ... rosetta was the only running program Click signature for global team stats ![]() |
loren Send message Joined: 10 Oct 05 Posts: 3 Credit: 2,449,762 RAC: 0 |
I am have also recieved a computational error each of last three mornings. Is there any information I can collect that will help fix the problem? Loren ![]() |
![]() ![]() Send message Joined: 22 Dec 05 Posts: 71 Credit: 138,867 RAC: 0 |
Exit status 1 (0x1) https://boinc.bakerlab.org/rosetta/result.php?resultid=12918099 Rosetta_4.82 Windows XP Click signature for global team stats ![]() |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
The following three WUs: we are looking into this--thanks. |
divyab Send message Joined: 20 Oct 05 Posts: 6 Credit: 0 RAC: 0 |
We have found the problem, and are resubmitting the jobs with a fix. There are still a few workunits with the following prefix out there that you can expect to fail very quickly: HOMSdt_homDB0??_1dtj this should not happen with the next batch. |
Message boards :
Number crunching :
Miscellaneous Work Unit Errors
©2025 University of Washington
https://www.bakerlab.org