Miscellaneous Work Unit Errors

Message boards : Number crunching : Miscellaneous Work Unit Errors

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 10 · Next

AuthorMessage
tng*

Send message
Joined: 28 Oct 05
Posts: 14
Credit: 5,389,798
RAC: 0
Message 11395 - Posted: 25 Feb 2006, 19:39:22 UTC - in response to Message 11323.  
Last modified: 25 Feb 2006, 19:43:16 UTC

For people having many work Unit Errors!!

I have received an e-mail from Dr. Baker with information for any of you who are having a lot of Work Unit errors.

"Could you help us to recommend to people having problems with lots of WU to set the target run time to a smaller value like 2 hours. We think there aren't any new bugs, just with longer run times it is more likely for a WU to have problems."

So if you are having a lot of errors please reset your Time setting to 2 hours and see if that helps.


Having received half a dozen errors on 4.82 on two machines that I don't believe have had that many errors in three months, I did this. Within hours,
another error:

12092296

(edited to show the correct result -- oops)

Not the same as the earlier ones, but there still seem to be problems with a
two-hour setting.

The machines having problems are Dell Dimension 9100s, Pentium D 820, 1 gig,
XP SP2 with all critical updates, Boinc 5.2.13 (updated from from 5.2.2, but
that didn't seem to fix anything).

My athlon 4000+ laptop has had no problems -- maybe something with multi-CPU systems.
ID: 11395 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Osku87

Send message
Joined: 1 Nov 05
Posts: 17
Credit: 280,268
RAC: 0
Message 11429 - Posted: 26 Feb 2006, 19:18:32 UTC - in response to Message 11395.  
Last modified: 26 Feb 2006, 19:20:22 UTC

My athlon 4000+ laptop has had no problems -- maybe something with multi-CPU systems.

It's not something with multi-CPU systems. I have Sempron 2400+ and have the same problems. Reducing the time setting to 2 hours didn't solve the problem. For example: 9642560
ID: 11429 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
doc :)

Send message
Joined: 4 Oct 05
Posts: 47
Credit: 1,106,102
RAC: 0
Message 11475 - Posted: 27 Feb 2006, 20:26:26 UTC

27/02/2006 20:09:06|rosetta@home|Unrecoverable error for result ABINITe6_hom011_1e6iA_320_4_0 ( - exit code -1073741811 (0xc000000d))

i get that error with that WU type (ABINIT**_hom***...) and 4.82 when i got the graphics open, it crashes for each WU i tried that with at about the time the first model should be finished (i am not opening the graphics in a WU that is past its first checkpoint right now to avoid wasting more cycles than necesarry). if i leave it running by itself without graphics in the background it finishes without problems.
some examples: result, result, result

this WU from the HBLR type of WUs failed with the same error while i had graphics open at a later point.
ID: 11475 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Osku87

Send message
Joined: 1 Nov 05
Posts: 17
Credit: 280,268
RAC: 0
Message 11476 - Posted: 27 Feb 2006, 20:29:40 UTC - in response to Message 11429.  

Reducing the time setting did really solve the problem. Not with the first unit but the second and the following. Now it works really fine.
ID: 11476 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marie Lucie

Send message
Joined: 9 Dec 05
Posts: 5
Credit: 40,616
RAC: 0
Message 11484 - Posted: 28 Feb 2006, 6:08:33 UTC

Hello,
For me the problems continue ...

27/02/2006 21:08:33|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_QUADRUPLELONGRANGEANTIPARALLEL_2acy__311_501_1 ( - exit code -164 (0xffffff5c))
27/02/2006 21:08:33||request_reschedule_cpus: process exited
27/02/2006 21:08:33|rosetta@home|Computation for result PRODUCTION_ABINITIO_QUADRUPLELONGRANGEANTIPARALLEL_2acy__311_501_1 finished
27/02/2006 23:43:29|rosetta@home|Unrecoverable error for result ABINITai_hom022_1aiu__320_64_1 ( - exit code -164 (0xffffff5c))
27/02/2006 23:43:29||request_reschedule_cpus: process exited
27/02/2006 23:43:29|rosetta@home|Computation for result ABINITai_hom022_1aiu__320_64_1 finished
28/02/2006 07:02:13|rosetta@home|Unrecoverable error for result HBLR_1.0_1di2_323_143_0 ( - exit code -1073741819 (0xc0000005))
28/02/2006 07:02:13||request_reschedule_cpus: process exited
28/02/2006 07:02:13|rosetta@home|Computation for result HBLR_1.0_1di2_323_143_0 finished


ID: 11484 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11487 - Posted: 28 Feb 2006, 7:17:00 UTC - in response to Message 11476.  

Reducing the time setting did really solve the problem. Not with the first unit but the second and the following. Now it works really fine.


great.
we hope to locate the sources of the errors this week.
in the meantime, you can control the fraction of wu that have errors since the probability of error seems constant over the duration of the run. so roughly speaking, if 50% of your wu fail with the 8 hour run time, only 12.5% should fail with a 2 hour run time, and an even smaller fraction with a 1 hour run time.
of course, this is only a very temporary fix since there are many reasons why longer run times are preferable.
ID: 11487 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Koen

Send message
Joined: 29 Sep 05
Posts: 8
Credit: 8,542,574
RAC: 0
Message 11490 - Posted: 28 Feb 2006, 11:56:19 UTC

Don't know if it has anything to do with it but I noticed that the RAM footprint of the rosetta-app. sometimes rises to as high as 130MB when processing HBLR_1.0-workunits.Do I remember correctly that this also caused problems a couple of months ago?
Looking at the errors my fellow-crunchers experienced I noticed that a lot of those errors occur on the above mentioned HBLR_1.0-workunits.So I thought this was worth mentioning.If not, please allow my appologies for wasting your time.

K.
ID: 11490 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mags
Avatar

Send message
Joined: 22 Nov 05
Posts: 33
Credit: 108,630
RAC: 0
Message 11498 - Posted: 28 Feb 2006, 18:57:30 UTC
Last modified: 28 Feb 2006, 18:59:42 UTC

Reducing the time does not fix it for me, the whole boinc/rosetta freezes and it does not restart until I manually fix the problem, now if I didn't have a life/family/job this just might be okay..................

And don't even ask if I have leave in memory ticked, this does not solve every boinc/rosetta problem.

I'm really peeved at the lack of upfront info about what appears to be a major problem for so many.

As I've said in the past talk to us, and in english not 'techno giberish'.

ps I have 1gig of ram
join Fadbeens

ID: 11498 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11509 - Posted: 1 Mar 2006, 5:33:56 UTC - in response to Message 11498.  

Reducing the time does not fix it for me, the whole boinc/rosetta freezes and it does not restart until I manually fix the problem, now if I didn't have a life/family/job this just might be okay..................

And don't even ask if I have leave in memory ticked, this does not solve every boinc/rosetta problem.

I'm really peeved at the lack of upfront info about what appears to be a major problem for so many.

As I've said in the past talk to us, and in english not 'techno giberish'.

ps I have 1gig of ram


I'm sorry about these frustrating problems!

David Kim had a great idea for solving almost all of these problems that he is testing on RALPH; if it works you will soon see it here at rosetta@home. if a rosetta or boinc error occurs, rather than killing the whole process, a special termination routine will be called which will send back to us all structures computed to that point. this is good for us, and for you since credit will be awarded for the process up to this point. so instead of seeing errors, you will see an occasional work unit with a shorter run time.
ID: 11509 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bruce boytler
Avatar

Send message
Joined: 17 Sep 05
Posts: 68
Credit: 3,565,442
RAC: 0
Message 11510 - Posted: 1 Mar 2006, 6:05:03 UTC

I inadvertently left my graphics running and it managed to crash the workunit with a computational error. I figured it was just coincedence and thought nothing of it. Untill I accidently did it again a day or two later and the same thing happened.
I am now afraid to turn on graphics at all. This wu's crashed with 6 to 7 hours of cpu time clocked on them.

BTW. out of the last 77 results using the default 8 hours I have experianced 5 crashes on a AMD64X2 3800 with a gig of ram.

Cheers.......I like the science faq's
ID: 11510 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile SallyH

Send message
Joined: 4 Nov 05
Posts: 6
Credit: 4,799,395
RAC: 0
Message 11524 - Posted: 1 Mar 2006, 14:46:15 UTC

I turned off the BOINC screensaver and have not had an error since.....had a few hung errors but not the other errors.....
ID: 11524 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mags
Avatar

Send message
Joined: 22 Nov 05
Posts: 33
Credit: 108,630
RAC: 0
Message 11528 - Posted: 1 Mar 2006, 18:15:51 UTC

Thanks guys, I had all but given up.

I have turned off the boinc screensaver, hopefully that will do the trick.
join Fadbeens

ID: 11528 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 11533 - Posted: 1 Mar 2006, 21:27:19 UTC
Last modified: 1 Mar 2006, 21:27:39 UTC

I don't think turning off the Screensaver will help these problems. I've always had it turned off and had 5 failures today, out of 25 returned results... The thing that will help is what David Baker stated below...
Join the Teddies@WCG
ID: 11533 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ecafkid

Send message
Joined: 5 Oct 05
Posts: 40
Credit: 15,177,319
RAC: 0
Message 11534 - Posted: 1 Mar 2006, 22:03:48 UTC

These are 2 errors for today so far

3/1/2006 4:23:18 AM|rosetta@home|Unrecoverable error for result ABINITac_hom020_1acf__320_49_1 ( - exit code -1073741811 (0xc000000d))
3/1/2006 3:36:37 PM|rosetta@home|Unrecoverable error for result ABINITen_hom023_1enh__322_48_0 ( - exit code -1073741811 (0xc000000d))



ID: 11534 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 11732 - Posted: 6 Mar 2006, 21:09:36 UTC

3/4/2006 6:26:59 PM|rosetta@home|Unrecoverable error for result HOMSdt_homDB004_1dtj__340_50_0 (Incorrect function. (0x1) - exit code 1 (0x1))

appeared.. so it's obvious my system isn't immune from the UEs.. (running 24 hours each just makes it take longer to find them..)

ID: 11732 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 11734 - Posted: 6 Mar 2006, 21:48:07 UTC

I have a few:

Computer ID 142540

HOMSdt_homDB009_1dtj__340_142_1
<core_client_version>5.2.13</core_client_version>
<message>Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>


Computer ID 142263

HOMSdt_homDB027_1dtj__340_175_0

<core_client_version>5.2.13</core_client_version>
<message>Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>


Computer ID 56911

HOMSdt_homDB009_1dtj__340_12_0

<core_client_version>5.2.13</core_client_version>
<message>Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>


Computer ID 57040

HBLR_1.0_1r69_323_710_2

<core_client_version>5.2.13</core_client_version>
<message>Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# random seed: 3816991
# cpu_run_time_pref: 86400
</stderr_txt>


I'm sure there's more but it will take awhile to check the rest of my puters...





Join the Teddies@WCG
ID: 11734 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 11742 - Posted: 7 Mar 2006, 1:02:22 UTC

If this is helpful..

--------
stderr.txt
# random seed: 3468381
# cpu_run_time_pref: 86400
---
2/19/2006 6:24:18 PM||Starting BOINC client version 5.2.13 for windows_intelx86
2/19/2006 6:24:18 PM||libcurl/7.14.0 OpenSSL/0.9.8 zlib/1.2.3
2/19/2006 6:24:18 PM||Executing as a daemon
2/19/2006 6:24:18 PM||Data directory: C:Program FilesBOINC
2/19/2006 6:24:18 PM||BOINC is running as a service and as a non-system user.
2/19/2006 6:24:18 PM||No application graphics will be available.
2/19/2006 6:24:18 PM||Processor: 1 AuthenticAMD AMD Athlon(tm) 64 Processor 3000+
2/19/2006 6:24:18 PM||Memory: 1023.48 MB physical, 1.65 GB virtual
2/19/2006 6:24:18 PM||Disk: 29.29 GB total, 3.84 GB free
2/19/2006 6:24:18 PM|rosetta@home|Computer ID: 121218; location: home;
project prefs: default
2/19/2006 6:24:18 PM||General prefs: from rosetta@home (last modified 2005-12-29 13:52:58)
2/19/2006 6:24:18 PM||General prefs: no separate prefs for home; using your defaults
2/19/2006 6:24:19 PM||Remote control not allowed; using loopback address

------
3/4/2006 6:26:59 PM|rosetta@home|Unrecoverable error for result HOMSdt_homDB004_1dtj__340_50_0 (Incorrect function. (0x1) - exit code 1 (0x1))
3/4/2006 6:26:59 PM||request_reschedule_cpus: process exited
3/4/2006 6:26:59 PM|rosetta@home|Computation for result HOMSdt_homDB004_1dtj__340_50_0 finished
------

25.98 seconds.. it sure failed quickly.

Mine is a 754 pin Athlon 64; running WinXP Pro SP2. (supposedly, fully
updated.. minus the microsoft anti spyware package.) Panda Titanium antivirus.
ID: 11742 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 11745 - Posted: 7 Mar 2006, 4:08:14 UTC - in response to Message 11737.  
Last modified: 10 Mar 2006, 4:42:37 UTC

I am also getting "Computation errors" resulting in exit code 1073741811(0xc000000d). The last unit failed 1 hour 13 minutes and some odd seconds in, so I do not believe that the 2 hour setting would have helped this unit. Is this a similar problem (bug) to what Einstein@home recently suffered from? They had a graphics bug that had to do with certain G L cards. This is running on a Gateway 840, running at 3 GHz, with 1 gig of RAM, running Windows XP Professional. I would like to get an answer to this before too many units fail. Please contact me at "Dbidanset at aol dot com" with some info on this.



{EDIT NOTE: I changed the email address to protect you from "sniffers". Original message deleted on request of user for the same purpose and corrected a spelling error}

I will send your message to the project team at RALPH. You may want to attach your machine to the RALPH project to help trouble shoot this problem. If you do not hear from the project team in a day or so let me know,

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 11745 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Hoelder1in
Avatar

Send message
Joined: 30 Sep 05
Posts: 169
Credit: 3,915,947
RAC: 0
Message 11774 - Posted: 8 Mar 2006, 6:32:49 UTC
Last modified: 8 Mar 2006, 6:33:53 UTC

The following three WUs:

HOMSdt_homDB002_1dtj__340_124_0
HOMSdt_homDB002_1dtj__352_271_0
HOMSdt_homDB004_1dtj__352_669_0

exited with error status 1 after about 30 seconds of run time on my Linux computer as well as on several other computers. Since three out of three units of this particular type have failed on this computer which usually has almost no errors I believe this is a WU specific error which may need investigating.
ID: 11774 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 22 Dec 05
Posts: 71
Credit: 138,867
RAC: 0
Message 11781 - Posted: 8 Mar 2006, 13:10:27 UTC

Exit status 1 (0x1)

https://boinc.bakerlab.org/rosetta/result.php?resultid=12859932

Rosetta_4.82 windows 2000 server sp4 512mb RAM

*This error ocurred while I asleep ... rosetta was the only running program
Click signature for global team stats
ID: 11781 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 10 · Next

Message boards : Number crunching : Miscellaneous Work Unit Errors



©2024 University of Washington
https://www.bakerlab.org