Miscellaneous Work Unit Errors

Message boards : Number crunching : Miscellaneous Work Unit Errors

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 9 · Next

AuthorMessage
Osku87

Send message
Joined: 1 Nov 05
Posts: 17
Credit: 280,268
RAC: 0
Message 11476 - Posted: 27 Feb 2006, 20:29:40 UTC - in response to Message 11429.  

Reducing the time setting did really solve the problem. Not with the first unit but the second and the following. Now it works really fine.
ID: 11476 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marie Lucie

Send message
Joined: 9 Dec 05
Posts: 5
Credit: 40,616
RAC: 0
Message 11484 - Posted: 28 Feb 2006, 6:08:33 UTC

Hello,
For me the problems continue ...

27/02/2006 21:08:33|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_QUADRUPLELONGRANGEANTIPARALLEL_2acy__311_501_1 ( - exit code -164 (0xffffff5c))
27/02/2006 21:08:33||request_reschedule_cpus: process exited
27/02/2006 21:08:33|rosetta@home|Computation for result PRODUCTION_ABINITIO_QUADRUPLELONGRANGEANTIPARALLEL_2acy__311_501_1 finished
27/02/2006 23:43:29|rosetta@home|Unrecoverable error for result ABINITai_hom022_1aiu__320_64_1 ( - exit code -164 (0xffffff5c))
27/02/2006 23:43:29||request_reschedule_cpus: process exited
27/02/2006 23:43:29|rosetta@home|Computation for result ABINITai_hom022_1aiu__320_64_1 finished
28/02/2006 07:02:13|rosetta@home|Unrecoverable error for result HBLR_1.0_1di2_323_143_0 ( - exit code -1073741819 (0xc0000005))
28/02/2006 07:02:13||request_reschedule_cpus: process exited
28/02/2006 07:02:13|rosetta@home|Computation for result HBLR_1.0_1di2_323_143_0 finished


ID: 11484 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11487 - Posted: 28 Feb 2006, 7:17:00 UTC - in response to Message 11476.  

Reducing the time setting did really solve the problem. Not with the first unit but the second and the following. Now it works really fine.


great.
we hope to locate the sources of the errors this week.
in the meantime, you can control the fraction of wu that have errors since the probability of error seems constant over the duration of the run. so roughly speaking, if 50% of your wu fail with the 8 hour run time, only 12.5% should fail with a 2 hour run time, and an even smaller fraction with a 1 hour run time.
of course, this is only a very temporary fix since there are many reasons why longer run times are preferable.
ID: 11487 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Koen

Send message
Joined: 29 Sep 05
Posts: 8
Credit: 8,542,574
RAC: 0
Message 11490 - Posted: 28 Feb 2006, 11:56:19 UTC

Don't know if it has anything to do with it but I noticed that the RAM footprint of the rosetta-app. sometimes rises to as high as 130MB when processing HBLR_1.0-workunits.Do I remember correctly that this also caused problems a couple of months ago?
Looking at the errors my fellow-crunchers experienced I noticed that a lot of those errors occur on the above mentioned HBLR_1.0-workunits.So I thought this was worth mentioning.If not, please allow my appologies for wasting your time.

K.
ID: 11490 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mags
Avatar

Send message
Joined: 22 Nov 05
Posts: 33
Credit: 108,630
RAC: 0
Message 11498 - Posted: 28 Feb 2006, 18:57:30 UTC
Last modified: 28 Feb 2006, 18:59:42 UTC

Reducing the time does not fix it for me, the whole boinc/rosetta freezes and it does not restart until I manually fix the problem, now if I didn't have a life/family/job this just might be okay..................

And don't even ask if I have leave in memory ticked, this does not solve every boinc/rosetta problem.

I'm really peeved at the lack of upfront info about what appears to be a major problem for so many.

As I've said in the past talk to us, and in english not 'techno giberish'.

ps I have 1gig of ram
join Fadbeens

ID: 11498 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11509 - Posted: 1 Mar 2006, 5:33:56 UTC - in response to Message 11498.  

Reducing the time does not fix it for me, the whole boinc/rosetta freezes and it does not restart until I manually fix the problem, now if I didn't have a life/family/job this just might be okay..................

And don't even ask if I have leave in memory ticked, this does not solve every boinc/rosetta problem.

I'm really peeved at the lack of upfront info about what appears to be a major problem for so many.

As I've said in the past talk to us, and in english not 'techno giberish'.

ps I have 1gig of ram


I'm sorry about these frustrating problems!

David Kim had a great idea for solving almost all of these problems that he is testing on RALPH; if it works you will soon see it here at rosetta@home. if a rosetta or boinc error occurs, rather than killing the whole process, a special termination routine will be called which will send back to us all structures computed to that point. this is good for us, and for you since credit will be awarded for the process up to this point. so instead of seeing errors, you will see an occasional work unit with a shorter run time.
ID: 11509 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bruce boytler
Avatar

Send message
Joined: 17 Sep 05
Posts: 68
Credit: 3,565,442
RAC: 0
Message 11510 - Posted: 1 Mar 2006, 6:05:03 UTC

I inadvertently left my graphics running and it managed to crash the workunit with a computational error. I figured it was just coincedence and thought nothing of it. Untill I accidently did it again a day or two later and the same thing happened.
I am now afraid to turn on graphics at all. This wu's crashed with 6 to 7 hours of cpu time clocked on them.

BTW. out of the last 77 results using the default 8 hours I have experianced 5 crashes on a AMD64X2 3800 with a gig of ram.

Cheers.......I like the science faq's
ID: 11510 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile SallyH

Send message
Joined: 4 Nov 05
Posts: 6
Credit: 4,799,395
RAC: 0
Message 11524 - Posted: 1 Mar 2006, 14:46:15 UTC

I turned off the BOINC screensaver and have not had an error since.....had a few hung errors but not the other errors.....
ID: 11524 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mags
Avatar

Send message
Joined: 22 Nov 05
Posts: 33
Credit: 108,630
RAC: 0
Message 11528 - Posted: 1 Mar 2006, 18:15:51 UTC

Thanks guys, I had all but given up.

I have turned off the boinc screensaver, hopefully that will do the trick.
join Fadbeens

ID: 11528 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 11533 - Posted: 1 Mar 2006, 21:27:19 UTC
Last modified: 1 Mar 2006, 21:27:39 UTC

I don't think turning off the Screensaver will help these problems. I've always had it turned off and had 5 failures today, out of 25 returned results... The thing that will help is what David Baker stated below...
Join the Teddies@WCG
ID: 11533 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ecafkid

Send message
Joined: 5 Oct 05
Posts: 40
Credit: 15,177,319
RAC: 0
Message 11534 - Posted: 1 Mar 2006, 22:03:48 UTC

These are 2 errors for today so far

3/1/2006 4:23:18 AM|rosetta@home|Unrecoverable error for result ABINITac_hom020_1acf__320_49_1 ( - exit code -1073741811 (0xc000000d))
3/1/2006 3:36:37 PM|rosetta@home|Unrecoverable error for result ABINITen_hom023_1enh__322_48_0 ( - exit code -1073741811 (0xc000000d))



ID: 11534 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 11732 - Posted: 6 Mar 2006, 21:09:36 UTC

3/4/2006 6:26:59 PM|rosetta@home|Unrecoverable error for result HOMSdt_homDB004_1dtj__340_50_0 (Incorrect function. (0x1) - exit code 1 (0x1))

appeared.. so it's obvious my system isn't immune from the UEs.. (running 24 hours each just makes it take longer to find them..)

ID: 11732 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 11734 - Posted: 6 Mar 2006, 21:48:07 UTC

I have a few:

Computer ID 142540

HOMSdt_homDB009_1dtj__340_142_1
<core_client_version>5.2.13</core_client_version>
<message>Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>


Computer ID 142263

HOMSdt_homDB027_1dtj__340_175_0

<core_client_version>5.2.13</core_client_version>
<message>Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>


Computer ID 56911

HOMSdt_homDB009_1dtj__340_12_0

<core_client_version>5.2.13</core_client_version>
<message>Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>


Computer ID 57040

HBLR_1.0_1r69_323_710_2

<core_client_version>5.2.13</core_client_version>
<message>Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# random seed: 3816991
# cpu_run_time_pref: 86400
</stderr_txt>


I'm sure there's more but it will take awhile to check the rest of my puters...





Join the Teddies@WCG
ID: 11734 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 11742 - Posted: 7 Mar 2006, 1:02:22 UTC

If this is helpful..

--------
stderr.txt
# random seed: 3468381
# cpu_run_time_pref: 86400
---
2/19/2006 6:24:18 PM||Starting BOINC client version 5.2.13 for windows_intelx86
2/19/2006 6:24:18 PM||libcurl/7.14.0 OpenSSL/0.9.8 zlib/1.2.3
2/19/2006 6:24:18 PM||Executing as a daemon
2/19/2006 6:24:18 PM||Data directory: C:Program FilesBOINC
2/19/2006 6:24:18 PM||BOINC is running as a service and as a non-system user.
2/19/2006 6:24:18 PM||No application graphics will be available.
2/19/2006 6:24:18 PM||Processor: 1 AuthenticAMD AMD Athlon(tm) 64 Processor 3000+
2/19/2006 6:24:18 PM||Memory: 1023.48 MB physical, 1.65 GB virtual
2/19/2006 6:24:18 PM||Disk: 29.29 GB total, 3.84 GB free
2/19/2006 6:24:18 PM|rosetta@home|Computer ID: 121218; location: home;
project prefs: default
2/19/2006 6:24:18 PM||General prefs: from rosetta@home (last modified 2005-12-29 13:52:58)
2/19/2006 6:24:18 PM||General prefs: no separate prefs for home; using your defaults
2/19/2006 6:24:19 PM||Remote control not allowed; using loopback address

------
3/4/2006 6:26:59 PM|rosetta@home|Unrecoverable error for result HOMSdt_homDB004_1dtj__340_50_0 (Incorrect function. (0x1) - exit code 1 (0x1))
3/4/2006 6:26:59 PM||request_reschedule_cpus: process exited
3/4/2006 6:26:59 PM|rosetta@home|Computation for result HOMSdt_homDB004_1dtj__340_50_0 finished
------

25.98 seconds.. it sure failed quickly.

Mine is a 754 pin Athlon 64; running WinXP Pro SP2. (supposedly, fully
updated.. minus the microsoft anti spyware package.) Panda Titanium antivirus.
ID: 11742 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Hoelder1in
Avatar

Send message
Joined: 30 Sep 05
Posts: 169
Credit: 3,915,947
RAC: 0
Message 11774 - Posted: 8 Mar 2006, 6:32:49 UTC
Last modified: 8 Mar 2006, 6:33:53 UTC

The following three WUs:

HOMSdt_homDB002_1dtj__340_124_0
HOMSdt_homDB002_1dtj__352_271_0
HOMSdt_homDB004_1dtj__352_669_0

exited with error status 1 after about 30 seconds of run time on my Linux computer as well as on several other computers. Since three out of three units of this particular type have failed on this computer which usually has almost no errors I believe this is a WU specific error which may need investigating.
ID: 11774 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 22 Dec 05
Posts: 71
Credit: 138,867
RAC: 0
Message 11781 - Posted: 8 Mar 2006, 13:10:27 UTC

Exit status 1 (0x1)

https://boinc.bakerlab.org/rosetta/result.php?resultid=12859932

Rosetta_4.82 windows 2000 server sp4 512mb RAM

*This error ocurred while I asleep ... rosetta was the only running program
Click signature for global team stats
ID: 11781 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
loren

Send message
Joined: 10 Oct 05
Posts: 3
Credit: 2,449,762
RAC: 0
Message 11782 - Posted: 8 Mar 2006, 14:55:57 UTC - in response to Message 11781.  

I am have also recieved a computational error each of last three mornings. Is there any information I can collect that will help fix the problem?
Loren
ID: 11782 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 22 Dec 05
Posts: 71
Credit: 138,867
RAC: 0
Message 11795 - Posted: 8 Mar 2006, 21:16:05 UTC

Exit status 1 (0x1)
https://boinc.bakerlab.org/rosetta/result.php?resultid=12918099
Rosetta_4.82 Windows XP
Click signature for global team stats
ID: 11795 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11802 - Posted: 9 Mar 2006, 2:06:14 UTC - in response to Message 11774.  

The following three WUs:

HOMSdt_homDB002_1dtj__340_124_0
HOMSdt_homDB002_1dtj__352_271_0
HOMSdt_homDB004_1dtj__352_669_0

exited with error status 1 after about 30 seconds of run time on my Linux computer as well as on several other computers. Since three out of three units of this particular type have failed on this computer which usually has almost no errors I believe this is a WU specific error which may need investigating.


we are looking into this--thanks.

ID: 11802 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
divyab

Send message
Joined: 20 Oct 05
Posts: 6
Credit: 0
RAC: 0
Message 11804 - Posted: 9 Mar 2006, 3:00:40 UTC

We have found the problem, and are resubmitting the jobs with a fix. There are still a few workunits with the following prefix out there that you can expect to fail very quickly:

HOMSdt_homDB0??_1dtj

this should not happen with the next batch.

ID: 11804 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 9 · Next

Message boards : Number crunching : Miscellaneous Work Unit Errors



©2025 University of Washington
https://www.bakerlab.org