Aborted WUs due to

Message boards : Number crunching : Aborted WUs due to

To post messages, you must log in.

AuthorMessage
Profile petrusbroder

Send message
Joined: 23 Sep 05
Posts: 9
Credit: 2,111,764
RAC: 0
Message 7496 - Posted: 24 Dec 2005, 7:45:31 UTC
Last modified: 24 Dec 2005, 7:49:14 UTC

Since December 21 I have had 42 WUs aborted (within 18 - 24 seconds after starting crunching) on 4 different computers with the same error message:
"access violation 0xc0000005 at address 0x0065638 read attempt to address 0x...." (different addresses here)(this is valid for PCs) and the error "segmentation violation... " (on the PowerMac)

Some of these aborted WUs occurred as early as December 20, some December 21, but most occurred December 23 and December 24.

4 different computers were hit: one using Intel P4 M @ 2.00MHz, one Athlon XP 2400+, one Athlon XP 1900+ and one PowerMac G5 2.0 GHz Dual processors. Each computer has at least 768 Mbyte RAM and at most 2 GByte RAM. The PCs run Win XP SP2 with all current updates. The Mac runs OS X 10.4 with all updates current.

The setting "Leave applications in memory while preempted?" is set to "Yes". There is sufficient harddisk space left (> 20 GBytes on each computer). all other settings are as by default except the "Connect to network about every" which is set to 4 days.

The PCs run the application version 4.81, the Mac 4.79.

The WUs lost due to abortion were:
1hz6a_xxxxxx_207_xxx... (13 WUs on PCs, 3 on Mac))
1ogw_xxxxxx_207_xxx... (17 WUs on PCs, 5 WUs on Mac)
1n0u_xxxxx_204_xxx... (2 WUs on PCs and 2 on Mac)

Between these aborted WUs other WUs of different kinds were completed without problems both on the PCs and on the PowerMac.

No other errors occurred on these comps.

Other projects using BOINC (seti@home, predictor@home) run well without any errors occurring at all.

This is not a big problem, since the time lost is at most 17 - 18 minutes crunching time (for all comps together). But i think that you - the officers of the project - should know about the problem. I have since these error occurred rebooted all the hit computers and will report back if the error occurs again (which I suspect it will, since 4 different - very different comps were hit with the same error: the only common denominator is BOINC, Rosetta@Home and just those WUs.

Are there other crunchers hit by these errors?

BTW: To all of you "Merry Christmas and a lot of joy!"

ID: 7496 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 7497 - Posted: 24 Dec 2005, 7:52:09 UTC

Lots in other threads....
ID: 7497 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 7499 - Posted: 24 Dec 2005, 7:53:58 UTC - in response to Message 7496.  

Since December 21 I have had 42 WUs aborted (within 18 - 24 seconds after starting crunching)


See Please abort WUs with
And Computation Error
And technical news
*** Join BOINC@Australia today ***
ID: 7499 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile petrusbroder

Send message
Joined: 23 Sep 05
Posts: 9
Credit: 2,111,764
RAC: 0
Message 7505 - Posted: 24 Dec 2005, 9:25:43 UTC - in response to Message 7499.  

Since December 21 I have had 42 WUs aborted (within 18 - 24 seconds after starting crunching)


See Please abort WUs with
And Computation Error
And technical news


Thank you for the info. I had looked through most of them before posting.

The problems mentioned in "Technical News" is the by now famous 205-problem. The second issue is "0xc0000005 UNHANDLED EXCEPTION". I did not know that the error code "0xc0000005 UNHANDLED EXCEPTION" is the same thing as "0xc0000005 access violation" because I have seen some other expressions connected to the "0xc0000005" error code. How is a non-programmer to know that those two expressions denote the same problem?

The "Please abort WUs with..." is mostly about the 205-problem and somewhat about other problems - not too well described. Then there is a lot of other stuff which makes it for me very hard to understand what is going on. There are other numbers mentioned, but nowhere did I see a reasonably similar description to my problem - especially not with that much detail, cross platform and the possibility for knowledgeable developer to see what is going on.

The same is valid for the "Computation error thread". 207 - error are mentioned but I did not realize that the other crunchers were seeing the same problems. Retarded me. Just the fact that a certain number in an aborting WU is mentioned does not mean that it is the same error being described. It may very well be so, but not proven or even necessarily.

IMHO there is not "Lots in other threads" if I can not recognize the problem in other threads. There has not been - to me - a clear message about what to do about this except to suspend work on Rosetta. That is not a solution: But what about the other (good) WUs?

Your comments indicate that I am just too stupid to read and to recognize what is going on. The reading part is wrong. The recognizing part may be right. However, I thought that those who know less should be gently and friendly taught - mostly in these forums and by those who know more. You have not succeeded in that. Thank you also for making a non-programmer non-knowledgeable user feel less than comfy with this forum. I will not bother you again and will thus not waste more bandwidth for myself and all the others. I am sorry for taking up valuable bandwidth. Just let this thread go idle.

In spite of all said above: Merry Christmas to you all.
ID: 7505 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 7511 - Posted: 24 Dec 2005, 11:38:09 UTC - in response to Message 7505.  

Thank you for the info. I had looked through most of them before posting.


You're welcome.

It was not obvious from what you wrote that you had actually looked at those threads (you made no reference to them). I tried to help by guiding you there as the symptoms are the same. At no time did I say anything about your level of knowledge.

Yes, many crunchers have been hit by those same problems and things are gradually improving. After the initial rush where something like 90% of work units failed, I have only seen a few in the last 12 hours. I expect to see a few more of them before they eventually disappear.

Merry Christmas!
*** Join BOINC@Australia today ***
ID: 7511 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rebel Alliance

Send message
Joined: 4 Nov 05
Posts: 50
Credit: 3,579,531
RAC: 0
Message 7530 - Posted: 24 Dec 2005, 16:13:04 UTC - in response to Message 7511.  

Thank you for the info. I had looked through most of them before posting.


You're welcome.

It was not obvious from what you wrote that you had actually looked at those threads (you made no reference to them). I tried to help by guiding you there as the symptoms are the same. At no time did I say anything about your level of knowledge.

Yes, many crunchers have been hit by those same problems and things are gradually improving. After the initial rush where something like 90% of work units failed, I have only seen a few in the last 12 hours. I expect to see a few more of them before they eventually disappear.

Merry Christmas!

You might see more than a few of then since they keep recycling them and some of the ones we in the Rebel Alliance have looked at has been crunched as many as 11 times now.
ID: 7530 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,531,968
RAC: 2,682
Message 7539 - Posted: 24 Dec 2005, 19:07:51 UTC

One comment on the "cross-platform" issue; the same logical problem, an input value that is "bad", will result in different errors on a Mac than it will on a PC. This is simply because they use different underlying math libraries, with different sets of error codes and error messages. Generally the error will be "translated" by the application into something BOINC-defined, but some parts of the underlying message will be "passed through" for debugging purposes.

The reason for pointing you to other threads was that you seemed just at a glance to not be aware of the problems that were going on. You said "Are there other crunchers hit by these errors?" - you intended the emphasis to be on "THESE" errors, but it was read as saying "ANY SIMILAR" errors. I don't see anything in the responses to imply that you are stupid, just an attempt to quickly say "look over there, it's already dealt with", rather than taking the time to analyze your specific issue. We are all way too familiar with the fact that when _some_ people have a problem, rather than reading _any_ information that is already here, they just start a new thread. Because of that, unless something in your posting "stands out" as being different from anything already said by others, the automatic response is "yeah, yeah..."

I honestly cannot say that your errors are exactly the same as those covered elsewhere. It is _possible_ that you have a totally different problem, that you've gotten some number of WUs in that 42 with some other underlying error. However, given the fact that thousands of WUs were released in this time period with the _same_ underlying error, it is pointless to spend the time looking to even verify that yours are the same or not. All we can say is that if the errors you describe continue _after_ all the "short WUs" are gone, THEN it will be reasonable to do more research. The "tech news" is very unspecific on what WU names are affected by this. I don't think anyone knows, actually; it was a program change in the software that _generates_ the WUs, and it had been running for some time before it was caught. The "DEFAULT_xxxx_205" error is a different one, that is the only WU name that is specifically known to have a problem, and that is because the problem is in the structure of the WUs themselves (1000 steps instead of 10) and not in the input value assignment.

ID: 7539 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Aborted WUs due to



©2024 University of Washington
https://www.bakerlab.org