Computation Error

Message boards : Number crunching : Computation Error

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,542,037
RAC: 3,110
Message 7494 - Posted: 24 Dec 2005, 7:32:22 UTC - in response to Message 7491.  

I`m using 4.8 and am losing at least 50% of WUs also from "computation error". It doesn`t happen on other BOINC projects, and I`m sure my computer is OK.


1) Please realize that this was two bad batches of WUs that came out just at the end of the day as they were leaving for the holidays. Yes, it was a bad time for it to happen, but "bad batches" happen to every project.

2) SETI had two of these "bad batches" in the last couple of weeks, very similar to the two Rosetta has; one that ran extremely long before failing, wasting MANY hours of CPU time, and one that failed immediately because it was "0 length". SETI staff was still there and were able to kill most of these after only three or four people had crunched them, but those who spent 10 or 11 hours on one got zero credit. Rosetta staff hasn't been there to kill these, so more people are getting them as they get reissued, but then Rosetta has said that they will figure out a way to make sure everyone gets credit for at least the "long" ones.

3) All of the "DEFAULT_xxxxx_205" long-running WUs should have already cleared by now. What remains are the last traces of the "short WUs" (which could have any name). So with every hour that passes, the percentage of computation errors you get decreases.

It's not a good situation, and nobody is happy about it, but the Rosetta staff is putting safeguards in to try to prevent it from happening again, and is doing everything they can to make sure that nobody loses any more credit than they have to.

ID: 7494 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 4,036,695
RAC: 35,105
Message 7662 - Posted: 26 Dec 2005, 14:22:16 UTC

1) Please realize that this was two bad batches of WUs that came out just at the end of the day as they were leaving for the holidays.
=======

I don't believe all the Computation Errors are due to the bad WU's that the Project Released. I think some of it has to do with the way the BOINC Manager Adjusts the Time to Completion.

I just had 2 WU's get the Computation Error & I feel it had to do with the Completion Time being to low. If you get a run of shorter WU's, say in the 1 - 2 hour range the Manager will adjust the Time down to that amount of time.

Then say all of a sudden you get a few 5-7 hour WU's it's more than likely or a good chance that you'll get the Computation Error once you get about 4 hour into the run because of a Time Overrun for the WU.

I had this happen to me a lot a while back at the PrimeGrid Project & I had to actually manually edit the Benchmark scores downward in the .xml files so the Manager would show more Time to Completion before the Errors stopped.

Rytis has extended the maximum execution length for the WU's several times to try & help out also, maybe they need to do that with the Rosetta WU's too.
ID: 7662 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DeHackedDragon

Send message
Joined: 24 Dec 05
Posts: 1
Credit: 112
RAC: 0
Message 7668 - Posted: 26 Dec 2005, 18:34:49 UTC - in response to Message 7494.  

I`m using 4.8 and am losing at least 50% of WUs also from "computation error". It doesn`t happen on other BOINC projects, and I`m sure my computer is OK.


1) Please realize that this was two bad batches of WUs that came out just at the end of the day as they were leaving for the holidays. Yes, it was a bad time for it to happen, but "bad batches" happen to every project.

2) SETI had two of these "bad batches" in the last couple of weeks, very similar to the two Rosetta has; one that ran extremely long before failing, wasting MANY hours of CPU time, and one that failed immediately because it was "0 length". SETI staff was still there and were able to kill most of these after only three or four people had crunched them, but those who spent 10 or 11 hours on one got zero credit. Rosetta staff hasn't been there to kill these, so more people are getting them as they get reissued, but then Rosetta has said that they will figure out a way to make sure everyone gets credit for at least the "long" ones.

3) All of the "DEFAULT_xxxxx_205" long-running WUs should have already cleared by now. What remains are the last traces of the "short WUs" (which could have any name). So with every hour that passes, the percentage of computation errors you get decreases.

It's not a good situation, and nobody is happy about it, but the Rosetta staff is putting safeguards in to try to prevent it from happening again, and is doing everything they can to make sure that nobody loses any more credit than they have to.



Yeah, I have that problem too, and I'm really unhappy about it. I lost almost 50 credits to that problem of that bad batch of 205 something... :(
ID: 7668 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,542,037
RAC: 3,110
Message 7676 - Posted: 26 Dec 2005, 20:29:25 UTC - in response to Message 7662.  

I don't believe all the Computation Errors are due to the bad WU's that the Project Released. I think some of it has to do with the way the BOINC Manager Adjusts the Time to Completion.


PoorBoy, is the "maximum CPU time" affected by the BOINC Manager settings? I understood that it was just a "cpu seconds" value passed in the WU itself. This would make slower machines more likely to hit it, but would be easier to code. If the DCF etc., _is_ taken into account, I think that would normally be a "good thing", but could be an issue here, at least on _when_ they blow up.

Have you had any CPU-time-exceeded errors on anything other than the DEFAULT_xxxxx_205's? Those monsters are going to hit _any_ reasonable limit, adjusted or not. I've only _seen_ that error on those. All the other computation errors that _I've_ seen have been on the "short WUs", the ones where the random seed is miscalculated/misread.

ID: 7676 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7685 - Posted: 26 Dec 2005, 23:11:21 UTC - in response to Message 7676.  


PoorBoy, is the "maximum CPU time" affected by the BOINC Manager settings?...


one of my slow boxes (733MHz) reached 26hrs before I noticed a DEFAULT_xxxx_205 - wheras I think I have seen other people have cut off around 11 hours if I remember right. If so then the max run must depend in some way on the benchmarks and/or historic run lengths.

River~~
ID: 7685 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile stilespj

Send message
Joined: 17 Dec 05
Posts: 1
Credit: 749,870
RAC: 0
Message 7686 - Posted: 26 Dec 2005, 23:28:08 UTC

Out of the 5 last work units, all five had unrecoverable errors. In fact since I have joint Rosetta@at home via boinc, this has been typical. I see no point on wasting my my computer power on this. I do not get these errors on other projects, such as seti at home.

Maybe when Rosetta@home (via boinc) is ready for prime time, I'll be back!

Bye.

Paul
ID: 7686 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,542,037
RAC: 3,110
Message 7690 - Posted: 26 Dec 2005, 23:37:13 UTC - in response to Message 7668.  

Yeah, I have that problem too, and I'm really unhappy about it. I lost almost 50 credits to that problem of that bad batch of 205 something... :(


Please note that the message you quoted (and others) have explained that the credits "lost" to the DEFAULT_xxxxx_205 problems will be replaced/granted after the staff returns from the holidays and the bad WUs have flushed through the system.

ID: 7690 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,542,037
RAC: 3,110
Message 7691 - Posted: 26 Dec 2005, 23:39:28 UTC
Last modified: 26 Dec 2005, 23:42:32 UTC

I have moved several "off-topic" messages out of this thread. over to thread 750, "Moderated Messages moved here".

Please limit comments in this thread to "computation error" issues. Thanks!

EDIT:: Scribe, I deleted your one-word response, because I could not tell which posting it was directed at, and when moved to the other thread, it made no sense at all. If you object, I'll restore it.

ID: 7691 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 4,036,695
RAC: 35,105
Message 7695 - Posted: 26 Dec 2005, 23:54:18 UTC
Last modified: 26 Dec 2005, 23:55:18 UTC

I don't believe all the Computation Errors are due to the bad WU's that the Project Released. I think some of it has to do with the way the BOINC Manager Adjusts the Time to Completion.


PoorBoy, is the "maximum CPU time" affected by the BOINC Manager settings? I understood that it was just a "cpu seconds" value passed in the WU itself. This would make slower machines more likely to hit it, but would be easier to code. If the DCF etc., _is_ taken into account, I think that would normally be a "good thing", but could be an issue here, at least on _when_ they blow up.

Have you had any CPU-time-exceeded errors on anything other than the DEFAULT_xxxxx_205's? Those monsters are going to hit _any_ reasonable limit, adjusted or not. I've only _seen_ that error on those. All the other computation errors that _I've_ seen have been on the "short WUs", the ones where the random seed is miscalculated/misread.


hehe ... good thing I scrolled down the page a little further because I was about to go on a Rant. I though my Post had been completely deleted because I couldn't find it. I wondered why it would have been because I thought I gave a rational response to some of the Computation Errors.

As far as I know from watching the BOINC Manager your setting's or preferences have nothing to do with the maximum CPU time. When a Benchmark is run it sets the Time to Completion at that time for what your Benchmark is saying the Computer is capable of completing the WU's in. As you run the WU's & finish them the Manager will slowly adjust the time upwards or downwards depending on the actual amount of time your taking to complete them in.

If you get a bunch of WU's in a row that only take 2-3 hours the Manager will eventually adjust the Time to Completion to that amount. Then you all of a sudden get a few 6-8 hour WU's and I think the Manager will Error out the WU's once they reach 4 or 5 hours of run time. That's just my feeling's on the matter, but like I said it's what was happening to me over @ the PrimeGrid Site until I jacked the Time to Completion amount back up over what it was actually taking me to run the WU's.

I haven't run any of the DEFAULT_xxxxxx_205's yet because I stopped running the Project for about a week. The 2 WU's I mentioned in my Post were from older WU's I had left yet from when I stopped running. They both Erred out at the same time around the 3 1/2 hour mark showing 50% done. The Time to Completion for the WU's that were left to run was @ 2 1/2 so right away I suspected a overrun as being the cause of the Error's & why I made my Post.

I Manually Re-Benchmarked the Computer and the Time to Completion jumped back up to around 5 1/2 hours & so far I haven't had another Error on that Computer again. I'm keeping an eye on it because the Time to Completion has slowly dropped back down to under 4 hours again. If it goes much lower I'm going to Manually Re-Benchmark it again to kick the Time back up again ...
ID: 7695 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ecafkid

Send message
Joined: 5 Oct 05
Posts: 40
Credit: 15,177,319
RAC: 0
Message 7696 - Posted: 26 Dec 2005, 23:54:23 UTC

I have been getting alot of computation errors lately. JUst look at my results. It seems not only to be with the 205 batch.

Ecaf.


ID: 7696 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,542,037
RAC: 3,110
Message 7698 - Posted: 27 Dec 2005, 0:05:24 UTC - in response to Message 7695.  

If you get a bunch of WU's in a row that only take 2-3 hours the Manager will eventually adjust the Time to Completion to that amount. Then you all of a sudden get a few 6-8 hour WU's and I think the Manager will Error out the WU's once they reach 4 or 5 hours of run time. That's just my feeling's on the matter, but like I said it's what was happening to me over @ the PrimeGrid Site until I jacked the Time to Completion amount back up over what it was actually taking me to run the WU's.


This is something that someone with a lot more knowledge of the code than I have will have to answer, but if the DCF etc. DO affect the "max time", then the project definitely needs to keep that in mind, and possibly raise it quite a bit. That would be make the "extreme" cases like these "DEFAULT_xxxxx_205"s _worse_, but if the other alternative is causing _good_ WUs to error out...

Can you copy your posting or write up something on the topic and create a new thread? I'm afraid it may be "missed" by the staff if it's buried in here, and it seems to be a separate issue that they need to be aware of.

ID: 7698 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,542,037
RAC: 3,110
Message 7699 - Posted: 27 Dec 2005, 0:08:23 UTC - in response to Message 7696.  

I have been getting alot of computation errors lately. JUst look at my results. It seems not only to be with the 205 batch.


"Computation errors" are _NOT_ the issue with the "DEFAULT_xxxxx_205" WUs. Those "run forever" and eventually give a "maximum cpu time exceeded" error. The computation errors are from the bad handling of the random seed, and seem to be across multiple batches, almost randomly.

At this point, there's nothing to do about them other than let them fail - we can't identify them to abort them, or anything else. The project staff has already turned off the creation of _more_ of these, but doing anything else with the existing ones will have to wait for them to return after the holidays.

ID: 7699 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ecafkid

Send message
Joined: 5 Oct 05
Posts: 40
Credit: 15,177,319
RAC: 0
Message 7700 - Posted: 27 Dec 2005, 0:12:04 UTC

Thanks Bill for your explanation. I'll just keep crunching for now.
ID: 7700 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 7721 - Posted: 27 Dec 2005, 8:38:44 UTC

Just one more point, if you look closely at most of the computation error work units they die within a few seconds to a minute or so. As BIll said they seem to be unstable in one way or another.

In LHC@Home we see this a lot where the work unit stops, though usually not with a computation error. :)

I am not sure that granting me 0.06 CS per failed work unit is going to substantially change my standing ... but I won't complain ... :)
ID: 7721 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
truckpuller

Send message
Joined: 5 Nov 05
Posts: 40
Credit: 229,134
RAC: 0
Message 7941 - Posted: 30 Dec 2005, 0:17:36 UTC

Well just had 5 more jobs (out of 12) just give me computation errors and none of them was The Default 205. I did just get a Default 205(12/28 yesterady)) download and it ran ok, was under the impression that the Defaults 205's where all gone. The jobs where topological( i think) that computation defaulted.
Visit us at Christianboards.org
ID: 7941 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,542,037
RAC: 3,110
Message 7943 - Posted: 30 Dec 2005, 0:37:49 UTC - in response to Message 7941.  

I did just get a Default 205(12/28 yesterady)) download and it ran ok, was under the impression that the Defaults 205's where all gone.


Again, "computation error" is on the _non_ "DEFAULT_xxxx_205"s... I looked through your results trying to find the 205 in question and couldn't locate it - can you give a link or a WU number? There are still some being "recycled", that had been delayed in large queues; I'm trying to see how many more hosts are likely to be bit by these.

Also, if it ran okay - how long did it take? Those are supposed to be 100x as large as the "normal" ones...

ID: 7943 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
truckpuller

Send message
Joined: 5 Nov 05
Posts: 40
Credit: 229,134
RAC: 0
Message 7952 - Posted: 30 Dec 2005, 3:48:09 UTC - in response to Message 7943.  

I did just get a Default 205(12/28 yesterady)) download and it ran ok, was under the impression that the Defaults 205's where all gone.


Again, "computation error" is on the _non_ "DEFAULT_xxxx_205"s... I looked through your results trying to find the 205 in question and couldn't locate it - can you give a link or a WU number? There are still some being "recycled", that had been delayed in large queues; I'm trying to see how many more hosts are likely to be bit by these.

Also, if it ran okay - how long did it take? Those are supposed to be 100x as large as the "normal" ones...


The Default_1hz6_205_87_1 i just uploaded it, but in the transferes tab (sections ) i didnot see it being uploaded. The cpu time shown was as follows 10:41:53, Iam mentioning this just to inform you that there is still some of these floating around. Thanks again.
Visit us at Christianboards.org
ID: 7952 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
truckpuller

Send message
Joined: 5 Nov 05
Posts: 40
Credit: 229,134
RAC: 0
Message 7953 - Posted: 30 Dec 2005, 3:59:57 UTC
Last modified: 30 Dec 2005, 4:02:33 UTC

Just went to upload another computer and 3 out of the 4 jobs had computation errors also and the jobs are as follows.
1ogw_topology_sample_207_1688_10
1ogw_topology_sample_207_12547_8
1ogw_topology_sample_207_9521_4

As i had mentioned before that i had several of these topology jobs on another machine fail, iam uploaded the above jobs now. When i went to upload jobs they did not show up in the transferes section as being uploaded.
Visit us at Christianboards.org
ID: 7953 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,542,037
RAC: 3,110
Message 7954 - Posted: 30 Dec 2005, 4:14:23 UTC - in response to Message 7952.  

can you give a link or a WU number?


The Default_1hz6_205_87_1


The name doesn't help; it's not shown on the "list of results" page. And I don't even know which of your computers this is... However, the "10 hours+" was enough, I just looked at the latest results from each of your computers until I saw one that was 38,000+ seconds, then dug down to find the name. It did _NOT_ complete successfully; it shows client error, as expected.

The good news is that the WU errors status is "cancelled", meaning it will not be sent to a third person. (This one was in the first person's cache for eight days before they aborted it, as requested.)

So the "205"s are coming along well. I also found your "short" computation-error results from today on another of your computers. The one I looked at in detail had been processed by five people now, but it _also_ shows to be "cancelled". So... someone of the project staff apparently has managed to figure out how to identify these, and has cancelled the resending of them; all that remains is for those who have already downloaded these and have them in their cache to return the final failure.

Progress! :-)

ID: 7954 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 7972 - Posted: 30 Dec 2005, 8:39:38 UTC

Bill I have just checked a couple of the 'short' failures visible on my results , one with 10 sendings the other with 4, both have now been set to 'cancelled' so you are probably correct about the staff finding a way...
ID: 7972 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Computation Error



©2024 University of Washington
https://www.bakerlab.org