again on on computation error

Message boards : Number crunching : again on on computation error

To post messages, you must log in.

AuthorMessage
Alessandro Freda

Send message
Joined: 17 Dec 05
Posts: 2
Credit: 410,881
RAC: 0
Message 7880 - Posted: 29 Dec 2005, 12:27:48 UTC

One question, is true that was write (don't remember where) that changing the preference of "processor usage", "Leave applications in memory while preempted" from No to Yes, can help to solve the problem?

At now after this change on my account, cannot say if there is an improvement.

If useful (hope that the developers can investigate all the errored results) these are the name of my last day failed WUs:

1hz6A_topology_sample_207_9720_3
1ogw__topology_sample_207_489_3
1hz6A_topology_sample_207_7251_4
DEFAULT_1di2_220_3000_0

all with ( - exit code -1073741819 (0xc0000005),
the first are about 0 CPU time, the last one instead wasted 2 or 3 hours of CPU time.

Regards,
Alessandro



ID: 7880 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7882 - Posted: 29 Dec 2005, 13:23:47 UTC - in response to Message 7880.  

One question, is true that was write (don't remember where) that changing the preference of "processor usage", "Leave applications in memory while preempted" from No to Yes, can help to solve the problem?

At now after this change on my account, cannot say if there is an improvement.

If useful (hope that the developers can investigate all the errored results) these are the name of my last day failed WUs:

1hz6A_topology_sample_207_9720_3
1ogw__topology_sample_207_489_3
1hz6A_topology_sample_207_7251_4
DEFAULT_1di2_220_3000_0

all with ( - exit code -1073741819 (0xc0000005),
the first are about 0 CPU time, the last one instead wasted 2 or 3 hours of CPU time.

Regards,
Alessandro




It solves some problems, it does not solve them all.

I am getting the same exit code as you and I use the yes setting

ID: 7882 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 7883 - Posted: 29 Dec 2005, 13:32:46 UTC - in response to Message 7882.  

One question, is true that was write (don't remember where) that changing the preference of "processor usage", "Leave applications in memory while preempted" from No to Yes, can help to solve the problem?


The first three are clearly from the bad batch mentioned in other threads. Nothing you can do about those. The last one is not from that batch but without more details it's hard to tell what caused the error.

Changing the setting to "yes" would probably have prevented the error if it happened during benchmarks or switching to another application.
*** Join BOINC@Australia today ***
ID: 7883 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Mike Smith

Send message
Joined: 27 Dec 05
Posts: 2
Credit: 3,913
RAC: 0
Message 7886 - Posted: 29 Dec 2005, 13:46:52 UTC

Not sure if I am on the same wavelength here but I have not had a single unit that Ive done actually give me a result I'm always getting client errors. Only on Rosetta, I'm doing other projects just fine. Any ideas

ID: 7886 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 3,283,799
RAC: 1,769
Message 7887 - Posted: 29 Dec 2005, 13:47:25 UTC

I had 1 WU yesterday that I decided to suspend while I ran some WU's from another Project. As soon as I Suspended the WU it gave me a Computation error.

It seems any of the Rosetta WU's are very touchy and Error out at the least little thing...IMO...I've found that running the Rosetta Project by it's self is the best way to go. And all you can do is hope & pray that you don't have to Suspend the Project or Exit the BOINC Manager while running it.

If you have to do either 1 of those things then it's a toss up whether the WU will crash or not, and because of the long time between Check Points it's best to only Suspend or Exit the Manager when you actually see the WU advance to the next % Point, if you don't you may lose upwards of an hour or more of crunching time anyway ...
ID: 7887 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7890 - Posted: 29 Dec 2005, 14:05:03 UTC - in response to Message 7886.  

Not sure if I am on the same wavelength here but I have not had a single unit that Ive done actually give me a result I'm always getting client errors. Only on Rosetta, I'm doing other projects just fine. Any ideas


Mike,

I have a total of 28 boxes on BOINC (not all owned by me by the way), of which 25 were crunching Rosetta over the holidays. Four of those boxes got almost no credit at all including one that got none at all. Most of the rest of my boxes lost some credit at some stage, but it is very variable.

Is this the luck of the draw, the bad wu going out in clusters?

Is it the case that some bad wu 'poison' the box for others?

Those are two good guesses but none of us really know.

In your position I would first make sure that I have got that 'keep in memory' setting to 'yes'.

Assuming that is so, I would be likely to try resetting the project next time a wu fails - before you do that set nomorework and then update the project so the falied wu gets reported. This forces all the project dependent files to be downloaded again.

Then if it happens yet again, I'd be inclined to detach and re-attach, either re-attaching right away or in mid Jan if you are fed up getting nowhere. This downloads all the project dependent files, but also gives you a new set of host records on the Rosetta databases. (Later on you can merge the old host into the new one)

Resetting and re-attaching will only help if there is some history dependent effect in all of this - but it won't harm even if it turns out to be a waste of the extra downloads.

In my view, even if it is a user settings issue, it is not a good time to troubleshoot settings when some of the wu are still dodgy.

Thanks for your patience - I would not have stuck around if my worst box had been my only one. I *would* have come back mid Jan because I really like the attitude of the project staff

River~~
ID: 7890 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7891 - Posted: 29 Dec 2005, 14:20:18 UTC - in response to Message 7887.  

I had 1 WU yesterday that I decided to suspend while I ran some WU's from another Project. As soon as I Suspended the WU it gave me a Computation error.


hmmmmm...

I use suspend from time to time. I wonder how many of my problems have been caused by that.

The problem with stopping BOINC is known, itis linked to the keep in memory issue (obviously BOINC does not keep tasks in memory if BOINC itself stops!)

Likewise when keep in memory = no you'd expect suspend to trigger an error as the task it taken out of memory.

If you have keep in memory = yes and still see problems with suspend, then I think it is one more thing worth reporting.

Your general point is just right - Rosetta is a fragile set of software at present. It will get better as Jack & DavidK find the bugs. We have certainly given the bug-hunters a target-rich environment this holiday!

River~~

ID: 7891 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 7893 - Posted: 29 Dec 2005, 14:36:26 UTC

Further to what I wrote below...

I run both SETI and Rosetta on most of my computers (currently on a 50/50 basis). I have "Leave applications in memory while preempted" set to "yes" and (apart from the recent bad batch) rarely have a problem.

I can suspend individual work units. I can suspend Rosetta while it's running a work unit. I can switch to SETI. Not a problem. But if I set the above to "no", I will get many crashes, guaranteed.
*** Join BOINC@Australia today ***
ID: 7893 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Mike Smith

Send message
Joined: 27 Dec 05
Posts: 2
Credit: 3,913
RAC: 0
Message 7896 - Posted: 29 Dec 2005, 15:07:52 UTC

Have set to YES. I also am running SETI and Predictorat 33% each and have no problems at all with these other two over the past several years.
ID: 7896 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7901 - Posted: 29 Dec 2005, 16:05:57 UTC - in response to Message 7896.  

Have set to YES. I also am running SETI and Predictorat 33% each and have no problems at all with these other two over the past several years.


In that case you have probably just had an unlucky selection of wu. It's worth a reset of Rosetta, I'd say.

R~~
ID: 7901 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,488,140
RAC: 10
Message 7933 - Posted: 29 Dec 2005, 23:19:13 UTC - in response to Message 7896.  

Have set to YES. I also am running SETI and Predictorat 33% each and have no problems at all with these other two over the past several years.


No project is "perfect" - but the three that have the MOST complaints seem to be SETI, Predictor, and SZTAKI. If you've been able to function well with SETI and Predictor, I would almost guarantee that your current Rosetta problems are very temporary.

ID: 7933 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dirk Broer

Send message
Joined: 16 Nov 05
Posts: 22
Credit: 2,947,217
RAC: 4,673
Message 7977 - Posted: 30 Dec 2005, 9:22:33 UTC - in response to Message 7933.  
Last modified: 30 Dec 2005, 9:23:27 UTC

Have set to YES. I also am running SETI and Predictorat 33% each and have no problems at all with these other two over the past several years.


No project is "perfect" - but the three that have the MOST complaints seem to be SETI, Predictor, and SZTAKI. If you've been able to function well with SETI and Predictor, I would almost guarantee that your current Rosetta problems are very temporary.


My last 24 Rosetta WU's gave only ONE normal completed result, the rest were all computation errors. Never had that with Seti, nor Predictor. Their problems seem more network/hardware related, whilst Rosetta seems to have corrupt data to begin with.
ID: 7977 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,488,140
RAC: 10
Message 7979 - Posted: 30 Dec 2005, 9:32:37 UTC - in response to Message 7977.  
Last modified: 30 Dec 2005, 9:35:13 UTC

My last 24 Rosetta WU's gave only ONE normal completed result, the rest were all computation errors. Never had that with Seti, nor Predictor. Their problems seem more network/hardware related, whilst Rosetta seems to have corrupt data to begin with.


Just looked at your results - the errors LOOK like the "application left in memory when preempted = no" errors. If that setting isn't "yes", then Rosetta _will_ fail, frequently, as covered in this and many other threads... it's a known bug, that they're chasing. (Much less fatal than Predictor's "fortran error", or "cypa" WUs that ran for longer than the extremely short deadline, or...) There was also a bad batch of WUs released, that at this point are all cleared out, unless you still have one in your cache from before today.

ID: 7979 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dirk Broer

Send message
Joined: 16 Nov 05
Posts: 22
Credit: 2,947,217
RAC: 4,673
Message 7994 - Posted: 30 Dec 2005, 16:35:32 UTC - in response to Message 7979.  
Last modified: 30 Dec 2005, 16:42:14 UTC

My last 24 Rosetta WU's gave only ONE normal completed result, the rest were all computation errors. Never had that with Seti, nor Predictor. Their problems seem more network/hardware related, whilst Rosetta seems to have corrupt data to begin with.


Just looked at your results - the errors LOOK like the "application left in memory when preempted = no" errors. If that setting isn't "yes", then Rosetta _will_ fail, frequently, as covered in this and many other threads... it's a known bug, that they're chasing. (Much less fatal than Predictor's "fortran error", or "cypa" WUs that ran for longer than the extremely short deadline, or...) There was also a bad batch of WUs released, that at this point are all cleared out, unless you still have one in your cache from before today.


But the setting is 'YES'! (Correction: that is to say when viewed in the Windows taskmanager. Rosetta is active no matter which other BOINC project is. Just altered my settings in the preferences to see whether this will have some benefit) I have no known 'bad' WUs, so I'll keep my fingers crossed.
ID: 7994 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 7 Oct 05
Posts: 65
Credit: 10,612,039
RAC: 0
Message 7997 - Posted: 30 Dec 2005, 18:04:09 UTC - in response to Message 7979.  

If that setting isn't "yes", then Rosetta _will_ fail, frequently, as covered in this and many other threads...


This isnt entirely accurate... If Rosetta is the only project a computer is attached to, or if the other projects are suspended, then Rosetta will get 100% of the computers resources and this setting does not matter.

Rosetta WUs are indeed haveing problems right now. As more and more "bad" work units are getting recycled the ratio of bad to good tilts very much in favor of the bad. We will just have to work through them. The project leaders have said they will sort this out after the holiday.
ID: 7997 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 7998 - Posted: 30 Dec 2005, 18:19:02 UTC - in response to Message 7997.  

[quote....]Rosetta WUs are indeed haveing problems right now. As more and more "bad" work units are getting recycled the ratio of bad to good tilts very much in favor of the bad. We will just have to work through them. The project leaders have said they will sort this out after the holiday.[/quote]
Not quite true now, as said in another thread all the bad ones appear to have been set to 'cancelled' so once sent back they should not go out again....
ID: 7998 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : again on on computation error



©2024 University of Washington
https://www.bakerlab.org