Please abort WUs with

Message boards : Number crunching : Please abort WUs with

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 9 · Next

AuthorMessage
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 7000 - Posted: 21 Dec 2005, 13:18:15 UTC - in response to Message 6991.  
Last modified: 21 Dec 2005, 13:23:36 UTC

The "short" failures shouldn't add up to more than a minute or two on average for everyone


No problem with that other than the wasted bandwidth of downloading them.

But it looks like the admins haven't changed the settings for "max # of error results" - WUs that have already crashed on more than one system in rapid succession are still being sent out (e.g. WU 3821321). Waste of bandwidth for the project too.

Suggest the admins change that setting as soon as they get to work in the morning.

Unless of course the fixes/workarounds that Jack referred to mean those WUs will still be able to be done to completion by the next person to download it.

EDIT: I don't see any evidence of that - I get about 3 WU at a time and most error out. I get lucky about 1 in every 10 WU.
*** Join BOINC@Australia today ***
ID: 7000 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,494,611
RAC: 615
Message 7001 - Posted: 21 Dec 2005, 13:23:43 UTC - in response to Message 7000.  

But it looks like the admins haven't changed the settings for "max # of error results" ...
Suggest the admins change that setting as soon as they get to work in the morning.


Definitely needs to be done, just on general principles, but I think it's too late for those that are already "out the door". That setting is stored in the WU itself... I don't think it's possible to modify them after the fact. It _should_ be possible to stop them from being reissued though, on the server side.

ID: 7001 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 7006 - Posted: 21 Dec 2005, 13:43:08 UTC

Mine now seem to be two thirds crashing, one third running ok.
ID: 7006 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 7021 - Posted: 21 Dec 2005, 15:08:29 UTC

Over 90% now failing.....gonna suspend :-(
ID: 7021 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,494,611
RAC: 615
Message 7024 - Posted: 21 Dec 2005, 15:40:18 UTC
Last modified: 21 Dec 2005, 15:43:29 UTC

If you are on dial-up or pay for bandwidth per download, then at this point you should definitely suspend Rosetta. It's still early morning out there, hopefully someone will be coming in soon with a better answer. If processing these doesn't bother you or cost you anything, then letting them continue will at least help get them out of the system a little quicker. At most you'll get 100/day as that's the maximum quota. If that quota drops too far due to the errors, it may affect your ability to process Rosetta as much as you normally do, until it climbs again.

ID: 7024 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 7025 - Posted: 21 Dec 2005, 15:40:35 UTC - in response to Message 7021.  
Last modified: 21 Dec 2005, 15:42:33 UTC

Over 90% now failing.....gonna suspend :-(


Ditto. I'm not going to donate more of my bandwidth to Rosetta until these bad work units are history. At least 90% are crashing.

*** Join BOINC@Australia today ***
ID: 7025 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 7027 - Posted: 21 Dec 2005, 15:46:40 UTC - in response to Message 7024.  

If you are on dial-up or pay for bandwidth per download, then at this point you should definitely suspend Rosetta. It's still early morning out there, hopefully someone will be coming in soon with a better answer. If processing these doesn't bother you or cost you anything, then letting them continue will at least help get them out of the system a little quicker. At most you'll get 100/day as that's the maximum quota. If that quota drops too far due to the errors, it may affect your ability to process Rosetta as much as you normally do, until it climbs again.


I must have got quite close to the 100 per day before suspending so I am glad I did if that meant that reaching 100 I would not get any good ones as well....
ID: 7027 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,494,611
RAC: 615
Message 7029 - Posted: 21 Dec 2005, 16:02:26 UTC - in response to Message 7027.  

I must have got quite close to the 100 per day before suspending so I am glad I did if that meant that reaching 100 I would not get any good ones as well....


Your quotas are still 94 and 88 - the rule is "one error, quota reduced by one", but then it's "one good result, quota increased by 2, then 4, then 8..." So it takes 100 in a row to actually drag you to the bottom, and only a handful of good results to get you back up. Because of the "error... error... good..." pattern of these, unless you can do more than 88/day, you're not in any trouble yet. :-) Rosetta's quotas are extremely generous. Einstein's limit, for example is 8.

ID: 7029 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 7031 - Posted: 21 Dec 2005, 16:12:06 UTC

Cheers for the info Bill.....I think I will still stick with the suspend though and let Climate and WCG have a boost.
ID: 7031 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,494,611
RAC: 615
Message 7033 - Posted: 21 Dec 2005, 16:18:21 UTC - in response to Message 7031.  

Cheers for the info Bill.....I think I will still stick with the suspend though and let Climate and WCG have a boost.


No problem here; everyone should do whatever they're most comfortable with, I just don't want everyone reading the thread to say "I GOTTA SUSPEND" from over-reacting. If my PC was _functional_ (mutter, mutter...) I'd just let it run, but then I'm on an unlimited-download broadband cable connection, so it wouldn't hurt anything. My Mac Mini is still crunching Rosetta, but at 12-20 hours/WU, 50:50 with Einstein, it's only on the first to be downloaded since all this started, and it seems to be okay. I think it's safe to assume that _that_ machine will never have a quota problem...

ID: 7033 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
PCZ

Send message
Joined: 16 Sep 05
Posts: 26
Credit: 2,024,330
RAC: 0
Message 7036 - Posted: 21 Dec 2005, 16:30:01 UTC
Last modified: 21 Dec 2005, 16:40:23 UTC

Well I'm seeing "reached daily quota of 49 results" on quite a few of my boxes.
The daily limit soon falls from 100.
51 bad WU's and your out of there.

Boincview is a sea of red :(


ID: 7036 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 2 Nov 05
Posts: 45
Credit: 11,255,707
RAC: 62,366
Message 7040 - Posted: 21 Dec 2005, 16:44:11 UTC

HM, never took a look how big are the WUs of Rosetta to download ...

16 of 17 boxes are on an internet-connection, where I have to pay for each GB, but it's not too much. So, if the WUs are not too big, I keep them downloading ...

Any hint for me ?


Supporting BOINC, a great concept !
ID: 7040 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,494,611
RAC: 615
Message 7043 - Posted: 21 Dec 2005, 17:18:56 UTC - in response to Message 7036.  

Well I'm seeing "reached daily quota of 49 results" on quite a few of my boxes.
The daily limit soon falls from 100.
51 bad WU's and your out of there.


Well, this says that you have somewhere around 49 results either on your system, or completed within the last 24 hours... That's not _too_ bad, and remember that each "success" brings it up faster than the errors brought it down! I assume you're still crunching Rosetta, and not out of work?

I originally thought the quota of 100 here was way too high for the time the results take. Even my "fast" PC can't do more than about 12-15/day. Assuming some of you have boxes twice that fast, I figured 50 should be plenty for anyone, even allowing for a bunch of download or computing errors. Never anticipated this large a 'bad batch'! (Oh, the quota is "per CPU", not "per user" or "per host", so multi-core hosts get more, etc.) The danger in having a _high_ quota can be seen at SETI. There are any number of broken boxes out there that are getting 1 WU/day, erroring, then waiting. But the first day, they killed off 100 WUs, that had to be re-issued, quorums delayed, etc. The danger in having a _low_ quota is that a few errors can cause a machine to lose crunching time once it bottoms out and before the first 'good' result is validated. That's not an issue at Rosetta, because it is validated almost instantly. Even if you reach a quota of 1, you'll lose only the time between the first "good" one to complete and upload, and when it is reported. (Except of course if you don't have regular net access.)

ID: 7043 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,494,611
RAC: 615
Message 7044 - Posted: 21 Dec 2005, 17:27:45 UTC - in response to Message 7040.  

HM, never took a look how big are the WUs of Rosetta to download ...


Don't have my PC, so this is Mac, but it looks like around 1.8MB. There's an awful lot of "extra" stuff in there, not just a single file, so this may not be an "every time" measurement, but that's what has today's date on it, and I only have 1 WU at a time.

Hm. Apparently my Mac WU finished and I got another, and it errored out after 30 seconds... so this is definitely cross-platform.

The project just quit sending out work, so apparently they're there and working on it!

ID: 7044 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 2 Nov 05
Posts: 45
Credit: 11,255,707
RAC: 62,366
Message 7045 - Posted: 21 Dec 2005, 17:28:20 UTC
Last modified: 21 Dec 2005, 17:30:10 UTC

At the moment, this problem seems to start a deadly cycle ...

Boxes more often ask for work:


  • more traffic on your side
  • scheduler not fast enough



I have several boxes, that didn't get work in the last 60 minutes, but they don't have reached the daily quota.

And I see boxes, that have delayed / interrupted downloads.

Please, keep an eye on the servers ...




Supporting BOINC, a great concept !
ID: 7045 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nic

Send message
Joined: 7 Oct 05
Posts: 3
Credit: 18,542
RAC: 0
Message 7046 - Posted: 21 Dec 2005, 17:48:48 UTC

Just to let everyone know, I've gotten errors on 205's, 207's, and now 203's.
ID: 7046 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
PCZ

Send message
Joined: 16 Sep 05
Posts: 26
Credit: 2,024,330
RAC: 0
Message 7054 - Posted: 21 Dec 2005, 18:01:59 UTC

Those boxes are out of work.
As i said 51 duff WU's in a row and your daily limit is then lower than you have already downloaded.

Successful WU's may well increase your daily limit but it never rises above 100 per CPU.
So you get 51 crappy WU's in a row and you ain't crunching anymore.

It took very little time to download those duff WU's.

ID: 7054 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,663,494
RAC: 723
Message 7055 - Posted: 21 Dec 2005, 18:05:21 UTC
Last modified: 21 Dec 2005, 18:19:32 UTC

Are these default_205's worked out of the system now? I am leaving for the holidays tomorrow morning and will be really annoyed if when I come back, I find I have been grinding away on a no hope unit simply because I was not home to abort it.

All of my wu's at the moment seem to be aborting quickly. This is not really bothering me, and if it helps get the duffers out of the system, great.

It is the other issue that bothers me.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 7055 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Schonbrun

Send message
Joined: 1 Nov 05
Posts: 115
Credit: 5,954
RAC: 0
Message 7056 - Posted: 21 Dec 2005, 18:07:45 UTC
Last modified: 21 Dec 2005, 18:09:00 UTC

I understand people wanting to suspend Rosetta during this time. Completely reasonable.

We really don't want this to happen again, and so will obviously make it top priority to implement safeguards and a more rigorous testbed. The max error results allowed clearly isn't helping matters either.

I'm sure you don't want to hear me apologize again, so instead I'll tell you what happened. In previous WUs, we have been taking our random number seed from your system clock. This isn't quite optimal, because it's very possible for two WUs to start at the same second on different computers, which is the resolution we were using. We were getting a number of duplicate jobs. So in an effort reduce wasting cpu cycles by running identical jobs, we decided to send out seeds with each WU. These seeds come from a list.

It turns out that there was a long standing bug in our seed reading code that mangles a significant fraction of seeds. This sometimes causes the random number stream to be corrupted, and give values out of range. We had not detected it before because we had not previously been systematically sampling the whole range of seeds.

The "work-around," until we can get out a new executable out that fixes the bug, is to go back to using seeds corresponding to values that do not get mangled. I hope we can prevent this thread from becoming another forum on pseudo-random number generation.

Again, the problem was with our seed reader, not the generator itself. But, when David Baker comes back from vacation, I will use this as an opportunity to advocate for the implementation of a more robust RNG in rosetta. Likely the Mersenne Twister that has been suggested in these forums, and in my discussions with other scientists.

Thanks again for your patience,
Jack









ID: 7056 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 2 Nov 05
Posts: 45
Credit: 11,255,707
RAC: 62,366
Message 7062 - Posted: 21 Dec 2005, 18:21:31 UTC

Thanks for the open information, Jack :-)

Is known, wheather all of the "bad" WUs are now out of the "download sequence" and we are back to business as usual or are these faulty WUs are still in there ?



Supporting BOINC, a great concept !
ID: 7062 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 9 · Next

Message boards : Number crunching : Please abort WUs with



©2024 University of Washington
https://www.bakerlab.org