Message boards : Number crunching : Please abort WUs with
Previous · 1 · 2 · 3 · 4 · 5 . . . 9 · Next
Author | Message |
---|---|
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
The "short" failures shouldn't add up to more than a minute or two on average for everyone No problem with that other than the wasted bandwidth of downloading them. But it looks like the admins haven't changed the settings for "max # of error results" - WUs that have already crashed on more than one system in rapid succession are still being sent out (e.g. WU 3821321). Waste of bandwidth for the project too. Suggest the admins change that setting as soon as they get to work in the morning. Unless of course the fixes/workarounds that Jack referred to mean those WUs will still be able to be done to completion by the next person to download it. EDIT: I don't see any evidence of that - I get about 3 WU at a time and most error out. I get lucky about 1 in every 10 WU. *** Join BOINC@Australia today *** |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 7 |
But it looks like the admins haven't changed the settings for "max # of error results" ... Definitely needs to be done, just on general principles, but I think it's too late for those that are already "out the door". That setting is stored in the WU itself... I don't think it's possible to modify them after the fact. It _should_ be possible to stop them from being reissued though, on the server side. |
Scribe Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0 |
Mine now seem to be two thirds crashing, one third running ok. |
Scribe Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0 |
Over 90% now failing.....gonna suspend :-( |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 7 |
If you are on dial-up or pay for bandwidth per download, then at this point you should definitely suspend Rosetta. It's still early morning out there, hopefully someone will be coming in soon with a better answer. If processing these doesn't bother you or cost you anything, then letting them continue will at least help get them out of the system a little quicker. At most you'll get 100/day as that's the maximum quota. If that quota drops too far due to the errors, it may affect your ability to process Rosetta as much as you normally do, until it climbs again. |
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
Over 90% now failing.....gonna suspend :-( Ditto. I'm not going to donate more of my bandwidth to Rosetta until these bad work units are history. At least 90% are crashing. *** Join BOINC@Australia today *** |
Scribe Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0 |
If you are on dial-up or pay for bandwidth per download, then at this point you should definitely suspend Rosetta. It's still early morning out there, hopefully someone will be coming in soon with a better answer. If processing these doesn't bother you or cost you anything, then letting them continue will at least help get them out of the system a little quicker. At most you'll get 100/day as that's the maximum quota. If that quota drops too far due to the errors, it may affect your ability to process Rosetta as much as you normally do, until it climbs again. I must have got quite close to the 100 per day before suspending so I am glad I did if that meant that reaching 100 I would not get any good ones as well.... |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 7 |
I must have got quite close to the 100 per day before suspending so I am glad I did if that meant that reaching 100 I would not get any good ones as well.... Your quotas are still 94 and 88 - the rule is "one error, quota reduced by one", but then it's "one good result, quota increased by 2, then 4, then 8..." So it takes 100 in a row to actually drag you to the bottom, and only a handful of good results to get you back up. Because of the "error... error... good..." pattern of these, unless you can do more than 88/day, you're not in any trouble yet. :-) Rosetta's quotas are extremely generous. Einstein's limit, for example is 8. |
Scribe Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0 |
Cheers for the info Bill.....I think I will still stick with the suspend though and let Climate and WCG have a boost. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 7 |
Cheers for the info Bill.....I think I will still stick with the suspend though and let Climate and WCG have a boost. No problem here; everyone should do whatever they're most comfortable with, I just don't want everyone reading the thread to say "I GOTTA SUSPEND" from over-reacting. If my PC was _functional_ (mutter, mutter...) I'd just let it run, but then I'm on an unlimited-download broadband cable connection, so it wouldn't hurt anything. My Mac Mini is still crunching Rosetta, but at 12-20 hours/WU, 50:50 with Einstein, it's only on the first to be downloaded since all this started, and it seems to be okay. I think it's safe to assume that _that_ machine will never have a quota problem... |
PCZ Send message Joined: 16 Sep 05 Posts: 26 Credit: 2,024,330 RAC: 0 |
Well I'm seeing "reached daily quota of 49 results" on quite a few of my boxes. The daily limit soon falls from 100. 51 bad WU's and your out of there. Boincview is a sea of red :( |
Yeti Send message Joined: 2 Nov 05 Posts: 45 Credit: 14,945,062 RAC: 0 |
|
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 7 |
Well I'm seeing "reached daily quota of 49 results" on quite a few of my boxes. Well, this says that you have somewhere around 49 results either on your system, or completed within the last 24 hours... That's not _too_ bad, and remember that each "success" brings it up faster than the errors brought it down! I assume you're still crunching Rosetta, and not out of work? I originally thought the quota of 100 here was way too high for the time the results take. Even my "fast" PC can't do more than about 12-15/day. Assuming some of you have boxes twice that fast, I figured 50 should be plenty for anyone, even allowing for a bunch of download or computing errors. Never anticipated this large a 'bad batch'! (Oh, the quota is "per CPU", not "per user" or "per host", so multi-core hosts get more, etc.) The danger in having a _high_ quota can be seen at SETI. There are any number of broken boxes out there that are getting 1 WU/day, erroring, then waiting. But the first day, they killed off 100 WUs, that had to be re-issued, quorums delayed, etc. The danger in having a _low_ quota is that a few errors can cause a machine to lose crunching time once it bottoms out and before the first 'good' result is validated. That's not an issue at Rosetta, because it is validated almost instantly. Even if you reach a quota of 1, you'll lose only the time between the first "good" one to complete and upload, and when it is reported. (Except of course if you don't have regular net access.) |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 7 |
HM, never took a look how big are the WUs of Rosetta to download ... Don't have my PC, so this is Mac, but it looks like around 1.8MB. There's an awful lot of "extra" stuff in there, not just a single file, so this may not be an "every time" measurement, but that's what has today's date on it, and I only have 1 WU at a time. Hm. Apparently my Mac WU finished and I got another, and it errored out after 30 seconds... so this is definitely cross-platform. The project just quit sending out work, so apparently they're there and working on it! |
Yeti Send message Joined: 2 Nov 05 Posts: 45 Credit: 14,945,062 RAC: 0 |
At the moment, this problem seems to start a deadly cycle ... Boxes more often ask for work:
Supporting BOINC, a great concept ! |
Nic Send message Joined: 7 Oct 05 Posts: 3 Credit: 18,542 RAC: 0 |
Just to let everyone know, I've gotten errors on 205's, 207's, and now 203's. |
PCZ Send message Joined: 16 Sep 05 Posts: 26 Credit: 2,024,330 RAC: 0 |
Those boxes are out of work. As i said 51 duff WU's in a row and your daily limit is then lower than you have already downloaded. Successful WU's may well increase your daily limit but it never rises above 100 per CPU. So you get 51 crappy WU's in a row and you ain't crunching anymore. It took very little time to download those duff WU's. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 42 |
Are these default_205's worked out of the system now? I am leaving for the holidays tomorrow morning and will be really annoyed if when I come back, I find I have been grinding away on a no hope unit simply because I was not home to abort it. All of my wu's at the moment seem to be aborting quickly. This is not really bothering me, and if it helps get the duffers out of the system, great. It is the other issue that bothers me. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Jack Schonbrun Send message Joined: 1 Nov 05 Posts: 115 Credit: 5,954 RAC: 0 |
I understand people wanting to suspend Rosetta during this time. Completely reasonable. We really don't want this to happen again, and so will obviously make it top priority to implement safeguards and a more rigorous testbed. The max error results allowed clearly isn't helping matters either. I'm sure you don't want to hear me apologize again, so instead I'll tell you what happened. In previous WUs, we have been taking our random number seed from your system clock. This isn't quite optimal, because it's very possible for two WUs to start at the same second on different computers, which is the resolution we were using. We were getting a number of duplicate jobs. So in an effort reduce wasting cpu cycles by running identical jobs, we decided to send out seeds with each WU. These seeds come from a list. It turns out that there was a long standing bug in our seed reading code that mangles a significant fraction of seeds. This sometimes causes the random number stream to be corrupted, and give values out of range. We had not detected it before because we had not previously been systematically sampling the whole range of seeds. The "work-around," until we can get out a new executable out that fixes the bug, is to go back to using seeds corresponding to values that do not get mangled. I hope we can prevent this thread from becoming another forum on pseudo-random number generation. Again, the problem was with our seed reader, not the generator itself. But, when David Baker comes back from vacation, I will use this as an opportunity to advocate for the implementation of a more robust RNG in rosetta. Likely the Mersenne Twister that has been suggested in these forums, and in my discussions with other scientists. Thanks again for your patience, Jack |
Yeti Send message Joined: 2 Nov 05 Posts: 45 Credit: 14,945,062 RAC: 0 |
|
Message boards :
Number crunching :
Please abort WUs with
©2024 University of Washington
https://www.bakerlab.org