Message boards : Number crunching : Please abort WUs with
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 9 · Next
Author | Message |
---|---|
Jack Schonbrun Send message Joined: 1 Nov 05 Posts: 115 Credit: 5,954 RAC: 0 |
Are these default_205's worked out of the system now? I am leaving for the holidays tomorrow morning and will be really annoyed if when I come back, I find I have been grinding away on a no hope unit simply because I was not home to abort it. My understanding is that the 205's are gone. If you don't have one in your cache, I think you are okay. I do not know how long it will take to flush our queue of the other ones, that crash after 30 seconds. |
PCZ Send message Joined: 16 Sep 05 Posts: 26 Credit: 2,024,330 RAC: 0 |
Hopefully most of the bad WU's are gone. I have detached and reatached the nodes which had exceeded their daily limit. All nodes back up and running :) PS Whats MORE_FRAGS a WU for Gamers ? |
Scribe Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0 |
Just downloaded 3 WU and tested them using suspend & release....all 3 seem to be ok and have run a couple of minutes each without failing....problem could be over???? |
Twinkletoes Send message Joined: 2 Nov 05 Posts: 54 Credit: 92,865 RAC: 0 |
Just to let everyone know, I've gotten errors on 205's, 207's, and now 203's. And 209's too for me. |
bruce boytler Send message Joined: 17 Sep 05 Posts: 68 Credit: 3,565,442 RAC: 0 |
2am Seattle time, and I've found the source of the problem for the quick crashing jobs. It's amazing how distributed computing puts ones code to the test. Hi Jack..... Something you may not be aware of being in the thick of the fight right now, is that some of the 30 second crash wu's that I have been recieving have been on as many as 11 other COMPUTERS! 1hz6A_topology_sample_207_10637 I would like to suggest either to your org or have your org suggest to BOINC devs that when a workunit has failed say 5 times to call a spade a spade and have the server flag and quarintine it. Luckily the wu gives up after a few seconds. But, if this were to go on say 15 minutes or so it would be quite a waste. Just another level of safeguard to go along with more critical QC. Cheers........ |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
... Quibble: To get your quote down from 100 to 1 you waste around 100 results, true. But it does not happen in just one day. The first day you get 50 WU, they get errored out, reducing your quote to 50, so you don't get any more. (You might get 51 depending on the timing of the quota chopper getting in before the issue of the 51st result.) Second day you get 25, then 13, 6, 3 , 2, 1, 1, 1 (Or again, maybe one more each time depending on timing issues) The only way you'd gett 100 WU in one day is if you had a big enough cache to get them all issued before the first dud had registered and started reducing the quota. This does at least spread the impact of each rogue box over a few days. Like so much on BOINC it is not ideal but it is a very workable compromise between the opposite evils of letting bad boxes gobble work units and letting bad work units gobble boxes. |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
[sorry, double post] |
Jack Schonbrun Send message Joined: 1 Nov 05 Posts: 115 Credit: 5,954 RAC: 0 |
I would like to suggest either to your org or have your org suggest to BOINC devs that when a workunit has failed say 5 times to call a spade a spade and have the server flag and quarintine it. Yes, there was a discussion of this earlier in this thread. BOINC does have this feature, but we set it to a default value of 10, instead of the 5 you suggest. Or the 3 might be even better. Unfortunately, it seems that we cannot change these Work Units after the fact. In the future these numbers will be adjusted. |
Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0 |
Why are the "DEFAULT_....._W/U's Still being sent out I have to remove 50 to 100 of these every day Why have the staff not removed these "DEFAULT_....._ W/U's from there severs If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 7 |
Why are the "DEFAULT_....._W/U's Still being sent out Laurenu2, WHY are you "removing" DEFAULT WUs other than the 205s? The DEFAULT_205's are all gone. Any other series should be processed, not aborted. That has been made VERY clear, repeatedly. The project _has_ removed all DEFAULT_205's that they can. If there are a handful still out there - abort them, but ONE series of ONE type of WU was to be aborted, not every series of this type of WU. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Again, the problem was with our seed reader, not the generator itself. But, when David Baker comes back from vacation, I will use this as an opportunity to advocate for the implementation of a more robust RNG in rosetta. Likely the Mersenne Twister that has been suggested in these forums, and in my discussions with other scientists. *I* never suggested a different one was necessary. I said the one you are using should be tested ... :) Not to restart the debate, as you say this is a different error. However, it does prove my point that the system's use of the stream of values is of concern, which is the point of testing said system ... |
Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0 |
Why are the "DEFAULT_....._W/U's Still being sent out Did you read the news on the main page (and follow the link to the 'Technical News' page) ? The "DEFAULT_xxxxx_205_" (the _205_ is important) were the only ones that were meant to be killed. The other WU errors which we are all having should take care of themselves, in that these WUs error out after about 30 sec - so you don't have to do anything about them. I am currently crunching several DEFAULT... units, some of them already completed. They seem to be perfectly ok. So DONT delete those. If you have other problems than the ones mentioned, then this would be something different than what is discussed here - perhaps related to your local setup ? - Oh, and in response to what you said in a different thread, I don't think it should ever be necessary to reboot if you have a 'hanging' WU. Killing the corresponding Rosetta executable should be enough, after which the WU should continue from its last checkpoint. Happend to me once or twice some time ago but unrelated to the recent troubles. |
Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0 |
The new defalts are takeing way to long too I have just abouted over 800 defalut WU's and have shut down over 50 client nodes til this problem is Fixed Some of the new defaults are takeing 3X loger then ther are list as 15+Hrs I do not have the time to read this website to see if a problem is or is Not Fied If this project can not produce Good WU's I and other just will not run it If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- |
Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0 |
just look at my imput drop 2005-12-22 4,378 534,956 2005-12-21 10,266 530,578 2005-12-20 14,405 520,312 2005-12-19 15,294 505,907 2005-12-18 16,298 490,612 2005-12-17 16,300 474,314 2005-12-16 16,129 458,014 2005-12-15 16,451 441,885 2005-12-14 19,177 425,434 2005-12-13 18,232 406,257 And this is with all my nodes running flat out And you try to tell me there is nothing wrong If this problem is not fixed soon this project will see people loseing intreat with it and moveing on If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- |
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
The new defalts are takeing way to long Rosetta work units vary in length and I can't see how longer work units are a problem other than if you like to process lots of tiny work units. You should still get the equivalent amount of credit for valid results since it's based on the amount of CPU time spent. Whether that's 24 hours for 2 work units or 24 hours for 50 work units makes no difference. Of course, if you abort work units after they have run for a while, you get no credit for them. *** Join BOINC@Australia today *** |
rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0 |
just look at my imput drop Workunits on other projects may be long as well (of course, Climate Prediction's are the longest!). If you are looking for projects with short workunits, you cannot necessarily count on Rosetta@home, which due to the nature of their research has workunits of varying length. It's the nature of the beast, the name of the game. :) Regards, Bob P. |
Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0 |
The new defalts are takeing way to long The loss is due not from me aborting conched WU but rather from bad WU's sent out 12/22/2005 10:06:57 AM|rosetta@home|Unrecoverable error for result 1hz6A_topology_sample_207_6639_9 ( - exit code -1073741819 (0xc0000005)) 12/22/2005 6:14:29 AM|rosetta@home|Unrecoverable error for result 1hz6A_topology_sample_207_91_3 ( - exit code -1073741819 (0xc0000005)) 12/22/2005 6:14:10 AM|rosetta@home|Unrecoverable error for result 1hz6A_topology_sample_207_12733_7 ( - exit code -1073741819 (0xc0000005)) 12/22/2005 3:23:00 AM|rosetta@home|Unrecoverable error for result 1ogw__topology_sample_207_4678_7 ( - exit code -1073741819 (0xc0000005)) 12/22/2005 2:18:54 AM|rosetta@home|Unrecoverable error for result 1hz6A_topology_sample_207_9476_4 ( - exit code -1073741819 (0xc0000005)) I would guess I have about 1000 of the failed jobs over the last few days If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- |
Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0 |
just look at my imput drop That is a given I have done WU that have run 145 days look at my reecord at FAD My beef is the the WU's clock just stops and if rebooted you lose all or most of the time that was put into that job If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- |
Dimmerjas Send message Joined: 20 Dec 05 Posts: 3 Credit: 109 RAC: 0 |
Just got WU "1ogw__topology_sample_207_14327_6" with Report Deadline 3 days ago. (12/19). I can't turn back time, or stop the earth from spinning. So should I abort that WU, or would I get credit anyway? But no need to worry. The WU crashed. "22-12-2005 18:13:26|rosetta@home|Unrecoverable error for result 1ogw__topology_sample_207_14327_6 ( - exit code -1073741819 (0xc0000005))" The same outcome like 6 other members for the same WU. I have only been with Rosetta for a short period. Have completed 10 WU's. And 5 of them have crashed. So now my "worry's" is more, if I should continue with Rosetta at all. It seems pointless to have the expence to powersupply for my computer, and Internetconnection, if the effort is 50-50. Dimmerjas |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 7 |
just look at my imput drop and in another message: I have just abouted over 800 defalut WU's You don't see anything that might possibly be RELATED in those two statements? If you would quit aborting WUs, perhaps you would get credit for them! I would guess I have about 1000 of the failed jobs over the last few days I would believe it. You have been aborting all the GOOD ones, leaving nothing for your computers to spend their time on but the BAD ones. At 30 seconds apiece, you can go through a bunch of those in a hurry. My beef is the the WU's clock just stops and if rebooted you lose all or most of the time that was put into that job We've already been through this - and you STILL have not given any explanation for why you would CONCEIVABLY reboot. You haven't answered any of the questions you have been asked. Do you WANT help? Or do you just want to whine? I do not have the time to read this website to see if a problem is or is Not Fied If this project can not produce Good WU's I and other just will not run it But you have time to reboot 50 systems and abort 800 WUs for no reason... The staff has said - YOU WILL GET CREDIT for the 'DEFAULT_xxxxx_205" WUs they have sent out. Period. Good or bad. Of course, you will ONLY get credit for the ones you have LET RUN. If there is a "DEFAULT_xxxxx_205" in there, you will get credit for however much time you spent on it before aborting it. All these _other_ WUs you have been aborting FOR NO REASON, you will NOT get credit for. Why is this so hard for you to understand? If you have a problem - describe the problem and ask for help. We will help you. But we CANNOT help you if all you say is "I had to reboot", and won't answer "Why?", and continue to do things you have been told you should not do! |
Message boards :
Number crunching :
Please abort WUs with
©2024 University of Washington
https://www.bakerlab.org