Please abort WUs with

Message boards : Number crunching : Please abort WUs with

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 9 · Next

AuthorMessage
Profile Jack Schonbrun

Send message
Joined: 1 Nov 05
Posts: 115
Credit: 5,954
RAC: 0
Message 7064 - Posted: 21 Dec 2005, 18:24:39 UTC - in response to Message 7055.  
Last modified: 21 Dec 2005, 18:26:26 UTC

Are these default_205's worked out of the system now? I am leaving for the holidays tomorrow morning and will be really annoyed if when I come back, I find I have been grinding away on a no hope unit simply because I was not home to abort it.


My understanding is that the 205's are gone. If you don't have one in your cache, I think you are okay.

I do not know how long it will take to flush our queue of the other ones, that crash after 30 seconds.
ID: 7064 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
PCZ

Send message
Joined: 16 Sep 05
Posts: 26
Credit: 2,024,330
RAC: 0
Message 7082 - Posted: 21 Dec 2005, 20:05:49 UTC

Hopefully most of the bad WU's are gone.
I have detached and reatached the nodes which had exceeded their daily limit.

All nodes back up and running :)

PS
Whats MORE_FRAGS a WU for Gamers ?


ID: 7082 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 7085 - Posted: 21 Dec 2005, 20:12:37 UTC

Just downloaded 3 WU and tested them using suspend & release....all 3 seem to be ok and have run a couple of minutes each without failing....problem could be over????
ID: 7085 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Twinkletoes

Send message
Joined: 2 Nov 05
Posts: 54
Credit: 92,865
RAC: 0
Message 7101 - Posted: 21 Dec 2005, 22:33:55 UTC - in response to Message 7046.  

Just to let everyone know, I've gotten errors on 205's, 207's, and now 203's.


And 209's too for me.

ID: 7101 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bruce boytler
Avatar

Send message
Joined: 17 Sep 05
Posts: 68
Credit: 3,565,442
RAC: 0
Message 7107 - Posted: 21 Dec 2005, 23:32:56 UTC - in response to Message 6985.  

2am Seattle time, and I've found the source of the problem for the quick crashing jobs. It's amazing how distributed computing puts ones code to the test.

David Kim's work-around should make things okay until we fix the code.

Unfortunately, I think the bad work units will have to error out to be removed from the queue. Again, we appreciate your patience.




Hi Jack.....

Something you may not be aware of being in the thick of the fight right now, is that some of the 30 second crash wu's that I have been recieving have been on as many as 11 other COMPUTERS!

1hz6A_topology_sample_207_10637

I would like to suggest either to your org or have your org suggest to BOINC devs that when a workunit has failed say 5 times to call a spade a spade and have the server flag and quarintine it.

Luckily the wu gives up after a few seconds. But, if this were to go on say 15 minutes or so it would be quite a waste. Just another level of safeguard to go along with more critical QC.

Cheers........


ID: 7107 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7108 - Posted: 21 Dec 2005, 23:35:39 UTC - in response to Message 7043.  

...
I originally thought the quota of 100 here was way too high for the time the results take. ... The danger in having a _high_ quota can be seen at SETI. There are any number of broken boxes out there that are getting 1 WU/day, erroring, then waiting. But the first day, they killed off 100 WUs, that had to be re-issued, quorums delayed, etc.


Quibble:

To get your quote down from 100 to 1 you waste around 100 results, true. But it does not happen in just one day.

The first day you get 50 WU, they get errored out, reducing your quote to 50, so you don't get any more. (You might get 51 depending on the timing of the quota chopper getting in before the issue of the 51st result.)

Second day you get 25, then 13, 6, 3 , 2, 1, 1, 1 (Or again, maybe one more each time depending on timing issues)

The only way you'd gett 100 WU in one day is if you had a big enough cache to get them all issued before the first dud had registered and started reducing the quota.

This does at least spread the impact of each rogue box over a few days.

Like so much on BOINC it is not ideal but it is a very workable compromise between the opposite evils of letting bad boxes gobble work units and letting bad work units gobble boxes.
ID: 7108 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7109 - Posted: 21 Dec 2005, 23:38:45 UTC - in response to Message 7108.  
Last modified: 21 Dec 2005, 23:39:47 UTC

[sorry, double post]
ID: 7109 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Schonbrun

Send message
Joined: 1 Nov 05
Posts: 115
Credit: 5,954
RAC: 0
Message 7111 - Posted: 21 Dec 2005, 23:41:58 UTC - in response to Message 7107.  

I would like to suggest either to your org or have your org suggest to BOINC devs that when a workunit has failed say 5 times to call a spade a spade and have the server flag and quarintine it.


Yes, there was a discussion of this earlier in this thread. BOINC does have this feature, but we set it to a default value of 10, instead of the 5 you suggest. Or the 3 might be even better. Unfortunately, it seems that we cannot change these Work Units after the fact. In the future these numbers will be adjusted.
ID: 7111 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 7129 - Posted: 22 Dec 2005, 4:04:37 UTC

Why are the "DEFAULT_....._W/U's Still being sent out
I have to remove 50 to 100 of these every day

Why have the staff not removed these "DEFAULT_....._ W/U's from there severs

If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 7129 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,629,814
RAC: 2,860
Message 7133 - Posted: 22 Dec 2005, 6:08:12 UTC - in response to Message 7129.  

Why are the "DEFAULT_....._W/U's Still being sent out
I have to remove 50 to 100 of these every day

Why have the staff not removed these "DEFAULT_....._ W/U's from there severs


Laurenu2, WHY are you "removing" DEFAULT WUs other than the 205s? The DEFAULT_205's are all gone. Any other series should be processed, not aborted. That has been made VERY clear, repeatedly. The project _has_ removed all DEFAULT_205's that they can. If there are a handful still out there - abort them, but ONE series of ONE type of WU was to be aborted, not every series of this type of WU.

ID: 7133 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 7134 - Posted: 22 Dec 2005, 6:12:42 UTC - in response to Message 7056.  
Last modified: 22 Dec 2005, 6:13:09 UTC

Again, the problem was with our seed reader, not the generator itself. But, when David Baker comes back from vacation, I will use this as an opportunity to advocate for the implementation of a more robust RNG in rosetta. Likely the Mersenne Twister that has been suggested in these forums, and in my discussions with other scientists.

*I* never suggested a different one was necessary. I said the one you are using should be tested ... :)

Not to restart the debate, as you say this is a different error. However, it does prove my point that the system's use of the stream of values is of concern, which is the point of testing said system ...
ID: 7134 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Hoelder1in
Avatar

Send message
Joined: 30 Sep 05
Posts: 169
Credit: 3,915,947
RAC: 0
Message 7135 - Posted: 22 Dec 2005, 6:17:03 UTC - in response to Message 7129.  

Why are the "DEFAULT_....._W/U's Still being sent out
I have to remove 50 to 100 of these every day

Why have the staff not removed these "DEFAULT_....._ W/U's from there severs

Did you read the news on the main page (and follow the link to the 'Technical News' page) ? The "DEFAULT_xxxxx_205_" (the _205_ is important) were the only ones that were meant to be killed. The other WU errors which we are all having should take care of themselves, in that these WUs error out after about 30 sec - so you don't have to do anything about them. I am currently crunching several DEFAULT... units, some of them already completed. They seem to be perfectly ok. So DONT delete those. If you have other problems than the ones mentioned, then this would be something different than what is discussed here - perhaps related to your local setup ? - Oh, and in response to what you said in a different thread, I don't think it should ever be necessary to reboot if you have a 'hanging' WU. Killing the corresponding Rosetta executable should be enough, after which the WU should continue from its last checkpoint. Happend to me once or twice some time ago but unrelated to the recent troubles.
ID: 7135 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 7185 - Posted: 22 Dec 2005, 16:02:56 UTC
Last modified: 22 Dec 2005, 16:05:29 UTC

The new defalts are takeing way to long too I have just abouted over 800 defalut WU's and have shut down over 50 client nodes til this problem is Fixed

Some of the new defaults are takeing 3X loger then ther are list as 15+Hrs

I do not have the time to read this website to see if a problem is or is Not Fied If this project can not produce Good WU's I and other just will not run it

If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 7185 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 7187 - Posted: 22 Dec 2005, 16:18:57 UTC

just look at my imput drop
2005-12-22 4,378 534,956
2005-12-21 10,266 530,578
2005-12-20 14,405 520,312
2005-12-19 15,294 505,907
2005-12-18 16,298 490,612
2005-12-17 16,300 474,314
2005-12-16 16,129 458,014
2005-12-15 16,451 441,885
2005-12-14 19,177 425,434
2005-12-13 18,232 406,257

And this is with all my nodes running flat out
And you try to tell me there is nothing wrong
If this problem is not fixed soon this project
will see people loseing intreat with it and moveing on
If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 7187 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 7191 - Posted: 22 Dec 2005, 16:49:47 UTC - in response to Message 7185.  
Last modified: 22 Dec 2005, 16:53:30 UTC

The new defalts are takeing way to long


Rosetta work units vary in length and I can't see how longer work units are a problem other than if you like to process lots of tiny work units.

You should still get the equivalent amount of credit for valid results since it's based on the amount of CPU time spent. Whether that's 24 hours for 2 work units or 24 hours for 50 work units makes no difference. Of course, if you abort work units after they have run for a while, you get no credit for them.
*** Join BOINC@Australia today ***
ID: 7191 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 7197 - Posted: 22 Dec 2005, 17:13:33 UTC - in response to Message 7187.  

just look at my imput drop
2005-12-22 4,378 534,956
2005-12-21 10,266 530,578
2005-12-20 14,405 520,312
2005-12-19 15,294 505,907
2005-12-18 16,298 490,612
2005-12-17 16,300 474,314
2005-12-16 16,129 458,014
2005-12-15 16,451 441,885
2005-12-14 19,177 425,434
2005-12-13 18,232 406,257

And this is with all my nodes running flat out
And you try to tell me there is nothing wrong
If this problem is not fixed soon this project
will see people loseing intreat with it and moveing on

Workunits on other projects may be long as well (of course, Climate Prediction's are the longest!).

If you are looking for projects with short workunits, you cannot necessarily count on Rosetta@home, which due to the nature of their research has workunits of varying length. It's the nature of the beast, the name of the game. :)

Regards,
Bob P.
ID: 7197 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 7199 - Posted: 22 Dec 2005, 17:50:27 UTC - in response to Message 7191.  

The new defalts are takeing way to long


Rosetta work units vary in length and I can't see how longer work units are a problem other than if you like to process lots of tiny work units.

You should still get the equivalent amount of credit for valid results since it's based on the amount of CPU time spent. Whether that's 24 hours for 2 work units or 24 hours for 50 work units makes no difference. Of course, if you abort work units after they have run for a while, you get no credit for them.


The loss is due not from me aborting conched WU but rather from bad WU's sent out
12/22/2005 10:06:57 AM|rosetta@home|Unrecoverable error for result 1hz6A_topology_sample_207_6639_9 ( - exit code -1073741819 (0xc0000005))

12/22/2005 6:14:29 AM|rosetta@home|Unrecoverable error for result 1hz6A_topology_sample_207_91_3 ( - exit code -1073741819 (0xc0000005))

12/22/2005 6:14:10 AM|rosetta@home|Unrecoverable error for result 1hz6A_topology_sample_207_12733_7 ( - exit code -1073741819 (0xc0000005))

12/22/2005 3:23:00 AM|rosetta@home|Unrecoverable error for result 1ogw__topology_sample_207_4678_7 ( - exit code -1073741819 (0xc0000005))

12/22/2005 2:18:54 AM|rosetta@home|Unrecoverable error for result 1hz6A_topology_sample_207_9476_4 ( - exit code -1073741819 (0xc0000005))

I would guess I have about 1000 of the failed jobs over the last few days




If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 7199 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 7200 - Posted: 22 Dec 2005, 17:57:07 UTC - in response to Message 7197.  

just look at my imput drop
2005-12-22 4,378 534,956
2005-12-21 10,266 530,578
2005-12-20 14,405 520,312
2005-12-19 15,294 505,907
2005-12-18 16,298 490,612
2005-12-17 16,300 474,314
2005-12-16 16,129 458,014
2005-12-15 16,451 441,885
2005-12-14 19,177 425,434
2005-12-13 18,232 406,257

And this is with all my nodes running flat out
And you try to tell me there is nothing wrong
If this problem is not fixed soon this project
will see people loseing intreat with it and moveing on

Workunits on other projects may be long as well (of course, Climate Prediction's are the longest!).

If you are looking for projects with short workunits, you cannot necessarily count on Rosetta@home, which due to the nature of their research has workunits of varying length. It's the nature of the beast, the name of the game. :)

That is a given I have done WU that have run 145 days look at my reecord at FAD
My beef is the the WU's clock just stops and if rebooted you lose all or most of the time that was put into that job

If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 7200 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dimmerjas

Send message
Joined: 20 Dec 05
Posts: 3
Credit: 109
RAC: 0
Message 7202 - Posted: 22 Dec 2005, 17:57:46 UTC

Just got WU "1ogw__topology_sample_207_14327_6" with Report Deadline 3 days ago. (12/19). I can't turn back time, or stop the earth from spinning. So should I abort that WU, or would I get credit anyway?
But no need to worry. The WU crashed.

"22-12-2005 18:13:26|rosetta@home|Unrecoverable error for result 1ogw__topology_sample_207_14327_6 ( - exit code -1073741819 (0xc0000005))"
The same outcome like 6 other members for the same WU.

I have only been with Rosetta for a short period. Have completed 10 WU's.
And 5 of them have crashed.
So now my "worry's" is more, if I should continue with Rosetta at all.
It seems pointless to have the expence to powersupply for my computer, and Internetconnection, if the effort is 50-50.

Dimmerjas
ID: 7202 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,629,814
RAC: 2,860
Message 7207 - Posted: 22 Dec 2005, 18:33:10 UTC - in response to Message 7187.  

just look at my imput drop


and in another message:

I have just abouted over 800 defalut WU's


You don't see anything that might possibly be RELATED in those two statements? If you would quit aborting WUs, perhaps you would get credit for them!

I would guess I have about 1000 of the failed jobs over the last few days


I would believe it. You have been aborting all the GOOD ones, leaving nothing for your computers to spend their time on but the BAD ones. At 30 seconds apiece, you can go through a bunch of those in a hurry.

My beef is the the WU's clock just stops and if rebooted you lose all or most of the time that was put into that job


We've already been through this - and you STILL have not given any explanation for why you would CONCEIVABLY reboot. You haven't answered any of the questions you have been asked. Do you WANT help? Or do you just want to whine?

I do not have the time to read this website to see if a problem is or is Not Fied If this project can not produce Good WU's I and other just will not run it


But you have time to reboot 50 systems and abort 800 WUs for no reason...

The staff has said - YOU WILL GET CREDIT for the 'DEFAULT_xxxxx_205" WUs they have sent out. Period. Good or bad. Of course, you will ONLY get credit for the ones you have LET RUN. If there is a "DEFAULT_xxxxx_205" in there, you will get credit for however much time you spent on it before aborting it. All these _other_ WUs you have been aborting FOR NO REASON, you will NOT get credit for.

Why is this so hard for you to understand? If you have a problem - describe the problem and ask for help. We will help you. But we CANNOT help you if all you say is "I had to reboot", and won't answer "Why?", and continue to do things you have been told you should not do!

ID: 7207 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 9 · Next

Message boards : Number crunching : Please abort WUs with



©2024 University of Washington
https://www.bakerlab.org