Please abort WUs with

Message boards : Number crunching : Please abort WUs with

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 9 · Next

AuthorMessage
Profile Jack Schonbrun

Send message
Joined: 1 Nov 05
Posts: 115
Credit: 5,954
RAC: 0
Message 6910 - Posted: 20 Dec 2005, 19:16:15 UTC

Unfortunately, we seem to have had some problems with our latest batch of Work Units.

The biggest one is that we inadvertantly instructed each WU to make 1000 structures instead of 10. This is clearly not possible before the deadlines for these WUs. So to make sure you don't lose any credit, and we don't lose any results, please ABORT any WUs whose names start with "DEFAULT_....._205_...." You will also notice that the percentage resolution is higher on these WUs. The percentage is based on the fraction of target structures made. 1000 structures means that you can have a .1% resolution.

There also seems to be a problem with some other WUs exiting quickly. It is likely due to another mistake on our command line that we can fix quickly. We are looking into it.

The message is that it's always dangerous to release a bunch of new stuff just before the holidays... :)

We appreciate your patience as we work through these issues. The newly queued WU's should work better.
ID: 6910 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile JChojnacki
Avatar

Send message
Joined: 17 Sep 05
Posts: 71
Credit: 10,633,777
RAC: 5,085
Message 6916 - Posted: 20 Dec 2005, 19:33:17 UTC - in response to Message 6910.  


We appreciate your patience as we work through these issues. The newly queued WU's should work better.


Appreciate the update!

No problem being patient, through the issues, as long as we remain informed. And, since you guys there do such a great job, at communicating, well as I said before, no problem. :-)

Thanks.



ID: 6916 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~GP500

Send message
Joined: 30 Nov 05
Posts: 14
Credit: 432,089
RAC: 0
Message 6926 - Posted: 20 Dec 2005, 20:45:22 UTC
Last modified: 20 Dec 2005, 20:56:11 UTC

Funny; We don't see here first, if the wu's are good.

I would find it logical that i get credit for the work i did.
https://boinc.bakerlab.org/rosetta/result.php?resultid=4421107

some hours of work went in too that 1, and it was a fault on your side.
so some credit for the work we did would be appreciated.

Even if you can't use the result we produced.
173.56 is more then half of my days production.
almost 40.000 sec.

Ps: good luck :)
ID: 6926 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 6932 - Posted: 20 Dec 2005, 21:25:19 UTC

We'll look into giving those who ran batch 205 credit. It is appearant that a local test system should be in place instead of sending test batches to the production server, particularly after this mistake was made for the very first batch submitted through our automated work generator. We'll be working hard to prevent future problems like this one.
ID: 6932 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,362
RAC: 7
Message 6934 - Posted: 20 Dec 2005, 21:32:12 UTC - in response to Message 6926.  

https://boinc.bakerlab.org/rosetta/result.php?resultid=4421107

some hours of work went in too that 1, and it was a fault on your side.
so some credit for the work we did would be appreciated.


The _good_ news on these is that they apparently are failing with a 'CPU time exceeded' error at around 11 hours, and not just running for days...

ID: 6934 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 6940 - Posted: 20 Dec 2005, 21:57:25 UTC - in response to Message 6932.  
Last modified: 20 Dec 2005, 22:00:31 UTC

We'll look into giving those who ran batch 205 credit.


David,

this is in part a repeat of what I said in another thread, but it's worth repeating if it saves you re-inventing the wheel.

When Einstein made a mistake like this, they managed to give everyone credit for the aborted WU - if you ask them nicely they may still have the script handy. (If they are not sure when this was, it was when they issued WU whose names differed only by upper-vs-lower case, and they confused the Windows machines)

On the other hand the script might be so simple that it's easier to re-write it - I don't know enough to say; but the E@h script proves it is possible to do. and the user response proved it was a worthwhile PR move.

People initially got 0 credit, but a few days later after the script was run everyone ended up getting what their client had claimed for the result.

And the message for participants is to abort the WU and let it report, people who still lost out on the Einstein blunder were those who'd reset their project.

River~~
ID: 6940 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 6941 - Posted: 20 Dec 2005, 22:16:10 UTC

It should be pretty easy to write a script to do this. I will be gone for the holidays for a week starting tomorrow, but when I get back I will give people credit for this batch. That should be enough time for most to have errored out, for those who did not get a chance to abort them.
ID: 6941 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The Pirate
Avatar

Send message
Joined: 22 Sep 05
Posts: 20
Credit: 7,090,933
RAC: 0
Message 6965 - Posted: 21 Dec 2005, 0:53:13 UTC

Things like this happen. To me, it doesn't matter if I get credit for the aborted wu's or not. As long as the sun comes up tomorrow I won't sweat it. I have a couple of 'puters crunching the Seti@Home test project and no credit is granted for the work, good or bad.

ID: 6965 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,362
RAC: 7
Message 6972 - Posted: 21 Dec 2005, 5:59:16 UTC

I agree that I don't care if credit is granted or not, but I do ask that the staff be careful here. Saying "we will try" is fine, or "it should be pretty easy" - but until you _know_ that you can grant the credit, don't give out a blanket "you will get credit" statement.

SETI said "you will get credit" for WUs that were late due to their latest outage. MOST did, but a certain set of those that were due the first or second day of it did NOT, and there are a lot of people upset. SZTAKI had a 0-credit problem three times, the first two, they granted credit. The third, they _said_ they would grant credit, and every time they were reminded "you still haven't", they said "we will we just haven't gotten to it". Then they deleted all the results before they "got to it", and went silent on it. No comment at all since. Don't care about the credit, 50-60 points out of 5000+, but don't lie to me! Bye, SZTAKI!

While "we gave you credit" would be great, it's far better to say "we'll try" and then "we're sorry, we tried but couldn't give you credit" than it is to say "we will give you credit" and then "oops we can't".

ID: 6972 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
hans jørn enevoldsen
Avatar

Send message
Joined: 18 Dec 05
Posts: 3
Credit: 404
RAC: 0
Message 6978 - Posted: 21 Dec 2005, 9:09:05 UTC

Please abort WUs with "DEFAULT_xxxxx_207_...
ID: 6978 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,362
RAC: 7
Message 6979 - Posted: 21 Dec 2005, 9:16:31 UTC - in response to Message 6978.  

Please abort WUs with "DEFAULT_xxxxx_207_...


Why? I can find no evidence that there even IS a "default_xxxxx_207" yet. Please provide some justification or explanation.

ID: 6979 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
hans jørn enevoldsen
Avatar

Send message
Joined: 18 Dec 05
Posts: 3
Credit: 404
RAC: 0
Message 6980 - Posted: 21 Dec 2005, 9:22:19 UTC

All is stopping after 10 minutes
ID: 6980 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 6981 - Posted: 21 Dec 2005, 9:28:39 UTC - in response to Message 6980.  
Last modified: 21 Dec 2005, 9:43:37 UTC

All is stopping after 10 minutes


Like Bill, I have not seen any WUs with names that start with "DEFAULT_xxxxx_207_" (replacing xxxxx with a number). However, I have had a relatively high number of work units in the 204 and 207 batch crash after a few minutes. Some have been OK.

I will persist with Rosetta for the day but if it gets out of hand I will suspend Rosetta on most of my computers (keeping an eye on the remaining one). The (minimal) time taken and lack of credit is not a major concern, but why waste bandwidth to download work units that are going to crash.

EDIT 1:
I just had three in a row crash, minutes after posting this message. All in the 207 batch:

1ogw__topology_sample_207_10103_1
1hz6A_topology_sample_207_7644_1
1ogw__topology_sample_207_14401_0

All with error 0xC00000005, in a matter of 10-30 seconds

EDIT 2:
4 more crashed, on two computers, since writing the above (a few minutes ago).
Two of them were batch 208, two of them batch 207

EDIT 3:
Minutes later, another 3.

It is getting out of hand - Rosetta has been set to "no new work" on all my computers pending a fix.
*** Join BOINC@Australia today ***
ID: 6981 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 6982 - Posted: 21 Dec 2005, 9:43:10 UTC - in response to Message 6979.  

Please abort WUs with "DEFAULT_xxxxx_207_...


Why? I can find no evidence that there even IS a "default_xxxxx_207" yet. Please provide some justification or explanation.

See This thread towards the end.
ID: 6982 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,362
RAC: 7
Message 6983 - Posted: 21 Dec 2005, 9:44:27 UTC - in response to Message 6982.  

See This thread towards the end.


Like where David Kim says

Batch 206 is okay, ONLY ABORT 205.


?

ID: 6983 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
pb

Send message
Joined: 30 Nov 05
Posts: 6
Credit: 65,632
RAC: 0
Message 6984 - Posted: 21 Dec 2005, 10:10:49 UTC

Hello!

I've aborted this one:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=3760695

Will I get promised credit for it, as it is said in News?

Thanks.
ID: 6984 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Schonbrun

Send message
Joined: 1 Nov 05
Posts: 115
Credit: 5,954
RAC: 0
Message 6985 - Posted: 21 Dec 2005, 10:35:59 UTC

2am Seattle time, and I've found the source of the problem for the quick crashing jobs. It's amazing how distributed computing puts ones code to the test.

David Kim's work-around should make things okay until we fix the code.

Unfortunately, I think the bad work units will have to error out to be removed from the queue. Again, we appreciate your patience.


ID: 6985 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nothing But Idle Time

Send message
Joined: 28 Sep 05
Posts: 209
Credit: 139,545
RAC: 0
Message 6986 - Posted: 21 Dec 2005, 11:10:14 UTC - in response to Message 6985.  

...
David Kim's work-around should make things okay until we fix the code.

Unfortunately, I think the bad work units will have to error out to be removed from the queue. Again, we appreciate your patience.


Hate to be this way but, let them error out on U of W Housing and Food Services computers not mine. Rosetta is currently suspended and EaH is merrily computing double quota, perhaps until after Christmas. I'm a retired computer programmer myself, and the concept of "work around" doesn't appeal to me.
ID: 6986 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~GP500

Send message
Joined: 30 Nov 05
Posts: 14
Credit: 432,089
RAC: 0
Message 6987 - Posted: 21 Dec 2005, 11:16:16 UTC
Last modified: 21 Dec 2005, 11:17:14 UTC

i see 207 jobs crash too.
It can't overview wich do and don't crash.

some 207's do compute right.

New stuff before the Christmis holiday isn't a good idea.
Maybe refert too the previous calculation methodes.
And research before experimenting.

This isn't too be anoying or cinical, i can't see what is being done.
Good luck, if we can help, if you need feedback ask us.
These are the bumps in the road, for a new project.
ID: 6987 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,362
RAC: 7
Message 6991 - Posted: 21 Dec 2005, 12:00:34 UTC

From the Technical News:

Batch 206 and greater are okay, and should not be aborted.


There appear to be two separate problems - the "DEFAULT_xxxxx_205" WUs which run for about 11 hours and then fail (and should be aborted) and an application(?) problem that is causing _various_ WUs to fail very quickly. There is no need to abort any WUs other than the specified ones; you can't tell by the name if a WU other than those is going to fail quickly, and a quick failure is not a big problem to either the project or the participants.

The "short" failures shouldn't add up to more than a minute or two on average for everyone, and there is little point in granting 0.x credits - those who have spent up to 11 hours on one of the "DEFAULT 205"s, the project has said _will_ get credit for the time spent, after they have been "flushed", and after the holidays.

If credit isn't being lost, there really is no reason _not_ to be running Rosetta right now. If you see one of the long ones after it starts and abort it, you'll get credit for however much time _was_ spent. If you don't see it and it runs until it gets the CPU time limit error, you'll get credit for it. If you see and abort one before it starts, you've lost nothing other than a few seconds downloading it. If a few "short" ones hit your computer and error out, well, so what? Meanwhile, most people are getting mostly "good" WUs, so the work continues. Suspending Rosetta just slows the process. Any "bad" WU that you don't get will just be done by someone else, but maybe a day later.

Anyone who knows my postings, on this and other boards, will know that I don't cut the projects very much slack. When they screw up, I tell them about it. If you read the "How to have the best BOINC project" thread, my points 6, 7, and 8 will tell you that I'm not terribly happy about this present situation. But this is a "young" project, and not all of the safeguards are in place yet. They're posting on these boards actively, and trying to do the right things.

ID: 6991 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 9 · Next

Message boards : Number crunching : Please abort WUs with



©2024 University of Washington
https://www.bakerlab.org