Message boards : Number crunching : Orphan/Ghost WUs
Author | Message |
---|---|
dag Send message Joined: 16 Dec 05 Posts: 106 Credit: 1,000,020 RAC: 0 |
I have three WUs that were sent out in early March that have not been caught in the weekly cleanup. You can view them here. Two errored out, one was successful, but all three are still sitting in my results list from March 2, 11, and 21. WU1 errored WU2 errored WU3 says it can't find the work unit even though it's in my results listing. dag --Finding aliens is cool, but understanding the structure of proteins is useful. |
Charles Dennett Send message Joined: 27 Sep 05 Posts: 102 Credit: 2,081,660 RAC: 345 |
I have three WUs that were sent out in early March that have not been caught in the weekly cleanup. You can view them here. Two errored out, one was successful, but all three are still sitting in my results list from March 2, 11, and 21. I've got one like that last one, too. The result is in my list of results and is here: https://boinc.bakerlab.org/rosetta/result.php?resultid=12257513 If you click on the WU id number from that page or my results page, it complains it can't find the workunit. The result was returned on March 3. I wonder if the RAH servers burped when they were removing old results and workunits and left a few things undone. Charlie -Charlie |
Carlos_Pfitzner Send message Joined: 22 Dec 05 Posts: 71 Credit: 138,867 RAC: 0 |
YES -This problem of Ghost / Phantom WU(s) is a BOINC problem and this problem cost many teraflops to projecs / science I have seen this occuring with regular frequecy in about all projects I crunch for. To not end-up with lots and lots of pending credits, the projects are forced to use Initial replication 3 , when only 2 would suffice. A REAL WASTE OF COMPUTING POWER ! Either the boinc server software running here is too old, -or- this problem was never fixed by boinc developers. However seems that Einstein@home is using a custom scheduler hack which resends the results already assigned to the host, if host asks for more work and does not list them in "already have" list. They call it "lost results". You could ask them for the patch - I'd say it's rather useful feature. I wonder why this hack is not default behavior of BOINC and why until today this remains "unfixed" on boinc server side software -:( Click signature for global team stats |
Keck_Komputers Send message Joined: 17 Sep 05 Posts: 211 Credit: 4,246,150 RAC: 0 |
This problem has gotten better in the 5.x.x clients and servers but not eliminated. The resend of lost tasks is a standard function of BOINC, however it must be enabled by the project. Most projects have not done this since the tasks will be resent to different hosts automatically if needed, and it does cost some server overhead to resend them to the same host. BOINC WIKI BOINCing since 2002/12/8 |
Charles Dennett Send message Joined: 27 Sep 05 Posts: 102 Credit: 2,081,660 RAC: 345 |
Just wanted to make sure something is clear here. Do not confuse what dgeiser reported and I confirmed with the traditional ghost WU problem. I believe what we reported is different. In the traditional ghost WU, the Boinc server thinks it has sent a WU to a particular computer. The WU will show up in the list of results for that computer. However, the computer's owner will report that the WU is not on the computer and never was. The WU will eventually pass its report deadline and be resent to another computer. What dgeiser reported and I yelled, "me, too!", is different. In our cases the WUs were sent to our computers, we crunched them and sent them back. All that was fine. However, some time later, in the normal process of the server cleaning out old results and workunits, the workunit was removed from the database but the corresponding result was not. Charlie -Charlie |
Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0 |
... Not the case with mine. This was one I dumped because I was restoring my system from a backup, where a mirror had been created of my system, when I took the backup, so when I restored from it, the WU, this one, was in my cache and was overwritten by the ones I had when I took the backup. So it is gone and will never be returned. A little weird that it hasn't been resent to others though... ?! [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] |
Charles Dennett Send message Joined: 27 Sep 05 Posts: 102 Credit: 2,081,660 RAC: 345 |
Hi, Fuzzy, That is strange. I looked at the first two workunits that dgeiser reported and they appear to have a similar problem as yours. All three are way past their deadline but are still listed as "In Progress". Should that change once the deadline is past? Why haven't they been sent out again? However, the third WU repotred by dgeiser and the one I reported have a different problem. Those have been crunched and returned and we've apparently gotten the credit. They just were not completely cleaned up from the database. Another thing I noticed about all these workunits is that they are from this past March. I wonder if something was wrong with the database back then that could have manifisted itself with these problems? With CASP7 going on, I'm sure the folks at RAH HQ don't want to start tearing into possible database problems that aren't huring them at the moment. Maybe later, though. Charlie PS. Looks like I'm going to have to get a picture of my cat and start using it here! -Charlie |
surrealchereal Send message Joined: 6 Nov 05 Posts: 23 Credit: 243,559 RAC: 0 |
I never bother to look at my messages often. I did yesterday and there were red messages saying the WU should be aborted because it's over due and I probably won't get credit for it. How long do you have to process one, or were those ghosts? Or could this have happend because I tried powering down my hard drive for a while? Come BOINC with me! USALUG !! |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
...WU should be aborted because it's over due and I probably won't get credit for it. Ghosts are where the project's WU page shows WUs that you don't have on your PC for some reason. Apparently the server can hit a state where it thinks it has sent it to you, but your PC doesn't think it's received it. The fact that you see the WU in your list means it is not a ghost. If you powered off your hard drive, Rosetta may not have been able to do work during that time, and so may have fallen behind on the deadlines. If your WUs have passed their deadlines, yes, you should abort them and some new ones will be downloaded. Strictly speaking the results of those WUs are still scientifically interesting, but you won't get credit for them, and they may be passed the CASP deadline as well, so better to crunch the current CASP targets and get results reported in time to help find the best model for CASP. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
dag Send message Joined: 16 Dec 05 Posts: 106 Credit: 1,000,020 RAC: 0 |
None the less, my original post said, "Here is a problem", and all I get are a bunch of pundits arguing about what kind of problem it is instead of a project member saying, "I've just taken care of it." dag --Finding aliens is cool, but understanding the structure of proteins is useful. |
Message boards :
Number crunching :
Orphan/Ghost WUs
©2024 University of Washington
https://www.bakerlab.org