Orphan/Ghost WUs

Message boards : Number crunching : Orphan/Ghost WUs

To post messages, you must log in.

AuthorMessage
Profile dag
Avatar

Send message
Joined: 16 Dec 05
Posts: 106
Credit: 1,000,020
RAC: 0
Message 19331 - Posted: 26 Jun 2006, 22:03:57 UTC
Last modified: 26 Jun 2006, 22:09:44 UTC

I have three WUs that were sent out in early March that have not been caught in the weekly cleanup. You can view them here. Two errored out, one was successful, but all three are still sitting in my results list from March 2, 11, and 21.

WU1 errored
WU2 errored
WU3 says it can't find the work unit even though it's in my results listing.
dag
--Finding aliens is cool, but understanding the structure of proteins is useful.
ID: 19331 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,081,660
RAC: 345
Message 19336 - Posted: 27 Jun 2006, 0:15:37 UTC - in response to Message 19331.  
Last modified: 27 Jun 2006, 0:16:16 UTC

I have three WUs that were sent out in early March that have not been caught in the weekly cleanup. You can view them here. Two errored out, one was successful, but all three are still sitting in my results list from March 2, 11, and 21.

WU1 errored
WU2 errored
WU3 says it can't find the work unit even though it's in my results listing.


I've got one like that last one, too. The result is in my list of results and is here:
https://boinc.bakerlab.org/rosetta/result.php?resultid=12257513

If you click on the WU id number from that page or my results page, it complains it can't find the workunit. The result was returned on March 3.

I wonder if the RAH servers burped when they were removing old results and workunits and left a few things undone.

Charlie


-Charlie
ID: 19336 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 22 Dec 05
Posts: 71
Credit: 138,867
RAC: 0
Message 19344 - Posted: 27 Jun 2006, 6:14:31 UTC

YES -This problem of Ghost / Phantom WU(s) is a BOINC problem
and this problem cost many teraflops to projecs / science


I have seen this occuring with regular frequecy in about all projects I crunch for.


To not end-up with lots and lots of pending credits, the projects
are forced to use Initial replication 3 , when only 2 would suffice.

A REAL WASTE OF COMPUTING POWER !


Either the boinc server software running here is too old,
-or- this problem was never fixed by boinc developers.

However seems that Einstein@home is using a custom scheduler hack which resends the results already assigned to the host, if host asks for more work and does not list them in "already have" list. They call it "lost results". You could ask them for the patch - I'd say it's rather useful feature.

I wonder why this hack is not default behavior of BOINC
and why until today this remains "unfixed" on boinc server side software -:(
Click signature for global team stats
ID: 19344 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Keck_Komputers
Avatar

Send message
Joined: 17 Sep 05
Posts: 211
Credit: 4,246,150
RAC: 0
Message 19348 - Posted: 27 Jun 2006, 7:07:11 UTC

This problem has gotten better in the 5.x.x clients and servers but not eliminated.

The resend of lost tasks is a standard function of BOINC, however it must be enabled by the project. Most projects have not done this since the tasks will be resent to different hosts automatically if needed, and it does cost some server overhead to resend them to the same host.
BOINC WIKI

BOINCing since 2002/12/8
ID: 19348 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,081,660
RAC: 345
Message 19353 - Posted: 27 Jun 2006, 14:06:18 UTC

Just wanted to make sure something is clear here. Do not confuse what dgeiser reported and I confirmed with the traditional ghost WU problem. I believe what we reported is different.

In the traditional ghost WU, the Boinc server thinks it has sent a WU to a particular computer. The WU will show up in the list of results for that computer. However, the computer's owner will report that the WU is not on the computer and never was. The WU will eventually pass its report deadline and be resent to another computer.

What dgeiser reported and I yelled, "me, too!", is different. In our cases the WUs were sent to our computers, we crunched them and sent them back. All that was fine. However, some time later, in the normal process of the server cleaning out old results and workunits, the workunit was removed from the database but the corresponding result was not.

Charlie

-Charlie
ID: 19353 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 7 Oct 05
Posts: 234
Credit: 15,020
RAC: 0
Message 19354 - Posted: 27 Jun 2006, 14:16:36 UTC - in response to Message 19353.  

...

What dgeiser reported and I yelled, "me, too!", is different. In our cases the WUs were sent to our computers, we crunched them and sent them back. All that was fine. However, some time later, in the normal process of the server cleaning out old results and workunits, the workunit was removed from the database but the corresponding result was not.

Charlie


Not the case with mine.

This was one I dumped because I was restoring my system from a backup, where a mirror had been created of my system, when I took the backup, so when I restored from it, the WU, this one, was in my cache and was overwritten by the ones I had when I took the backup. So it is gone and will never be returned.

A little weird that it hasn't been resent to others though... ?!


[b]"I'm trying to maintain a shred of dignity in this world." - Me[/b]

ID: 19354 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,081,660
RAC: 345
Message 19357 - Posted: 27 Jun 2006, 14:46:24 UTC - in response to Message 19354.  



Not the case with mine.

This was one I dumped because I was restoring my system from a backup, where a mirror had been created of my system, when I took the backup, so when I restored from it, the WU, this one, was in my cache and was overwritten by the ones I had when I took the backup. So it is gone and will never be returned.

A little weird that it hasn't been resent to others though... ?!



Hi, Fuzzy,

That is strange. I looked at the first two workunits that dgeiser reported and they appear to have a similar problem as yours. All three are way past their deadline but are still listed as "In Progress". Should that change once the deadline is past? Why haven't they been sent out again?

However, the third WU repotred by dgeiser and the one I reported have a different problem. Those have been crunched and returned and we've apparently gotten the credit. They just were not completely cleaned up from the database.

Another thing I noticed about all these workunits is that they are from this past March. I wonder if something was wrong with the database back then that could have manifisted itself with these problems?

With CASP7 going on, I'm sure the folks at RAH HQ don't want to start tearing into possible database problems that aren't huring them at the moment. Maybe later, though.

Charlie

PS. Looks like I'm going to have to get a picture of my cat and start using it here!
-Charlie
ID: 19357 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile surrealchereal
Avatar

Send message
Joined: 6 Nov 05
Posts: 23
Credit: 243,559
RAC: 0
Message 19697 - Posted: 2 Jul 2006, 16:21:52 UTC - in response to Message 19331.  

I never bother to look at my messages often.
I did yesterday and there were red messages saying the WU should be aborted because it's over due and I probably won't get credit for it.

How long do you have to process one, or were those ghosts?
Or could this have happend because I tried powering down my hard drive for a while?
Come BOINC with me!

USALUG !!
ID: 19697 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 19798 - Posted: 5 Jul 2006, 15:00:20 UTC - in response to Message 19697.  

...WU should be aborted because it's over due and I probably won't get credit for it.

How long do you have to process one, or were those ghosts?
Or could this have happend because I tried powering down my hard drive for a while?

Ghosts are where the project's WU page shows WUs that you don't have on your PC for some reason. Apparently the server can hit a state where it thinks it has sent it to you, but your PC doesn't think it's received it.

The fact that you see the WU in your list means it is not a ghost. If you powered off your hard drive, Rosetta may not have been able to do work during that time, and so may have fallen behind on the deadlines.

If your WUs have passed their deadlines, yes, you should abort them and some new ones will be downloaded. Strictly speaking the results of those WUs are still scientifically interesting, but you won't get credit for them, and they may be passed the CASP deadline as well, so better to crunch the current CASP targets and get results reported in time to help find the best model for CASP.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 19798 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dag
Avatar

Send message
Joined: 16 Dec 05
Posts: 106
Credit: 1,000,020
RAC: 0
Message 19801 - Posted: 5 Jul 2006, 15:36:58 UTC

None the less, my original post said, "Here is a problem", and all I get are a bunch of pundits arguing about what kind of problem it is instead of a project member saying, "I've just taken care of it."
dag
--Finding aliens is cool, but understanding the structure of proteins is useful.
ID: 19801 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Orphan/Ghost WUs



©2024 University of Washington
https://www.bakerlab.org