code release and redundancy

Message boards : Number crunching : code release and redundancy

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,120
RAC: 21
Message 5348 - Posted: 7 Dec 2005, 8:29:04 UTC - in response to Message 5345.  

Unfortunately, it doesn't sound like flop counting would solve cheating in our app without redundancy.


I don't think by itself it would "solve" cheating - but what it _would_ solve, all by itself, is "cheating via inflated benchmarks". Rather than having someone "inadvertently" cheating, by running an optimized Client that they installed because of SETI, that optimized client would simply have no effect here. If someone _deliberately_ wanted to cheat, and was willing to go to some effort, they still could. But with _no_ other effort on your part, flops-counting could eliminate what I think is 95% of the problem. The casual "hm, Rosetta will give me anything I ask for, so I'll run this super-optimized Client" people would be out of luck.

Work Units that differ only in their random number seed can have significantly different running times because of the nature of our algorithm. So we wouldn't be in the situation where "all the WUs in this block give 27.35 credits, +/- 0.5". The only way to be able to compare flop counts would to be have 100% identical jobs, so we would be back to redundancy.


Darn. That was the part I wasn't sure about, how much variation there was just due to the random seed.

Also, it sounds like flop-counting would be just as prone to code manipulation, and so wouldn't make people any more comfortable about releasing the code.


True... unless you "do like Einstein", and allow _limited_ release of the code to optimizers, excluding some necessary portion, then roll the optimizations back in to the official release. This is the part where SETI falls down. If YAOSCW-whatever is so good, then why the heck isn't it the standard??? But yes, it doesn't resolve the issues around full open-source.

To summarize my understanding: Flop counting provides more accurate feedback on how much work a host did. But because our app performs variable amounts of work even for very similar work units, we might not be able to catch cheaters simply by looking for outliers in small quora. This would mean that we are back to implementing true redundancy. If we do end up using redundancy, it sounds like it might worth implementing flop counting with it.


Yes. My feeling is that as a "quick and dirty" thing, flops-counting, even by itself, is far better than nothing.

Pseudo-Redundancy

That said, I still thing pseudo-redundancy has a lot going for it. For all of the folks worrying about accuracy, the quorum sizes would be in the thousands. The issues about being paired with a Mac user, and the like would disappear. You wouldn't be getting "average" credit anymore than with any other quorum system. A credit amount would be assigned based on the median, but the rate at which you gained credit would depend on the precise speed of your boxes relative to this average.


The only part of this that's bothersome is the "thousands" part. Not sure that would be practical, and where do you start validating and awarding credit? I could see a quorum size of 100, but even that might be unwieldy. The largest I've seen is more like six! Unless the deadlines were quite short, you'd have trouble with results not being returned... I can just see a quorum of 1000 issued, with 900 required to validate, but 200 miss deadline, you reissue them, 101 don't come back...

As another benefit, because this would require us to code up methods to examine the distributions of cpu performance overall for each Work Unit, it would be easier to deny credit to the outliers.

My main concern, other than the development time, is convincing the community that this is a satisfactory system. I am partly basing that on what I see in this thread, where it seems to have gained very little traction. I would be very interested to hear more critique of this method, becuase it seems like the best solution to me.


It is probably the _best_ solution - I'm just having trouble getting my mind around all the potential complications of it, and how they could be avoided, and all the options there would be. (Too many years of system design... I tend to look for the pitfalls, the boundary conditions...)

I'll be blunt, and risk offending people, but this _IS_ my profession... while BOINC's a hobby, programming, systems analysis, and design is what pays the bills. If I were the I.T. manager assigned to make this call, I think I'd put someone on flops-counting immediately (low cost, high gain, useful regardless of later decisions), and put someone else on drawing up a plan for pseudo-redundancy. When the plan was done, I'd give it to everybody involved and tell them to tear it apart and find the problems. That might take a couple of iterations. Then and only then would I be able to estimate what it would require to actually implement it. I've never been a proponent of requiring everything to be planned and spec'd to the last detail before starting coding, but I always figured if I couldn't grasp the whole gestalt of the project, at _least_ enough design work should be done up front so that I, and the lead people working on it, all knew where we were going. I'm concerned that the Rosetta people "know" Rosetta, and the BOINC people "know" BOINC, but there isn't anyone with enough knowledge of _both_, such that they could safely estimate, plan, and direct this without the detailed design work up front. Once I was sure of what it would take, I could make the call of whether I had the resources to do it, or if instead, the only option was "regular" redundancy. By that point, hopefully I'd also know how well flops-counting alone was doing on eliminating any perceived or real cheating.

ID: 5348 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 5354 - Posted: 7 Dec 2005, 12:46:04 UTC - in response to Message 5345.  

My main concern, other than the development time, is convincing the community that this is a satisfactory system. I am partly basing that on what I see in this thread, where it seems to have gained very little traction. I would be very interested to hear more critique of this method, becuase it seems like the best solution to me.


For my part. I do not, still, really understand how you are planning to make this work. Perhaps it is the name.

During one of the studies I did of credit application to the work done there is a reasonable coorelation between those projects that are using the redundent/averaging system (BOINC Standard) with CS per second. Which means there is not a huge difference between running for one project or another.

CPDN, with the "bonus" for completing models and flat rate granting of credit per trickle in fact gives a higher "return" per second.

You are proposing a third system with new implications. As with Bill, I ask questions about boundary conditions and implications (perhaps you have noticed? :)). What these are is still a little in the air.

As best as I can tell, you would not in fact be collecting a quorum, but instead would be doing a running average of sorts and then granting credit based on those numbers. In effect, if I have this right, a flat-rate application of CS per second, but based on a large population of claims.

With outlier trimming of 0s and obvious nonsense at the top this *SHOULD* ensure that the average grant is closer to a reality that can be supported.

If this is a correct summerization of the intent I think I would be inclined to support it as a third mechanism for the granting of credit. This would be especially valuable for non-redundent project in the future if we can work it out here.

However, because of the variances caused by the benchmarking system the concept of "papering-over" the flaws with averaging is not very supportable. In early beta we did not expect the variances to be this large. Most other concepts foundered on other aspects. With that said, FLOP counting also should be considered. But the urgency goes down. Now, it can be added as part of general changes in the application instead of as a major effort that has to be completed immediately.

In the final system, the more accurate FLOP counts drive the claims, and the variance is smoothed with your pseudo-redundency (though you need a better name) and this allows you to avoid wasting contributions to ONLY improve credit scores.

So, all this in mind, we need a new thread ... :)
ID: 5354 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Honza

Send message
Joined: 18 Sep 05
Posts: 48
Credit: 173,517
RAC: 0
Message 5356 - Posted: 7 Dec 2005, 13:39:45 UTC - in response to Message 5354.  

CPDN, with the "bonus" for completing models and flat rate granting of credit per trickle in fact gives a higher "return" per second.
Yes, fixed credit per trickle/model, that's right - but what "bonus" are you talking about?
CPDN is not the only one project that give fixed credit per WU type. I think Primegrid is following the same patern.

ID: 5356 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 5361 - Posted: 7 Dec 2005, 15:06:53 UTC

If I recall correctly you can finish a model and the total credit was 5,700 and change. When they run the background jobs this is raised to 6,805.26 urp! ah that is the credit for the 72nd trickle ...

And PrimeGrid is using only 1 result, but, variable credit.
ID: 5361 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scott Brown

Send message
Joined: 19 Sep 05
Posts: 19
Credit: 8,739
RAC: 0
Message 5364 - Posted: 7 Dec 2005, 15:32:56 UTC - in response to Message 5345.  

Pseudo-Redundancy

That said, I still thing pseudo-redundancy has a lot going for it. For all of the folks worrying about accuracy, the quorum sizes would be in the thousands. The issues about being paired with a Mac user, and the like would disappear. You wouldn't be getting "average" credit anymore than with any other quorum system. A credit amount would be assigned based on the median, but the rate at which you gained credit would depend on the precise speed of your boxes relative to this average.


I am very concerned that the statistical issues involved in this discussion are not being fully understood. First, I must agree with Bill and Paul that lack of clarity regarding how, and more importantly when, the distribution of results would be assessed for assigning credit is fundamentally problematic. There are no clear criteria being proposed for this, and the potentially excessive time involved to obtain credit is likely to turn substantial numbers of users away.

More importantly, the assumptions underlying this approach (particularly that unusually high and low credit claims are temporally spread via a random process) are clearly untrue. Across projects, we have observed that faster machines return units more quickly and (if not optimized) claim less credit. The opposite holds for slower machines. Optimized BOINC clients complicate this picture, but since we have no reason to suspect that faster or slower machines are more likely to use optimization, then the effect should not be significant. Thus, the granted credit from these distributions will necessarily more heavily reflect the results returned by the fastest machines (and therefore, are likely to be downwardly biased).

I would also point out that it has been incorrectly claimed below that this process would reduce "noise" in the results. Noise ( or error statistically) is the variation around a measure of central tendancy (e.g., mean, median, mode, a regression paramter, etc.). Collecting thousands of results does nothing to reduce the noise in a distribution; it simply increases the accuracy by which characteristics of that distribution are estimated. In other words, it does not matter whether one uses 4 , 10, or 100 units for redundancy, the actual spread of the true distribution always remains the same (i.e., this procedure does nothing to reduce the vast spread in claims based on benchmarking which is the fundamental problem--see Ingleside's posts below). Given the fundamental biases created by faster machines likely dominanting the credit determining results and issues with optimized BOINC clients, the effect of a "psuedo-redundancy"-only approach is at best to mute these differences (and Paul is also right here--this needs a better name).

I believe that you are too quick to dismiss FLOPs counting and to view the two options as mutually exclusive. In many ways, the complimentary application of the two alternatives solves the inherent problems in both. FLOPs counting seems to reduce the fundamental problem of a large variance in claimed credit, while "pseudo-reundancy" provides more accurate estimates of distributional parameters that allow for the detection of and credit denial for claims that are likely the result of overt code manipulation.
ID: 5364 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scott Brown

Send message
Joined: 19 Sep 05
Posts: 19
Credit: 8,739
RAC: 0
Message 5366 - Posted: 7 Dec 2005, 15:35:17 UTC - in response to Message 5361.  

If I recall correctly you can finish a model and the total credit was 5,700 and change. When they run the background jobs this is raised to 6,805.26 urp! ah that is the credit for the 72nd trickle ...

And PrimeGrid is using only 1 result, but, variable credit.


Paul,

Would I be correct in assuming that PrimeGrid has less issues with varaibility in credit claims because it is more reliant on integer rather than floating -point calculations?

ID: 5366 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ingleside

Send message
Joined: 25 Sep 05
Posts: 107
Credit: 1,514,472
RAC: 0
Message 5367 - Posted: 7 Dec 2005, 15:49:32 UTC - in response to Message 5345.  

Flop Counting

Unfortunately, it doesn't sound like flop counting would solve cheating in our app without redundancy. Work Units that differ only in their random number seed can have significantly different running times because of the nature of our algorithm. So we wouldn't be in the situation where "all the WUs in this block give 27.35 credits, +/- 0.5". The only way to be able to compare flop counts would to be have 100% identical jobs, so we would be back to redundancy.

...

Pseudo-Redundancy

You are right that it wouldn't take unlimited development time. But it would take significantly more time than turning on redundancy. And if we wanted to include validation (which we get for free with redundancy) that will require yet more coding. Plus, we need to make sure that our implementation is solid enough to satisfy the boinc community. Or at least as solid as what we get from boinc with redundancy.

That said, I still thing pseudo-redundancy has a lot going for it. For all of the folks worrying about accuracy, the quorum sizes would be in the thousands. The issues about being paired with a Mac user, and the like would disappear. You wouldn't be getting "average" credit anymore than with any other quorum system. A credit amount would be assigned based on the median, but the rate at which you gained credit would depend on the precise speed of your boxes relative to this average.


If not completely mis-understood the workings of the "pseudo-redundancy"-system, this basically boils down to "1 wu = 1 credit". There will be multiple wu-types, and therefore multiple granted credits, but still you're at "1 wu = 1 credit".

But, hang on, you just mentioned Work Units that differ only in their random number seed can have significantly different running times because of the nature of our algorithm. Meaning, you're basically re-inventing the system SETI@Home "classic" was using, and was discarded when implemented BOINC due to the large variations in crunch-times due to different AngleRange...

ID: 5367 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rytis

Send message
Joined: 17 Sep 05
Posts: 9
Credit: 183,185
RAC: 0
Message 5368 - Posted: 7 Dec 2005, 15:50:16 UTC - in response to Message 5366.  

If I recall correctly you can finish a model and the total credit was 5,700 and change. When they run the background jobs this is raised to 6,805.26 urp! ah that is the credit for the 72nd trickle ...

And PrimeGrid is using only 1 result, but, variable credit.


Paul,

Would I be correct in assuming that PrimeGrid has less issues with varaibility in credit claims because it is more reliant on integer rather than floating -point calculations?


I have set a max-credit-limit for PrimeGrid (this is not a feature of standard BOINC scheduling system), as we basically know what is the biggest "legal" value that could be given. The variability of the credit in PG is because of the difference that different CPUs make - we are more integer- that floating-point-based project. The workunit length is almost the same every time.
PrimeGrid
Administrator
ID: 5368 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 5378 - Posted: 7 Dec 2005, 17:12:00 UTC

Scott,

What Rytus said. :)

He the man ...

This was one of the points I had to argue for a little bit when the new FLOPS counting was proposed because there was no separate way to count IOPS. This should be corrected now (though I have not checked).

Oddly enough, my question about f(FLOPS + IOPS) => CS has gone unanswered to this point. I mean, a Cobble-Computer is measured on both FLOPS and IOPS, yet, we seem to be heading to counting one or the other but not both. Again, this was not clear before in the documentation from UCB. And I have not looked at the source code to see what they are doing there ...
ID: 5378 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Schonbrun

Send message
Joined: 1 Nov 05
Posts: 115
Credit: 5,954
RAC: 0
Message 5524 - Posted: 8 Dec 2005, 4:58:43 UTC

Thanks to everyone for commenting on pseudo-redundancy (which I happen to like as a name.) After going through and trying respond to everyone, I'm starting to think that flop counting combined with code authentication might be the fairest system. Does anyone know of examples where other projects have done authentication?

Some responses follow.

It is probably the _best_ solution - I'm just having trouble getting my mind around all the potential complications of it, and how they could be avoided, and all the options there would be.


Okay, the original goal of this thread was to see whether there is a way to release the code without compromising the integrity of the credit system. The simplest solution, both in implementation and robustness, is redundancy. This is why David Kim and David Baker have put it forward as the most likely next step. As the number of computers running Rosetta grows, my feeling is that the incentive to eliminate redundancy also grows. In raw numbers, the number of cpu cycles being underutilized will become more and more significant. So even if it will take a while to implement pseudo-redundancy, or some other yet to be designed system, it's worth thinking about what a good design would look like.

I don't think by itself it would "solve" cheating - but what it _would_ solve, all by itself, is "cheating via inflated benchmarks".


Once more, for my benefit: "Cheating via inflated benchmarks" is something our current system is susceptible to. Redundancy, even without flop counting, would improve this, by weeding out spurious credit claims. Flop counting would also solve this problem, by preventing spurious credit claims in the first place. But it would be open to problems if we release the full code, because the flop counts could me manipulated. And we would be unable to validate these flop counts without redundancy.

So if we could find a way to release the code in a limited way, such that we can ensure that credit is only granted by code that has been authenticated, flop counting would be sufficient?

I'm concerned that the Rosetta people "know" Rosetta, and the BOINC people "know" BOINC, but there isn't anyone with enough knowledge of _both_, such that they could safely estimate, plan, and direct this without the detailed design work up front.


I agree with this, and it's one reason I'm trying to participate in this discussion. :)

First, I must agree with Bill and Paul that lack of clarity regarding how, and more importantly when, the distribution of results would be assessed for assigning credit is fundamentally problematic. There are no clear criteria being proposed for this, and the potentially excessive time involved to obtain credit is likely to turn substantial numbers of users away.


This reveals another design constraint that I hadn't considered: the rapidity with which credit is granted. You make a very good point that faster machines will return work units more quickly, and skew the early estimates of the distribution. My naive view had been that we could wait until a large fraction of the units had returned. But it sounds like that would not be very popular. Though I am curious what the time for credit granting is like for systems even with small quorums. Presumably you don't get instant credit?

it does not matter whether one uses 4 , 10, or 100 units for redundancy, the actual spread of the true distribution always remains the same (i.e., this procedure does nothing to reduce the vast spread in claims based on benchmarking which is the fundamental problem


I apologize for my sloppy use of the concept of "noise." But you do agree that larger sample sizes produce better "estimates of the distributional parameters".

I believe that you are too quick to dismiss FLOPs counting and to view the two options as mutually exclusive.


I do not want to give the impression that I am dismissing FLOPs counting. I am trying to evaluate its role in providing a robust credit system that allows us to release source code and yet avoid redundancy.

If not completely mis-understood the workings of the "pseudo-redundancy"-system, this basically boils down to "1 wu = 1 credit". There will be multiple wu-types, and therefore multiple granted credits, but still you're at "1 wu = 1 credit"


I think you are right. But the 1 credit would be accurately based on the average properties of that kind of Work Unit. You have to tell me whether this is fair or not, I honestly do not know. Half the time you would get more credit than you deserve, and half the time you would get less. The more you ran, the more it would even out. I can see not liking this, and wanting to be rewarded for exactly how much work your computer performed.

* * *


ID: 5524 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 5533 - Posted: 8 Dec 2005, 6:31:43 UTC - in response to Message 5524.  

This is why David Kim and David Baker have put it forward as the most likely next step.


As long as that leaves the project with enough crunching power to get the work done, I don't see a problem with it as such, except....

This reveals another design constraint that I hadn't considered: the rapidity with which credit is granted.


Yes, the instant credit (whether inflated, deflated or "correct") is a big attraction, at least for me. My computers do the work today, and I get credit for it today. I don't want to wait a week, a month, or even longer, for someone else to do the work, before I get credited.

With Rosetta having a 1 month deadline, this could mean many of us who run fast machines will have hundreds, if not thousands of work units pending at any one time and some of these may take months to clear up.

If we do go to redundancy, I would like to see the deadlines shortened to 1 or two weeks. All these pending work units also take up space on your servers...
*** Join BOINC@Australia today ***
ID: 5533 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Hoelder1in
Avatar

Send message
Joined: 30 Sep 05
Posts: 169
Credit: 3,915,947
RAC: 0
Message 5536 - Posted: 8 Dec 2005, 6:57:26 UTC - in response to Message 5533.  
Last modified: 8 Dec 2005, 7:03:45 UTC

This reveals another design constraint that I hadn't considered: the rapidity with which credit is granted.

Yes, the instant credit (whether inflated, deflated or "correct") is a big attraction, at least for me.

So what about granting the credit instantly: take the median of the claimed credit currently in the database (for that particular WU type). The first returned result would be granted its claimed credit (current situation); for the following ones the granted credit would quickly approach a constant value.
ID: 5536 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 5537 - Posted: 8 Dec 2005, 7:08:35 UTC - in response to Message 5524.  

These are mostly some random thoughts that address some (but not all) of the issues raised by Jack.

Okay, the original goal of this thread was to see whether there is a way to release the code without compromising the integrity of the credit system.


If full source were released, I think that it would tend to reduce the effectiveness of any credit system. In the simplest terms, in the absence of redundancy of any sort, the doors are wide open. Redundancy can do a lot to help, but I don't know that it can completely close Pandora's box. However, it could reduce the problem to the point where it's at least acceptable.

Once more, for my benefit: "Cheating via inflated benchmarks" is something our current system is susceptible to. Redundancy, even without flop counting, would improve this, by weeding out spurious credit claims. Flop counting would also solve this problem, by preventing spurious credit claims in the first place. But it would be open to problems if we release the full code, because the flop counts could me manipulated. And we would be unable to validate these flop counts without redundancy.


Bang on the money, IMHO.

So if we could find a way to release the code in a limited way, such that we can ensure that credit is only granted by code that has been authenticated, flop counting would be sufficient?


If the world were black and white, I'd say this is correct. A more accurate statement might be that limited source release most likely will do the least to damage the credit system.

If I may go off at a tangent here, I have a rather ambivalent attitude to limited release. First of all, I look at this with my 3D device driver writer's hat on. In Rosetta, I'd take just about any odds you have the same optimization problem that you do in a 3D driver, or in a 3D game (I've worked on both). In these types of setups, the code that you want to optimize is usually a minimal fraction of the total code base, what I call the "inner loop" that does the actual number crunching. Releasing just that in a suitable harness would allow us to optimize the code, at a reduced risk of credit cheating.

On the other hand, the crypto expert in me takes the attitude that "security through obscurity is not security". By that, I mean that if we can find a solution that is robust enough to withstand a full source release, and still hand out true credit numbers, we arguably have the best solution from a credit fairness perspective.

First, I must agree with Bill and Paul that lack of clarity regarding how, and more importantly when, the distribution of results would be assessed for assigning credit is fundamentally problematic. There are no clear criteria being proposed for this, and the potentially excessive time involved to obtain credit is likely to turn substantial numbers of users away.


Maybe, maybe not. Stephen T has pointed out that the big groups are stats driven. We have to assume that they want to play on a level playing field, and so we would hope they understand that delayed credit award is a necessity to get a fair credit handout. Therefore if there is a delay, they should take it in stride. In the long term, your RAC will smooth out, and the numbers will come up.

This reveals another design constraint that I hadn't considered: the rapidity with which credit is granted. You make a very good point that faster machines will return work units more quickly, and skew the early estimates of the distribution.


While true, there's another value that skews the distribution: the cache size of the machine. I run a large cache (7 days) but I could easily reduce this by simply altering how often I connect in my preferences. In a perfect world, this would help smooth out the skew you've noted above. In practice, it probably won't. About all I can suggest here is to shorten the work deadline, and use that as a metric for how long to delay before awarding credit.

it does not matter whether one uses 4 , 10, or 100 units for redundancy, the actual spread of the true distribution always remains the same (i.e., this procedure does nothing to reduce the vast spread in claims based on benchmarking which is the fundamental problem


I apologize for my sloppy use of the concept of "noise." But you do agree that larger sample sizes produce better "estimates of the distributional parameters".


Going back to statistics, that I learned so many years ago, I remember the existance of three types of average. Mean, mode and median. Mean is the most common "sum divided by count". IIRC, the median is simply the 50th percentile. Mode is perhaps the most useful to us, it is the most common. As a side bar to this, I believe that Seti uses a median for it's quorum of three. In the case of a small quorum, that's arguably better than the mean. However, if the pseudo-redundancy can get us a large sample size, then the mode starts looking attractive. There's that phrase "large sample size" again. Statisticians usually love them.

I believe that you are too quick to dismiss FLOPs counting and to view the two options as mutually exclusive.


I do not want to give the impression that I am dismissing FLOPs counting. I am trying to evaluate its role in providing a robust credit system that allows us to release source code and yet avoid redundancy.


My take on this (and I'm willing to listen to differing opinions, because I know they're out there), is that if you do a full source release, about the only solution that will hold up is some sort of quorum approach. I tend to agree with the idea that flops counting can help solve the current benchmark * CPU time system, with all it's flaws, but I believe its strength will be weakened if you release the full source to Rosetta.
ID: 5537 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,120
RAC: 21
Message 5538 - Posted: 8 Dec 2005, 7:12:48 UTC - in response to Message 5524.  

Does anyone know of examples where other projects have done authentication?


No... closest is Einstein, where they haven't released their code "openly", so there are no 3rd-party optimized versions, but they've let a few "outside" people contribute to their "official" applications. I'm not sure authentication is needed, if anyone who has access has signed a non-disclosure form. It probably _is_ needed, if parts/all of the code is generally released, otherwise you are forced into redundancy like SETI.

Okay, the original goal of this thread was to see whether there is a way to release the code without compromising the integrity of the credit system. The simplest solution, both in implementation and robustness, is redundancy. This is why David Kim and David Baker have put it forward as the most likely next step. As the number of computers running Rosetta grows, my feeling is that the incentive to eliminate redundancy also grows. In raw numbers, the number of cpu cycles being underutilized will become more and more significant. So even if it will take a while to implement pseudo-redundancy, or some other yet to be designed system, it's worth thinking about what a good design would look like.


This is actually backwards from what I was thinking - that right now, with too few hosts to "keep up" with the work, redundancy would hurt a lot, but later, when there were "plenty" of hosts, it wouldn't hurt as much.

Once more, for my benefit: "Cheating via inflated benchmarks" is something our current system is susceptible to. Redundancy, even without flop counting, would improve this, by weeding out spurious credit claims. Flop counting would also solve this problem, by preventing spurious credit claims in the first place. But it would be open to problems if we release the full code, because the flop counts could me manipulated. And we would be unable to validate these flop counts without redundancy.


Yes, but flops-counting by itself, with or without code release, would still need some kind of "sanity check". This wouldn't have to be redundancy, but could be an after-the-fact range check; just look at the "top 2%" of credit claims after each batch, and see if those people _always_ seem to get the best WUs. BOINC itself is already open, so a "cheat" could be applied either there _or_ in your app. I'm anti-redundancy, and used to writing all kinds of ad-hoc reports to do data mining... I don't see it as a big issue to do the validation "manually". (It wouldn't be too hard to automate the process and only report a couple of names a month for a human to look at.) As long as the participants know it's being done, the details don't really matter.

So if we could find a way to release the code in a limited way, such that we can ensure that credit is only granted by code that has been authenticated, flop counting would be sufficient?


That is _my_ feeling. This would have to be reexamined periodically, if there were any indications that cheating was going on and not being caught.

This reveals another design constraint that I hadn't considered: the rapidity with which credit is granted. You make a very good point that faster machines will return work units more quickly, and skew the early estimates of the distribution. My naive view had been that we could wait until a large fraction of the units had returned. But it sounds like that would not be very popular. Though I am curious what the time for credit granting is like for systems even with small quorums. Presumably you don't get instant credit?


I would guess that anything over two or three days "average" would cause complaints. Einstein probably has the longest wait (and gets complaints) at about a week, Rosetta the shortest. SETI is about in the middle; about half of mine get credit almost instantly, with others waiting up _to_ a week, rarely longer. I don't know how large the "blocks" would be, what the deadline would be or if it would remain the same, what your quorum settings would be... lots of options that would affect the average time. Some people _will_ have a 10-day cache, and will return results at the last minute, whether they be slow or fast in calculating the result. If enough do this, it'll lengthen things terribly. And could one host get two results from the same WU? This isn't allowed on other projects at present, so you'd have to have at least as many WUs in progress at any time as there are hosts registered... are you even "input constrained" at all, or are there an infinite number of possible things for us to test?

If not completely mis-understood the workings of the "pseudo-redundancy"-system, this basically boils down to "1 wu = 1 credit". There will be multiple wu-types, and therefore multiple granted credits, but still you're at "1 wu = 1 credit"


I think you are right. But the 1 credit would be accurately based on the average properties of that kind of Work Unit. You have to tell me whether this is fair or not, I honestly do not know. Half the time you would get more credit than you deserve, and half the time you would get less. The more you ran, the more it would even out. I can see not liking this, and wanting to be rewarded for exactly how much work your computer performed.


If the range is very large, this isn't going to fly. If I claim 25 and get 20, well, so be it. But if I claim 100 and get 20, I'm not going to be a happy camper. It's going to be hard to convince "me" (speaking generically) that "in the long run" it'll average out. Especially if "I" see two or three in a row where I'm the one who got less, even if it was only a little bit on all but one of them. Being right is necessary, but not always sufficient...

For any kind of redundancy and credit-averaging to be accepted, you have to be able to explain it simply. Three or four _identical_ results, and I get the median, ok, got it. Thousands of results that are all over the place on claimed credit? I think you've got a problem. When this was brought up, my thought was that a WU would be maybe five or six results, and they would be selected such that the times would be very close (seed assigned by server, with resulting time range known; similar to angle range on SETI). If the seed is assigned by the host, or if the time effect of the seed isn't known, this is (to me) a major factor _against_ pseudo-redundancy.

If you look at CPDN, they have a "fixed" amount of credit, but it is for a "fixed" amount of work; it may take me 8 hours to get a trickle and it may take you 12 hours, but we both get the same number of credits for the same number of cycles done. (My computer is just faster than yours.) Most other projects use benchmark*time (which is the inferior equivalent of flops-counting). Having some other approach that doesn't reflect "actual work", at least somewhat closely, I think is going to be disliked.

---

MY thought is that you're thinking too hard about the credit side of this. Yes, something should be done, because Rosetta is getting a reputation (I've seen it mentioned on other boards) as "the place to go to get credits easily". Not a problem from the project perspective, because it attracts users - until the users who AREN'T inflating their benchmarks get mad and leave. Simple answer is redundancy, but those who care most about the science don't like seeing the waste. Flops-counting plus some server-side sanity checks to catch the one guy who goes to the effort to get around the flops-counting code would, I think, satisfy everyone.

I still think pseudo-redundancy could be the "best" answer, but you're going to have to get the quorum size down to something manageable (in terms of delayed credit) and the range of times within each WU down to "pretty close". I don't know if you can do that.

ID: 5538 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 5544 - Posted: 8 Dec 2005, 8:55:30 UTC - in response to Message 5536.  
Last modified: 8 Dec 2005, 8:56:29 UTC

So what about granting the credit instantly: take the median of the claimed credit currently in the database (for that particular WU type).


That would be grossly unfair, given the enormous variation in computation times of the same type of WU on the same PC, let alone on lots of different computers. In addition, the median will change every time a new WU is returned... Do you want the project to keep adjusting your granted credit, based on that moving target?

Just to illustrate the variation in completion times... Here's some examples of completion times for a bunch of 1dcj_abrelax_rand_len10_jit02_omega_sim_filters WUs on one of my computers (times rounded to nearest second, ascending order):


  • 2633s
  • 3751s
  • 3786s
  • 3830s
  • 4066s
  • 4258s
  • 4368s
  • 5839s



As you can see, the longest of them took more than twice as long as the shortest (and this is just a small sample). Same computer, same WU series. What if I was unlucky enough to get lots of the long WUs?


*** Join BOINC@Australia today ***
ID: 5544 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 5553 - Posted: 8 Dec 2005, 12:36:39 UTC - in response to Message 5524.  

Thanks to everyone for commenting on pseudo-redundancy (which I happen to like as a name.) After going through and trying respond to everyone, I'm starting to think that flop counting combined with code authentication might be the fairest system. Does anyone know of examples where other projects have done authentication?

Some responses follow.

It is probably the _best_ solution - I'm just having trouble getting my mind around all the potential complications of it, and how they could be avoided, and all the options there would be.


Okay, the original goal of this thread was to see whether there is a way to release the code without compromising the integrity of the credit system. The simplest solution, both in implementation and robustness, is redundancy. This is why David Kim and David Baker have put it forward as the most likely next step. As the number of computers running Rosetta grows, my feeling is that the incentive to eliminate redundancy also grows. In raw numbers, the number of cpu cycles being underutilized will become more and more significant. So even if it will take a while to implement pseudo-redundancy, or some other yet to be designed system, it's worth thinking about what a good design would look like.


The best 'study' of credit done in recent history is Improved Benchmarking System Using Calibration Concepts, not the least because I wrote it ... :)

But, it was peer reviewed for this rough draft. THere are implementation details to be worked out yet, and some areas that I quite plainly do not have the technical expertise to ensure the intent is met (the non-repudiation certifications needed).

Redundency to solve the credit problems is suffering from its poor reputation. So, if redundency is not needed for the science, it should not be considered. CPDN has the advantage that the models are all roughly the same "size" so the flat rate works for them.

All in all, FLOPS counting is the way to go. Most of us that are interested in this problem, I think, are waiting for the first field trial of the SETI@Home enchanced application when we can start to get some real numbers over large populations of work units.

With regard to the provision of optimized clients, the results at Einstein@Home are nothing short of amazing to me. I am not at all clear why the Windows version has not been optimized yet, or if it was tried and found to not be worth the trouble. UCB's curious reluctance to provide "official" optimized clients has also been a long time puzzler.

I have no problems with open source, but, still, we have no tracibility on the quality of the results. The fact that it does not matter because of the nature of the project is so speculative aside, it is still curious to me.

Personally I would prefer to see optimized clients followed by FLOP counting. WIth those in place we can then begin to look at the rigor needed to control potential exploits of the credit system. Which again, is recognized in submitted papers as important, but, I think has taken a back seat for far too long based on its importance to the participant base.

ID: 5553 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile stephan_t
Avatar

Send message
Joined: 20 Oct 05
Posts: 129
Credit: 35,464
RAC: 0
Message 5555 - Posted: 8 Dec 2005, 12:48:44 UTC - in response to Message 5524.  

I can see not liking this, and wanting to be rewarded for exactly how much work your computer performed.


You were right, I don't like this at all :-) Stats are a motivation factor only if 1) they are accurate 2) they cannot be defrauded.

On a separate note: I used to have 4 boxes crunching Rosetta. 2 of them had to be rebuilt. Being a stats-fan, it's in my interest to dismantle/rebuild them as fast as possible, to stay on top of the ladder - downtime is inconceivable. However, if the stats somehow have no 'value' then, then there is no 'incentive' for me to make an effort and hurry up. Their is no red line pointing downward on boincstats, or a RAC number dwindling out of control - therefore no rush. Similarly the day I reformat a drive their is no hurry to reinstall boinc... and eventually the whole thing falls by the way side.

Whatever solution you implement, please don't take the 'competition' aspect out of it. In the meantime, since it's all mysql based, couldn't you just run a few queries that pick up the average credit per WU and pinpoint the results that are abnormally high?
Team CFVault.com
http://www.cfvault.com

ID: 5555 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scott Brown

Send message
Joined: 19 Sep 05
Posts: 19
Credit: 8,739
RAC: 0
Message 5559 - Posted: 8 Dec 2005, 13:39:10 UTC - in response to Message 5555.  


In the meantime, since it's all mysql based, couldn't you just run a few queries that pick up the average credit per WU and pinpoint the results that are abnormally high?


I like this idea, but it also follows a very basic set of statistical assumptions that may be problematic. First, this can only be done on workunits for which all results are completed. On incomplete results, the downward bias produced by faster machines (noted below) would result in an artifically low threshold for what is defined as too high. Second, and more importantly, such a query assumes that we would know what the distribution of results should look like. If everything were truely random, we would likely end-up with a Normal or t distribution (basically bell shaped). However, without either strong theoretical reasons for identifying the nature of the distribution or substantial empirical evidence regarding what the underlying distribution is (and assuming that this does not vary by type of workunit), I am not sure that a true cutoff point could be clearly identified (for example, such a cutoff would be very difficult indeed if the distribution is bi-modal, etc.).

Furthermore, a question might also arise regarding exceptionally low claimed credits. For example, (assuming a Normal distribution) if one were to throw out all results at 3+ standard deviations above the mean, would one then also be obligated to do something with results claiming more than 3 standard deviations less than the mean or median of the distribution. This also brings up a final point: all work units are bounded by zero credit claimed which means that the distribution of claims will necessarily vary as the mean claim moves away from zero (though the distribution will stabilize as claims reach a point sufficiently far from zero to nullify this effect).
ID: 5559 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Schonbrun

Send message
Joined: 1 Nov 05
Posts: 115
Credit: 5,954
RAC: 0
Message 5611 - Posted: 8 Dec 2005, 17:46:17 UTC

Thanks again to everyone. This discussion is extremely useful. Let me try to quickly summarize my understanding as of this moment, so you can tell me if I'm following correctly. I now think there are two related, but distinct issues:

1) The most accurate way to assign credit. (FLOP counting vs. bench * cpu time. vs. pseudo-redundancy, etc...)

2) How to catch cheaters. (True redundancy, looking for outliers, etc..)

My current impression is that catching cheaters is what we are really talking about. Much of the discussion about ways to assign credit revolves around coming up with systems that will be more cheat proof. (Within the bounds of other constraints, such as lag between job completion and credit granting.)

What might be most effective from the boinc community's perspective would be for us to implement some cheat finding scripts that examine the claimed credit and look for outliers. These hosts could then be excluded from the credit tallies.

Our method for assigning credit would not need to change, though it would be better if we switched to FLOP counting.

In a sense, we are talking about using "pseudo-redundancy" to catch cheaters, but not to assign credit. Even though WUs that differ only in random number seed can have a wide variance in FLOP counts, we should still be able to see outliers. Especially because we could do our culling after most results were in. I don't see how some lag in the cheat finding can be avoided. It requires waiting until a significant number of results are returned. Credit could still be granted instantly. It just might get taken away.

It seems that we could get an especially strong cheat detection algorithm by not simply looking for outliers in a given WU, but by looking for boxes that are consistently outliers across WUs.

So, if such a system were in place, would people be comfortable with us releasing a partial version of the code?

ID: 5611 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Housing and Food Services

Send message
Joined: 1 Jul 05
Posts: 85
Credit: 155,098,531
RAC: 0
Message 5614 - Posted: 8 Dec 2005, 17:54:35 UTC - in response to Message 5611.  
Last modified: 8 Dec 2005, 17:56:18 UTC

I think, even outside of releasing the code, that this is a fine way to go. It addresses the blatant cheating, and will give those who are so inclined less motivation to try and work the system. . . . all the while utilizing the full potential of the grid (no wu duplication). It's a good compromise between the two camps of people, those who want each second of cpu time given credit, and those who want each cpu cycle to do the most it can for the science.

I'm sure there are folks who would like to see the code released, but how is version control going to be carried out? Are users going to post changes on the board? Do you intend to let users compile their own software, or only suggest changes that go out to everyone?

-E

ID: 5614 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

Message boards : Number crunching : code release and redundancy



©2024 University of Washington
https://www.bakerlab.org