What about Docking@home and Proteins@home?

Author	Message
thom217 Send message Joined: 29 Oct 05 Posts: 12 Credit: 182 RAC: 0	Message 30477 - Posted: 2 Nov 2006, 0:47:36 UTC There is also a copy of the posts of Caroline made at the FaD forum. http://www.fadbeens.co.uk/phpBB2/viewtopic.php?t=248 ID: 30477 · Rating: 0 · rate: / Reply Quote

Ingleside Send message Joined: 25 Sep 05 Posts: 107 Credit: 1,514,472 RAC: 0	Message 30478 - Posted: 2 Nov 2006, 0:50:44 UTC - in response to Message 30243. In my opinion the biggest weakness in BOINC was the decision to force the same degree of redundancy on every WU of a type. This is a reflection of the fact that on SETI at the time BOINC was first designed they had more computing power than they needed. Subsequent projects (including SETI when they get access to more input data) suffer from the fact that there is no degree of redundancy possible between 1 and 2. Actually, all the BOINC wu-parameters is set per wu, this includes min_quorum and target_nresults, but the most common is for the project-specific wu-generator to use a config-file there the BOINC-wu-parameters is constant. Example, in the Seti_Enhanced wu-generator (splitter), the angle-range of a wu decides fpops_est, fpops_bound and delay_bound (deadline). It will be no problem to extend this functionality by adding a couple lines looking something like this: if AR < 0.4 => min_quorum = 3, target_nresults = 4 if 0.4 <= AR < 0.5 => min_quorum = 2, target_nresults = 3 if 0.5 <= AR < 1.1 => min_quorum = 3, target_nresults = 3 if 1.1 <= AR => min_quorum = 2, target_nresults = 2 Well, guess you've got my meaning, so I'll not go any more off-topic in this thread. :) "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." ID: 30478 · Rating: 0 · rate: / Reply Quote

Dr. Armen Send message Joined: 25 Oct 06 Posts: 4 Credit: 0 RAC: 0	Message 30495 - Posted: 2 Nov 2006, 8:04:15 UTC - in response to Message 30472. Last modified: 2 Nov 2006, 8:09:53 UTC I remember there was a gentleman who was in touch with Keith Davis, the head of the Find-a-Drug project, at the time of the project closure. He is one of the people responsible for running Chmoogle now called eMolecules search engine. http://www.emolecules.com/ http://usefulchem.blogspot.com/2005/11/chmoogle.html Jean-Claude Bradley http://www.blogger.com/profile/6833158 He might be able to contribute to the Docking@Home database. Hey Thom217, Thank you very much for these links! I just checked all of them out, and they will be very useful to me. I just gave the eMolecules search engine a "test drive" and I was very impressed. As someone who thinks about chemistry all day long every day, I thought that I would give it a fairly random and very targeted structural search about a family of compounds called pyrethroids, which I have been thinking a lot about in the last few days. This family of compounds are derivatives of permethrin which is a natural insecticide from the chrysanthemum flower. This group of synthetic insecticides is commonly used in agriculture and in our homes every day (an example is Allethrin in RAID spray). Pyrethroids have been considered to be safe for decades and are now for the first time being being studied as possible enviromental toxins in the state of California. My search resulted in many derivatives of a specific pyrethroid that I did not even know existed, very impressive. For being freely availible to the public this is a very impressive tool. If any of you are interested in these new enviromental impact studies, Here is an npr story about these compounds: http://www.npr.org/templates/story/story.php?storyId=6160974 I can't belive that I have never heard of eMolecules before. I am a recent newcommer to the boinc community and I have been very impressed and amazed as to how much I can learn from all of you out there on the forums. As I continue to extend and develop the database of compounds for the Docking@Home project, I will contact Jean-Claude Bradley and Caroline, and see if either of them can help or give me some good suggestions. Developing such a database is quite a lot of work, even if you have all of the pdb files, and you assume that the chemistry in all of the structures is correct. This is mainly because small molecule potential functions are difficult to develop and implement (lots of errors) because there are so many different types of novel chemical connectivities and geometries possible. This type of work is what I spend a lot of my time on, so anything that saves me time is very appreciated. Thank you very much. ID: 30495 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 30588 - Posted: 3 Nov 2006, 20:38:40 UTC - in response to Message 30478. In my opinion the biggest weakness in BOINC was the decision to force the same degree of redundancy on every WU of a type. This is a reflection of the fact that on SETI at the time BOINC was first designed they had more computing power than they needed. Subsequent projects (including SETI when they get access to more input data) suffer from the fact that there is no degree of redundancy possible between 1 and 2. Actually, all the BOINC wu-parameters is set per wu, this includes min_quorum and target_nresults, but the most common is for the project-specific wu-generator to use a config-file there the BOINC-wu-parameters is constant. Example, in the Seti_Enhanced wu-generator (splitter), the angle-range of a wu decides fpops_est, fpops_bound and delay_bound (deadline). It will be no problem to extend this functionality by adding a couple lines looking something like this: if AR < 0.4 => min_quorum = 3, target_nresults = 4 if 0.4 <= AR < 0.5 => min_quorum = 2, target_nresults = 3 if 0.5 <= AR < 1.1 => min_quorum = 3, target_nresults = 3 if 1.1 <= AR => min_quorum = 2, target_nresults = 2 Well, guess you've got my meaning, so I'll not go any more off-topic in this thread. :) I think you missed my meaning. Your example indicates that all WU of a type (ie with the same paramters) will have the same redundancy. Non-BOINC projects may decide the redundancy after the scheduler knows who is going to get the work, so that for example there can be hosts that are more-trusted and less-trusted. Also it can be decided after the event to increase the redundancy retrospectively if one of the hosts (or the only host) is then discovered to have been cheating elsewhere. Also these ways of doing things allow for random testing (with a patteren known before, or only after, the initial crunching). In contrast to those schemes, BOINC insists that the need for redundancy inheres in the data of the WU not the crunchers. By insisting that the user can see the degree of redundancy, the use of random blind redundancy is ruled out, a pity as it is the most efficient way of spotting deliberate cheating. R~~ ID: 30588 · Rating: 0 · rate: / Reply Quote

Ingleside Send message Joined: 25 Sep 05 Posts: 107 Credit: 1,514,472 RAC: 0	Message 30591 - Posted: 4 Nov 2006, 0:24:12 UTC - in response to Message 30588. Last modified: 4 Nov 2006, 0:25:18 UTC I think you missed my meaning. Your example indicates that all WU of a type (ie with the same paramters) will have the same redundancy. Instead of using AR to decide redundancy, can just as easily say every 5th result, or every 50th result, or randomly choose the redundancy for each and every wu on time of generation. In contrast to those schemes, BOINC insists that the need for redundancy inheres in the data of the WU not the crunchers. By insisting that the user can see the degree of redundancy, the use of random blind redundancy is ruled out, a pity as it is the most efficient way of spotting deliberate cheating. "Blind redundancy" is easily done, just make two copies of the same wu. It's very unlikely someone will detect that wuid-123 and wuid-456 is actually the same wu, especially since most projects uses more-or-less random wu-names... "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." ID: 30591 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 30610 - Posted: 4 Nov 2006, 17:18:37 UTC - in response to Message 30591. Last modified: 4 Nov 2006, 17:37:44 UTC I think you missed my meaning. Your example indicates that all WU of a type (ie with the same paramters) will have the same redundancy. Instead of using AR to decide redundancy, can just as easily say every 5th result, or every 50th result, or randomly choose the redundancy for each and every wu on time of generation. In contrast to those schemes, BOINC insists that the need for redundancy inheres in the data of the WU not the crunchers. By insisting that the user can see the degree of redundancy, the use of random blind redundancy is ruled out, a pity as it is the most efficient way of spotting deliberate cheating. "Blind redundancy" is easily done, just make two copies of the same wu. It's very unlikely someone will detect that wuid-123 and wuid-456 is actually the same wu, especially since most projects uses more-or-less random wu-names... I agree it is possible, but that route involves parallel writing a lot of code. Naturally it is always possible to ignore what BOINC provide and do something else. Take that far enough and you don't use BOINC at all... The point I was making is that other projects (outside of BOINC) have those features built into the database, wu generation, etc, and, in my own view, it was a mistake when designing a flexible transferable infrastructure that BOINC ruled those options out. With little extra effort at the design stage a much wider range of options could be offered. Instead, to do it now involves basically a re-write from scratch. R~~ ID: 30610 · Rating: 0 · rate: / Reply Quote

FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0	Message 30614 - Posted: 4 Nov 2006, 19:56:29 UTC River, why does boinc insist we can see it. They could just not allow access to or remove the pages that show ther redundency used, if it was such a big issue. That is not exactly a complete rewrite. Team mauisun.org ID: 30614 · Rating: 0 · rate: / Reply Quote

Ingleside Send message Joined: 25 Sep 05 Posts: 107 Credit: 1,514,472 RAC: 0	Message 30623 - Posted: 5 Nov 2006, 1:52:44 UTC - in response to Message 30610. I agree it is possible, but that route involves parallel writing a lot of code. Naturally it is always possible to ignore what BOINC provide and do something else. Take that far enough and you don't use BOINC at all... The point I was making is that other projects (outside of BOINC) have those features built into the database, wu generation, etc, and, in my own view, it was a mistake when designing a flexible transferable infrastructure that BOINC ruled those options out. With little extra effort at the design stage a much wider range of options could be offered. Instead, to do it now involves basically a re-write from scratch. Well, since 02.07.2004 the BOINC-default is to not display wu/result-info, a project that wants to show this info, including wu-redundancy, must add <show_results/> to their config-file... As for "complete re-write"... One of the fairly resent additions (July), is the "reliable"-option, if a wu is N days old, any task from this wu can only be sent to computers meeting a minimum RAC-limit and maximum turnaround-time-limit, and the deadline can even be set shorter than originally. So, doing another add-on to Scheduling-server can be something like this: if computerid = on_bad_computer_list => wu.min_quorum += 1, wu.target_nresults += 1, wu.transition_time = now. Well, instead of having a list of bad computers, it's likely better to add a host_is_unreliable-field to database... This field can even have multiple levels, including one to be targeted for "random" validation, even if nothing suspicious with host. Now, as actual code, it will be a little longer than my one-liner, since it's always a good idea to do limit-checking, and error-check writing to database and so on did went error-free, and to include a logging-option. Appart for the Scheduling-server, you also needs some way to add this unreliable-flag, this can be to validator to keep an error-count of failed validations and if N failed trigger extra-checking, as a web-page to manually add suspicious hosts, as a script that can be run periodically to example trigger validation of any hosts with RAC > N that haven't been validated for the last N days, a script that randomly picks hosts that haven't been validated for N days... Now, this addition can catch cheaters and permanently-bad computers. But, it will not catch whoever was today's computer that had it's once-in-a-lifetime bad result... "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." ID: 30623 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 30721 - Posted: 6 Nov 2006, 19:34:21 UTC - in response to Message 30623. ... The point I was making is that other projects (outside of BOINC) have those features built into the database, wu generation, etc, ... ... As for "complete re-write"... One of the fairly resent additions (July), is the "reliable"-option, Thanks Ingleside, and FC too, I am not so sure as I was that it would involce a complete re-write. So, convince me. Suppose hosts start as 'untested' boxes, where 10% of there WU get crunched by others, those 10% picked at random from the workj they are sent, and the user never knows which. If these are all OK, they upgrade to 'trusted', and only 1% are chacked from then on untill one goes bad, when they fall back to untested. If more than so many go wrong (as a %, or as a run of consecutive problems) the host is marked as dodgy, and 50% are checked. If more than so many of thise go wrong, then the host is banned and all past wu that are still accessible are re-run. I am not asking for the code in detail, but enough to sketch out how it could be done using the current db & code structures. Now, this addition can catch cheaters and permanently-bad computers. But, it will not catch whoever was today's computer that had it's once-in-a-lifetime bad result... True - any statistical error checker lets some bad results through. So does 100% redundancy, as some errors are more likley than others, and if there is a 0.1% chance of error X prducing erroneous output P, then there is a one in a million chance that error X and error Y will come together in any given WU. That is why LHC, for example, uses three way redundnacy and insists on a three-way match. If two match and the other does not, LHC does not take it on the majority but waits for a fourth opinion to come in. So whether you use full redundancy or statistical methods, error checking is always a trade off against doing more independent results. Which you need depends on the task. For a project like SETI where collecting the data costs more than crunching it, and where the supply of volunteers seems unending, full redundancy is used to avoid wasiting a single datum to an error. That is not always the best balance in other situations. R~~ ID: 30721 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 30722 - Posted: 6 Nov 2006, 19:38:38 UTC - in response to Message 30614. Last modified: 6 Nov 2006, 19:39:22 UTC River, why does boinc insist we can see it. They could just not allow access to or remove the pages that show ther redundency used, if it was such a big issue. That is not exactly a complete rewrite. Reaction to this went as follows: rubbish. No, it is a very good point. Well, partly right, and partly wrong. Right as follows: it is not all that hard to hide the wu page and still give access to the results page. Somelinks would need to be hidable, but again that is not a big job. But wrong, as follows: the biggie, unless I am missing something, is that the BOINC resultid clearly shows you if you are crunching the original or a copy. The original ends _0, whereas the cop ends _1. The user with a _0 result would not know whether their result has been duplicated or not (the db may not know yet, depending when the random choices are made). However the user with a _1 result would know it was to be compared with another. This flaw would halve the odds of catching deliberate mischief. and you can't hide the result ids from the client, nor from client_state hackers. OR finally, it is a very very good point as, to be more subtle, the _1 client state hack might be left in to be used as a heffalump trap. We expect a host to have as many errors (proportionally speaking) in its _1 results as in those of its _0 that are checked. If that is not so, that itself becomes a cheat catcher. R~~ ID: 30722 · Rating: 0 · rate: / Reply Quote

FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0	Message 30726 - Posted: 6 Nov 2006, 20:37:31 UTC - in response to Message 30722. River, why does boinc insist we can see it. They could just not allow access to or remove the pages that show ther redundency used, if it was such a big issue. That is not exactly a complete rewrite. Reaction to this went as follows: rubbish. No, it is a very good point. Well, partly right, and partly wrong. Right as follows: it is not all that hard to hide the wu page and still give access to the results page. Somelinks would need to be hidable, but again that is not a big job. But wrong, as follows: the biggie, unless I am missing something, is that the BOINC resultid clearly shows you if you are crunching the original or a copy. The original ends _0, whereas the cop ends _1. The user with a _0 result would not know whether their result has been duplicated or not (the db may not know yet, depending when the random choices are made). However the user with a _1 result would know it was to be compared with another. This flaw would halve the odds of catching deliberate mischief. and you can't hide the result ids from the client, nor from client_state hackers. OR finally, it is a very very good point as, to be more subtle, the _1 client state hack might be left in to be used as a heffalump trap. We expect a host to have as many errors (proportionally speaking) in its _1 results as in those of its _0 that are checked. If that is not so, that itself becomes a cheat catcher. R~~ Never notice the _0 _1 before :-D What with generally looking at a task list of Rosetta@home's _0. Well I'll have to remember that one. Double bluffing them. Though they could take the _0 _1 internally to the task/result so we cannot see it. (of course more code rewritting :-) (SIDE/ I'm still supprised how little encryption BOINC uses) Team mauisun.org ID: 30726 · Rating: 0 · rate: / Reply Quote

Ingleside Send message Joined: 25 Sep 05 Posts: 107 Credit: 1,514,472 RAC: 0	Message 30766 - Posted: 7 Nov 2006, 13:44:57 UTC - in response to Message 30722. Last modified: 7 Nov 2006, 13:46:37 UTC But wrong, as follows: the biggie, unless I am missing something, is that the BOINC resultid clearly shows you if you are crunching the original or a copy. The original ends _0, whereas the cop ends _1. The user with a _0 result would not know whether their result has been duplicated or not (the db may not know yet, depending when the random choices are made). However the user with a _1 result would know it was to be compared with another. This flaw would halve the odds of catching deliberate mischief. and you can't hide the result ids from the client, nor from client_state hackers. _1 can also be due to _0 erroring-out, example download-error or crashed at startup, or possibly not returned by deadline. If a project haven't enabled showing of result-pages, this info is unknown to the user. But, even for someone getting the _0-result, can report this result, wait 1 minute, do another RPC, and if host-credit and total credit have not increased, it's a good chance needs to be validated against another result... Well, after only 1 result it's likely not apparent, but with 10-100 results someone deliberately trying to cheat will likely start spotting a pattern, this will be true even if a random number had been used instead of _0, _1 and so on... "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." ID: 30766 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 30773 - Posted: 7 Nov 2006, 16:26:25 UTC - in response to Message 30766. But, even for someone getting the _0-result, can report this result, wait 1 minute, do another RPC, and if host-credit and total credit have not increased, it's a good chance needs to be validated against another result... Well, after only 1 result it's likely not apparent, but with 10-100 results someone deliberately trying to cheat will likely start spotting a pattern, this will be true even if a random number had been used instead of _0, _1 and so on... This is true, but is not a problem. The user knowing the result is being validated only matters if they know before their machine reports it, once it has been reported they are stuck with the fact that they cheated (if they did) and cannot hide it. The reason the duplication is hidden is to avert the dihonest strategy where a user reports duff results most of the time but does an honest bit of work when they know they will be checked up on. So I think that choosing random suffices together with removal of pages showing wu would probably work. I therefore am happy to withdraw my previous criticism. R~~ R~~ ID: 30773 · Rating: 0 · rate: / Reply Quote

Ingleside Send message Joined: 25 Sep 05 Posts: 107 Credit: 1,514,472 RAC: 0	Message 30778 - Posted: 7 Nov 2006, 18:26:36 UTC - in response to Message 30721. So, convince me. Suppose hosts start as 'untested' boxes, where 10% of there WU get crunched by others, those 10% picked at random from the workj they are sent, and the user never knows which. If these are all OK, they upgrade to 'trusted', and only 1% are chacked from then on untill one goes bad, when they fall back to untested. If more than so many go wrong (as a %, or as a run of consecutive problems) the host is marked as dodgy, and 50% are checked. If more than so many of thise go wrong, then the host is banned and all past wu that are still accessible are re-run. I am not asking for the code in detail, but enough to sketch out how it could be done using the current db & code structures. Well, let's say RANDOM returns a random number between 0 and 1, and let's add a new db-field, host.reliable, that can take following numbers: 0 : new hosts/unverified. 1 : Reliable hosts. 2 : Problematic hosts. 3 : Deny_work. Add to Scheduling-server: if (host.reliable = 3) => deny_work else assign_result. if (host.reliable = 0) and (RANDOM > 0.9) => wu.min_quorum += 1, wu.target_nresults += 1, wu.needs_reliable = true, wu.transition_time = now. if (host.reliable = 2) and (RANDOM > 0.5) => wu.min_quorum += 1, wu.target_nresults += 1, wu.needs_reliable = true, wu.transition_time = now. if (host.reliable = 1) and (not wu.needs_reliable) => if (RANDOM > 0.99) => wu.min_quorum += 1, wu.target_nresults += 1, wu.needs_reliable = true, wu.transition_time = now. In the reliable-part of the Scheduling-server, add an extra requirement to the already present min-RAC and max-turnaround-time: if (wu.needs_reliable = true) and (host.reliable != 1) => deny_this_result. In Validator, not sure on the best ways to count-up how many errors, but the easy one is atleast: if (Invalid) and (host.reliable = 1) => host.reliable = 0 Hmm, maybe something like adding a field, host.invalid, initialized to 0: if Valid and (host.invalid > -5) => host.invalid -= 1; if host.invalid = -5 => if host.reliable = 2 set host.reliable = 0 else host.reliable = 1. if Invalid => host.invalid += 2; if host.invalid > 10 => if host.invalid > 20 set host.reliable = 3 else host.reliable = 2. Well, wu.needs_reliable is AFAIK not a current variable, the current usage is probably something like if (now - wu.generated_time) > N days => needs_reliable. With my usage of wu.needs_reliable, the extra-generated copy will never be sent to another unreliable host. Also, have made sure wu.min_quorum is never increased more than once. For the Validator-part, more unsure on the best usage... Hmm, to also correctly handle projects without any quorum, maybe an idea to change to: if Valid and (host.invalid > -5) and wu.needs_reliable => ... A new host needs 5 valid to change to "reliable", but for each invalid you needs 2 valid to get back to "reliable"... Or maybe this should also be changed, so needs 5 more valid ones: if (Invalid) and (host.reliable = 1) => host.reliable = 0, host.invalid = 0 Anyway, an obvious weakness with my requirement to only send the extra copy to a reliable host is, atleast at the start of a project, no hosts is reliable so nothing will ever be validated... Oh, as for validating all old results for a host that gets marked "broken", if they was validated earlier it means they're also been Assimilated, and a good chance any trace in the BOINC-part of the system has been removed... So, whatever you do depends on whatever you've done with the Assimilated results... As for the usability of this... CPDN doesn't need it, since uses ensemble-approach, there a few bogus results doesn't matter. Rosetta@home validates by re-calculating server-side the energy of the final structures in each result, and therefore doesn't need any redundancy. For any project that can use either of these methods, running without redundancy is not a problem. For many projects on the other hand, neither of these approaches is possible, and if some bogus results can influence the science, using "random redundancy" will catch cheaters and permanently-bad hosts but the random errors will still slip-through the cracks... So, for these projects, using real redundancy is better, and adding random redundancy on top has little if any extra benefit. As for any other projects that can't validate server-side, or isn't using ensemble-approach, but can still overlook a few errors... Hmm, BURP isn't science, so a few wrongly-rendered frames is likely not a problem... "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." ID: 30778 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 30814 - Posted: 8 Nov 2006, 19:25:14 UTC - in response to Message 30778. ... For many projects on the other hand, neither of these approaches is possible, and if some bogus results can influence the science, using "random redundancy" will catch cheaters and permanently-bad hosts but the random errors will still slip-through the cracks... So, for these projects, using real redundancy is better, and adding random redundancy on top has little if any extra benefit. ... agree that if you have true redundancy then random redundancy on top of that would be, well redundant ;-) But it is a myth that true redundancy cathces all errors. It catches a larger proportion than random redundancy, but not all. And the cost is that the same amount of crunching produces fewer results. Do you prefer a million results where you can show statistically from your random redundancy tests that there are only likely to be 100 errors in there, or do you prefer a quarter of a million results where there is still a 50% chance of having 1 error? The answer to that depends on what you will use the data for -- there is no one right answer. (I made those figures up, but the ball park is about what you can achieve). LHC needs three-fold agreement, and crunches the typical WU just over six times. It needs the accuracy and accepts the 6x overhead of getting it. Other projects don't need so much; just like we rarely these days insist on redundancy when using spreadsheets, etc. River~~ ID: 30814 · Rating: 0 · rate: / Reply Quote

[AF>Slappyto] popolito Send message Joined: 8 Mar 06 Posts: 13 Credit: 1,022,275 RAC: 465	Message 31818 - Posted: 29 Nov 2006, 15:50:33 UTC About proteins@home : http://biology.polytechnique.fr/proteinsathome/documentation2.php :) ID: 31818 · Rating: 0 · rate: / Reply Quote