Rosetta@home

Dr. Baker's journal archive 2006

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search
Message boards : Rosetta@home Science : Dr. Baker's journal archive 2006

Sort
AuthorMessage
David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 11591 - Posted 3 Mar 2006 6:59:12 UTC

Many of you have suggested that I give more frequent feedback on what is happening with Rosetta@home. I'm very happy to do this, and
will try to make nightly posts here (I don't promise every day!). I'll also start a second thread where people can post comments/questions related to the journal which I can try to answer as well.

It has been an extremely exciting week for the project! Divya and I for the past two months have been improving the algorithms in rosetta based on tests on a set of ten proteins whose names you are probably quite familiar with by now. David Kim tested the improved method on an additional set of about 25 proteins. He has been so busy trying to fix the problems causing the WU errors that he hasn't had much time to analyze the results, but David and I took a quick look yesterday and the results are pretty amazing--for about half of them the lowest energy structure was quite close to the correct (native) structure. Since we hadn't used this set in improving the method, it suggests that the improvements we have made are pretty general, and that we are making significant progress towards solving the protein structure prediciton for small proteins.

It is hard to describe how electrifying (and almost scary!) this is. the protein structure prediction problem is perhaps the longest standing problem in molecular biology. it has been known for forty years that the structures of proteins are determined by their amino acid sequences, but as recently as five or six years ago it was generally thought that the prediction problem was completely intractable as very little progress had been made. starting about this time we showed in the CASP blind tests that with the rosetta low resolution structure prediction method rough models could be built for small proteins that in some cases were reasonably similar in topology to the true structure, but the predicted structures were never accurate at the atomic level. we have worked for the past five years on developing high resolution refinement methods that could take these rough models and refine them to much higher accuracy. this goal remained elusive for the first few years, but about a year and a half ago we made a breakthrough and found that we could make very accurate predictions for some proteins using a trick that involves folding not only the sequence of the protein of interest but also the sequences of a large number of evolutionarily related homologs. using this method we made the first high accuracy ab initio structure prediction in CASP (the last target in CASP6) and did further tests which showed accurate predictions for 6 of 16 proteins which were published in Science last year (I think there is a link to the paper on the home page).

however, this work did not achieve the goal of predicting structure accurately from the amino acid sequence of a protein alone as we had to resort to evolutionary information. achieving this goal has been the central aim of rosetta@home thus far, and as I said above it is almost a "holy grail" of computational biology. so now, looking at David Kim's results and seeing that for quite a few proteins we are coming close to predicting structure from their amino acid sequences without any other information is pretty breathtaking.

Divya is now preparing figures to post on the science page which show the results on the 10 protein set we have been working on, and David will post results on his larger set once he can finish the analysis after he has implemented his clever fix to the unhandled exception errors.

it is clear for the still large number of proteins for which we are failing that the problem is not enough sampling, even with 100,000 independent folding runs we are not coming close enough to the native strucutre to land in its energy minimum. so we need more cpu power! it is kind of amazing that solving such a long standing scientific problem depends so crucially on the efforts of volunteers like yourselves! I don't know how much more cpu power it will take, but if you can each recruit ten friends or relations ...
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 11640 - Posted 4 Mar 2006 4:59:50 UTC

Tonight I will try to explain what the names of the work units mean--I know it must seem quite bewildering.

David Kim has just submitted a batch of work units with names like "HB_BARCODE_30_1enh_".

The "HB", like the "HBLR_1.0" in the work units I sent out in the last several weeks, indicates that hydrogen bonds are given less weight than in our standard work units. I found last month that reducing this weight helps overcome barriers to sampling that were due to formation of beta sheet
structures more regular than in most native proteins. We will probably soon make this change in all of our calculations as it seems to be a quite general improvement.

The "BARCODE_30" in this work unit and in many of the previous work units indicates that a method
we developed last fall for spreading out sampling is being used. In this method, each instance of each work unit gets a randomly chosen "barcode" which directs the folding trajectory to a different part of the space we are searching. the "30" indicates that one randomly selected residue is restricted according to the barcode for every 30 residue segment in the protein.

the "1enh" indicates the name of the protein being folded. this protein is a "homeodomain" which is a fascinating family of proteins found throughout multicellular organisms. these proteins control the fundamental steps of development all the way from flies to humans! mutations in these proteins in flies, for example, can cause legs to grow out from where the eyes normally are. if you go to www.rcsb.org and put the four letter code for a protein running on your machine you can learn more about each protein. maybe we can put together a summary sheet with a short description of each protein if this would be interesting.

the "_" at the end of 1enh indicates that this protein only has one chain. some proteins have multiple different chains, and in these cases each chain is indicated by a label like "A". for example 5croA means we are folding chain A of the transcription factor 5cro.

Divya has just submitted a series of work units with names like HOMSdi_homDB006_1di2_. Divya here is testing to see how much better we can do with the multiple sequence method described in my first post with distributed computing. "homDB006" means this is the 6th homolog sequence being folded for the protein 1di2.

thats all for tonight. Tune in tomorrow for a description of the causes of work unit failures and what we are trying to do to solve these problems.



____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 11677 - Posted 5 Mar 2006 7:32:23 UTC

tonight I will describe how we are going about trying to fix the problems causing work unit errors.

the large increase in errors two weeks ago was not due to new bugs, but was an unintended byproduct of our effort to help dial up users who wanted longer work units and the many other users who requested work units with more consistent lengths. since most work units had been taking on the order of an hour or two previously, when we set the default work unit length to 8 hours, this resulted in 4 -8 times more structures being generated per work unit. This unmasked rare possibiliities for error that were not evident before: if the chance of running into an error when generating one structure is say 1 out of 100, if 10 structures are being generated most work units will be fine, but if 100 structures are generated there is a high liklihood of error. David K. has reduced the default time to 2 hours, and this seems to have increased success rates considerably.

It is clear that different machines and platforms have errors at very different frequencies. If you have a machine that never or very rarely has errors, to reduce net traffic it would be good to increase the work unit time, but if you have a machine that still has frequent errors it is probably worth decreasing the time to one hour.

Clearly we need to find the sources of the errors. this is complicated for us because we use linux machines for our own work, and the errors are primarily happening on windows machines. David Kim had the excellent idea of putting in error catching logic so that if an error is encountered the program just goes on to compute the next structure rather than aborting, but unfortunately from tests on Ralph it seems that many of the errors cannot be recovered from. David is continuing to explore this, and meanwhile I have posted a bounty on finding the bugs/problems to all rosetta developers. Also, if you are an expert and have a windows machine with problems that you can reproduce outside of boinc (as described in the number crunching boards a few weeks ago) and have time to go through with the debugger let us know.

We are also working with the boinc developers to solve problems likely to be more related to boinc than rosetta. these include the occasional crashing on termination, and the graphics related problems. Jack Schonbrun who developed the rosetta@home screensaver (we hope you like it!) just moved on to a job at Lawrence Berkeley Laboratory in the same division as David Anderson and the boinc development team, which should help us coordinate to solve these problems.

I'll get back to more exciting recent science results in the next few days. Suggestions for topics are also welcome.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 11701 - Posted 6 Mar 2006 4:12:11 UTC

Tonight I will describe some exciting results from the last two days!

I need to first explain the idea of folding multiple "homologues". Because the fundamental machinery of life has remained very similar throughout the course of evolution, many proteins for example in humans are closely related to a number of proteins in many different organisms. the homeobox protein I mentioned earlier is present in all higher organisms, for example. So most proteins belong to large families; if you line up the amino acid sequences of any two proteins in one of these families you would find that between 25 and 90% of the amino acids are identical. It is very well established that all members of a protein family have very similar structures. So if we want to predict the structure of a protein we can fold not only the protein we are interested in but also all of its family members, which are called homologues.

The exciting result is that if we generate models for many different homologues for a given protein, and restrict our attention to those homologues where the ten lowest energy structures are very similar to each other, we find that the modles produced by these "converged" runs are very close to native. It is clear that when making predictions for a single protein, if the low energy structures are all different from each other, a confident prediction cannot be made as there is no basis for choosing one prediction over another. The new result is that convergence (all low energy structures similar to each other) can be used to pick out the folding runs which generated accurate models. This also provides a way to assess the confidence of a prediction--if the folding runs for a protein which converge all produce the same structure (as in the cases I've analyzed), it is almost certain to be correct. Having criteria for evaluating the probability that a model is correct is very important when we move to prediction of structures of important proteins with currently unknown structure, something which may happen sooner than I had thought given the rapid progress we have been making.









(as a side note, our local filesystem was rearranged this weekend for non Boinc related reasons, but this may have inadvertedly affected our ability to send jobs out. if this is a problem, it will be fixed by tomorrow AM)
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 11746 - Posted 7 Mar 2006 6:11:42 UTC

Tonight to give you a more balanced picture of what doing this kind of scientific research is like I'm going to talk about failure!

Most of the fall I spent trying to achieve in folding calculations using only a single sequence the considerable improvement in structure prediction that we had observed using many different homologous sequences. My idea was to try to make the sequence being folded effectively a bit different each time by changing the parameters which determine how the energy is computed during the low resolution part of the search. This is along the lines of the suggestion on the comments/questions boards about whether we could generate our own sequence homologues to improve the search. Frustratingly, I was never able to really improve the results over using just the same sequence over and over again starting (as we always do) from different random starting points (We do continue to add this variation into each run through the "rand_envpair_res_wt" and "-rand_SS_wt" flags you will see on the command lines we are sending out as adding this variation may help in some fraction of cases).

This is the way research is--many ideas that seem like they have to work in fact don't, and it is only by constant experimentation and testing of lots of different ideas that progress is made.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 11773 - Posted 8 Mar 2006 6:23:31 UTC

Just a quick update tonight.

(1). David Kim has started a large set of new jobs for his 25 protein test set. Their names are
HB_BARCODE_30_***** where ***** are the name and chain id of the protein; I described the HB_BARCODE part below. Rhiju is preparing to send out jobs for 30 homologs for each protein in a still larger set of 62 proteins. Divya is completing her homolog runs for the initial set of 10 proteins she and I have been studying. We are very excited about seeing the results for all of these calculations as they will guide our next steps. There are over 3 million jobs queud now!

(2). I'm a bit disappointed that the total cpu power has remained constant for the past weeks rather than increasing as it had up until recently. More users and hosts are joining every day, but this is not translating into increased computing capability. We aim to extend our calculations to larger proteins based on teh success we ahve been having with proteins under 100 amino acids, but this will require a significant increase in computing power. Please let me know if there is anything I can do to help with recruiting.

(3). We are making a major push to find and eliminate the causes of those annoying access violation and other errors. We've enlisted a team of rosetta experts to help find possible memory leaks, etc. and it is looking like Microsoft will also have a team to help us--we have been interacting quite a bit with them as they are interested in becoming more involved with scientific computing. So I'm optimisitic that most issues will be resolved within the next couple of weeks.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 11811 - Posted 9 Mar 2006 6:21:02 UTC

Good news on the tracking down the remaining errors front--a long time BOINC and SETI@home expert, Rom, is joining us as a consultant to help fix the "leave in memory", graphics, and other problems related to the rosetta-BOINC interface.

Divya will be posting updates to the "top results" section (yes, I know, at long last...) before tomorrow evening, and I will discuss these in my update tomorrow.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 11844 - Posted 10 Mar 2006 6:05:23 UTC

Tonight I will discuss the results that Divya has posted in the "Top Results" section.

The first protein is 2reb, which is the shorthand code for RecA, one of the most important proteins
in DNA repair and recombination in all of life.

The energy versus RMSD plot on the left shows that Rosetta has strongly converged on the correct structure--all of the very low energy structures are less than 1.5-2 Ā RMSD from the native structure which is close to experimental error. In cases like this, where all the lowest energy structures are
very similar to each other, we can be quite sure that Rosetta has found the correct solution. Congratulations to Dalephi and Team TeAm AnandTech for finding the lowest energy structure!

The top two pictures on the right show a comparison of the experimentally determined structure (top)
with the Rosetta predicted sturcutre (bottom). as you can see, they are very similar. The third picture (below) is one that you haven't seen before. Here the two structures are shown superimposed with each other, with the protein sidechains in the core of the protein shown as well. you can see that the sidechains are in very similar positions in the prediction and the actual structure.

The panel below shows the energy vs rmsd plot again and on the right is a comparison of the lowest rmsd structure to the actual structure. They are again very similar. Congratulations to Jacekko and Team Rosetta@Poland for finding this structure!

The following panels show results for the protein 1dcj. 1dcj is important for cell division in bacteria, and could be a target for developing antibacterial drugs. As in the case of 2reb, the lowest energy points are all close to the correct structure, and similar to each other. Rosetta has again converged on the correct answer. The RMSD is a bit higher, but this probably due to wiggles in the red tail in the pictures at the right which probably does not have a well defined structure in solution. The pictures show that aside from this segment the lowest energy structure is very similar to the experimentally determined structure. The superposition in the panel below shows that the amino acid sidechains are in very similar positions in the lowest energy structure and the experimentally determined structure. Congratulations to Team_Gaol~Christian and the Dutch Power Cow for finding this structure (I hope the DPC stampede in the right direction) !

The last set of panels compare the lowest rmsd structure and the experimentally determined structure, which are again very similar. Congratulations to LocalBusinessMen (Team XtremeSystems) for finding this structure!

In these plots on the left for clarity we are only showing the lowest energy 5000 of the roughly 300,000 structures all of you computed for these proteins. If we could collectively do as well on all proteins, the protien folding problem would be solved!


____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 11880 - Posted 11 Mar 2006 6:25:07 UTC

Two new things for tonight.

First, Divya has posted results for a new protein, 1dtj, on the top results page. 1dtj is an RNA binding protein that is associated with an autoimmune neurological disease called paraneoplastic opsoclonus-myoclonus ataxia.

The energy versus rmsd plots on the left again show a cluster of structures with low rmsds and low energies. The rmsd spread of the low energy structures is a bit larger than in the previous examples. This is because of a long loop which is visible at the bottom of the structure superposition image. It actually associates with another protein in the experimentally determined structure, but in our calculations it folds back on the main part of the protein since there is nothing else to interact with. The much higher energy "low rmsd" structure has lower rmsd because this loop is in a position more like in the native structure, but the main part of the protein is in fact not as well predicted as in the low energy structure.

This is again a very successful prediction; congratulations to tgxiii (Team XPC) and DPaddick for finding the lowest energy and lowest rmsd structures. These results will also help us to improve rosetta--on the right side of the energy versus rmsd plot you will see a low energy structure with high rmsd which is an indicator of an inaccuracy of the energy function. We are currently studying such cases to learn how to improve our energy calculations.

The second piece of news is that our new BOINC consultant Rom visited for several hours today, and is now up to speed on rosetta@home--hopefully he will soon find the causes of some of the problems. We have also got all of our local machines running rosetta@home and hoping they will have errors so we can then track down, but we haven't gotten any (non trivial) errors yet, which in our case is a nuisance! There is probably some common feature to the subset of your machines having errors and I hope we can track it down.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 11924 - Posted 12 Mar 2006 4:46:56 UTC

Divya has just posted the results for 1mky, a protein involved in a molecular switch. This was probably the hardest protein in our original test set, and the results illustrate why we are still working on improving the rosetta algorithm and why we need more computer power. You can see on the lower left of the rmsd versus energy plot that there is a low energy low rmsd point, this is quite a good model and is comparable to the "low rmsd" model shown on the page. however there was very little sampling in this energy+rmsd range, and none closer to the native structure, and the very low native (correct) energy minimum was missed! because of this, the lowest energy structure found is quite incorrect: you can see that the cyan loop in the native structure turns into a strand in the low energy structure and intercalates into the main body of the structure. again, by studying this low energy wrong structure we will be able to improve the rosetta energy function. also, since the native structure is lower in energy than any of this (wrong) lowest energy structure, if we could have sampled more we would have come closer to the native structure and then the lowest energy structure would have been more accurate.

Congratulations to Canada_David (Team Team Commonwealth) for finding the low energy structure--it has a lot to teach us!, and to Viking69 (Team Ricoh Corp.) for finding the lowest rmsd structure. We are carrying out more runs for this protein now using the multiple sequence method I described, and we should give out a prize to the person who finds the even lower energy close to native structures not found so far.

____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 12000 - Posted 14 Mar 2006 5:16:54 UTC

Several new things today:

We will be acknowledging the lucky individuals who find the lowest energy and lowest rmsd models on the main news page in addition to the top results page. We are looking into possibilities for certificates. I assume giving bonus credits for low rmsd and low energy structures would not be a popular idea.

Today's protein is 2tif, a protein which plays a critical role in the initiation of translation. Translation is the process by which information in the genetic code is read to create proteins with specific amino acid sequences. You can see on the top results page that the low energy model has the same overall shape as the actual structure. The major differences are in the red bit at the C terminal end of the protein.

David and Keith are setting up multiple load sharing web servers to reduce the load which has caused some of the slowdowns on the web site.

Rhiju is testing an approach he and I think should yield improved results on a number of the proteins you have become familiar with. He will give you a report in a few days.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 12085 - Posted 16 Mar 2006 6:42:35 UTC

Divya has just posted the most striking of the recent results on the top predictions page. This could be a poster child for what can be accomplished with large scale distributed computing! (The protein, 1di2, mediates interactions with double stranded RNA).

If you look at the energy vs rmsd plot for 1di2, you will see that one amazing rosetta run produced a structure far lower in energy than any other, and has an rmsd of ~1.3Ā. As you can see from the superposition on the right, this structure is essentially identical to the native structure--a nearly perfect prediction.

It is remarkable that exactly 1 run, out of the roughly 500,000 independent calculations all of you did, found the native minimum. With five fold less sampling, there would only be a one in five chance of having landed in the correct minimum, and rather than achieving an incredibably accurate prediction such as this one, the prediction would have been quite incorrect as the next lowest energy structures are quite a bit higher in rmsd.

Of course, a level of sampling for which just one trajectory lands in the native minimum is not adequate for reliably predicting structure--this is why I keep harping on the need for more cpu time. with ten times more sampling, we would expect ten hits in the native minimum and a much lower chance for failure. Indeed, for the preceding two proteins, 1mky and 1tif, which are somewhat bigger than 1di2, we did not have enough sampling, and the native minimum was not found.

We can roughly divide the different prediction problems into three categories; first, those like 2reb, 1r69, and 1dcj where a large number of trajectories end up in the native minimum and the prediction is accurate and confident, second, those like 1tif and 1mky, for which no trajectories end up inthe native minimum, and for which much more sampling is necessary, and finally those like 1di2 which are precariously balanced between these two extremes. As we continue to post examples over the coming weeks you will see more instances of all three classes, and I will describe our strategies for improving classes II and III.

(you may have noted the "user unknown" label above the low energy plot; this is a funny sidenote. I was excited about showing this example and highlighting the lucky user who found this amazing solution, but we haven't been able to track down the relevant entry in the database; divine providence perhaps?)


____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 12186 - Posted 18 Mar 2006 5:25:59 UTC

Rhiju has posted results for the protein 1elw, which is part of proteins called "molecular chaperones" which help proteins to fold. Even if you haven't looked at many of the top predictions posts, you should look at this one! The energy versus rmsd plot illustrates the power of large scale sampling: the native minimum is clearly located by a small subset of runs, and the lowest energy structure is amazingly similar to the experimentally determined structure (probably within experimental error). Look at the superposition carefully if you have time. This is a dramatic success of the rosetta-boinc approach.
Congratulations to Markku Fagar from Finland for finding the lowest rmsd structure, which is close indeed.

On the error fixing front there has been rapid progress! I hope to have very good news in the next few days.



____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 12246 - Posted 19 Mar 2006 7:10:07 UTC

For some exciting news on the error front, and a description of the problem and the solution, take a look at
http://www.romwnet.org/dasblogce/

It is amazing how fast Rom brought the error rate way down on Windows machines on Ralph, and we have spent today and yesterday making further modifications based on his suggestions that will hopefully bring the error rate down still further. You should see the improvements made thus far on rosetta@home by tuesday.

____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 12395 - Posted 21 Mar 2006 6:34:20 UTC

Rom is now trying to track down the 1% bug, and we've been going over backtraces he has sent me. we still need more data, so if you have computers with this problem please sign them up for ralph@home.

there is a new thread on "Can you help rosetta@home" started by Robert Brooke and Feet1st that has already a number of great suggestions on improving the project. we really need your help to make these happen as there are only a few of us here and we are trying to do the science and keep the project going at the same time (and fix the bugs and put out the fires that arise...). If you can coordinate on these projects it would be fantastic, and as I said earlier, we will throw a rosetta@home volunteers party some time. For things like T shirts, mailing certificates, etc. I have some funds that could be used.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 12491 - Posted 22 Mar 2006 5:01:44 UTC

Lots of neat stuff today!

Bin has his first set of results back from your computers on model refinement using his "loop relax" protocol which we will explain soon. in these calculations, rather than starting with an extended chain he starts with a rough model based on already known structures of related proteins. the exciting result is that most of the time the lowest energy of the refined models are significantly closer to the correct structure over the core regions than the starting model before refinement. we will try to put up some pictures to illustrate this soon.

Rhiju had the excellent idea of incrementing the percent complete counter at the beginning of each of six distinct phases in the folding process. currently the percent complete counter is only updated after a complete structure is generated, so when the counter is at 1% you (and we) don't know where the calculation is. Rhiju and I have modified the code, and if we can get this into the next release you will see the percent complete increase much more frequently; this will also help us to track down the remaining problems as we will be able to locate where in the code any problems are ocurring.

We have just released on Ralph the latest of Rom's improvements to rosetta--these should allow us to pinpoint the source of the 1% error. see Rom's descriptions of his work in his message boards post. we are incredibly fortunate to have a true boinc and windows wizard solving all of these problems!

as I noted earlier, Rom has already solved the keep in memory problem. we are only waiting a day or two to release the fixed version to rosetta@home in case Rom solves more problems quickly so we don't have to release two versions in rapid succession.

I'm absolutely delighted about the response to the "Can you help Rosetta" thread--we will be able to do great things all together!!

____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 12552 - Posted 23 Mar 2006 6:16:54 UTC

The hero of the day is Keith, who spent many hours on his day off to track down the cause of the dramatic slowdowns you have probably noticed on and off on the rosetta@home web site. he found the problem and solved it as he has described on the technical news page.

Rhiju has added links to information on each of the proteins in the top results section that should be of interest to people interested in learning more about what the proteins do.

Rhiju and I have been looking at the roughly 30% of proteins for which we are still failing to produce accurate models. It will come as no surprise to you that the problem appears to be insufficient sampling--the native structure is always much lower in energy than any of the structures found, and the structures sampled are all quite different from each other--the problem is not that they are getting trapped in a small region of the space, but that the space is so large. I'm hoping that if we can solve the 1% bug this week, and the efforts described in the "Can you help Rosetta@home" and related threads get going, we can really push to increase participation, which should bring some of these more challenging larger proteins into the solvable range.

For those who are helping to track down the 1% bug on Ralph, please read Rom's post on updating the boinc client--it turns out that we will get a complete error report for aborted "stuck" runs only if the very latest client is being used
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 12744 - Posted 28 Mar 2006 4:50:51 UTC

A couple of things today:

First, after consultations with many of you on the message boards, we have set the maximum allowed run time to close to 24 hours. No jobs should be getting stuck for much longer than this time. The maximum time that you can increase the length of your work units for has correspondingly dropped to 24 hours. This is of course a temporary fix as we are still hot on the trail of the "1%" bug. It is clear that it is not an infinite loop within Rosetta, as it only occurs when rosetta is run within boinc. With better backtracking in the latest version on ralph, and a new version of boinc soon to be released that should make error tracing even more straightforward, we are optimistic about solving the problem in the near future.

Our RALPH tests have shown that Rom's fixes have solved most of the other problems and the error rates are way down on Windows machines in particular. You haven't seen the new version on Rosetta@home yet because everytime we think we have a version ready to release a small problem has shown up on a different platform which David has had to go back and fix. The new version included more freqent updating of the "percent complete" which will allow you (and us) to localize more precisely where in the folding process work units are getting stuck.

I had some neat results over the last day or two taking the lowest energy structures identified in your large scale searches, and resampling the beta strand pairings allowing for a limited amount of additional variation. When I used pairing information consistent with the native structure, I found significantly lower rmsd and energy structures than earlier, and I am now excited to try this without any bias on a large scale--at the beginning of each run, a set of beta pairings, defining a beta sheet "topology" will be selected at random and used to guide the trajectory. Only a small subset will have the correct pairings, but based on this recent test these should have lower energies than the others and so should be detectable based on their energies. These runs will be called "TOP_SAMPLE" (for "topology sample")--look for them soon!

____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 12784 - Posted 29 Mar 2006 7:41:32 UTC

As you know, David Kim released the version we've been improving on RALPH today on rosetta@home.
I just checked the error rate for jobs on RALPH with this version over the last 24 hours, and they are great:

Version OS Total Results Pass Rate Fail Rate
496 Windows 1284 98.05 1.95
487 Darwin 50 98.00 2.00
487 Linux 178 96.63 3.37

So you should see an improvement over the next several days as the new jobs replace the older ones.

As I mentioned earlier, the new version more continuously updates the "percent complete" value, and if you do have a stuck work unit please let us know exactly what percent complete it was at as this will allow us to pinpoint where in the program the sticking occurs.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 12900 - Posted 1 Apr 2006 5:15:56 UTC

Version 4.83 is looking great! Here are the results for the last 24 hours on Rosetta:


#jobs %success %failure
483 Windows 84993 95.59 4.41

A large fraction of these errors are associated with downloading files; these are not so bad as the process exits immediately so no time is lost.

I had hoped to be able to track down the '1% bug" using the more continual updating scheme we put in, but so far there have been no reports of "stuck" wu with 4.83, which we can't complain too much about!

Almost all the recent error reports have been with older WU--if these are causing problems for you please go ahead and delete them.

I'm really excited about the ideas Feet1st, Robert Brooke, and others are discussing on the threads below about recruiting more users to the project. The letter to inactive rosetta participants will be a
great step in this direction. If you have time, check out their excellent work and get involved!
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 13035 - Posted 4 Apr 2006 6:03:05 UTC

Rhiju has a new exciting approach to the search problem I'd like to tell you about. Tests of his approach have been running the past several days on your computers and the results are very promising.

Recall that our search for the lowest energy structure occurs in two stages which are clearly distinguishable when you watch the progress of the calculations on your screensaver. In the first "low resolution" stage, the protein explores a wide range of different conformations, often changing quite wildly. In the second "high resolution" stage, the range of motions is much smaller because all the atoms in the protein are represented in detail and almost all large changes would lead to impossible structures with atoms on top of each other.

In the low resolution stage, we can sample broadly and rapidly, but because of the approximate representation of the protein chain, the computed energies are not very reliable. In contrast, in the high resolution stage we can compute energies accurately, but it is very difficult to sample.

Rhiju's idea is to try to combine the best of both worlds: the accuracy of the high resolution energy calculations and the rapid and broad sampling of the low resolution calculations. You can think of the many models returned by your searches as building up a map of the high resolution "energy landscape". What Rhiju does is to take a large set of high resolution structures and their energies returned by your computers, and derive a model of the high resolution energy landscape from them. He then starts a new large scale set of low resolution runs, ADDING this modeled energy landscape to the standard low resolution energy function. These new runs can explore space broadly and rapidly, but will be guided to the regions that are low in energy according to the high resolution energy model.
As I mentioned above, his results over the past few days have been very promising, with good predictions for a number of proteins we were struggling with before.

____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 13069 - Posted 5 Apr 2006 6:32:50 UTC
Last modified: 5 Apr 2006 6:33:34 UTC

Today I looked again at the results of the HBLR_1.0 runs I sent out a month and a half ago. Quite a few more results have been returned since I last analyzed these, and I was blown away by what I saw.
For example for 1dtj, for which there is a snapshot of the results as of March 10 on the "top predictions" page, there is now a single point much lower in energy than any other that has very low rmsd. We now have 1.1 million (!!) results returned for this protein, and one of these 1.1 million runs hit the jackpot. A couple of other of the test proteins have similar "one in a million" amazing low rsmd and low energy points, and the results for all of the proteins have gotten considerably better than earlier with the increased sampling of the energy landscape. Again, at the risk of sounding like a broken record, these results really highlight the absolutely critical role of massive distributed computing in solving the protein folding problem--in house we were able to do ~10,000 independent runs, but 1,000,000 was completely out of the question.

With our improvements over the past few months and this big increase in sampling, the prediction failures are becoming far fewer, but for the thornier problems it is clear that we are not sampling enough. With the 150 teraflops the project is aiming for, even these should fall into place.

so everybody please look at the letter to inactive rosetta users in the threads below, and spread the word!

thanks!

David


(on a more technical note, Rom knows what the problem is with the 0 cpu time on the win98 computers, and will have it fixed soon. in the meantime, please continue to crunch with these machines, credits will be awarded and the results are being collected as usual)
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 13522 - Posted 12 Apr 2006 6:09:22 UTC

As you know, I mistakenly sent out a large batch of jobs without properly testing them first on RALPH. I apologize again for the trouble this caused you over the weekend.

I did get enough results back from machines which did not have problems with the jobs to see that the improvement in sidechain sampling does improve the overall search. Particularly dramatic were the 1di2 and 1dtj cases, which with the standard protocol had "1 in a million" low energy low rmsd points, but with the improved sidechain sampling protocol had many more points in these (correct) low energy minima. We are currently working to track down the source of the windows specific problem in the sidechain sampling routines, and when this is fixed we will test on ralph and then (after verifying that the error rate is low!) transition on to rosetta.

The frequent updates and experimenting with new methods is going to change soon to putting together everything we have learned since rosetta@home started in september for the casp7 structure prediction challenge. here is an email I recently got from the organizers which includes the URL for the project--the exact starting date hasn't been announced but will appear on their site soon.

From: casp@predictioncenter.org
Subject: CASP7 registration is open
Date: April 4, 2006 4:41:01 PM PDT
To: casp@predictioncenter.org
Reply-To: casp@predictioncenter.org

Dear members of the CASP community,

It is this long-awaited time of the year again!
We are starting CASP7 season. The registration is now open.
We encourage you to register at your earliest opportunity as of
the next week we will stop sending CASP-related emails to our
CASP6 distribution list and will start sending those to the people
that registered for CASP7. The early registration is especially
important to our server curators as we are planning to have server
dry run in the mid-April.

The main CASP7 web page http://predictioncenter.org/casp7/
contains the details for this round of experiment. If you can not find
answers to your questions there - please write us at
casp@predictioncenter.org .

Hope we all will be having a fruitful and enjoyable season! Good luck!

CASP7 organizers
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 13619 - Posted 13 Apr 2006 5:16:28 UTC

Good news today:

first, Rhiju and I found the bug in the rosetta code that caused several of his jobs to get stuck. I'd describe it to you, but it is pretty arcane, and only affected proteins of exactly 44 amino acids so it had not been seen before. Rhiju met up with this bug as he has been following up recent observations that cutting the ends off protein sequences can signfiicantly improve prediction results for the core of the sequence. Rhiju has cancelled the offending jobs, and corrected the problem in the code, so this will not happen again.

second, David Kim has awarded credits to those who lost valuable time during the problems last weekend.

third, I've had excellent discussions with Janet Skeels, in the UW Office of Research. She and her office have done a wonderful job helping with publicity, see the UW home page at:

http://www.washington.edu/
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 14427 - Posted 23 Apr 2006 1:47:59 UTC
Last modified: 24 Apr 2006 23:05:28 UTC

Hello All,

sorry for not reporting earlier, I was away this past week with several people in my group at a meeting in Florida on designing new enzymes to catalyze any desired chemical reaction. this is a very exciting area and hopefully we will be able to run some of our design calculations on rosetta@home after CASP7 finishes.

Bin and Rhiju have been doing wonderful things to try to keep all of you and your computers happy.
I'm a bit out of the loop on this as I've been away, but I thought I'd share with you some of the correspondence they have been cc'ing me on so you can see what the pace is like here even on a sunny spring Saturday in Seattle. The watchdog thread is Rhiju's solution to stuck jobs--any job that runs for more than a specified length of time gets killed; this was originally suggested by some of you on the message boards.






(Bin to Rhiju)
Date: April 22, 2006 4:50:27 PM PDT

OK. If you have the error info thing, then the only other change is add a
set_pose_flag(false);

before output_decoy() in get_the_hell_out().

Of course if we don't output_decoy then this is not necessary.

Thanks!

Bin

----- Original Message ----- From: "Rhiju Das"
Sent: Saturday, April 22, 2006 4:47 PM
Subject: Re: should we increase the max cpu run time to 4 days


Hi Bin:

I just wrote the same error-info thing but also haven't checked in! I'm also changing a few other things. Maybe instead of checking into SVN, can you point me to your code?

Also, I was thinking that we should not output_decoy, just a blankfile -- that way we won't get confused with incomplete decoys. This is similar behavior to DK's 5 strikes.

Cheers,
Rhiju



Subject: Re: should we increase the max cpu run time to 4 days
Date: April 22, 2006 4:43:36 PM PDT

Sounds good.

While we were at this, I'm changing a couple of things in watchdog too:

in get_the_hell_out(), before calls output_decoy(), I'm adding a set_pose_flag(false). This is because if the watchdog killed the thread while running in pose mode, output_decoy() will fail since it's not compatible with pose.

also I added different error info for score_not_change killing and twice_cpu_pref_time killing.

I can check in this changes in a minute if you can OK them.

Bin


Sent: Saturday, April 22, 2006 4:32 PM
Subject: Re: should we increase the max cpu run time to 4 days


Hi Bin:

Can you wait until the evening to resubmit your jobs? I'm making some additional changes to the watchdog to make it gracefully exit (so the user automatically gets credit), based on advice
from the message boards. I can hopefully get ralph 5.03 up and running by tonight.

Cheers,
Rhiju


On Apr 22, 2006, at 4:26 PM, Bin Qian wrote:


You are right! My outfiles are named wrong!


Sent: Saturday, April 22, 2006 4:23 PM
Subject: Re: should we increase the max cpu run time to 4 days



Hi Bin:

The -161 error looks like a file transfer error ... can your run one of these guys locally
and see if the right filename is being outputted? I'm having the same errors with my jumping runs; looking into it.

Thanks,
Rhiju

On Apr 22, 2006, at 4:15 PM, Bin Qian wrote:



Hi Rhiju,

I looked at the results this morning and noticed the failures, but I think they are more than just aborting by watchdog. For example the NO_DOG jobs that had -no_watchdog flag are also failing. Actually they are not really failing - almost all the failing jobs (with or without watchdogs turned on) have the following error information:
<core_client_version>5.4.4</core_client_version>
<stderr_txt>
# random seed: 3885595
# cpu_run_time_pref: 14400
# DONE :: 1 starting structures built 5 (nstruct) times
# This process generated 5 decoys from 5 attempts
# 0 starting pdbs were skipped

</stderr_txt>
<message><file_xfer_error>
<file_name>NO_CHECK_NO_DOG_7486h002_dec129_1.pdb_408_4_0_0</ file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
So looks like the WU did make 5 decoys as specified (-nstruct 5), but for some reason returned with an error code -161. Actually the exit status of these WUs are all "0", which I thought meant non-error.

Wait, even the successful WUs have the following error info:
http://ralph.bakerlab.org/queue_ops/db_action.php? table=result&id=92975
<core_client_version>5.2.13</core_client_version>
<stderr_txt>
# random seed: 3885617
# cpu_run_time_pref: 3600
******************************************************************** **
Rosetta score stayed the same too long. Watchdog is killing the run!
******************************************************************** **

</stderr_txt>
<message><file_xfer_error>
<file_name>NO_CHECK_7486h002_dec184_1.pdb_407_17_0_0</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>

</message>
Is error code -161 returned by the watchdog thread?

Bin

Sent: Saturday, April 22, 2006 12:59 PM
Subject: Re: should we increase the max cpu run time to 4 days


Hi Bin:

The watchdog is working -- maybe too well. I have it shut down Rosetta if it goes for longer than twice the default cpu run time (an hour for ralph). For ralph that time was 1 hour, so all of your jobs are being aborted after two hours!

I just changed the ralph_submit time to be 3 hours; can you send out your jobs again? Also, if you want your jobs to run longer than 6 hours, you can use the flag "- cpu_run_timeout_factor <float>" which is currently set to two, for twice the deafult run time.

I'm realizing now that I should use the actual cpu_run_time from the boinc api, so I may make that change later today and post ralph 5.03.

Thanks,
Rhiju

____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 14521 - Posted 24 Apr 2006 5:30:45 UTC

Tonight I would just like to recommend to everybody to look at Rhiju's descriptions of current work units in the "Active work units log" on these boards. They are really terrific and will give you a picture of the improvements in the science we are currently working on. On the computing side, Bin made an exciting discovery today--with his new frequent checkpointing during the relax protocol, many more structures seem to be returned than previously; this could reflect work being lost in the earlier runs when rosetta@home is temporarily interupted. You should see this advance together with Rhiju's new watchdog thread and other improvements along these lines on Rosetta@home by the end of the coming week.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 14723 - Posted 27 Apr 2006 4:56:17 UTC

I'm delighted to see that the rosetta@home throughput has been climbing recently! This is excellent timing as CASP7 is scheduled to start very soon. When it does start, there will be some changes in the screensaver as the true structure will not be known; we will have a question mark and the target number instead of the native structure.

As a warmup, we will be running targets from CASP6 on rosetta@home this week; some of these proteins are larger than what you are used to, but tests on ralph have not shown any problems. For the larger proteins we stop after the low resolution search because of the greater memory requirements of the high resolution search.

We had planned to release the new version of rosetta@home yesterday, but there have been some lingering issues with the watchdog and preemption that have taken a bit longer than we anticipated to resolve. The hope is to send out the new version tomorrow; if not we will wait until monday because many of you dislike weekend releases.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 14962 - Posted 29 Apr 2006 4:10:09 UTC

Lots of neat stuff coming up this week!

Rhiju, Bin and David K. think they have the watchdog and checkpointing all working properly. This means
(1) No stuck jobs ever!!
(2) Little time wasted when Rosetta is taken out of memory--process resumes in the middle of whatever structure was being calculated.

I'm excited to see the results of jobs running now testing the improved sidechain sampling. The outcome should be clear by next monday/tuesday. Rhiju will then be testing an exciting combination of the "jumping" protocol invented by Phil Bradley here, the star of CASP6, with fullatom refinement.

Will Sheffler, a graduate student here, has developed a "smoother" version of the energy function which we hope will make possible the finding of deep minima from further away. We should be able to start testing this next week as well.

Other good news is that Microsoft has generously agreed to cover Rom Walton's consulting fees. This means that Rom will be back soon to put in more robust backtracing on ralph and rosetta to allow him and us to track down the remaining access violation errors. He will also fix the 0 credit win98 problem as promised some time ago. He is deeply involved in getting the latest boinc release ready for prime time, so it may be a little while before you see him back in action here.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 15133 - Posted 1 May 2006 6:13:08 UTC

The CASP6 test proteins currently running are larger than many of the proteins we have been testing our methods with thus far, which are small by protein standards. In CASP7, which is about to start, we expect the size distribution to resemble those of the CASP6 targets proteins, and that is why we are currently running tests with these proteins. Since these proteins are longer, the calculations take longer and require somewhat more memory. With the safeguards and checkpointing Rhiju and Bin have put in place, we hope these work units do not cause any trouble for you other than taking somewhat longer to complete--please let us know if this is (or is not!) the case.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 15248 - Posted 2 May 2006 6:43:47 UTC

Good news tonight!

(1) Error rates are lower than ever (2% Linux, 5% Windows, 6% Mac) even though we are currently running calculations on larger proteins which are more computationally demanding. Great job
Bin and Rhiju!!

(2) After going through the code today, I think we can reduce the memory requirements for the larger proteins by at least 25%; I hope to make progress on this front this week. In addition to easing the burden on lower memory machines, this may help to reduce some of the remaining low frequency errors.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 15600 - Posted 6 May 2006 6:05:03 UTC

Today I got several phone messages/questions about things I had said on TV last night, which I found pretty bewildering; at first I thought it must be a different David Baker, but finally tracked it down to a showing on UW TV of a videotape of a lecture I gave in the computer science department here a few months ago. I tried to watch a bit of it to see whether it might be interesting to participants, but was so horrified by the ums and uhs that I had to stop after the first minute. In any event, if you are interested in finding more about our research, and can stand the ums, you can find it at:

http://norfolk.cs.washington.edu/htbin-post/unrestricted/colloq/details.cgi?id=449


Other news is that a reporter from the Associated Press is doing a story on rosetta@home and will be contacting some of the volunteers posting in the message boards.

On the science side, we have some new ideas for improving predictions for larger proteins using the low resolution part of the Rosetta folding process which you should see in action soon.

And finally, CASP7 is scheduled to begin on Monday, so look for some exciting prediction challenges coming to your computer very soon!


____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 15673 - Posted 8 May 2006 2:51:20 UTC

I just recieved the following email from the CASP organizers with the latest information on the CASP7 eperiment which begins on May 10:

Dear Predictors,

CASP7 will begin May 10 with a single target trial followed by two
targets on May 11.

Some additional notes on this years’ process:

MODEL ACCURACY
We will pay special attention to model accuracy predictions. These can
be submitted for own predictions in the regular CASP format (PDB
B-factor field) on a per atom basis. These should be error estimates in
Angstroms. In addition an overall score for a given model can be
submitted as follows:
REMARK SCORE 0.7
in prediction header, where SCORE range is 0.0 to 1.0 (1.0 being a
perfect model). Independently, model accuracy predictions can be
submitted on server models, usually available within few days of target
release, or on your own models in the following format:
http://predictioncenter.org/casp7/doc/casp7-format.html#QA
The deadline for accuracy predictions on server models will be the
regular deadline for that target (typically 3 weeks).

In addition, to assess model evaluation methods on all CASP models
(i.e., including human expert predictions) we will be shorly collecting
software for a later assessment. Additional details will be announced
separately.

MODEL REFINEMENT
Special attention will also be paid to model refinement. In some cases
(CM/easy targets of less than 150 residues, preferably less than 100
residues), a single model submitted during the regular prediction window
will be selected for further refinement by others. Refinement window
will then open for additional 3 weeks. The usual refinement of own
models is still encouraged (using the unrefined and regular – for
refined models - model designations).

PREDICTION WINDOWS
Prediction windows will in general be shorter than in previous CASPs
(approximately 3 weeks). This is to adhere more closely to the
target structure release timelines adopted by crystallographers and to
minimize information leaks and subsequent target cancellations. However,
to allow assessment of methods requiring longer computation times, at
least some target deadlines will be extended. In such cases we will
still strongly encourage submitting models within the 3 week prediction
window. If information leak occurs after the initial three weeks but
before the assigned prediction deadline, evaluation of models will be
limited to those submitted within the 3 week window only.

PREDICTION OF FUNCTION
The format for function predictions is as follows:
http://predictioncenter.org/casp7/doc/casp7-format.html#FN
Additional targets will be made available for this category of
prediction (targets for which experimental structures may not be
forthcoming).

MODEL QUALITY FILTERS
Human expert predictions with severely unrealistic geometry will be
rejected outright. The criteria for this are as follows:
More than 5% of CAs taking part in clashes of less than 1.9 Angstrom.
OR
More than 25% of CAs taking part in clashes of less than 3.6 Angstrom.
CA-CA clashes below these percentage values as well as segmented
predictions with more than 4 chain breaks (CAs adjacent in sequence
separated by more than 5 Angstroms) will be flagged (warnings issued).
The model will be accepted, but it might be penalized in the assessment.
Missing loops or other deletions are acceptable.

Server predictions with clashes will be accepted in all cases, but
similarly to the human expert predictions will be issued warnings and
might be penalized in the assessment.

--
CASP7 organizers


____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 15675 - Posted 8 May 2006 5:39:00 UTC


I was just looking at the results for the top teams:

Name Members Recent average credit Total Credit Country
1 XtremeSystems 246 311,043.65 25,052,975.81 International
2 Dutch Power Cows 1071 261,066.85 24,927,849.52 Netherlands
3 Free-DC 154 174,032.69 28,027,746.46 International

this is serious computing power; I think the recent credit numbers correspond to
3,100, 2,600 and 1,740 computers crunching more or less full time
for these top three teams which is fantastic!

It looks like the DPC have nearly caught up to XtremeSystems in total credit, but XtremeSystems is
moving ahead faster. Why does Free-DC have the most total credits but considerably less than the other two teams recently?

We will have to have prizes when CASP finishes at the end of July for the top overall team and the
top team during CASP. definitely a citation in the science paper on rosetta distributed contributing at the minimum!
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 15705 - Posted 9 May 2006 6:14:48 UTC

The lead article in the June issue of Scientific American has an article descibing some of our work on engineering new molecules. I've been discussing with the editors an article on rosetta@home. With CASP starting in a couple of days, we will have a parallel competition for the "top 5 teams" just as in CASP we get to submit the "top 5 models". We will be keeping track of the total credits earned by each team from the period May 10 to Aug 1 when CASP ends, and will describe the winning teams and their contributions to the CASP prediction efforts in the above article and in book chapters on distributed computing we will be writing at the end of the summer.
The spirit of friendly competition has made CASP exciting for the past 10 years, and it is great that
this can extend now, in a very positive way, to the key problem of computing power!



____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 15754 - Posted 10 May 2006 5:20:45 UTC

In anticipation of the soon to be coming CASP7 targets we have made a concerted effort to reduce memory use in Rosetta and I'm optimistic we can get below 100Mb for a reasonable size (150 amino acid) protein. Rhiju found that on his mac a run for a large protein used 150Mb when the graphics was off, but well over 300Mb with the graphics on. We are working to track down why the graphics are taking so much memory.

Two questions:
(1) What level of memory use are you seeing for rosetta@home (with graphics) on your computers?
(2) Should we disable the graphics in the next release to reduce memory use for larger proteins?
(at ~100Mb per work unit, there should be no problems on most machines).
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 15757 - Posted 10 May 2006 6:58:53 UTC

CASP7 is starting tomorrow, and we are excited both about testing the methods we have been developing over the past 9 months on proteins of unknown structure, and also testing some new ideas
that we think can really improve structure prediction for larger proteins. One of these ideas that seems very promising now is that we may be able to recognize even at the low resolution level, where computing is much faster, some topological features which distinguish native structures from random chain conformations.

One of the great things about CASP is that it inspires all the participants to come up with new ideas and approaches before and during the experiment. So in the next month you can expect both work units for casp targets with unknown structures, and work units for proteins of known structure where we are testing out very recent ideas.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 15896 - Posted 11 May 2006 6:44:14 UTC

The first CASP 7 target was released today!

Here is its amino acid sequence:

MSFIEKMIGSLNDKREWKAMEARAKALPKEYHHAYKAIQKYMWTSGGPTDWQDTKRIFGG
ILDLFEEGAAEGKKVTDLTGEDVAAFCDELMKDTKTWMDKYRTKLNDSIGRD

can you tell from this what the three dimensional structure and function
of this protein are?

The problem with proteins, of course, is that you can't
read off directly from the sequence what the structure and function are, although
both are completely determined by the sequence (the genetic blueprint, quite literally).

We are excited because this protein looks unrelated to any protein of known structure, and is
not too much bigger than most of the proteins we've running tests on these many months, so it
is a perfect challenge for the methods we've been developing. After some quick runs on RALPH to
make sure work units behave properly, you should see work units for this protein by the end of tomorrow!




____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 16011 - Posted 12 May 2006 5:38:12 UTC

Two more CASP7 targets were released today. One of the sequences is clearly similar to a protein with known structure, and we will use the known structure as a starting point in the searches. The other new sequence, like yesterdays, is not related to any sequence with a known structure, so again we will predict its structure using the methods we've been developing here for the last months. It is a little harder though--at 200 amino acids larger than almost all of the test proteins we have been testing on. So we are in somewhat uncharted territory here, this is the great thing about CASP--you have to try to solve problems that you would not have otherwise attempted! By stimulating people to try to solve very hard problems CASP is a great stimulant to progress in the field.

Things are getting very busy already, and we are only on the second day!
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 16052 - Posted 12 May 2006 14:52:29 UTC

I thought I'd answer here two good questions that were just raised in the "Comments" thread:

(1) As Rollo suggested, for our five submissions for CASP we will take the lowest five energy structures ensuring that none are really close to each other (unless all the low energy structures are similar to each other as we saw in our tests on some of the easiest proteins).

(2) For large proteins, we don't have any data to guide us as far as what to expect as far as prediction accuracy. We did do tests on CASP6 targets, but it is clear that these were greatly limited by sampling, and for CASP7 we are doing MUCH more sampling for each target than we've done in any of our tests.
So CASP7 is very much an "experiment" for us as it is supposed to be.

Also, Moderator9 asked me to remind everybody that since we don't know the true structures of the CASP7 targets, on the screensaver no "native structure" will show up and the rmsd cannot be computed.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 16233 - Posted 14 May 2006 6:20:09 UTC

I just got an email from CASP which I've copied below; next week is going to be busy!!



Dear CASP7 participants,

Quick update on the experiment progress.

Today (Friday) we have closed accepting server predictions for the first
target, T0283. 80 servers submitted their predictions for this target.
That's already an impressive increase from the total of 62 servers that
participated in CASP6. We are still waiting for several more servers
which curators are still working on their setup. Server predictions
will be made available through our web site as soon as we stop accepting
server corrections (3 days after the target release). Then all the
participants (servers and humans) will have an opportunity to try
themselves in assessing quality of server models (we have announced
earlier about our new QA category).

Next week we plan to release at least 8 new targets starting with two on
Monday.

We are receiving A LOT of emails these days. But amazingly, only one
predictor wrote to us about the gap in target numbering (T0284 and then
T0287). Please, don't be surprised if you see cases like this in the
future. Nothing special about it - I just prepared different targets for
release on Friday and then I had to let other 2 targets go in front of
the prepared ones because of their shorter deadline. The two skipped
targets (T0285 and T0286) will be released next week.

--
Andriy Kryshtafovych
CASP team
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 16427 - Posted 17 May 2006 4:32:09 UTC

Rhiju and I had fun looking at the lowest energy structures returned for T283 thus far. They are very similar in the first two thirds of the protein, but for the last third we see several distinct solutions. Based on our experience with the test problems over the past three months, we expect that with more sampling one of the solutions will clearly win out, and this should be (we hope!) the correct structure.
Thus far we have about 100,000 structures returned; we hope to have 10x more sampling before submitting predictions.

We made another step forward today in reducing the rosetta memory footprint. For a 175 amino acid residue protein, the standard ab relax protocol we are using for CASP took 222MB of virtual memory three weeks ago, and is now down to 108MB! Now a computer with only 256MB of memory should be able to comfortable process rosetta@home jobs even for larger proteins. The major memory hog now is the boinc graphics which can add on another 100MB or more--any experts out there who might be able to help with this? In any event, you should be able to run rosetta@home on low memory machines as long as you turn the graphics off.

A side benefit to some of the memory use reductions is that it should be relatively easy to reduce the sizes of some of the input files we send out with each work unit. Would a 30% reduction make a significant difference to dialup users?

Seven targets have been released thus far in CASP7. The list, which is updated daily, is at
http://predictioncenter.gc.ucdavis.edu/casp7/targets/cgi/casp7-view.cgi?loc=predictioncenter.org;page=casp7/.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 16517 - Posted 18 May 2006 6:19:44 UTC

I was asked how it was possible to reduce the memory usage on rosetta@home so dramatically.

The answer has to do with the extremely rapid pace at which the rosetta code base is evolving and the very large number of developers. I have encouraged all of the researchers leaving my group for faculty positions at other universities to continue working with and developing rosetta, and now there are six research groups in addition to mine actively developing the code. We share all of our advances through a common code repository (SVN) which developers in all of the groups are encouraged to incorporate their changes/improvements into. On a typical day, there can be as many as eight different people commiting their changes into the repository. Nighly automatic benchmarks are run on a wide variety of test problems which cover the wide range of applications currently being pursued with rosetta (you can read about some of them in the links from the home page). As you can imagine, the results of these benchmarks are scrutinized carefully, and if there is anything amiss a flurry of emails goes around all the groups until any problems are resolved. You could imagine a more conservative approach to code evolution, but my philosophy is that there are so many important hard problems to solve in biology that we all benefit from incorporating all advances as soon as possible.

Now, because of all the new areas being pursued in the different groups, and the very large number of developers, the code base is constantly growing. This is ocurring even as we try to make rosetta as suitable as possible for distributed computing. Up until recently, as you are all too aware, there were a number of problems with the rosetta-boinc interaction and with distributed computing with rosetta generally which occupied all of our efforts. Due to the work of Rhiju, David K., Bin and Rom, these problems have been largely solved. This has given us time to try to make rosetta even better for distributed computing--because the problems we are trying to solve are so big, we hope to ultimately reach the size of seti@home.

Early on, users told us the memory footprint was a significant problem. We didn't have time to deal with this with all the fires we were trying to put out until recently. I had time over the past three weeks to compile and pore over a list of all the arrays in rosetta that are larger than 1Mb. With help from quite a number of developers, we systematically went through the list, starting at the top, and tried to reduce each as much as possible. Many of the arrays were mode specific, and could be dynamically allocated only when needed, and others could be replaced by more efficient containers. It was actually kind of fun; every few days we had cut the memory use down by a significant percentage. It can't go all that much lower, but as I mentioned yesterday, we can now cut down the size of the largest datafile we currently send out with each work unit. I hope the reduction in memory use are helping some of you; they already have made it possible for us to efficiently utilize blue gene processors--this was not possible a month ago because of the small amount of memory associated with each processor.





____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 16611 - Posted 19 May 2006 4:24:21 UTC

Good news for dial up users--starting tomorrow, the data files sent out with each work unit will be MUCH smaller. David Kim found for the work unit

t287_HOMOLOG_ABRELAX_hom001_

the input files have dropped in size from 6.6MB to 2.3MB. I hope this helps!!


For people who would like to learn more about our research, but don't want to deal with the umm filled long video, there is an article which is basically a transcript of a talk I gave at the royal society in london last year at http://depts.washington.edu/bakerpg/: click on "publications" and then on "2006"; it is called "Prediction and design of macromolecular structures and interactions"

Also, Divya has fixed the silly problem with the text on the screensaver; the workunits going out tomorrow will have this fixed.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 16714 - Posted 20 May 2006 16:47:43 UTC

In response to some of the recent discussions on the science boards, today I'd like to tell you about how Rosetta is being used to help understand diseases caused by protein misfolding.

A significant fraction of human diseases are caused by proteins misfolding to form long "amyloid fibrils". These diseases range from Alzheimer's disease to infectious diseases from amyloid forming prion proteins. A huge breakthrough in the understanding of the process of protein misfolding to form amyloid fibrils was published in Nature last year from David Eisenberg's research group at UCLA. They reported the first high resolution structure of an amyloid forming peptide. It revealed a set of interactions which seem very likely to be general to most if not all amyloid structures.

We have been collaborating with Eisenberg's group to try to predict the portions of proteins known to form amyloid structures responsible for amyloid fiber formation. We use the rosetta-design method to identify sequences compatible with a generalized model of their amyloid structure. You can read about the promising results of this work in the collaborative paper with Eisenberg's group that is posted on the "2006" portion of our home page publication list mentioned in my previous post. The next challenge which we are collaborating on is to design "caps" that will add on to fibers and prevent them from growing further. This is a good example of how basic research development can have applications to pressing medical problems that were entirely unanticipated.

On a different note, Rhiju asked me to request that users with computers with remaining problems sign up for ralph@home--the error rate is significantly lower on ralph perhaps because there are a larger fraction of high end machines, and this makes it harder to track down the remaining issues on rosetta@home.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 16814 - Posted 22 May 2006 5:49:54 UTC

Donna from the AP sent me a draft of her article; it is great! Thanks to all who helped her with it! She says the AP reaches half a billion people each day--hard to beat that for publicity!

On the CASP front, the lowest energy structures for Target 283 are quite similar to one another, which makes us very excited as this kind of convergence in our tests the last few months has been a pretty good indicator that predictions are correct. We will submit these lowest energy structures for T283 this week, and then focus on the harder problems presented by T287 and T285, and the new targets likely to be released in the next few days.


____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 16956 - Posted 24 May 2006 6:00:10 UTC

A quick update on CASP7:

Many of the targets are very closely related to proteins of already known structure; in fact I'm not sure why the experimentalists bothered to determine their structures! The search is pretty easy in this case, and we are not putting too much effort into these predictioins (they are not so exciting).

There are four or five targets which do not appear to be related to any protein of known structure. For two of these we feel confident that we are zeroing in on the correct structure (of course we won't know for sure for a few months!). But target T296 released today was quite humbling--it has 445 amino acids! This is a dramatically bigger search problem then any we have done tests on, and it may be more a problem for the rosetta@home of next year than this year. but we are going to give it our best shot!

It has been wonderful to see the compute power increase over the past weeks. rosetta@home according to boincstats is now above 31Tflops. We hope this continues and we will do everything we can to make this possible. If it does continue, solving problems like T296 will move more and more into the range of possibility.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 17292 - Posted 29 May 2006 6:23:07 UTC

I wrote an internal benchmark for Rosetta last week, and Rom now has a version that uses this to compute credits. Rom suggests however that we wait until after CASP to deploy it because it may take a few iterations to make it acceptable to everybody. I don't know how difficult it will be to "get it right", but I'd like to start testing it on Ralph soon.

The new version soon to appear on ralph will also have a fix Rom put in for graphics problems; as reported on the boards, a good fraction of the errors seem to be associated with the graphics (I suspect the fact that they consume lots of memory is part of the problem), and in the new versions graphics related errors should abort the graphics but not disrupt completion of the Rosetta calculation.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 17335 - Posted 30 May 2006 4:52:53 UTC

The AP article on rosetta@home is out! See Ethan's post on the boards today. I think it turned out very well--what do you think? Lets hope lots of people see it.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 17497 - Posted 1 Jun 2006 5:40:43 UTC

Here is a press release from the scientific journal Nature on an article of ours that is appearing in the June 1 issue. I'll explain a bit more about the applications of this new Rosetta methodology in future posts; these jobs should start running on rosetta@home after CASP is completed in early august.



Featured press release entry:

Protein engineering: OK Computer (pp 656-659)

One of the great remaining problems in computational protein design involves the redesign of a DNA-modifying protein so that it recognizes, and alters, a new DNA sequence. For example, changing the specificity of a nuclease — a protein that cuts DNA at a specific site — could be beneficial for a range of biotechnological and medical applications.
In this week’s Nature, David Baker and colleagues have shown that it is possible to modify the sequence specificity of a ‘homing endonuclease’ called I-MsoI. They used a computational approach to screen a virtual library of mutant proteins and predicted which amino acids needed to be changed to re-engineer this enzyme so that it recognized, and cleaved, a new DNA sequence. The mutant protein was highly active and was able to cleave the new DNA sequence, but did not modify the original sequence. The authors hope to redesign this and other DNA-modifying enzymes to alter a range of DNA sequences, so that they could specifically target almost any sequence in the genome. These computationally designed proteins may be useful in a range of medical and biotechnological applications, including gene therapeutic and other targeted genomics applications.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 17566 - Posted 3 Jun 2006 6:06:13 UTC

CASP7 is really heating up! You can see the list of targets at

http://predictioncenter.gc.ucdavis.edu/casp7/targets/cgi/casp7-view.cgi?loc=predictioncenter.org;page=casp7/

We made submissions for the first two that were due yesterday (targets 284 and 287). if you have 287 work units still remaining on your computer you can delete them. Please keep all others running!

CASP7 is turning out to be an even more extensive test of rosetta@home than we expected! A much larger fraction of the proteins than we expected based on previous CASPS are both relatively small and completely unrelated to any protein of known structure. These targets are perfect for the methodology we have been developing at rosetta@home since last september when the project began. Things are exciting now, but imagine what it will be like in a couple of months when the true structures are released and rosetta developers, rosetta@home participants, and the whole world can see how good (hopefully!) the predictions are.

We will resume our user feedback by acknowledging the users who find the lowest energy several structures for each of the targets on the home page. (we can't show structure comparisons as on the "top predictions" page because we don't know the true structure!).

____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 18014 - Posted 7 Jun 2006 22:01:22 UTC

Welcome to all of our new participants! I have only very sporadic internet access as I'm out of town this next week, but I look forward to interacting with all of you here when I return. I was absolutely delighted to see the large increases of the last few days; they will really help accomplish the goals of the project! Thanks again to all of you, David
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 18761 - Posted 16 Jun 2006 4:19:17 UTC

I was back at work today and Rhiju and Bin showed me the current sets of CASP7 targets they are working on. Most of them are much bigger than the proteins we tested on during the spring, and almost certainly require more computer time. So please recruit all of your friends and relations for the next month and a half--we are going to need every spare cycle!
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 18854 - Posted 17 Jun 2006 14:52:08 UTC

Rosetta@Home is now resuming feedback on "top predictions". Every few days we will be acknowledging the person who found the lowest energy structure for one of the CASP targets that has run thus far. In most cases, we have tested several different approaches, which have different work unit names, and in these cases we will be highlighting the person who found the lowest energy structure for each approach. So keep a lookout for your name in the limelight!

I think this is a nice addition to the credit system for following contributions as anybody can win, big or small; like a lottery if you buy more tickets you have a better chance, but the small guy can still have the magic entry.

Are there other suggestions for feedback we could give? Certificates, etc. we could think about if people would like this, but we would certainly need this to be at least in part handled by a volunteer group as we are swamped with CASP.


____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 19061 - Posted 21 Jun 2006 14:40:59 UTC

CASP targets are continuing to come in, and we have more to do than ever--our "CASP control room" with Rhiju, Bin and others furiously going through targets, pieces of paper with various information on each of the targets floating around, etc. is quite a sight!

The structures of a few of the targets have been published now, but these are all in the "comparative modeling" category where copying a known structure gives a good solution already (we are trying to refine these starting models using the high resolution part of the the protocol running on rosetta@home). Calculations for these proteins didn't use rosetta@home as they are less time consuming. Our results are good compared to the automatic servers, but we won't know how they stack up compared to other participants until the meeting in November.

We did get some exciting news yesterday from an analogous prediction experiment/competition called CAPRI on protein-protein docking. For this problem, which consists of finding the lowest energy docked arrangement of two protein structures give the coordinates of the isolated proteins, our approach is very similar to that running on rosetta@home--there is an initial low resolution search followed by full atom refinement. Chu Wang, a graduate student in the group, made predictions for the most recent round of CAPRI, and they turn out to be the best made by any group:

http://capri.ebi.ac.uk/round10/R10_T26/ (scroll down to "medium predictions"; we are group 80).


Finally, to answer a question on the discussion boards, many proteins consist of multiple independently folded "domains". In many cases, it is possible to recognize from the amino acid sequence roughly where the boundaries between the domains are, and in these cases we carry out folding calculations separately on each domain. This in the end produces models for different parts of an amino acid sequence, and we then need to assemble these into one coherenet structure. For this we use a protocol again very similar to what you have been running, except that the only variation allowed is in the linker between the domains, typically around 10 residues, while the intradomain structure is kept fixed (this is quite analogous to the docking problem I mentioned above).




____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 19416 - Posted 28 Jun 2006 14:12:31 UTC

CASP targets are continuing to come in and we have our hands totally full. We have less than a day for each target. This compares with the well over a year it takes to solve a structure using xray crystallography or NMR, often with considerable application of human intuition. perhaps predictions would be closer in quality to experimentally determined structures if there was less of a difference in time investment!

This question came up on the number crunching boards:

With the new methodologies being developed, will there be a point at which we go beyond the needle-in-the-haystack decoys and start clustering around the actual structures of protiens?

Answer: correct models will always be a very small fraction of the structures generated just because there are so many alternative conformations for a protein chain. but to have confidence in a prediction, there must be convergence of the lowest energy conformations on a single structure. As our methods improve and sampling (cpu power) increases, correct models will remain "needle in a haystack" in the overall population, but dominant in the population of lowest energy models.



And is this the goal before (from what I understand) the project moves into the design/docking phase?

Answer: No, while this is the solution to the structure prediction problem, it is not necessary for successful design and docking (certainly, though, more accurate prediction methods would impact both areas). We have had considerable success with both design and docking already. After CASP we will start running both docking and design calculations on rosetta@home, as well as continuiing to improve our structure prediction methods.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 19455 - Posted 29 Jun 2006 6:51:38 UTC

Today I met with the people who design the science curriculum for Seattle Public School middle and high schools to discuss incorporating rosetta@home into middle and high school science classes. I think that participating in a real research project could be more inspiring than just learning a set of facts; I certainly never found science classes very fun or interesting--the exciting part is discovering new things more than learning about discoveries made long ago. Anyway, they were very interested and we should have some pilot projects in schools this fall.

These message boards were what gave me the idea for this--it has been really fun and rewarding to try to explain our research and answer all of your questions. As part of making the project more educational, we are working, with help from a Microsoft expert, to increase the amount of feedback participants can get on the results their computer produces. Hopefully you will see this here in not too long.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 19869 - Posted 7 Jul 2006 7:06:04 UTC

David Kim just put a link on the home page to an article just out in "The Scientist" which describes my group's research work and features rosetta@home. If you are interested what is coming down the road for rosetta@home you might take a look at it.

The last CASP targets are going to be released in a couple of weeks; it has been so much work that we are all ready for CASP to be over so we can start pursuing the new ideas that have come up as we work on these concrete problems. Also, of course, we are very eager to see the actual structures, and learn what we need to work on most to improve Rosetta.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 20187 - Posted 14 Jul 2006 15:45:46 UTC

We desperately need as much CPU power as possible for the next two weeks--there are more than 25 CASP targets due, including some that are our best shots at really high resolution models. Frustratingly, we won't be able to do anywhere near as much sampling as we had planned for these proteins as there are so many coming due near the same time, and thus can't really expect the accuracy we had hoped for. So if it is at all possible for you to
increase your rosetta@home cpu time for the next two weeks please do--it will make a huge difference for our collective efforts!
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 20573 - Posted 19 Jul 2006 4:57:38 UTC

Several things tonight:

(1) A reporter is doing a story on rosetta@home participants. please if you are interested respond to her at Message boards : Cafe Rosetta : Reporter ISO of Interviewees.

(2) The first actual structure for an ab initio casp7 target was released today. our top prediction is very close, but not perfect. the error is a shift in register of the last beta strand. this is a problem that we saw in a number of cases in the tests we ran in the spring, and will be high up on the list of methods improvements to be tackled in August when casp is over.

(3) David Kim has put together instructions on how to save and view the predictions your computer is making. I think that many of you will find this very interesting--give it a try!

(4) Thank you all for your response to our plea for more computing power--I think we are seeing an increase even in the face of the summer heat.

(5) Please contact Jose on the message boards if you would like to know what is being done about high credit claims.

____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 21367 - Posted 29 Jul 2006 5:40:58 UTC

I just returned from our annual Rosetta developers meeting. It was a tremendous success! There were 70 people attending from all over the country and some came all the way from Europe to the conference center in the Washington Cascades.

We discussed improvements in the basic methodology in Rosetta made in the different research groups, and some of the exciting scientific advances as well. It was a terrific opportunity to get caught up and to learn about all the new capabilities created by the extended Rosetta developers community. We also discussed how to continue to keep the program intact and cohesive with all of the changes being made in so many different places all of the time (those of you who are programmers will certainly appreciate this challenge).

The meeting was also a great opportunity for beginning students in the different research groups from different institutions to meet each other and the people who have been working on developing rosetta for several years. And for those of us who have been around for a bit longer, it was a great opportunity to see old friends!

My only disappointment was that I had to skip the traditionall hike/climb following the meeting because
of a knee still not fully recoved from a previous trip.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 21794 - Posted 4 Aug 2006 5:58:42 UTC

Our Gates foundation grant to develop HIV vaccines started earlier this week (Aug 1), and it has been an exciting time as we are finally able to pursue this goal actively rather than just thinking about it! We have already ordered reagents to put together the first five designs, and there are many more in the pipeline. After this first round of designs is complete, and the designs are close to being sent to our collaborators to be tested as possible vaccines, a high priority will be to extend rosetta@home to design calculations (which shouldn't be difficult, as the same underlying rosetta source code is used) so that all of you can contribute to the second round of designs.

This morning I had to wake up early to go to a BBC radio interview which I think several of you participated in as well. The more media attention, the better for the project, but I'm really not very good at this kind of thing...

Meanwhile, CASP7 has only a few days more to go and we are waiting for the solved structures to be released so we can see how well we did. so far only a few novel structures have been released, our predictions for these are good, but not at atomic resolution, probably because the proteins are longer than those in our pre CASP tests and the search problem is correspondingly more difficult.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 22438 - Posted 14 Aug 2006 6:22:50 UTC

You are currently carrying out calculations for the CASPR structure refinement challenge. In this case, rather than having to predict the structure of a protein from the sequence information alone, we are given both the sequence of the protein and a starting model which is not too far away. The challenge is to refine the structure to closer to the actual true structure. In the landscape searching analogy, this corresponds to being told the valley the lowest elevataion point lies in, but not the exact location of thie point.

The refinement problem is similar in concept to the second "high resolution" stage of our standard prediction protocol. Bin uses essentially the same code to carry out these refinement calculations, except that some portions of the starting model are randomly rebuilt at a low frequency to allow a broader search arround the starting model. As you have undoubtably noticed, there is relatively little movement during the high resolution refinement protocol, and this random rebuilding broadens the search considerably.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 23476 - Posted 19 Aug 2006 16:39:10 UTC

It is an exciting time in the lab now as we are recovering from the craziness of CASP. While Bin and Rhiju are taking an incredibly well deserved vacation, the new HIV vaccine design project is starting to come into full swing. We now have computationally designed amino acid sequences for 15 potential vaccine candidates, and we will start the process of making them next tuesday; the first step is to synthesize genes which encode these proteins. We have also designed a whole series of novel enzymes which catalyze a wide variety of reactions, and are starting the gene synthesis process for these as well.
I'm particularly interested now in designing enzymes which destroy organophosphate compounds which are the key ingredients in many pesticides and nerve agents. On rosetta@home, we are carrying out calculations in which we are resampling regions of the landscape found to be low energy in initial sets of runs and we hope these will lead to significnat improvements in our abilities to find global minima.

I'm very sorry about some of the not nice things being passed about on the message boards, and I'm also sorry that my efforts to calm things down haven't helped, so I will be doing my communicating with the project solely through this thread for the next week. You are all making great contributions, and I ask people who have been annoyed by what has been said on one side or the other to try to think about the big picture and what we are all trying to accomplish together.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 23501 - Posted 19 Aug 2006 17:20:40 UTC

A couple more things:

First, I will be describing all of your efforts at a conference on Monday on grid computing:
http://www.opensciencegrid.org/events/meetings/consmeeting0806/agenda.html. (should I
show some excerpts from the message boards?). I will of course describe the great work you have all done together.

Second, To answer a question which came up on the boards--we will NOT be backdating credit totals. The new system will go into place early next week, adding on to the current totals.

Again, I thank you all for your efforts and the many people who have volunteered to help recently!
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 23742 - Posted 20 Aug 2006 15:32:23 UTC
Last modified: 20 Aug 2006 15:33:52 UTC

A BBC report on distributed computing is at http://www.bbc.co.uk/radio4/science/citizenscience.shtml
(I was interviewed, but haven't listened to it).

Longer segments on Rosetta@home were made recently by media groups in the US; I'll pass along the links as I get them.

I'm now preparing my talk for the grid computing meeting on rosetta@home; after I discuss the science and the results, I'm going to pick a few message boards threads tomorrow morning to illustrate the issues that come up--any suggestions on which ones to show? (this could be a good time to clean up recent posts)
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 24241 - Posted 22 Aug 2006 6:51:54 UTC

I showed a bit of the rosetta@home web site, including the message boards, at the grid computing meeting this morning--I think this is the part of my talk that people understood the best. If people are interested all the talks from the meeting will eventually be posted on the web.

I'm working tonight on a manuscript with my former graduate student Rich Bonneau on some of the results from HPF1 done on the world community grid. We predicted structures for all the proteins in one of the best studied eukaryotic organisms--the yeast used to make bread and beer, and then integrated these predictions with other experimental data to assign 500 proteins of previously unknown structure to protein structural families. After this is done, we will start working on the report on the structures of human proteins also done in HPF1. These efforts used the low resolution version of rosetta (which is all we had several years ago when the HPF project started); I am of course excited about HPF2 which is using the protocol we have been improving on rosetta@home (I sent Rich and the collaborators at IBM the code last March) and should produce much more accurate models.

On the credits front: we have decided to use the average amount of time for producing a structure over all rosetta@home runs for a particular work unit to determine the amount of credit to be awarded for each structure produced for that work unit. so, for example, let us suppose that all rosetta@home computers on average took 1 hour to make 1 structure for a given work unit, and that this corresponds on average to 10 credits using the standard boinc accounting scheme. Then each computer gets 10 credits for each structure returned--a fast computer might be able to do 3 structures in an hour, and get 30 credits per hour, wheras my old slow laptop may require 2 hours to make a single structure so I would only get 5 credits per hour. I think everybody will be happy with this approach in the end, even though nobody may be very happy with it initially (I must emphasize that, contrary to some statements on the boards, no individuals or groups had any more influence on our strategy than any other, so I hope this issue can be laid to rest). I am sorry that the switch to the new system has generated so much conflict, I really didn't anticipate this, and I'm sorry that my attempts to calm things down only made things worse (again, I will post from now on only in this thread and only in this forum). In any event, we are quite set on the new credits plan, which we think will be better for everybody, and please hold off on comments or suggestions for two weeks or so until we all have a clear picture of how things are working--we do not need or want new suggestions at this point. David Kim will be posting the definitive description of the new system tomorrow on the boards.

A positive outcome of the recent animated discussions is that we have recruited a number of new moderator volunteers who will be introduced soon and who will be ensuring that the boards stay friendly (please help the project out by avoiding posting things likely to offend other participants!). Also, as I'm pretty much tapped out on corresponding with participants by emai (which could easily turn into a full time occupation!), after tonight if you have issues please contact the moderators via the boards or email and they will pass unresolved issues on to me.

OK--thats enough of this topic, my posts will be back on the new science discoveries track starting tomorrow!
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 24546 - Posted 24 Aug 2006 6:26:58 UTC

Today was an exciting day for the group! In our vaccine design and enzyme design calculations, the end result is an amino acid sequence for a protein predicted to be a good vaccine or catalyst of a chemical reaction. The next step is to make a gene--a piece of DNA--that codes for the amino acid sequence. Due to advances in technology, rather than having to laboriously synthesize each gene in the lab, we can buy genes for any amino acid sequence for not to much from DNA synthesis companies, and we are lucky to be collaborating with a startup company in Boston called Codon who can make them for us quite cheaply. Today we ordered genes for 16 potential HIV vaccines, 15 potential new enzymes, and 4 potential new protien-protein complexes. I say potential above because our design calculations are not perfect, and we won't really know if these proteins act as designed until after we get the genes back in a month or so. Then we take advantage of modern molecular biology techniques to put the genes into bacteria where they direct the cells to make large amounts of the designed proteins. We can then separate the designed proteins from the rest of the stuff in the bacteria using a special tag we include in each of them that provides a good handle. Once we have the purified designed proteins, we can see whether they bind the desired antibodies in the case of the vaccine designs or catalyze the desired reactions in the case of the enzymes. In this way, we will learn about both the strengths and weaknesses of the rosetta design methodology, and hopefully have crated proteins that can have a very positive effect on the world!

As Hugo pointed out, we have not quite gotten the design methodology to the point we can run it on rosetta@home, but this should be coming in not too long as several people in my group are now focusing on this. Before this, look for protein-protein docking calculations where we are trying to predict the structures of the complexes between proteins which mediate much of the basic processes important to life. Chu Wang, a graduate student in the group, is close to having his docking methodology compatible with distributed computing, and we anticipate breakthroughs in this importnat problem as it also seems largely limited by cpu power.

Currently running on rosetta@home are the last of the casp tests on protein structure refinement (see the casp7 website) and tests of a general approach for estimating how much compute power is necessary to find the lowest energy structure for a sequence. I will describe the basic idea behind this approach in one of my next posts.


____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 24951 - Posted 26 Aug 2006 8:14:25 UTC

Tonight I want to describe the approach we are taking, in collaboration with several other research groups, to trying to cure human diseases caused by mutations in critical genes in cell populations that are capable of self renewal. An example of such a disease is sever combined immunodeficiency, which is described at http://www.scid.net/. The idea is that if we could correct a crippling disease causing mutation in for example the blood cell population or the immune cell population in even a very small number of cells, then these now normal cells could divide and eventually repopulate the body with healthy normal blood or immune system cells.

To accomplish this, we would need to target specific DNA sequences around the site of the disease causing mutations in the critical genes. We are doing this using the computational design methodology we described in the Nature paper earlier this summer that I mentioned in an earlier post. We are now designing enzymes designed to cut precisely in the genes responsible for SCID and other diseases. When we have succeeded in creating enzymes that cut specifically within these genes, and not in other parts of the genome, our collaborators will introduce these designed enzymes into mutant cells with a copy of the normal gene that doesn't have the mutation. Cells repair breaks in their DNA by copying from identical or near identical sequences elsewhere in the genome, and it is likely that this introduced DNA would be used to repair the break, in which case the mutation would be corrected. Of course, it would still be a very long road before such an approach could be used clinically, but it is an exciting road to be getting started on!

I'll be out of internet access for the next 9 days, so keep up the great work! Just before I left today Ben a graduate student in the group showed me some very exciting preliminary results from the rosetta@home jobs you have been doing; he can find very low energy low rmsd structures much more efficiently using information gleaned from a first round of searching than with our standard random search protocol. This was for one test protein, by the time I return you will have completed calculations using his new approach for a number of test proteins and we are excited to see the results!
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 26680 - Posted 13 Sep 2006 5:02:01 UTC

The solutions for many of the CASP7 prediction problems many of you worked on May-July have now been released and we are currently comparing them to the predictions we submitted earlier to the CASP organizers. Rhiju has just posted on the "top predictions" page a side by side comoparison of the predicted structure for one of these proteins, target 299, with the actual structure. As you will see, it is quite close despite the considerable complexity of the structure. This was a very interesting target because Rhiju employed the full spectrum of approaches we have been devloping en route to making, with all of your help, this quite excellent prediction.


____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 27512 - Posted 19 Sep 2006 5:33:55 UTC
Last modified: 20 Sep 2006 5:10:46 UTC

I am delighted to thank the four new moderators who are taking care of the rosetta@home message boards--they are doing a great job! I asked them to write a few sentences about themselves, and I'm posting these below. Again, I deeply appreciate their efforts on behalf of the project!



Mod.DE
I participated in some DC projects a few years ago (i.e. RC-64) but stopped.
February this year I found BOINC and later Rosetta where I’m feeling quite
happy at the moment. I’m kind of a computer geek but refuse to work in that
area (in order not to become a total geek) and try to finish my diploma
thesis in economics instead. I have a faible for all kind of social
interaction over the internet currently called "Web 2.0" but on the same
time think I should spend less time before my computer.

Mod.Canada
Some of the things that I am into are hunting, fishing and camping. I enjoy the "outdoors thing". I also like to tinker with DIY Audio circuits and hifi sound, along with computer mods of all kinds such as case mods. I do lots of overclocking, and have a real new found love of Computer language studies mainly being C++, and a little Perl.

I have also been battling Crohn's Disease for 6 years and have had several operations along the way. I have also come down with another life long disease called Avascular Necrosis which was caused by the large amount of steroids used to treat the Crohns disease. It is mainly looked at as a loss of blood flow to the bones, which in turn causes them to die over time.

Mod.Sense
I'm an armchair scientist that came project hopping, and found the insight into science and the high level of transparency to the research team to be compelling reasons to stay and crunch Rosetta full-time. I've stopped looking for another interesting project, and started looking for ways to help the Rosetta team achieve its goal of developing the science that will reinvent medicine, as we know it.

Mod. Tymbrimi
I've spent most of my DC time these last 4 years on F@H, but have joined teammates on other projects like D2oL and FaD. After Dr. David Baker posted an invitation for FaD participants, many of the team moved here.

I'm a fan of another infamous David. Mr. Brin. Even if he does talk about brushing mule's teeth.

It's still amazing to have started with the original release of programs with a sew on patch stating, "Boldly going where Angels fear to tread." in the 1980s, to actually helping perform scientific research with our spare cpu cycles.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 27515 - Posted 19 Sep 2006 5:42:47 UTC

We are just starting to model the structure of the fibrils which accumulate in Alzheimer's disease. Phil Bradley here has developed really exciting new methods which allow him to model both the folding of the protein and the assembly into fibrils simultaneously. We are collaborating with people who are making experimental measurements on the fibers, which provide constraints which will be very valuable for evaluating our models. I will keep you posted on these calculations, and hopefully you will see them running on your screen savers in not too long. (In addition to its medical importance, the simultaneous folding and spiral fiber formation is really neat to watch!).
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 27651 - Posted 20 Sep 2006 5:18:24 UTC

Our first vaccine design calculations have just been sent out on rosetta@home! We are taking a number of different approaches to vaccine design; in the calculations that are just being sent out we are aiming to stabilize a portion of the HIV surface protein, GP120, in its active conformation so it can be used as a vaccine. We are keeping the portion of the protein that interacts with a cell surface receptor called CD4 constant, but redesigning the remainder to increase its stability. The lowest energy designs that are returned from all of your runs will be analyzed here, and the most promising of these sent to our collaborators at the NIH for testing as possible vaccines.
This first batch of work units is relatively small--look for work units beginning "PSH"; there will be a much larger batch next week and we will give them more informative names so you can tell what you are looking at.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 28502 - Posted 25 Sep 2006 14:49:11 UTC

Bin has just posted on the "top predictions" page a quite stunning example of success with the "high resolution refinement" method we have been developing. For this CASP target, there was a structure of an evolutionarily related protein that was already known. In such cases, the structure prediction problem is called "comparative modeling" and the challenge is to start with the evolutionarily related structure and refine it towards the true structure (the basis of comparative modeling is the empirical observation that evolutionarily related proteins almost always have similar (but not identical) structures).

For this prediction, Bin started with the red model, and used the high resolution refinement protocol from the second step in the ab initio prediction protocol (the part which is less exciting to watch because most of the action is with the sidechains which we don't display on the screensaver). He used your computers to carry out very large numbers of independent refinement runs, and chose the lowest energy structures to submit to CASP. As you can see, his submission, the green model, is very much closer to the true structure (in blue) than the starting red model.

These changes may seem subtle, but from the point of view of designing drugs to interact specifically with a protein structure, and understanding precisely how a protein machine works, this level of accuracy really is critical. Needless to say, we would be delighted if we could consistently refine models to this level of accuracy!
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 28602 - Posted 28 Sep 2006 5:29:05 UTC

Laura has just posted a new version of her rosetta@home video--see her post in the rosetta video thread below. If you have a chance, take a look and post in her thread any suggestions you have for improving it.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 28776 - Posted 30 Sep 2006 22:31:24 UTC
Last modified: 30 Sep 2006 23:47:54 UTC

Graduate student Chu Wang has added sidechains to the screensaver, so you will soon be able to see them flickering around during the fullatom refinement stage of the calculations. Chu has also made his flexible backbone protein-protein docking approach compatible with BOINC, so you will soon be seeing pairs of proteins searching out the lowest energy docked conformation on your screen savers. This is a very important problem as much of biology depends on precise and specific interactions between pairs of proteins.


____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 28806 - Posted 2 Oct 2006 5:30:03 UTC

Rhiju has just posted another ab initio prediction made for CASP7 using all of your computers. This is an all beta sheet protein, which have traditionally been the most difficult to predict because the interactions involve residues separated by long distances along the chain. If you compare the predicted structure for this protein, CASP7 target t0316, to the recently released native structure, you will see they are very similar. The superposition with the sidechains of both the predicted and native structure show this particularly clearly. Thanks to bwpow and JVMerlino for finding these excellent predictions!
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 29002 - Posted 7 Oct 2006 5:31:25 UTC
Last modified: 7 Oct 2006 5:35:24 UTC

David Kim and Stuart Ozer from Microsoft Research have been working hard for the past two months to help give participants more feedback on their results. The results of their efforts thus far are now available--as David describes below, you can now see both the overall results on each work unit, and the contributions you individually, or your team has made. David has started a thread in the Science section which addresses questions which are coming up--we are delighted at the positive response so far!


from David's post:
You can now view energy vs rmsd plots for active work units. To view your results, click on the "Results" link under "Returning participants" on the home page. To view results from the top users, hosts, and teams, click on the "Rank" numbers on the respective leader lists. The data gets updated daily.

We'd like to thank Stuart Ozer from Microsoft Research for helping us develop this feature.

____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 29373 - Posted 15 Oct 2006 5:35:33 UTC

Lots of exciting stuff these days!

Rhiju just posted a spectacular CASP7 prediction--see the top predictions page. thanks to the contributors!

Oct 14, 2006 - CASP7 target T0283
sarha1 (Team Czech National Team)
XS_DDTUNG (Team XtremeSystems)
cupra (Team XtremeSystems)
rtroll
XS_vapb400 (Team XtremeSystems)

I hope you have all had a chance to look at the results your computer has been generating (accessible from thre "results" bar in the "returning participants" section of the page. we really like this reporting tool, and are starting to use it for many of our analyses.

You will have seen by now on your screensavers the addition of the protein sidechains during the "fullatom relax" stage of the simulations. This gives you a more complete picture of the "3 dimensional jigsaw" nature of the protein folding problem, where the challenge can be viewed as getting all the pieces of the puzzle to fit together perfectly with no holes.

Our dream now is to make rosetta@home interactive, so you can move the chain around if you see a possible way to solve the puzzle. we are talking with colleagues in the CS department here who are experts on video games about how to approach this. eventually you could imagine designing proteins to cure diseases for fun and relaxation--I think it is possible that it could be made as engaging as a standard computer game. (we haven't thought about the ramifications for credits, but if you can guide the simulation yourself, you should get a higher score for finding lower energy solutions ... . but we won't have to cross this bridge for quite a while!).
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 29560 - Posted 18 Oct 2006 6:31:47 UTC

Welcome to the new participants--we are delighted at the increase in users over the past few days! please tell all your friends and relations!

I posted a brief explanation of the "top predictor" award we post every day on the home page in answer to a question on the number crunching boards; I'm copying the question and my response here for people who might not see it there:

question

"I was absolutely thrilled to see that my team has helped the Rosetta@home project enough to be chosen as the predictor of the day. I was hoping for this for a few months now as our output has gradually increased. This is very exciting!

I am just wondering.. how does the predictor of the day relate to the top predictions, as seen on http://boinc.bakerlab.org/rosetta/rah_top_predictions.php?

Also, where can I see the actual structure which has the top prediction? It is no longer listed on http://boinc.bakerlab.org/rah_results.php?TeamID=1291.

Thanks and best wishes to all! "

my response:

Thanks and congratulations!! As Feet1st explained, you found the lowest energy structure for the indicated CASP7 work unit for CASP target 354. During CASP we experimented with a number of different strategies for each target; the different work units for the same target protein represent the different strategies. The participants acknowledged in the "top predictions" section are the people who found the lowest energy structures over all the different strategies (I can't tell whether your model was one of these because we haven't put the data together yet).

Your result, like those of the other "top predictors" was important because it gave us important feedback on the strategy being tested. target 354 I think was one where the predictions are really good--your model may have been one of the top 5 models submitted!

to view the predictions that you have recently made, you can follow the directions on the home page. to view the distributions of energies and rmsds for your predictions and those of your team, follow the "view your results" link from the home page.

we don't currently have a mechanism for you to retrieve the structures for predictions you made in the past because of the large amount of data we would have to have accessible, but if there is interest we might be able to do something along these lines.


____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 29620 - Posted 19 Oct 2006 5:25:46 UTC

Bin has posted in the top predictions section the rosetta@home based blind prediction for the CASP7 target 330 structure. This is an impressive illustration of the all atom refinement procedure we have developed to improve low resolution models. In this case, starting from a rough model based on a protein with a related sequence and known structure, shown in red, Bin carried out large numbers of independent refinement runsl with all of your help. The lowest energy structure he found is shown in green and you can see it has moved very much closer to the true structure (blue) which of course we did not know when the predictions were made.

Thanks to the following users who contributed particularly low energy models for this target:
Marko (Team Serbia - The Wild Bunch)
WindForce (Team XtremeSystems)
Ian_D
pxee (Team Poland Null-Zero Team)
csbyrosetta (Team SETI.Germany)
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 30411 - Posted 1 Nov 2006 5:11:47 UTC

We are actively considering now several different approaches for building interactivity into rosetta@home which I think will be tremendously exciting. In the simplest scenario, you will be able to propose moves to the computer to try out, for example rotating around a bond you select. In the current version, running on your computers, the computer selects moves at random, and then accepts those moves which reduce the energy. In the interactive version, whenever you have an idea, the computer will try out your move instead of picking randomly. This will let those of you who are interested to become actively involved in the searches, which should be more fun than watching the screensaver, and I'll bet there are some protein folding geniuses out there who will be able to work wonders! We are also thinking of having the 5-10 best models found thus far, and the names of the finders, in a panel on the screen that if you want you can start from and try to improve.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 30974 - Posted 12 Nov 2006 2:33:36 UTC

Rhiju has posted the top prediction we made for CASP7 target 354 on the top predictions page. This is one of the best ab initio structure predictions we made in the CASP7 experiment. Take a look--this is the level of accuracy we would like to be able to achieve consistently starting from amino acid sequences alone. We have been working hard making improvements to the search strategy over the last two months and the next code update you recieve (in the next few days) will have a number of improvement I am optimistic will take us closer to the goal of consistent high accuracy structure prediction.

Also in this next code release is exciting new methodology created by Phil Bradley for modeling the amyloid fibrils associated with many human diseases. Next week Phil will be sending out work units which generate models for the amyloid fibrils associated with Alzheimer's disease. This is not only an extremely important problem from the human health perspective, but is also an interesting challenge for structure prediction--the length of the protein that forms the fibrils associated with the disease is only 41 amino acids, but instead of folding into a single independent structure on its own, it associates with many other copies of itself to form extended helical fibers. Phil has built into the Rosetta code a general treatment of symmetry which allows him to model the folding and simultaneous association of many copies of the 41 residue protein--the constraint that all copies are identical makes it possible to model these many copies without significantly increasing the size of the search space. You will be able to contribute to and observe the symmetric folding and association of the subunits of the Alzheimer's fibril very soon.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 31497 - Posted 21 Nov 2006 6:56:45 UTC

We are now preparing for the CASP7 meeting next week. Bin and Rhiju will be giving talks describing the work all of you did on both ab iniitio structure prediction and comparative modeling, and I'm going to talk about high resolution refinement a bit more generally. I will give you a full report on the meeting when we return.

Last week I was at a meeting of the Howard Hughes Medical Institute, which supports much of our research. The Institute has just started a new initiative to provide resources to the scientists they support to develop science outreach and education programs. The director of this program was very enthusiastic about our goals for developing the science education side of rosetta@home, both for the general public and for high school students (perhaps as a short unit in science classes). So I hope you will see considerable evolution of the screensaver and supporting materials in this direction over the next six months.

On the interactivity front, a computer science graduate student here, Adrien, just completed the first proof of concept step and is going to give us a demo tomorrow. He will describe his plans on these boards soon.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 31888 - Posted 1 Dec 2006 7:33:55 UTC

I'm just back from the CASP7 meeting which was very exciting. I will give a fuller report this weekend after I catch up on sleep and the pile of things which has accumulated, but very briefly it turns out that many of the predictions Rhiju and Bin have posted on the "top predictions" page were the best made for these targets in the whole prediction experiment, and for the experts among you, the T283 ab initio prediction was found by the CASP7 assessor to be accurate enough to unambigously solve (by molecular replacement) the xray crystallography phase problem, an absolutely unprecedented result.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 31923 - Posted 2 Dec 2006 1:38:30 UTC

I'll post more soon on the CASP7 results--for now you can view all of the numerical evaluation of the predictions at the CASP7 web site. The human expert evaluations will be posted there soon as well.

I just recieved the latest edition of the Howard Hughes Medical Institute bulletin which contains an article about rosetta@home participants. You can download this article, and others describing other interesting work supported by the institute from

http://www.hhmi.org/bulletin/popups/download_pdfs_nov06.html
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 32043 - Posted 4 Dec 2006 3:00:18 UTC
Last modified: 4 Dec 2006 7:31:08 UTC

Thanks to feet1st for clarifying my original post! His improved version, with an explanation suggested by SOAN, follows:

The assessor reports are not yet posted. In the mean time, you can get an overview of the predictions you all made for CASP7 by looking at this url

Each team participanting in CASP is allowed to submit up to 5 models. You can see each of our submitted models indicated by black lines. The orange lines are the models submitted by some of the other 252 groups that participated in CASP7. Not all groups submitted predictions for all of the proteins.

The y axis is a measure (RMSD) of how close the model is to the true structure for the fraction of the structure indicated on the x axis (from 0 to 100%). Each point (X,Y) on one of the lines indicates that that the best predicted contiguous segment of X percent of the residues in the model represented by the line was on average Y angstroms from the experimentally determined structure. .

A perfect prediction would be a horizontal line at zero. so we would like there to be at least one black line at zero all the way accross for each target (there are five black lines in each graph, one for each of the five models submitted). The first thing to note is that we are not even close to predicting protein structures perfectly! This is why we are continuing to do methods development work as a major part of Rosetta@Home, and we think the predictions would be better if they were made today than these which we made last summer. Thanks again for contributing to our efforts.

How did Baker lab do compared to the 252 other groups participating in the CASP7 experiment? One way to look at this is to count the number of times one of the Rosetta@Home models was clearly better than the models produced by other groups. You could then browse through each of the graphs, and count the number of cases one of the black lines is clearly lower than all of the orange lines for some fraction of the structure.

My son Benjamin and I just did this quickly. Our list is the following:
targets
283
299
299D1
299D2
300
307
316D3
319
323
327
329
330
330D2
331
347D2
350
354
356
357
360
363
365
368
380

From looking at these plots, it does not seem other groups had as many "breakaway" models. If you have a minute, take a look at these plots and perhaps select a different group to see what their plots look like.

Also note that when our predictions are not clearly better then the others, we're often very close to the best predictions. This is indicative of Rosetta's consistently good predictions.

This "numerical evaluation" is only part of the story, and the measure used in these plots only looks at the position of the backbone Calpha atoms, not at the protein sidechains which, as you know, we spent a lot of time modeling as well. I'll give you a report on the expert evaluation and the results for the whole protein chain including sidechains in a day or two.

____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 32382 - Posted 10 Dec 2006 6:56:25 UTC

The CASP7 assessor reports are now posted at http://predictioncenter.org/casp7/meeting/talks.html. Take a look at the plot on page 18 of the report on free modelling--congratulations to all of you!


____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 32519 - Posted 12 Dec 2006 15:30:59 UTC

Yesterday and today Bill Schief and I are attending the Gates foundation funded
COLLABORATION FOR AIDS VACCINE DISCOVERY KICK-OFF MEETING
I am meeting many of the long time leaders in the vaccine design area
and we are establishing many new exciting collaborations. Computational
protein design methodology has never before been applied to vaccine design,
and there are huge numbers of things to do. For example, yesterday we talked
to the director of the NIH vaccine research center about designing flu vaccines, and
with other experts on viruses about designing Herpes vaccines.

The presentations at the meeting are primarily from the leaders of each of the new vaccine
design projects which the Gates foundation started funding several months ago,
but also from Gates foundation people describing their perspective, including today
a talk by Bill Gates.
____________

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 696
ID: 122
Credit: 559,847
RAC: 0
Message 32736 - Posted 16 Dec 2006 8:06:50 UTC

Today I met with Nancy Hutchinson and her team of educators who work closely with teachers to design
and implement high school science units on proteins and molecular biology. Nancy's team and I are very excited to work jointly together to integrate rosetta@home into these units, and also to develop a "virtual" version of the course that would provide a comprehensive introduction to the science underlying proteins and rosetta@home which would not require a previous science background. Tim Herman, who produces beautiful models of proteins students can hold and manipulate will also be part of these efforts (see http://www.rpc.msoe.edu/cbm/). Nancy's program for teachers is described at
http://www.fhcrc.org/science/education/educators/sep/.

On the research front, the results from the last month of rosetta@home are really exciting. Chu Wang has made very dramatic progress in predicting the structures of protein-protein complexes: previously we were able to predict structures of complexes of two proteins given the structures of the two proteins
solved independently, but only when these structures do not change significantly when the two proteins interact. Now Chu, using the computer power you are so generously contributing, has been able to generate accurate predictions for complexes where the individual protein structures do change upon binding--he does this by simultaneously optimizing both the relative position of the two proteins and their structures; needless to say this is an extremely computationally demanding problem that would not be even thinkable without rosetta@home. This is an extremely important step forward since Chu's approach should now allow the prediction of protein-protein interactions quite generally, which is a very important problem in molecular biology.

Brian Kidd just showed me very promising results on another important problem: the modelling of protein conformational changes. Many proteins change their structure and function when they bind an activiating small molecule. Brian just started running his calculations on rosetta@home, and already has exciting results showing that, starting with the structure of proteins bind to such activators, if he removes the small molecule, the protein switches back to the state it is known to adopt experimentally in the absence of the small molecule. This is an step toward modeling the motions that underlie the signalling pathways used for communication within and between cells.

Finally, our continued analysis of the CASP7 rosetta@home predictions keeps getting more exciting--it turns out many of these blind predictions have an accuracy that is unprecedented, and that allows the models to be used for applications, such as solving the x ray crystallographic phase problem, that seemed out of reach earlier. The CASP7 assessor Randy Read made the first dramatic discovery with Rhiju and Bin's prediction, and he is continuing to work with us to better define the advances that have been made.

The next month or two we are going to start writing papers describing all of this progress which will highlight the contributions all of you have made, and we will also be continuing to attack new problems in the protein structure prediction and design area which you have made it possible to envision solving.
Also, as I've previously discussed, we will be beginning a concerted effort to bring full interactivity to rosetta@home, so you will be able to be involved in the research even more directly in the future!

____________

Message boards : Rosetta@home Science : Dr. Baker's journal archive 2006


Home | Join | About | Participants | Community | Statistics

Copyright © 2014 University of Washington

Last Modified: 10 Nov 2010 1:51:38 UTC
Back to top ^