Dr. Baker's journal archive 2006

Message boards : Rosetta@home Science : Dr. Baker's journal archive 2006

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11591 - Posted: 3 Mar 2006, 6:59:12 UTC

Many of you have suggested that I give more frequent feedback on what is happening with Rosetta@home. I'm very happy to do this, and
will try to make nightly posts here (I don't promise every day!). I'll also start a second thread where people can post comments/questions related to the journal which I can try to answer as well.

It has been an extremely exciting week for the project! Divya and I for the past two months have been improving the algorithms in rosetta based on tests on a set of ten proteins whose names you are probably quite familiar with by now. David Kim tested the improved method on an additional set of about 25 proteins. He has been so busy trying to fix the problems causing the WU errors that he hasn't had much time to analyze the results, but David and I took a quick look yesterday and the results are pretty amazing--for about half of them the lowest energy structure was quite close to the correct (native) structure. Since we hadn't used this set in improving the method, it suggests that the improvements we have made are pretty general, and that we are making significant progress towards solving the protein structure prediciton for small proteins.

It is hard to describe how electrifying (and almost scary!) this is. the protein structure prediction problem is perhaps the longest standing problem in molecular biology. it has been known for forty years that the structures of proteins are determined by their amino acid sequences, but as recently as five or six years ago it was generally thought that the prediction problem was completely intractable as very little progress had been made. starting about this time we showed in the CASP blind tests that with the rosetta low resolution structure prediction method rough models could be built for small proteins that in some cases were reasonably similar in topology to the true structure, but the predicted structures were never accurate at the atomic level. we have worked for the past five years on developing high resolution refinement methods that could take these rough models and refine them to much higher accuracy. this goal remained elusive for the first few years, but about a year and a half ago we made a breakthrough and found that we could make very accurate predictions for some proteins using a trick that involves folding not only the sequence of the protein of interest but also the sequences of a large number of evolutionarily related homologs. using this method we made the first high accuracy ab initio structure prediction in CASP (the last target in CASP6) and did further tests which showed accurate predictions for 6 of 16 proteins which were published in Science last year (I think there is a link to the paper on the home page).

however, this work did not achieve the goal of predicting structure accurately from the amino acid sequence of a protein alone as we had to resort to evolutionary information. achieving this goal has been the central aim of rosetta@home thus far, and as I said above it is almost a "holy grail" of computational biology. so now, looking at David Kim's results and seeing that for quite a few proteins we are coming close to predicting structure from their amino acid sequences without any other information is pretty breathtaking.

Divya is now preparing figures to post on the science page which show the results on the 10 protein set we have been working on, and David will post results on his larger set once he can finish the analysis after he has implemented his clever fix to the unhandled exception errors.

it is clear for the still large number of proteins for which we are failing that the problem is not enough sampling, even with 100,000 independent folding runs we are not coming close enough to the native strucutre to land in its energy minimum. so we need more cpu power! it is kind of amazing that solving such a long standing scientific problem depends so crucially on the efforts of volunteers like yourselves! I don't know how much more cpu power it will take, but if you can each recruit ten friends or relations ...
ID: 11591 · Rating: 11 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11640 - Posted: 4 Mar 2006, 4:59:50 UTC

Tonight I will try to explain what the names of the work units mean--I know it must seem quite bewildering.

David Kim has just submitted a batch of work units with names like "HB_BARCODE_30_1enh_".

The "HB", like the "HBLR_1.0" in the work units I sent out in the last several weeks, indicates that hydrogen bonds are given less weight than in our standard work units. I found last month that reducing this weight helps overcome barriers to sampling that were due to formation of beta sheet
structures more regular than in most native proteins. We will probably soon make this change in all of our calculations as it seems to be a quite general improvement.

The "BARCODE_30" in this work unit and in many of the previous work units indicates that a method
we developed last fall for spreading out sampling is being used. In this method, each instance of each work unit gets a randomly chosen "barcode" which directs the folding trajectory to a different part of the space we are searching. the "30" indicates that one randomly selected residue is restricted according to the barcode for every 30 residue segment in the protein.

the "1enh" indicates the name of the protein being folded. this protein is a "homeodomain" which is a fascinating family of proteins found throughout multicellular organisms. these proteins control the fundamental steps of development all the way from flies to humans! mutations in these proteins in flies, for example, can cause legs to grow out from where the eyes normally are. if you go to www.rcsb.org and put the four letter code for a protein running on your machine you can learn more about each protein. maybe we can put together a summary sheet with a short description of each protein if this would be interesting.

the "_" at the end of 1enh indicates that this protein only has one chain. some proteins have multiple different chains, and in these cases each chain is indicated by a label like "A". for example 5croA means we are folding chain A of the transcription factor 5cro.

Divya has just submitted a series of work units with names like HOMSdi_homDB006_1di2_. Divya here is testing to see how much better we can do with the multiple sequence method described in my first post with distributed computing. "homDB006" means this is the 6th homolog sequence being folded for the protein 1di2.

thats all for tonight. Tune in tomorrow for a description of the causes of work unit failures and what we are trying to do to solve these problems.



ID: 11640 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11677 - Posted: 5 Mar 2006, 7:32:23 UTC

tonight I will describe how we are going about trying to fix the problems causing work unit errors.

the large increase in errors two weeks ago was not due to new bugs, but was an unintended byproduct of our effort to help dial up users who wanted longer work units and the many other users who requested work units with more consistent lengths. since most work units had been taking on the order of an hour or two previously, when we set the default work unit length to 8 hours, this resulted in 4 -8 times more structures being generated per work unit. This unmasked rare possibiliities for error that were not evident before: if the chance of running into an error when generating one structure is say 1 out of 100, if 10 structures are being generated most work units will be fine, but if 100 structures are generated there is a high liklihood of error. David K. has reduced the default time to 2 hours, and this seems to have increased success rates considerably.

It is clear that different machines and platforms have errors at very different frequencies. If you have a machine that never or very rarely has errors, to reduce net traffic it would be good to increase the work unit time, but if you have a machine that still has frequent errors it is probably worth decreasing the time to one hour.

Clearly we need to find the sources of the errors. this is complicated for us because we use linux machines for our own work, and the errors are primarily happening on windows machines. David Kim had the excellent idea of putting in error catching logic so that if an error is encountered the program just goes on to compute the next structure rather than aborting, but unfortunately from tests on Ralph it seems that many of the errors cannot be recovered from. David is continuing to explore this, and meanwhile I have posted a bounty on finding the bugs/problems to all rosetta developers. Also, if you are an expert and have a windows machine with problems that you can reproduce outside of boinc (as described in the number crunching boards a few weeks ago) and have time to go through with the debugger let us know.

We are also working with the boinc developers to solve problems likely to be more related to boinc than rosetta. these include the occasional crashing on termination, and the graphics related problems. Jack Schonbrun who developed the rosetta@home screensaver (we hope you like it!) just moved on to a job at Lawrence Berkeley Laboratory in the same division as David Anderson and the boinc development team, which should help us coordinate to solve these problems.

I'll get back to more exciting recent science results in the next few days. Suggestions for topics are also welcome.
ID: 11677 · Rating: 1 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11701 - Posted: 6 Mar 2006, 4:12:11 UTC

Tonight I will describe some exciting results from the last two days!

I need to first explain the idea of folding multiple "homologues". Because the fundamental machinery of life has remained very similar throughout the course of evolution, many proteins for example in humans are closely related to a number of proteins in many different organisms. the homeobox protein I mentioned earlier is present in all higher organisms, for example. So most proteins belong to large families; if you line up the amino acid sequences of any two proteins in one of these families you would find that between 25 and 90% of the amino acids are identical. It is very well established that all members of a protein family have very similar structures. So if we want to predict the structure of a protein we can fold not only the protein we are interested in but also all of its family members, which are called homologues.

The exciting result is that if we generate models for many different homologues for a given protein, and restrict our attention to those homologues where the ten lowest energy structures are very similar to each other, we find that the modles produced by these "converged" runs are very close to native. It is clear that when making predictions for a single protein, if the low energy structures are all different from each other, a confident prediction cannot be made as there is no basis for choosing one prediction over another. The new result is that convergence (all low energy structures similar to each other) can be used to pick out the folding runs which generated accurate models. This also provides a way to assess the confidence of a prediction--if the folding runs for a protein which converge all produce the same structure (as in the cases I've analyzed), it is almost certain to be correct. Having criteria for evaluating the probability that a model is correct is very important when we move to prediction of structures of important proteins with currently unknown structure, something which may happen sooner than I had thought given the rapid progress we have been making.









(as a side note, our local filesystem was rearranged this weekend for non Boinc related reasons, but this may have inadvertedly affected our ability to send jobs out. if this is a problem, it will be fixed by tomorrow AM)
ID: 11701 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11746 - Posted: 7 Mar 2006, 6:11:42 UTC

Tonight to give you a more balanced picture of what doing this kind of scientific research is like I'm going to talk about failure!

Most of the fall I spent trying to achieve in folding calculations using only a single sequence the considerable improvement in structure prediction that we had observed using many different homologous sequences. My idea was to try to make the sequence being folded effectively a bit different each time by changing the parameters which determine how the energy is computed during the low resolution part of the search. This is along the lines of the suggestion on the comments/questions boards about whether we could generate our own sequence homologues to improve the search. Frustratingly, I was never able to really improve the results over using just the same sequence over and over again starting (as we always do) from different random starting points (We do continue to add this variation into each run through the "rand_envpair_res_wt" and "-rand_SS_wt" flags you will see on the command lines we are sending out as adding this variation may help in some fraction of cases).

This is the way research is--many ideas that seem like they have to work in fact don't, and it is only by constant experimentation and testing of lots of different ideas that progress is made.
ID: 11746 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11773 - Posted: 8 Mar 2006, 6:23:31 UTC

Just a quick update tonight.

(1). David Kim has started a large set of new jobs for his 25 protein test set. Their names are
HB_BARCODE_30_***** where ***** are the name and chain id of the protein; I described the HB_BARCODE part below. Rhiju is preparing to send out jobs for 30 homologs for each protein in a still larger set of 62 proteins. Divya is completing her homolog runs for the initial set of 10 proteins she and I have been studying. We are very excited about seeing the results for all of these calculations as they will guide our next steps. There are over 3 million jobs queud now!

(2). I'm a bit disappointed that the total cpu power has remained constant for the past weeks rather than increasing as it had up until recently. More users and hosts are joining every day, but this is not translating into increased computing capability. We aim to extend our calculations to larger proteins based on teh success we ahve been having with proteins under 100 amino acids, but this will require a significant increase in computing power. Please let me know if there is anything I can do to help with recruiting.

(3). We are making a major push to find and eliminate the causes of those annoying access violation and other errors. We've enlisted a team of rosetta experts to help find possible memory leaks, etc. and it is looking like Microsoft will also have a team to help us--we have been interacting quite a bit with them as they are interested in becoming more involved with scientific computing. So I'm optimisitic that most issues will be resolved within the next couple of weeks.
ID: 11773 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11811 - Posted: 9 Mar 2006, 6:21:02 UTC

Good news on the tracking down the remaining errors front--a long time BOINC and SETI@home expert, Rom, is joining us as a consultant to help fix the "leave in memory", graphics, and other problems related to the rosetta-BOINC interface.

Divya will be posting updates to the "top results" section (yes, I know, at long last...) before tomorrow evening, and I will discuss these in my update tomorrow.
ID: 11811 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11844 - Posted: 10 Mar 2006, 6:05:23 UTC

Tonight I will discuss the results that Divya has posted in the "Top Results" section.

The first protein is 2reb, which is the shorthand code for RecA, one of the most important proteins
in DNA repair and recombination in all of life.

The energy versus RMSD plot on the left shows that Rosetta has strongly converged on the correct structure--all of the very low energy structures are less than 1.5-2 RMSD from the native structure which is close to experimental error. In cases like this, where all the lowest energy structures are
very similar to each other, we can be quite sure that Rosetta has found the correct solution. Congratulations to Dalephi and Team TeAm AnandTech for finding the lowest energy structure!

The top two pictures on the right show a comparison of the experimentally determined structure (top)
with the Rosetta predicted sturcutre (bottom). as you can see, they are very similar. The third picture (below) is one that you haven't seen before. Here the two structures are shown superimposed with each other, with the protein sidechains in the core of the protein shown as well. you can see that the sidechains are in very similar positions in the prediction and the actual structure.

The panel below shows the energy vs rmsd plot again and on the right is a comparison of the lowest rmsd structure to the actual structure. They are again very similar. Congratulations to Jacekko and Team Rosetta@Poland for finding this structure!

The following panels show results for the protein 1dcj. 1dcj is important for cell division in bacteria, and could be a target for developing antibacterial drugs. As in the case of 2reb, the lowest energy points are all close to the correct structure, and similar to each other. Rosetta has again converged on the correct answer. The RMSD is a bit higher, but this probably due to wiggles in the red tail in the pictures at the right which probably does not have a well defined structure in solution. The pictures show that aside from this segment the lowest energy structure is very similar to the experimentally determined structure. The superposition in the panel below shows that the amino acid sidechains are in very similar positions in the lowest energy structure and the experimentally determined structure. Congratulations to Team_Gaol~Christian and the Dutch Power Cow for finding this structure (I hope the DPC stampede in the right direction) !

The last set of panels compare the lowest rmsd structure and the experimentally determined structure, which are again very similar. Congratulations to LocalBusinessMen (Team XtremeSystems) for finding this structure!

In these plots on the left for clarity we are only showing the lowest energy 5000 of the roughly 300,000 structures all of you computed for these proteins. If we could collectively do as well on all proteins, the protien folding problem would be solved!


ID: 11844 · Rating: 1 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11880 - Posted: 11 Mar 2006, 6:25:07 UTC

Two new things for tonight.

First, Divya has posted results for a new protein, 1dtj, on the top results page. 1dtj is an RNA binding protein that is associated with an autoimmune neurological disease called paraneoplastic opsoclonus-myoclonus ataxia.

The energy versus rmsd plots on the left again show a cluster of structures with low rmsds and low energies. The rmsd spread of the low energy structures is a bit larger than in the previous examples. This is because of a long loop which is visible at the bottom of the structure superposition image. It actually associates with another protein in the experimentally determined structure, but in our calculations it folds back on the main part of the protein since there is nothing else to interact with. The much higher energy "low rmsd" structure has lower rmsd because this loop is in a position more like in the native structure, but the main part of the protein is in fact not as well predicted as in the low energy structure.

This is again a very successful prediction; congratulations to tgxiii (Team XPC) and DPaddick for finding the lowest energy and lowest rmsd structures. These results will also help us to improve rosetta--on the right side of the energy versus rmsd plot you will see a low energy structure with high rmsd which is an indicator of an inaccuracy of the energy function. We are currently studying such cases to learn how to improve our energy calculations.

The second piece of news is that our new BOINC consultant Rom visited for several hours today, and is now up to speed on rosetta@home--hopefully he will soon find the causes of some of the problems. We have also got all of our local machines running rosetta@home and hoping they will have errors so we can then track down, but we haven't gotten any (non trivial) errors yet, which in our case is a nuisance! There is probably some common feature to the subset of your machines having errors and I hope we can track it down.
ID: 11880 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11924 - Posted: 12 Mar 2006, 4:46:56 UTC

Divya has just posted the results for 1mky, a protein involved in a molecular switch. This was probably the hardest protein in our original test set, and the results illustrate why we are still working on improving the rosetta algorithm and why we need more computer power. You can see on the lower left of the rmsd versus energy plot that there is a low energy low rmsd point, this is quite a good model and is comparable to the "low rmsd" model shown on the page. however there was very little sampling in this energy+rmsd range, and none closer to the native structure, and the very low native (correct) energy minimum was missed! because of this, the lowest energy structure found is quite incorrect: you can see that the cyan loop in the native structure turns into a strand in the low energy structure and intercalates into the main body of the structure. again, by studying this low energy wrong structure we will be able to improve the rosetta energy function. also, since the native structure is lower in energy than any of this (wrong) lowest energy structure, if we could have sampled more we would have come closer to the native structure and then the lowest energy structure would have been more accurate.

Congratulations to Canada_David (Team Team Commonwealth) for finding the low energy structure--it has a lot to teach us!, and to Viking69 (Team Ricoh Corp.) for finding the lowest rmsd structure. We are carrying out more runs for this protein now using the multiple sequence method I described, and we should give out a prize to the person who finds the even lower energy close to native structures not found so far.

ID: 11924 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12000 - Posted: 14 Mar 2006, 5:16:54 UTC

Several new things today:

We will be acknowledging the lucky individuals who find the lowest energy and lowest rmsd models on the main news page in addition to the top results page. We are looking into possibilities for certificates. I assume giving bonus credits for low rmsd and low energy structures would not be a popular idea.

Today's protein is 2tif, a protein which plays a critical role in the initiation of translation. Translation is the process by which information in the genetic code is read to create proteins with specific amino acid sequences. You can see on the top results page that the low energy model has the same overall shape as the actual structure. The major differences are in the red bit at the C terminal end of the protein.

David and Keith are setting up multiple load sharing web servers to reduce the load which has caused some of the slowdowns on the web site.

Rhiju is testing an approach he and I think should yield improved results on a number of the proteins you have become familiar with. He will give you a report in a few days.
ID: 12000 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12085 - Posted: 16 Mar 2006, 6:42:35 UTC

Divya has just posted the most striking of the recent results on the top predictions page. This could be a poster child for what can be accomplished with large scale distributed computing! (The protein, 1di2, mediates interactions with double stranded RNA).

If you look at the energy vs rmsd plot for 1di2, you will see that one amazing rosetta run produced a structure far lower in energy than any other, and has an rmsd of ~1.3. As you can see from the superposition on the right, this structure is essentially identical to the native structure--a nearly perfect prediction.

It is remarkable that exactly 1 run, out of the roughly 500,000 independent calculations all of you did, found the native minimum. With five fold less sampling, there would only be a one in five chance of having landed in the correct minimum, and rather than achieving an incredibably accurate prediction such as this one, the prediction would have been quite incorrect as the next lowest energy structures are quite a bit higher in rmsd.

Of course, a level of sampling for which just one trajectory lands in the native minimum is not adequate for reliably predicting structure--this is why I keep harping on the need for more cpu time. with ten times more sampling, we would expect ten hits in the native minimum and a much lower chance for failure. Indeed, for the preceding two proteins, 1mky and 1tif, which are somewhat bigger than 1di2, we did not have enough sampling, and the native minimum was not found.

We can roughly divide the different prediction problems into three categories; first, those like 2reb, 1r69, and 1dcj where a large number of trajectories end up in the native minimum and the prediction is accurate and confident, second, those like 1tif and 1mky, for which no trajectories end up inthe native minimum, and for which much more sampling is necessary, and finally those like 1di2 which are precariously balanced between these two extremes. As we continue to post examples over the coming weeks you will see more instances of all three classes, and I will describe our strategies for improving classes II and III.

(you may have noted the "user unknown" label above the low energy plot; this is a funny sidenote. I was excited about showing this example and highlighting the lucky user who found this amazing solution, but we haven't been able to track down the relevant entry in the database; divine providence perhaps?)


ID: 12085 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12186 - Posted: 18 Mar 2006, 5:25:59 UTC

Rhiju has posted results for the protein 1elw, which is part of proteins called "molecular chaperones" which help proteins to fold. Even if you haven't looked at many of the top predictions posts, you should look at this one! The energy versus rmsd plot illustrates the power of large scale sampling: the native minimum is clearly located by a small subset of runs, and the lowest energy structure is amazingly similar to the experimentally determined structure (probably within experimental error). Look at the superposition carefully if you have time. This is a dramatic success of the rosetta-boinc approach.
Congratulations to Markku Fagar from Finland for finding the lowest rmsd structure, which is close indeed.

On the error fixing front there has been rapid progress! I hope to have very good news in the next few days.



ID: 12186 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12246 - Posted: 19 Mar 2006, 7:10:07 UTC

For some exciting news on the error front, and a description of the problem and the solution, take a look at
http://www.romwnet.org/dasblogce/

It is amazing how fast Rom brought the error rate way down on Windows machines on Ralph, and we have spent today and yesterday making further modifications based on his suggestions that will hopefully bring the error rate down still further. You should see the improvements made thus far on rosetta@home by tuesday.

ID: 12246 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12395 - Posted: 21 Mar 2006, 6:34:20 UTC

Rom is now trying to track down the 1% bug, and we've been going over backtraces he has sent me. we still need more data, so if you have computers with this problem please sign them up for ralph@home.

there is a new thread on "Can you help rosetta@home" started by Robert Brooke and Feet1st that has already a number of great suggestions on improving the project. we really need your help to make these happen as there are only a few of us here and we are trying to do the science and keep the project going at the same time (and fix the bugs and put out the fires that arise...). If you can coordinate on these projects it would be fantastic, and as I said earlier, we will throw a rosetta@home volunteers party some time. For things like T shirts, mailing certificates, etc. I have some funds that could be used.
ID: 12395 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12491 - Posted: 22 Mar 2006, 5:01:44 UTC

Lots of neat stuff today!

Bin has his first set of results back from your computers on model refinement using his "loop relax" protocol which we will explain soon. in these calculations, rather than starting with an extended chain he starts with a rough model based on already known structures of related proteins. the exciting result is that most of the time the lowest energy of the refined models are significantly closer to the correct structure over the core regions than the starting model before refinement. we will try to put up some pictures to illustrate this soon.

Rhiju had the excellent idea of incrementing the percent complete counter at the beginning of each of six distinct phases in the folding process. currently the percent complete counter is only updated after a complete structure is generated, so when the counter is at 1% you (and we) don't know where the calculation is. Rhiju and I have modified the code, and if we can get this into the next release you will see the percent complete increase much more frequently; this will also help us to track down the remaining problems as we will be able to locate where in the code any problems are ocurring.

We have just released on Ralph the latest of Rom's improvements to rosetta--these should allow us to pinpoint the source of the 1% error. see Rom's descriptions of his work in his message boards post. we are incredibly fortunate to have a true boinc and windows wizard solving all of these problems!

as I noted earlier, Rom has already solved the keep in memory problem. we are only waiting a day or two to release the fixed version to rosetta@home in case Rom solves more problems quickly so we don't have to release two versions in rapid succession.

I'm absolutely delighted about the response to the "Can you help Rosetta" thread--we will be able to do great things all together!!

ID: 12491 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12552 - Posted: 23 Mar 2006, 6:16:54 UTC

The hero of the day is Keith, who spent many hours on his day off to track down the cause of the dramatic slowdowns you have probably noticed on and off on the rosetta@home web site. he found the problem and solved it as he has described on the technical news page.

Rhiju has added links to information on each of the proteins in the top results section that should be of interest to people interested in learning more about what the proteins do.

Rhiju and I have been looking at the roughly 30% of proteins for which we are still failing to produce accurate models. It will come as no surprise to you that the problem appears to be insufficient sampling--the native structure is always much lower in energy than any of the structures found, and the structures sampled are all quite different from each other--the problem is not that they are getting trapped in a small region of the space, but that the space is so large. I'm hoping that if we can solve the 1% bug this week, and the efforts described in the "Can you help Rosetta@home" and related threads get going, we can really push to increase participation, which should bring some of these more challenging larger proteins into the solvable range.

For those who are helping to track down the 1% bug on Ralph, please read Rom's post on updating the boinc client--it turns out that we will get a complete error report for aborted "stuck" runs only if the very latest client is being used
ID: 12552 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12744 - Posted: 28 Mar 2006, 4:50:51 UTC

A couple of things today:

First, after consultations with many of you on the message boards, we have set the maximum allowed run time to close to 24 hours. No jobs should be getting stuck for much longer than this time. The maximum time that you can increase the length of your work units for has correspondingly dropped to 24 hours. This is of course a temporary fix as we are still hot on the trail of the "1%" bug. It is clear that it is not an infinite loop within Rosetta, as it only occurs when rosetta is run within boinc. With better backtracking in the latest version on ralph, and a new version of boinc soon to be released that should make error tracing even more straightforward, we are optimistic about solving the problem in the near future.

Our RALPH tests have shown that Rom's fixes have solved most of the other problems and the error rates are way down on Windows machines in particular. You haven't seen the new version on Rosetta@home yet because everytime we think we have a version ready to release a small problem has shown up on a different platform which David has had to go back and fix. The new version included more freqent updating of the "percent complete" which will allow you (and us) to localize more precisely where in the folding process work units are getting stuck.

I had some neat results over the last day or two taking the lowest energy structures identified in your large scale searches, and resampling the beta strand pairings allowing for a limited amount of additional variation. When I used pairing information consistent with the native structure, I found significantly lower rmsd and energy structures than earlier, and I am now excited to try this without any bias on a large scale--at the beginning of each run, a set of beta pairings, defining a beta sheet "topology" will be selected at random and used to guide the trajectory. Only a small subset will have the correct pairings, but based on this recent test these should have lower energies than the others and so should be detectable based on their energies. These runs will be called "TOP_SAMPLE" (for "topology sample")--look for them soon!

ID: 12744 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12784 - Posted: 29 Mar 2006, 7:41:32 UTC

As you know, David Kim released the version we've been improving on RALPH today on rosetta@home.
I just checked the error rate for jobs on RALPH with this version over the last 24 hours, and they are great:

Version OS Total Results Pass Rate Fail Rate
496 Windows 1284 98.05 1.95
487 Darwin 50 98.00 2.00
487 Linux 178 96.63 3.37

So you should see an improvement over the next several days as the new jobs replace the older ones.

As I mentioned earlier, the new version more continuously updates the "percent complete" value, and if you do have a stuck work unit please let us know exactly what percent complete it was at as this will allow us to pinpoint where in the program the sticking occurs.
ID: 12784 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12900 - Posted: 1 Apr 2006, 5:15:56 UTC

Version 4.83 is looking great! Here are the results for the last 24 hours on Rosetta:


#jobs %success %failure
483 Windows 84993 95.59 4.41

A large fraction of these errors are associated with downloading files; these are not so bad as the process exits immediately so no time is lost.

I had hoped to be able to track down the '1% bug" using the more continual updating scheme we put in, but so far there have been no reports of "stuck" wu with 4.83, which we can't complain too much about!

Almost all the recent error reports have been with older WU--if these are causing problems for you please go ahead and delete them.

I'm really excited about the ideas Feet1st, Robert Brooke, and others are discussing on the threads below about recruiting more users to the project. The letter to inactive rosetta participants will be a
great step in this direction. If you have time, check out their excellent work and get involved!
ID: 12900 · Rating: 0 · rate: Rate + / Rate - Report as offensive
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Rosetta@home Science : Dr. Baker's journal archive 2006



©2024 University of Washington
https://www.bakerlab.org