Message boards : Rosetta@home Science : Comments/questions on Rosetta@home journal
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 10 · Next
Author | Message |
---|---|
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
With regard to making more CPUs available, I think the single biggest issue is the disparity between credits awarded to different machines, whether thats between OS's or from the use of 'optimised' BOINC clients. If that could be fixed somehow (and I'm not suggesting that's a small job!) then I believe there'd be a lot more interest from some of the big competitive teams. I read somewhere that that was TSCRussia's main reason for not joining here. good idea! any idea about who to contact at the BBC? |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
Btw, HPF (running on IBM's World Community Grid and UD's grid.org) is soon going into Phase-II. It's using the Rosetta software, which we're helping improve here. It looks like they're now using the "full atom relax" mode? Yes--I just sent them the version we'd improved based on all of your results up to the "HBLR" series of runs. in contrast to hpf1, which used our older low resolution model, they will be doing full atom refinement on all structures. Rich Bonneau was a graduate student in my group and now has moved on to be a professor at NYU. I just talked to him today and he updated me on the hectic life of a starting assistant professor. He is going to come back to Seattle for a week next month so we can finalize the scientific report on HPF1. |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
I'd like to ask about the distribution of results on the RMSD / energy charts. it is small on this scale--for a high quality crystal structure less than 0.5-1. Dimitris--could you send me an email? |
Johnathon Send message Joined: 5 Nov 05 Posts: 120 Credit: 138,226 RAC: 0 |
Mr Baker, You asked about contacting the BBC? go here: http://www.bbc.co.uk/sn/hottopics/climatechange/ and click on the "contact us" link in the left hand menu. (You may have to scroll down a bit). That'll let you contact the BBC science & nature team, re their climate change project. Unfortuanalty I cant give you a dirrect link, because of the way they're doing their feedback system. HTH Johnathon |
Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0 |
All those cool new top prediction plots sort of got me to fool around with what from the point of view of the experts probably is the silly idea of an amateur-biochemist. I will let you know anyway: ;-) The goal (e.g., for CASP7) seems to be to find the structures which are closest to the native structure in terms or rmsd for proteins where the native structure isn't known yet and the rmsd therefore can't be calculated directly. Since the native structure is thought to be the one with the lowest energy we therefore assume the lowest energy strucutre found by Rosetta to be our best prediction of the native structure. However, looking, e.g., at the energy vs. rmsd plot of 1mky, the correlation between energy and rmsd seems less than perfect. The point I want to make in this post is that the rms distance of each strucuture found by Rosetta to its nearest neighbour in the search space (rmsd_nearest) may in fact be a better predictor of rmsd to native (rmsd_native) than energy. A number of people on the forum have asked (e.g., Dimitris a few posts down in this thread) why the points on the top prediction plots don't cluster close to rmsd=0 but rather somewhere further to the right. This mainly is due to the large number of free parameters or in other words the high dimensionality of the search space, meaning that, due to the one-dimensional representation on the plot, there is _much_ more volume to be searched at higher rmsd values than at low values (there is 0 search volume at rmsd=0). If one could look at the distribution of 'points' in the multi-dimensional search space directly, the highest density of 'points' could very well be much closer to the native sturcture. In the remainder of this post I will try to discuss this in a quantitative way: If d is the number of free parameters (the number of dimensions of the search space), then the (d-1)-dimensional volume, V, that needs to be searched at each rmsd_native (the surface of a d-dimensional sphere) can be expressed as V(rmsd_native) ~ rmsd_native^(d-1) [1] (all constants omittted) Assuming that F(rms_native) is the density of structures along the rmsd_native axis (the number of points at each rmsd_native) then the density, p, of structures in the d-dimensional search space can be expressed as p ~ F(rmsd_native)/V(rmsd_native) ~ F(rmsd_native) x rmsd_native^-(d-1) [2] Let's now consider the rms distance of each structure to its nearest neighbour, rmsd_nearest: The volume in the d-dimensional search space corresponding to rmsd_nearest (volume of d-dimensional sphere with radius rmsd_nearest) is V(rmsd_nearest) ~ rmsd_nearest^d [3] (again omitting all constants) The expectation value of V(rmsd_nearest) must be inversely proportional to the local density of structures (if the density is reduced by a factor of two, the volume needs to be doubled to obtain the same probabiltiy for finding a structure in V): V(rmsd_nearest) ~ 1/p [4] which using [3] gives rmsd_nearest ~ p^-1/d [5] Combining equation [2] and [5], we now can relate rmsd_nearest to rmsd_native: rmsd_nearest ~ F(rmsd_native)^-1/d x rmsd_native^1-1/d [6] Since d is very large (> 100 ?), the first term, F(rmsd_native)^-1/d, as well as the exponent of the second term, 1-1/d, are both very close to 1, leading to the approximate relationship rmsd_nearest ~ rmsd_native [7] i.e., rmsd_nearest is approximately proportional to rmsd_native, with a small downward hump where F(rmsd_native) is largest. The approximation should be better for larger d (larger proteins). So, it really seems like rmsd_nearest may be as a predictor of rmsd_native. I also calculated the count-statistics errors on rmsd_nearest (since this is already getting pretty long I will only give the results). In addition to the nearest neighbour (n=1) I also did this for the nth nearest neighbour which leads to somewhat smaller errors. For d=100 I find the following relative 90% error ranges: n=1: [0.974,1.014] n=5: [0.990,1.008] Even for n=1 the errors seem to be pretty small, much smaller than the scatter in the energy vs. rmsd_native plots; rmsd_nearest may thus indeed be a useful predictor of rmsd_native. Of course all of this relies on the assumption that the structures found by Rosetta, except for the F(rmsd_native) dependence, are more or less evenly distributed in the search space - which most likely is not the case. If the distribution of structures is sort of clumpy the scatter in the rmsd_nearest vs. rmsd_native relation may well be so large as to render it useless. I guess the only way to find out would be to plot rmsd_nearest vs. rmsd_native using real data (unless of course the experts tell me that all of this is silly and I made some obvious conceptual mistake ;-) ... |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
All those cool new top prediction plots sort of got me to fool around with what from the point of view of the experts probably is the silly idea of an amateur-biochemist. I will let you know anyway: ;-) |
James Send message Joined: 8 Jan 06 Posts: 21 Credit: 11,697 RAC: 0 |
David, The new 'top prediction of the day' feature on the main page is great. I do think it will lead to a greater feeling of 'involvement' (which alot of people are interested - look only at stats, teams, etc). It is also interesting to see how close the predictions are on a daily basis. One further 'enhancement' to the 'top prediction' would be a profile of the user team merged into the graphical section of the 'further details' part (you're using two links). What I mean is that you should make it 1 link rather than 2, nix the user info save for a link to their computer and add their profile info at the top of the prediction graphics page. That way it would not distract from the main page 'user of the day' but also give some more info on who the user (or team in this case) actually is. It also reinforces that the top result is the result of the day. The two links are, perhaps, a bit 'inefficient'? If you could script it - it would likely be easier, given that this will be daily, to generate userinfo followed by the plots. People are vain - they like their profiles displayed. This way you have the user of the day and the prediction of the day and both get prominence (user of the day more so, really). Just a thought - but integration in the form of one link might be a good idea. That way people don't just click on the graphics. The BBC may or may not help you. The climate prediction deal was a big one, but you MUST speak to one of the people at climate prediction (oxford folks I believe). They will pretty readily help you out I would think. They are the only way you will get noted in any real way on BBC or climateprediction.net's web site (they are also now running seasonal). They run the project and the site - convince them:) It won't take much. Also, perhaps ask to do a 'project update' on boinc synergy so you get a main page review. That will draw some users. This would be particularly useful for Ralph - many of the experimental projects need users that boinc synergy people are willing to go for. Troll for hosts there? I would have said it was a great idea to move it had I not been enjoying the island life for over a week:) Meanwhile I had a power outage here stopping all WUs for a week. I suppose that's the price I paid to take a vacation to a quiet island with no internet access. |
hugothehermit Send message Joined: 26 Sep 05 Posts: 238 Credit: 314,893 RAC: 0 |
If you look at the energy vs rmsd plot for 1di2, you will see that one amazing rosetta run produced a structure far lower in energy than any other, and has an rmsd of ~1.3�. As you can see from the superposition on the right, this structure is essentially identical to the native structure--a nearly perfect prediction. Does that mean that this result could be applied in the real world? I realise that this is an already know structure, but if it wasn't, could it be assumed to be so close to the correct structure that other scientists could use it with confidence? Is it within the x-ray crystallography error threshhold? It is remarkable that exactly 1 run, out of the roughly 500,000 independent calculations all of you did, found the native minimum. With five fold less sampling, there would only be a one in five chance of having landed in the correct minimum, and rather than achieving an incredibably accurate prediction such as this one, the prediction would have been quite incorrect as the next lowest energy structures are quite a bit higher in rmsd. I've re-posted my ABC (Australia) request for a show as they didn't get back to me, but that still begs the question I asked in my E-mail, can the Rosetta@Home servers handle the load if they make and air a programme on Rosetta@Home? You were having difficulties with just us for a while, I would hate to think that a premature recruitment drive turned lots of people off because of limited server capabilites, if you think your servers can handle the load and they (the ABC) don't get back to me in the next week I'll give them a call to ask what's happening. Great news about the hit on the protein and great journal. Edit: added a few words to clarify, wish I could spell |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
If you look at the energy vs rmsd plot for 1di2, you will see that one amazing rosetta run produced a structure far lower in energy than any other, and has an rmsd of ~1.3?. As you can see from the superposition on the right, this structure is essentially identical to the native structure--a nearly perfect prediction. |
Marky-UK Send message Joined: 1 Nov 05 Posts: 73 Credit: 1,689,495 RAC: 0 |
(2). I'm a bit disappointed that the total cpu power has remained constant for the past weeks rather than increasing as it had up until recently. More users and hosts are joining every day, but this is not translating into increased computing capability. I know there's probably not much you can do about it, but I could probably double our team's throughput if the memory requirements of Rosetta weren't so high. I have a quad 2 GHz Xeon PC here that I can't run Rosetta on because it hasn't got enough memory to run four WUs at once, so it's crunching for a different project with a lower memory requirement (at the moment SIMAP). |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
There was a comment about how some of the participants are running with less than 512 megs of ram per cpu, and they're having more than a normal amount of failures with larger WUs. Is it possible to get the system Ram and number of system cpus - for each machine and then give out the larger WUs to only those with around 512Megs per cpu or higher? And if not.. do we have to ask the Boinc developers to add that possibility to the newer Boinc clients? |
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
(2). I'm a bit disappointed that the total cpu power has remained constant for the past weeks rather than increasing as it had up until recently. More users and hosts are joining every day, but this is not translating into increased computing capability. This is probably a damn fool question, but what happens if you create a custom config and use that to limit the number of CPU's? |
Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0 |
I had a good look at the new 1elw plot. Great ! The obvious question of course is: what causes the horizontal gaps ? Is that because of the peculiar shape of the protein ? I also noticed a cluster of three points at about 8 A rmsd with energies only slightly higher than the top prediction. I wonder whether these structures actually cluster in parameter space or just appear close in the plot? If the latter is the case this might be an example where analyzing the density of the structures in parameter space could help to distinguish the 'correct' from the 'wrong' low energy structures. Also, studying some of the previous plots once more, mostly the ones where the prediction didn't work so well, I was wondering what might cause structures very close to the native structure to have high energies (e.g., 1tif) while some structures far away from the native structure seem to have low energies (1mky). Perhaps in the high energy, close structures most parts of the protein are at about the same energy as in the native strucuture with just a few outliers that drive the total energy up ? Looking at the low energy prediction of 1mky the reverse might be the case, with just a small part of the protein being very tightly bound and contributing very low energies to the total, while all of the rest is only loosely bound ? Assuming that it is possible to determine the energy contribution of each residue separately, I wonder whether it wouldn't make sense to get rid of the outliers (both high and low) when calculating the total energy (just for analyzing the results, not for minimizing the energy)? Combining this with the previous discussion on the distances of structures in parameter space, I am guessing that perhaps the median energy of the residues, divided by the rms distance of the structure to its nearest neighbor might be a promising predictor of rmsd to native ? Taking this one step further (again assuming that it is possible to determine the contribution of each residue to the total energy), wouldn't it make sense to study the distribution of these energies (say, number of residues with energies < E plotted vs. E)? Perhaps these distributions have characteristic properties for structures close to native? They might for example be relatively flat, such that each residue is bound equally well, giving the protein a stable shape? Perhaps the distributions can be parameterized in a simple way (power-law, exponential...?), providing an additional quantity (in addition to energy and distance of structures in parameter space) to characterize the structures ? Oh well, time to shut up - trying to be creative when one lacks essentially all the relevant background knowledge probably isn't that helpful. ;-) |
R/B Send message Joined: 8 Dec 05 Posts: 195 Credit: 28,095 RAC: 0 |
I really enjoy this thread and the D.Baker journal thread. I don't know of another BOINC project that provides this large amount of feedback from the project staff. Wonderful. Founder of BOINC GROUP - Objectivists - Philosophically minded rational data crunchers. |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
I had a good look at the new 1elw plot. Great ! The obvious question of course is: what causes the horizontal gaps ? Is that because of the peculiar shape of the protein ? I also noticed a cluster of three points at about 8 A rmsd with energies only slightly higher than the top prediction. I wonder whether these structures actually cluster in parameter space or just appear close in the plot? If the latter is the case this might be an example where analyzing the density of the structures in parameter space could help to distinguish the 'correct' from the 'wrong' low energy structures. Again, very good ideas here! A graduate student in my group, Will Sheffler, is investigating the distributions of energies and interactions in native structure compared to the low energy structures you are generating. He may at some point give you a fully description, but a clear result so far is that while native structures are pretty uniformly packed, with relatively low energies for all interactions , some of the low energy wrong models are much less uniformly packed, with clusters of atoms making very low energy interactions and others making relatively poor interactions. In this case, while such a (wrong) model may have an overall energy close to that of the native structure a histogram of the per residue energies would clearly distinguish the two. Will is working on two approaches to get at this--first, an model evaluation approach that explicitly takes these per residue interaction energy distribtuions into account, and second, improvements to rosetta which will penalize these overly compact portions of models that are now getting overly low energies. he should be testing the latter approach on rosetta@home in the next few weeks. |
Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0 |
Also, studying some of the previous plots once more, mostly the ones where the prediction didn't work so well, I was wondering what might cause structures very close to the native structure to have high energies (e.g., 1tif) while some structures far away from the native structure seem to have low energies (1mky)... WOW - very interesting ! I would be delighted to hear more about Will's work - whenever he has something definite to report (good luck with your work, Will, I am keeping my fingers crossed that these ideas will work out as intended !). |
will sheffler Send message Joined: 20 Mar 06 Posts: 3 Credit: 0 RAC: 0 |
Also, studying some of the previous plots once more, mostly the ones where the prediction didn't work so well, I was wondering what might cause structures very close to the native structure to have high energies (e.g., 1tif) while some structures far away from the native structure seem to have low energies (1mky)... Hi Hoelderlin. I am indeed thinking about some of the things you mentioned. I agree that it's important to look at distributions of scores. This can help to see if a few of sub-structures (residues, atoms, etc) have really bad scores and others are pretty good, or if sub-scores are pretty even overall. One could imagine that if there are a few really bad scores, the structure has some kind of serious flaw. An example of such a local problem which never happens in real proteins is a crack or hole. We haven't had very good luck picking out holes, and this is one case where breaking scores down by residue or even by atom is helpful. Just last friday I was looking into a hole-detector baded on scores for individual atoms. Take a look here if you are interested. http://www.gs.washington.edu/~wsheffle/boinc/ |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
Hi Hoelderlin. I am indeed thinking about some of the things you mentioned. I agree that it's important to look at distributions of scores. This can help to see if a few of sub-structures (residues, atoms, etc) have really bad scores and others are pretty good, or if sub-scores are pretty even overall. One could imagine that if there are a few really bad scores, the structure has some kind of serious flaw. An example of such a local problem which never happens in real proteins is a crack or hole. We haven't had very good luck picking out holes, and this is one case where breaking scores down by residue or even by atom is helpful. Just last friday I was looking into a hole-detector baded on scores for individual atoms. Take a look here if you are interested. Correct me as you see fit.. but we're currently working with atoms or groups of atoms (residues) to create molecules. The current energy scoring is based on the atoms, but while it's producing great residues with low scores, we now need another energy scoring algorithm to score how the residues fit together and form the molecule. And while the ribbon and string visual representation of the molecule allows us to see the "this is good and this is bad" versions - it's tough (at least for me) to tell what's better or worse about the atom based models that follow. While I can't read the lower set of models - is what you're saying, pointing to basically creating a library of higher level residues, recognizing when we've come across a known residue and substituting the known structure - and then building the unknown molecule by making known and unknown residues fit together best? |
will sheffler Send message Joined: 20 Mar 06 Posts: 3 Credit: 0 RAC: 0 |
Correct me as you see fit.. but we're currently working with atoms or groups of atoms (residues) to create molecules. The current energy scoring is based on the atoms, but while it's producing great residues with low scores, we now need another energy scoring algorithm to score how the residues fit together and form the molecule. While I can't read the lower set of models - is what you're saying, pointing to basically creating a library of higher level residues, recognizing when we've come across a known residue and substituting the known structure - and then building the unknown molecule by making known and unknown residues fit together best?[/quote] I'm sorry the models aren't very clear. The second set of pictures with all the atoms shown as large spheres is intended to show the hole/crack in the model with incorrect topology. In the middle of the protein - where the yellow and green meet - there's a fairly sizeable gap. It's definitely tough to see these kinds of features from a static 2D image, but you should be able to pick out a little spot in the right hand image where you can see entirely through the protein. These kinds of cracks/holes aren't something that happen very often in real proteins, but they are surprisingly hard to detect and prevent using just our current energy function. This isn't to say that our main energy function isn't good -- it's very good -- just that different methods of evaluating structures can be useful. |
R/B Send message Joined: 8 Dec 05 Posts: 195 Credit: 28,095 RAC: 0 |
You just hop on that radio show if you can, Dr. Baker. 700 radio stations with millions of listeners to your 1 hour phone interview has got to bring in new support. You'll pick up enthusiastic support; I assure you. Founder of BOINC GROUP - Objectivists - Philosophically minded rational data crunchers. |
Message boards :
Rosetta@home Science :
Comments/questions on Rosetta@home journal
©2024 University of Washington
https://www.bakerlab.org