Comments/questions on Rosetta@home journal

Author	Message
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 12034 - Posted: 15 Mar 2006, 4:45:25 UTC - in response to Message 12023. Btw, HPF (running on IBM's World Community Grid and UD's grid.org) is soon going into Phase-II. It's using the Rosetta software, which we're helping improve here. It looks like they're now using the "full atom relax" mode? Richard Bonneau, 13-Mar-2006 (source): "The proposed project (HPF phase-2) will refine, using Rosetta in a mode that accounts for greater atomic detail, the structures resulting from the first phase of the Human Proteome Folding Project (HPF phase1). The project will focus on human secreted proteins (proteins in the blood and the spaces between cells). These proteins can be important for signaling between cells and are often key markers for diagnosis. These proteins have even ended up being useful as drugs (when synthesized and given by doctors to people lacking the pro-teins). The project will also focus on key secreted pathogenic protein. This project dove-tails with efforts at the ISB in Seattle to support predictive, preventative and personalized medicine (under the assumption that these secreted proteins will be key elements of this medicine of the future). This project continues where the Human Proteome Folding Project leaves off. With the Human proteome Folding project we aimed to get protein function. With the second phase we would aim to increase the resolution of a select subset of Human proteins. Better reso-lution is important for a number of applications including but not limited to virtual screening of drug targets with docking procedures and protein design. The second phase of the pro-ject will also serve to improve our understanding of the physics of protein structure and ad-vance the state of the art in protein structure prediction (help us to further develop our program, Rosetta). The two main objectives are to: 1) obtain higher resolution structures for specific hu-man proteins and pathogen proteins and 2) further explore the limits of protein structure prediction by further developing Rosetta structure prediction. Thus, the project would ad-dress two very important parallel imperatives, one biological and one biophysical. The Human Proteome Folding Project Phase-2 will use the computer power of millions of computers to predict the shape of Human proteins for which researchers currently know little. From this detailed shape scientists hope to learn about the function of these proteins, as the shape of proteins is inherently related to how they function in our bodies. This data-base of protein structures and putative functions will let scientists take the next steps un-derstanding how diseases that involve these proteins work. Proteins are the most important molecules in living beings. Just about everything in your body involves or is made out of pro-teins. Protein structure is key to understanding the functions of this diverse class of bio-molecule. Thus we hope that our work on HPF 1 and HPF 2 will contribute to critical pub-lic infrastructure to the biological and biomedical community." Yes--I just sent them the version we'd improved based on all of your results up to the "HBLR" series of runs. in contrast to hpf1, which used our older low resolution model, they will be doing full atom refinement on all structures. Rich Bonneau was a graduate student in my group and now has moved on to be a professor at NYU. I just talked to him today and he updated me on the hectic life of a starting assistant professor. He is going to come back to Seattle for a week next month so we can finalize the scientific report on HPF1. ID: 12034 · Rating: 0 · rate: /

David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 12041 - Posted: 15 Mar 2006, 5:41:46 UTC - in response to Message 12022. I'd like to ask about the distribution of results on the RMSD / energy charts. Somehow I had imagined there'd be a "clustering" of those 5000 (I think that's the ones you're charting nowadays, right?) best results in the bottom left conrner of the chart. Yes, for the first few proteins Divya posted, this is what was observed. but this is a bigger protein, and finding the very lowest energy minimum (the correct structure) is like finding a needle in a haystack. But, looking at the charts, it seems that most results stay far away (due to local minima keeping them away?) and only few "lucky outliers" actually approach the global energy minima. Using the "planet exploration analogy", it looks as if a handful lucky explorers somehow "fall into a hole" and discover the lowest energy point. So this is why you need more explorers (CPUs). Exactly! wait until you see Divya's post for tonight--a case where one lucky explorer did land in the (correct) global minimum! for the 1tif case we just didn't have enough sampling to find the global minimum. Also, maybe you could also plot the Energy of the "native" (experimentally derived) structure on the charts, for reference, like you did in the past? good idea--we will start doing this again as it does make the problem clearer. What is the error in experimentally (X-ray crystallography or NMR) derived structures? it is small on this scale--for a high quality crystal structure less than 0.5-1�. Dimitris--could you send me an email? ID: 12041 · Rating: 0 · rate: /

Johnathon Send message Joined: 5 Nov 05 Posts: 120 Credit: 138,226 RAC: 0	Message 12046 - Posted: 15 Mar 2006, 8:05:51 UTC Last modified: 15 Mar 2006, 8:06:24 UTC Mr Baker, You asked about contacting the BBC? go here: http://www.bbc.co.uk/sn/hottopics/climatechange/ and click on the "contact us" link in the left hand menu. (You may have to scroll down a bit). That'll let you contact the BBC science & nature team, re their climate change project. Unfortuanalty I cant give you a dirrect link, because of the way they're doing their feedback system. HTH Johnathon ID: 12046 · Rating: 0 · rate: /

Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0	Message 12070 - Posted: 15 Mar 2006, 21:11:35 UTC Last modified: 15 Mar 2006, 21:20:13 UTC All those cool new top prediction plots sort of got me to fool around with what from the point of view of the experts probably is the silly idea of an amateur-biochemist. I will let you know anyway: ;-) The goal (e.g., for CASP7) seems to be to find the structures which are closest to the native structure in terms or rmsd for proteins where the native structure isn't known yet and the rmsd therefore can't be calculated directly. Since the native structure is thought to be the one with the lowest energy we therefore assume the lowest energy strucutre found by Rosetta to be our best prediction of the native structure. However, looking, e.g., at the energy vs. rmsd plot of 1mky, the correlation between energy and rmsd seems less than perfect. The point I want to make in this post is that the rms distance of each strucuture found by Rosetta to its nearest neighbour in the search space (rmsd_nearest) may in fact be a better predictor of rmsd to native (rmsd_native) than energy. A number of people on the forum have asked (e.g., Dimitris a few posts down in this thread) why the points on the top prediction plots don't cluster close to rmsd=0 but rather somewhere further to the right. This mainly is due to the large number of free parameters or in other words the high dimensionality of the search space, meaning that, due to the one-dimensional representation on the plot, there is _much_ more volume to be searched at higher rmsd values than at low values (there is 0 search volume at rmsd=0). If one could look at the distribution of 'points' in the multi-dimensional search space directly, the highest density of 'points' could very well be much closer to the native sturcture. In the remainder of this post I will try to discuss this in a quantitative way: If d is the number of free parameters (the number of dimensions of the search space), then the (d-1)-dimensional volume, V, that needs to be searched at each rmsd_native (the surface of a d-dimensional sphere) can be expressed as V(rmsd_native) ~ rmsd_native^(d-1) [1] (all constants omittted) Assuming that F(rms_native) is the density of structures along the rmsd_native axis (the number of points at each rmsd_native) then the density, p, of structures in the d-dimensional search space can be expressed as p ~ F(rmsd_native)/V(rmsd_native) ~ F(rmsd_native) x rmsd_native^-(d-1) [2] Let's now consider the rms distance of each structure to its nearest neighbour, rmsd_nearest: The volume in the d-dimensional search space corresponding to rmsd_nearest (volume of d-dimensional sphere with radius rmsd_nearest) is V(rmsd_nearest) ~ rmsd_nearest^d [3] (again omitting all constants) The expectation value of V(rmsd_nearest) must be inversely proportional to the local density of structures (if the density is reduced by a factor of two, the volume needs to be doubled to obtain the same probabiltiy for finding a structure in V): V(rmsd_nearest) ~ 1/p [4] which using [3] gives rmsd_nearest ~ p^-1/d [5] Combining equation [2] and [5], we now can relate rmsd_nearest to rmsd_native: rmsd_nearest ~ F(rmsd_native)^-1/d x rmsd_native^1-1/d [6] Since d is very large (> 100 ?), the first term, F(rmsd_native)^-1/d, as well as the exponent of the second term, 1-1/d, are both very close to 1, leading to the approximate relationship rmsd_nearest ~ rmsd_native [7] i.e., rmsd_nearest is approximately proportional to rmsd_native, with a small downward hump where F(rmsd_native) is largest. The approximation should be better for larger d (larger proteins). So, it really seems like rmsd_nearest may be as a predictor of rmsd_native. I also calculated the count-statistics errors on rmsd_nearest (since this is already getting pretty long I will only give the results). In addition to the nearest neighbour (n=1) I also did this for the nth nearest neighbour which leads to somewhat smaller errors. For d=100 I find the following relative 90% error ranges: n=1: [0.974,1.014] n=5: [0.990,1.008] Even for n=1 the errors seem to be pretty small, much smaller than the scatter in the energy vs. rmsd_native plots; rmsd_nearest may thus indeed be a useful predictor of rmsd_native. Of course all of this relies on the assumption that the structures found by Rosetta, except for the F(rmsd_native) dependence, are more or less evenly distributed in the search space - which most likely is not the case. If the distribution of structures is sort of clumpy the scatter in the rmsd_nearest vs. rmsd_native relation may well be so large as to render it useless. I guess the only way to find out would be to plot rmsd_nearest vs. rmsd_native using real data (unless of course the experts tell me that all of this is silly and I made some obvious conceptual mistake ;-) ... ID: 12070 · Rating: 0 · rate: /

David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 12086 - Posted: 16 Mar 2006, 6:59:41 UTC - in response to Message 12070. Last modified: 16 Mar 2006, 16:46:34 UTC All those cool new top prediction plots sort of got me to fool around with what from the point of view of the experts probably is the silly idea of an amateur-biochemist. I will let you know anyway: ;-) The goal (e.g., for CASP7) seems to be to find the structures which are closest to the native structure in terms or rmsd for proteins where the native structure isn't known yet and the rmsd therefore can't be calculated directly. Since the native structure is thought to be the one with the lowest energy we therefore assume the lowest energy strucutre found by Rosetta to be our best prediction of the native structure. However, looking, e.g., at the energy vs. rmsd plot of 1mky, the correlation between energy and rmsd seems less than perfect. The point I want to make in this post is that the rms distance of each strucuture found by Rosetta to its nearest neighbour in the search space (rmsd_nearest) may in fact be a better predictor of rmsd to native (rmsd_native) than energy. Excellent points! in fact, you anticipated what I am trying out currently. I take the lowest energy subset of structures, measure the rmsd between each pair of structures, and identify those with the most other structures close to them (within an rmsd threshold of 1.5 to 3.0). I can then ask whether structures in these densely populated regions are more likely to be close to the native structure, and the answer as you anticipate is usually yes. Just today I found, going further, that if I predicted structures for not only the native sequence, but also 30 homologs, and selected those homologs with densely populated low energy regions, these were almost always the ones that had produced the best predictions. This is very nice, because it suggests that the presence of densely populated low energy neighborhoods is an indicator of both model accuracy and prediction confidence. (if anyone is interested in playing with the actual data from these plots just let me know and we will make it available) A number of people on the forum have asked (e.g., Dimitris a few posts down in this thread) why the points on the top prediction plots don't cluster close to rmsd=0 but rather somewhere further to the right. This mainly is due to the large number of free parameters or in other words the high dimensionality of the search space, meaning that, due to the one-dimensional representation on the plot, there is _much_ more volume to be searched at higher rmsd values than at low values (there is 0 search volume at rmsd=0). If one could look at the distribution of 'points' in the multi-dimensional search space directly, the highest density of 'points' could very well be much closer to the native sturcture. In the remainder of this post I will try to discuss this in a quantitative way: If d is the number of free parameters (the number of dimensions of the search space), then the (d-1)-dimensional volume, V, that needs to be searched at each rmsd_native (the surface of a d-dimensional sphere) can be expressed as V(rmsd_native) ~ rmsd_native^(d-1) [1] (all constants omittted) Assuming that F(rms_native) is the density of structures along the rmsd_native axis (the number of points at each rmsd_native) then the density, p, of structures in the d-dimensional search space can be expressed as p ~ F(rmsd_native)/V(rmsd_native) ~ F(rmsd_native) x rmsd_native^-(d-1) [2] Let's now consider the rms distance of each structure to its nearest neighbour, rmsd_nearest: The volume in the d-dimensional search space corresponding to rmsd_nearest (volume of d-dimensional sphere with radius rmsd_nearest) is V(rmsd_nearest) ~ rmsd_nearest^d [3] (again omitting all constants) The expectation value of V(rmsd_nearest) must be inversely proportional to the local density of structures (if the density is reduced by a factor of two, the volume needs to be doubled to obtain the same probabiltiy for finding a structure in V): V(rmsd_nearest) ~ 1/p [4] which using [3] gives rmsd_nearest ~ p^-1/d [5] Combining equation [2] and [5], we now can relate rmsd_nearest to rmsd_native: rmsd_nearest ~ F(rmsd_native)^-1/d x rmsd_native^1-1/d [6] Since d is very large (> 100 ?), the first term, F(rmsd_native)^-1/d, as well as the exponent of the second term, 1-1/d, are both very close to 1, leading to the approximate relationship rmsd_nearest ~ rmsd_native [7] i.e., rmsd_nearest is approximately proportional to rmsd_native, with a small downward hump where F(rmsd_native) is largest. The approximation should be better for larger d (larger proteins). So, it really seems like rmsd_nearest may be as a predictor of rmsd_native. I also calculated the count-statistics errors on rmsd_nearest (since this is already getting pretty long I will only give the results). In addition to the nearest neighbour (n=1) I also did this for the nth nearest neighbour which leads to somewhat smaller errors. For d=100 I find the following relative 90% error ranges: n=1: [0.974,1.014] n=5: [0.990,1.008] Even for n=1 the errors seem to be pretty small, much smaller than the scatter in the energy vs. rmsd_native plots; rmsd_nearest may thus indeed be a useful predictor of rmsd_native. Of course all of this relies on the assumption that the structures found by Rosetta, except for the F(rmsd_native) dependence, are more or less evenly distributed in the search space - which most likely is not the case. If the distribution of structures is sort of clumpy the scatter in the rmsd_nearest vs. rmsd_native relation may well be so large as to render it useless. I guess the only way to find out would be to plot rmsd_nearest vs. rmsd_native using real data (unless of course the experts tell me that all of this is silly and I made some obvious conceptual mistake ;-) ... ID: 12086 · Rating: 0 · rate: /

James Send message Joined: 8 Jan 06 Posts: 21 Credit: 11,697 RAC: 0	Message 12123 - Posted: 17 Mar 2006, 3:00:28 UTC - in response to Message 12086. David, The new 'top prediction of the day' feature on the main page is great. I do think it will lead to a greater feeling of 'involvement' (which alot of people are interested - look only at stats, teams, etc). It is also interesting to see how close the predictions are on a daily basis. One further 'enhancement' to the 'top prediction' would be a profile of the user team merged into the graphical section of the 'further details' part (you're using two links). What I mean is that you should make it 1 link rather than 2, nix the user info save for a link to their computer and add their profile info at the top of the prediction graphics page. That way it would not distract from the main page 'user of the day' but also give some more info on who the user (or team in this case) actually is. It also reinforces that the top result is the result of the day. The two links are, perhaps, a bit 'inefficient'? If you could script it - it would likely be easier, given that this will be daily, to generate userinfo followed by the plots. People are vain - they like their profiles displayed. This way you have the user of the day and the prediction of the day and both get prominence (user of the day more so, really). Just a thought - but integration in the form of one link might be a good idea. That way people don't just click on the graphics. The BBC may or may not help you. The climate prediction deal was a big one, but you MUST speak to one of the people at climate prediction (oxford folks I believe). They will pretty readily help you out I would think. They are the only way you will get noted in any real way on BBC or climateprediction.net's web site (they are also now running seasonal). They run the project and the site - convince them:) It won't take much. Also, perhaps ask to do a 'project update' on boinc synergy so you get a main page review. That will draw some users. This would be particularly useful for Ralph - many of the experimental projects need users that boinc synergy people are willing to go for. Troll for hosts there? I would have said it was a great idea to move it had I not been enjoying the island life for over a week:) Meanwhile I had a power outage here stopping all WUs for a week. I suppose that's the price I paid to take a vacation to a quiet island with no internet access. ID: 12123 · Rating: 0 · rate: /

hugothehermit Send message Joined: 26 Sep 05 Posts: 238 Credit: 314,893 RAC: 0	Message 12128 - Posted: 17 Mar 2006, 6:17:23 UTC Last modified: 17 Mar 2006, 6:22:20 UTC If you look at the energy vs rmsd plot for 1di2, you will see that one amazing rosetta run produced a structure far lower in energy than any other, and has an rmsd of ~1.3�. As you can see from the superposition on the right, this structure is essentially identical to the native structure--a nearly perfect prediction. Does that mean that this result could be applied in the real world? I realise that this is an already know structure, but if it wasn't, could it be assumed to be so close to the correct structure that other scientists could use it with confidence? Is it within the x-ray crystallography error threshhold? It is remarkable that exactly 1 run, out of the roughly 500,000 independent calculations all of you did, found the native minimum. With five fold less sampling, there would only be a one in five chance of having landed in the correct minimum, and rather than achieving an incredibably accurate prediction such as this one, the prediction would have been quite incorrect as the next lowest energy structures are quite a bit higher in rmsd. Of course, a level of sampling for which just one trajectory lands in the native minimum is not adequate for reliably predicting structure--this is why I keep harping on the need for more cpu time. with ten times more sampling, we would expect ten hits in the native minimum and a much lower chance for failure. Indeed, for the preceding two proteins, 1mky and 1tif, which are somewhat bigger than 1di2, we did not have enough sampling, and the native minimum was not found. I've re-posted my ABC (Australia) request for a show as they didn't get back to me, but that still begs the question I asked in my E-mail, can the Rosetta@Home servers handle the load if they make and air a programme on Rosetta@Home? You were having difficulties with just us for a while, I would hate to think that a premature recruitment drive turned lots of people off because of limited server capabilites, if you think your servers can handle the load and they (the ABC) don't get back to me in the next week I'll give them a call to ask what's happening. Great news about the hit on the protein and great journal. Edit: added a few words to clarify, wish I could spell ID: 12128 · Rating: 0 · rate: /

David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 12134 - Posted: 17 Mar 2006, 7:31:46 UTC - in response to Message 12128. If you look at the energy vs rmsd plot for 1di2, you will see that one amazing rosetta run produced a structure far lower in energy than any other, and has an rmsd of ~1.3?. As you can see from the superposition on the right, this structure is essentially identical to the native structure--a nearly perfect prediction. Does that mean that this result could be applied in the real world? I realise that this is an already know structure, but if it wasn't, could it be assumed to be so close to the correct structure that other scientists could use it with confidence? Is it within the x-ray crystallography error threshhold? It is remarkable that exactly 1 run, out of the roughly 500,000 independent calculations all of you did, found the native minimum. With five fold less sampling, there would only be a one in five chance of having landed in the correct minimum, and rather than achieving an incredibably accurate prediction such as this one, the prediction would have been quite incorrect as the next lowest energy structures are quite a bit higher in rmsd. Of course, a level of sampling for which just one trajectory lands in the native minimum is not adequate for reliably predicting structure--this is why I keep harping on the need for more cpu time. with ten times more sampling, we would expect ten hits in the native minimum and a much lower chance for failure. Indeed, for the preceding two proteins, 1mky and 1tif, which are somewhat bigger than 1di2, we did not have enough sampling, and the native minimum was not found. I've re-posted my ABC (Australia) request for a show as they didn't get back to me, but that still begs the question I asked in my E-mail, can the Rosetta@Home servers handle the load if they make and air a programme on Rosetta@Home? You were having difficulties with just us for a while, I would hate to think that a premature recruitment drive turned lots of people off because of limited server capabilites, if you think your servers can handle the load and they (the ABC) don't get back to me in the next week I'll give them a call to ask what's happening. Great news about the hit on the protein and great journal. Thanks, Hugo. We can get new servers to split the load if necessary--it is a problem we would love to have. But I'm thinking we should get all the errors out of the system as much as possible before starting a big recruitment drive; I'm optimistic there will be real progress by the end of the month. Edit: added a few words to clarify, wish I could spell ID: 12134 · Rating: 0 · rate: /

Marky-UK Send message Joined: 1 Nov 05 Posts: 73 Credit: 1,689,495 RAC: 0	Message 12136 - Posted: 17 Mar 2006, 8:18:40 UTC Last modified: 17 Mar 2006, 8:18:50 UTC (2). I'm a bit disappointed that the total cpu power has remained constant for the past weeks rather than increasing as it had up until recently. More users and hosts are joining every day, but this is not translating into increased computing capability. I know there's probably not much you can do about it, but I could probably double our team's throughput if the memory requirements of Rosetta weren't so high. I have a quad 2 GHz Xeon PC here that I can't run Rosetta on because it hasn't got enough memory to run four WUs at once, so it's crunching for a different project with a lower memory requirement (at the moment SIMAP). ID: 12136 · Rating: 0 · rate: /

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 12160 - Posted: 17 Mar 2006, 19:56:12 UTC There was a comment about how some of the participants are running with less than 512 megs of ram per cpu, and they're having more than a normal amount of failures with larger WUs. Is it possible to get the system Ram and number of system cpus - for each machine and then give out the larger WUs to only those with around 512Megs per cpu or higher? And if not.. do we have to ask the Boinc developers to add that possibility to the newer Boinc clients? ID: 12160 · Rating: 0 · rate: /

dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0	Message 12178 - Posted: 18 Mar 2006, 3:15:14 UTC - in response to Message 12136. (2). I'm a bit disappointed that the total cpu power has remained constant for the past weeks rather than increasing as it had up until recently. More users and hosts are joining every day, but this is not translating into increased computing capability. I know there's probably not much you can do about it, but I could probably double our team's throughput if the memory requirements of Rosetta weren't so high. I have a quad 2 GHz Xeon PC here that I can't run Rosetta on because it hasn't got enough memory to run four WUs at once, so it's crunching for a different project with a lower memory requirement (at the moment SIMAP). This is probably a damn fool question, but what happens if you create a custom config and use that to limit the number of CPU's? ID: 12178 · Rating: 0 · rate: /

Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0	Message 12196 - Posted: 18 Mar 2006, 17:26:40 UTC Last modified: 18 Mar 2006, 17:29:33 UTC I had a good look at the new 1elw plot. Great ! The obvious question of course is: what causes the horizontal gaps ? Is that because of the peculiar shape of the protein ? I also noticed a cluster of three points at about 8 A rmsd with energies only slightly higher than the top prediction. I wonder whether these structures actually cluster in parameter space or just appear close in the plot? If the latter is the case this might be an example where analyzing the density of the structures in parameter space could help to distinguish the 'correct' from the 'wrong' low energy structures. Also, studying some of the previous plots once more, mostly the ones where the prediction didn't work so well, I was wondering what might cause structures very close to the native structure to have high energies (e.g., 1tif) while some structures far away from the native structure seem to have low energies (1mky). Perhaps in the high energy, close structures most parts of the protein are at about the same energy as in the native strucuture with just a few outliers that drive the total energy up ? Looking at the low energy prediction of 1mky the reverse might be the case, with just a small part of the protein being very tightly bound and contributing very low energies to the total, while all of the rest is only loosely bound ? Assuming that it is possible to determine the energy contribution of each residue separately, I wonder whether it wouldn't make sense to get rid of the outliers (both high and low) when calculating the total energy (just for analyzing the results, not for minimizing the energy)? Combining this with the previous discussion on the distances of structures in parameter space, I am guessing that perhaps the median energy of the residues, divided by the rms distance of the structure to its nearest neighbor might be a promising predictor of rmsd to native ? Taking this one step further (again assuming that it is possible to determine the contribution of each residue to the total energy), wouldn't it make sense to study the distribution of these energies (say, number of residues with energies < E plotted vs. E)? Perhaps these distributions have characteristic properties for structures close to native? They might for example be relatively flat, such that each residue is bound equally well, giving the protein a stable shape? Perhaps the distributions can be parameterized in a simple way (power-law, exponential...?), providing an additional quantity (in addition to energy and distance of structures in parameter space) to characterize the structures ? Oh well, time to shut up - trying to be creative when one lacks essentially all the relevant background knowledge probably isn't that helpful. ;-) ID: 12196 · Rating: 0 · rate: /

R/B Send message Joined: 8 Dec 05 Posts: 195 Credit: 28,095 RAC: 0	Message 12249 - Posted: 19 Mar 2006, 7:59:42 UTC I really enjoy this thread and the D.Baker journal thread. I don't know of another BOINC project that provides this large amount of feedback from the project staff. Wonderful. Founder of BOINC GROUP - Objectivists - Philosophically minded rational data crunchers. ID: 12249 · Rating: 0 · rate: /

David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 12314 - Posted: 20 Mar 2006, 5:02:18 UTC - in response to Message 12196. I had a good look at the new 1elw plot. Great ! The obvious question of course is: what causes the horizontal gaps ? Is that because of the peculiar shape of the protein ? I also noticed a cluster of three points at about 8 A rmsd with energies only slightly higher than the top prediction. I wonder whether these structures actually cluster in parameter space or just appear close in the plot? If the latter is the case this might be an example where analyzing the density of the structures in parameter space could help to distinguish the 'correct' from the 'wrong' low energy structures. Also, studying some of the previous plots once more, mostly the ones where the prediction didn't work so well, I was wondering what might cause structures very close to the native structure to have high energies (e.g., 1tif) while some structures far away from the native structure seem to have low energies (1mky). Perhaps in the high energy, close structures most parts of the protein are at about the same energy as in the native strucuture with just a few outliers that drive the total energy up ? Looking at the low energy prediction of 1mky the reverse might be the case, with just a small part of the protein being very tightly bound and contributing very low energies to the total, while all of the rest is only loosely bound ? Assuming that it is possible to determine the energy contribution of each residue separately, I wonder whether it wouldn't make sense to get rid of the outliers (both high and low) when calculating the total energy (just for analyzing the results, not for minimizing the energy)? Combining this with the previous discussion on the distances of structures in parameter space, I am guessing that perhaps the median energy of the residues, divided by the rms distance of the structure to its nearest neighbor might be a promising predictor of rmsd to native ? Taking this one step further (again assuming that it is possible to determine the contribution of each residue to the total energy), wouldn't it make sense to study the distribution of these energies (say, number of residues with energies < E plotted vs. E)? Perhaps these distributions have characteristic properties for structures close to native? They might for example be relatively flat, such that each residue is bound equally well, giving the protein a stable shape? Perhaps the distributions can be parameterized in a simple way (power-law, exponential...?), providing an additional quantity (in addition to energy and distance of structures in parameter space) to characterize the structures ? Oh well, time to shut up - trying to be creative when one lacks essentially all the relevant background knowledge probably isn't that helpful. ;-) Again, very good ideas here! A graduate student in my group, Will Sheffler, is investigating the distributions of energies and interactions in native structure compared to the low energy structures you are generating. He may at some point give you a fully description, but a clear result so far is that while native structures are pretty uniformly packed, with relatively low energies for all interactions , some of the low energy wrong models are much less uniformly packed, with clusters of atoms making very low energy interactions and others making relatively poor interactions. In this case, while such a (wrong) model may have an overall energy close to that of the native structure a histogram of the per residue energies would clearly distinguish the two. Will is working on two approaches to get at this--first, an model evaluation approach that explicitly takes these per residue interaction energy distribtuions into account, and second, improvements to rosetta which will penalize these overly compact portions of models that are now getting overly low energies. he should be testing the latter approach on rosetta@home in the next few weeks. ID: 12314 · Rating: 0 · rate: /

Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0	Message 12315 - Posted: 20 Mar 2006, 5:40:53 UTC - in response to Message 12314. Also, studying some of the previous plots once more, mostly the ones where the prediction didn't work so well, I was wondering what might cause structures very close to the native structure to have high energies (e.g., 1tif) while some structures far away from the native structure seem to have low energies (1mky)... Again, very good ideas here! A graduate student in my group, Will Sheffler, is investigating the distributions of energies and interactions in native structure compared to the low energy structures you are generating. He may at some point give you a fully description, but a clear result so far is that while native structures are pretty uniformly packed, with relatively low energies for all interactions , some of the low energy wrong models are much less uniformly packed, with clusters of atoms making very low energy interactions and others making relatively poor interactions. In this case, while such a (wrong) model may have an overall energy close to that of the native structure a histogram of the per residue energies would clearly distinguish the two. Will is working on two approaches to get at this--first, an model evaluation approach that explicitly takes these per residue interaction energy distribtuions into account, and second, improvements to rosetta which will penalize these overly compact portions of models that are now getting overly low energies. he should be testing the latter approach on rosetta@home in the next few weeks. WOW - very interesting ! I would be delighted to hear more about Will's work - whenever he has something definite to report (good luck with your work, Will, I am keeping my fingers crossed that these ideas will work out as intended !). ID: 12315 · Rating: 0 · rate: /

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 12347 - Posted: 20 Mar 2006, 20:15:11 UTC - in response to Message 12343. Hi Hoelderlin. I am indeed thinking about some of the things you mentioned. I agree that it's important to look at distributions of scores. This can help to see if a few of sub-structures (residues, atoms, etc) have really bad scores and others are pretty good, or if sub-scores are pretty even overall. One could imagine that if there are a few really bad scores, the structure has some kind of serious flaw. An example of such a local problem which never happens in real proteins is a crack or hole. We haven't had very good luck picking out holes, and this is one case where breaking scores down by residue or even by atom is helpful. Just last friday I was looking into a hole-detector baded on scores for individual atoms. Take a look here if you are interested. http://www.gs.washington.edu/~wsheffle/boinc/ Correct me as you see fit.. but we're currently working with atoms or groups of atoms (residues) to create molecules. The current energy scoring is based on the atoms, but while it's producing great residues with low scores, we now need another energy scoring algorithm to score how the residues fit together and form the molecule. And while the ribbon and string visual representation of the molecule allows us to see the "this is good and this is bad" versions - it's tough (at least for me) to tell what's better or worse about the atom based models that follow. While I can't read the lower set of models - is what you're saying, pointing to basically creating a library of higher level residues, recognizing when we've come across a known residue and substituting the known structure - and then building the unknown molecule by making known and unknown residues fit together best? ID: 12347 · Rating: 0 · rate: /

R/B Send message Joined: 8 Dec 05 Posts: 195 Credit: 28,095 RAC: 0	Message 12405 - Posted: 21 Mar 2006, 8:01:04 UTC You just hop on that radio show if you can, Dr. Baker. 700 radio stations with millions of listeners to your 1 hour phone interview has got to bring in new support. You'll pick up enthusiastic support; I assure you. Founder of BOINC GROUP - Objectivists - Philosophically minded rational data crunchers. ID: 12405 · Rating: 0 · rate: /

Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0	Message 12406 - Posted: 21 Mar 2006, 8:15:59 UTC - in response to Message 12343. Hi Hoelderlin. I am indeed thinking about some of the things you mentioned. I agree that it's important to look at distributions of scores. This can help to see if a few of sub-structures (residues, atoms, etc) have really bad scores and others are pretty good, or if sub-scores are pretty even overall. One could imagine that if there are a few really bad scores, the structure has some kind of serious flaw. An example of such a local problem which never happens in real proteins is a crack or hole. We haven't had very good luck picking out holes, and this is one case where breaking scores down by residue or even by atom is helpful. Just last friday I was looking into a hole-detector baded on scores for individual atoms. Take a look here if you are interested. http://www.gs.washington.edu/~wsheffle/boinc/ Hi Will, I had a look at your web page. Intriguing ! Two things I wanted to mention: Could it be that despite having a lower total energy the prediction on the right sits in a shallower local minimum (less energy needed to change the shape), than the more tightly packed structure on the left ? Also, I am not sure how the interaction with the surrounding medium (water) is treated in the energy calculation (I seem to remember something about implicit and explicit solvent models). Would the fact that the hole seems just large enough for individual water molecules (about the size of your 2.4 A probe) to fit through have any effect on the energy calculation ? ID: 12406 · Rating: 0 · rate: /

Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 12538 - Posted: 23 Mar 2006, 0:45:57 UTC - in response to Message 12196. Hi Hoelderlin! You also had a good question about the striking gaps in the 1elw plot... we've been running Rosetta with "score filters". If a client gets about halfway through a run and doesn't make it below a certain energy, we stop the run -- we'd rather you use your computer cycles to start another simluation instead of spending time on a simulation that will likely not get a very low energy! This happens again one more time during the simulation; hence the two gaps in the plot. I had a good look at the new 1elw plot. Great ! The obvious question of course is: what causes the horizontal gaps ? Is that because of the peculiar shape of the protein ? ID: 12538 · Rating: 0 · rate: /

Sean Kiely Send message Joined: 31 Jan 06 Posts: 65 Credit: 43,992 RAC: 0	Message 12577 - Posted: 23 Mar 2006, 17:19:14 UTC Hi Will: Thank you for the interesting details about the issue of implicit vs. explicit solvation. Would there be any possible benefit to modeling explicit solvation (very sparsely) during processing? Maybe once or twice partway through especially promising structures? I wonder if it might keep us from wandering onto paths where subtle solvation issues keep us away from achieving best energy minima? Sort of a "solvation sanity check"? I recognise, of course, that modeling complexity issues or sheer processing costs may make this unworkable! Sean ID: 12577 · Rating: 0 · rate: /