Comments/questions on Rosetta@home journal

Message boards : Rosetta@home Science : Comments/questions on Rosetta@home journal

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 10 · Next

AuthorMessage
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12032 - Posted: 15 Mar 2006, 4:41:01 UTC - in response to Message 12021.  

With regard to making more CPUs available, I think the single biggest issue is the disparity between credits awarded to different machines, whether thats between OS's or from the use of 'optimised' BOINC clients. If that could be fixed somehow (and I'm not suggesting that's a small job!) then I believe there'd be a lot more interest from some of the big competitive teams. I read somewhere that that was TSCRussia's main reason for not joining here.

It might also be worth getting in touch with the BBC to get a mention on their website regarding their BOINC-based climate change project. I expect there'll be litterally millions of people reading about their project, and many of those won't fit the criteria for that project but would be welcomed here (I think they want PCs that are on pretty much 24/7). I'd expect the BBC to be pretty good like that.

HTH
Danny


good idea! any idea about who to contact at the BBC?
ID: 12032 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12034 - Posted: 15 Mar 2006, 4:45:25 UTC - in response to Message 12023.  

Btw, HPF (running on IBM's World Community Grid and UD's grid.org) is soon going into Phase-II. It's using the Rosetta software, which we're helping improve here. It looks like they're now using the "full atom relax" mode?

Richard Bonneau, 13-Mar-2006 (source):
"The proposed project (HPF phase-2) will refine, using Rosetta in a mode that accounts for greater atomic detail, the structures resulting from the first phase of the Human Proteome Folding Project (HPF phase1). The project will focus on human secreted proteins (proteins in the blood and the spaces between cells). These proteins can be important for signaling between cells and are often key markers for diagnosis. These proteins have even ended up being useful as drugs (when synthesized and given by doctors to people lacking the pro-teins). The project will also focus on key secreted pathogenic protein. This project dove-tails with efforts at the ISB in Seattle to support predictive, preventative and personalized medicine (under the assumption that these secreted proteins will be key elements of this medicine of the future).

This project continues where the Human Proteome Folding Project leaves off. With the Human proteome Folding project we aimed to get protein function. With the second phase we would aim to increase the resolution of a select subset of Human proteins. Better reso-lution is important for a number of applications including but not limited to virtual screening of drug targets with docking procedures and protein design. The second phase of the pro-ject will also serve to improve our understanding of the physics of protein structure and ad-vance the state of the art in protein structure prediction (help us to further develop our program, Rosetta).

The two main objectives are to: 1) obtain higher resolution structures for specific hu-man proteins and pathogen proteins and 2) further explore the limits of protein structure prediction by further developing Rosetta structure prediction. Thus, the project would ad-dress two very important parallel imperatives, one biological and one biophysical.

The Human Proteome Folding Project Phase-2 will use the computer power of millions of computers to predict the shape of Human proteins for which researchers currently know little. From this detailed shape scientists hope to learn about the function of these proteins, as the shape of proteins is inherently related to how they function in our bodies. This data-base of protein structures and putative functions will let scientists take the next steps un-derstanding how diseases that involve these proteins work. Proteins are the most important molecules in living beings. Just about everything in your body involves or is made out of pro-teins. Protein structure is key to understanding the functions of this diverse class of bio-molecule. Thus we hope that our work on HPF 1 and HPF 2 will contribute to critical pub-lic infrastructure to the biological and biomedical community."




Yes--I just sent them the version we'd improved based on all of your results up to the "HBLR" series of runs. in contrast to hpf1, which used our older low resolution model, they will be doing full atom refinement on all structures.

Rich Bonneau was a graduate student in my group and now has moved on to be a professor at NYU. I just talked to him today and he updated me on the hectic life of a starting assistant professor. He is going to come back to Seattle for a week next month so we can finalize the scientific report on HPF1.

ID: 12034 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12041 - Posted: 15 Mar 2006, 5:41:46 UTC - in response to Message 12022.  

I'd like to ask about the distribution of results on the RMSD / energy charts.



Somehow I had imagined there'd be a "clustering" of those 5000 (I think that's the ones you're charting nowadays, right?) best results in the bottom left conrner of the chart.

Yes, for the first few proteins Divya posted, this is what was observed. but this is a bigger protein, and finding the very lowest energy minimum (the correct structure) is like finding a needle in a haystack.

But, looking at the charts, it seems that most results stay far away (due to local minima keeping them away?) and only few "lucky outliers" actually approach the global energy minima. Using the "planet exploration analogy", it looks as if a handful lucky explorers somehow "fall into a hole" and discover the lowest energy point.

So this is why you need more explorers (CPUs).

Exactly! wait until you see Divya's post for tonight--a case where one lucky explorer did land in the (correct) global minimum! for the 1tif case we just didn't have enough sampling to find the global minimum.

Also, maybe you could also plot the Energy of the "native" (experimentally derived) structure on the charts, for reference, like you did in the past?

good idea--we will start doing this again as it does make the problem clearer.


What is the error in experimentally (X-ray crystallography or NMR) derived structures?


it is small on this scale--for a high quality crystal structure less than 0.5-1.

Dimitris--could you send me an email?

ID: 12041 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Johnathon

Send message
Joined: 5 Nov 05
Posts: 120
Credit: 138,226
RAC: 0
Message 12046 - Posted: 15 Mar 2006, 8:05:51 UTC
Last modified: 15 Mar 2006, 8:06:24 UTC

Mr Baker,
You asked about contacting the BBC?
go here: http://www.bbc.co.uk/sn/hottopics/climatechange/
and click on the "contact us" link in the left hand menu.
(You may have to scroll down a bit).
That'll let you contact the BBC science & nature team, re their climate change project.
Unfortuanalty I cant give you a dirrect link, because of the way they're doing their feedback system.

HTH

Johnathon
ID: 12046 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Hoelder1in
Avatar

Send message
Joined: 30 Sep 05
Posts: 169
Credit: 3,915,947
RAC: 0
Message 12070 - Posted: 15 Mar 2006, 21:11:35 UTC
Last modified: 15 Mar 2006, 21:20:13 UTC

All those cool new top prediction plots sort of got me to fool around with what from the point of view of the experts probably is the silly idea of an amateur-biochemist. I will let you know anyway: ;-)

The goal (e.g., for CASP7) seems to be to find the structures which are closest to the native structure in terms or rmsd for proteins where the native structure isn't known yet and the rmsd therefore can't be calculated directly. Since the native structure is thought to be the one with the lowest energy we therefore assume the lowest energy strucutre found by Rosetta to be our best prediction of the native structure. However, looking, e.g., at the energy vs. rmsd plot of 1mky, the correlation between energy and rmsd seems less than perfect. The point I want to make in this post is that the rms distance of each strucuture found by Rosetta to its nearest neighbour in the search space (rmsd_nearest) may in fact be a better predictor of rmsd to native (rmsd_native) than energy.

A number of people on the forum have asked (e.g., Dimitris a few posts down in this thread) why the points on the top prediction plots don't cluster close to rmsd=0 but rather somewhere further to the right. This mainly is due to the large number of free parameters or in other words the high dimensionality of the search space, meaning that, due to the one-dimensional representation on the plot, there is _much_ more volume to be searched at higher rmsd values than at low values (there is 0 search volume at rmsd=0). If one could look at the distribution of 'points' in the multi-dimensional search space directly, the highest density of 'points' could very well be much closer to the native sturcture. In the remainder of this post I will try to discuss this in a quantitative way:

If d is the number of free parameters (the number of dimensions of the search space), then the (d-1)-dimensional volume, V, that needs to be searched at each rmsd_native (the surface of a d-dimensional sphere) can be expressed as

V(rmsd_native) ~ rmsd_native^(d-1) [1] (all constants omittted)

Assuming that F(rms_native) is the density of structures along the rmsd_native axis (the number of points at each rmsd_native) then the density, p, of structures in the d-dimensional search space can be expressed as

p ~ F(rmsd_native)/V(rmsd_native) ~ F(rmsd_native) x rmsd_native^-(d-1) [2]

Let's now consider the rms distance of each structure to its nearest neighbour, rmsd_nearest: The volume in the d-dimensional search space corresponding to rmsd_nearest (volume of d-dimensional sphere with radius rmsd_nearest) is

V(rmsd_nearest) ~ rmsd_nearest^d [3] (again omitting all constants)

The expectation value of V(rmsd_nearest) must be inversely proportional to the local density of structures (if the density is reduced by a factor of two, the volume needs to be doubled to obtain the same probabiltiy for finding a structure in V):

V(rmsd_nearest) ~ 1/p [4]

which using [3] gives

rmsd_nearest ~ p^-1/d [5]

Combining equation [2] and [5], we now can relate rmsd_nearest to rmsd_native:

rmsd_nearest ~ F(rmsd_native)^-1/d x rmsd_native^1-1/d [6]

Since d is very large (> 100 ?), the first term, F(rmsd_native)^-1/d, as well as the exponent of the second term, 1-1/d, are both very close to 1, leading to the approximate relationship

rmsd_nearest ~ rmsd_native [7]

i.e., rmsd_nearest is approximately proportional to rmsd_native, with a small downward hump where F(rmsd_native) is largest. The approximation should be better for larger d (larger proteins). So, it really seems like rmsd_nearest may be as a predictor of rmsd_native.

I also calculated the count-statistics errors on rmsd_nearest (since this is already getting pretty long I will only give the results). In addition to the nearest neighbour (n=1) I also did this for the nth nearest neighbour which leads to somewhat smaller errors. For d=100 I find the following relative 90% error ranges:

n=1: [0.974,1.014]
n=5: [0.990,1.008]

Even for n=1 the errors seem to be pretty small, much smaller than the scatter in the energy vs. rmsd_native plots; rmsd_nearest may thus indeed be a useful predictor of rmsd_native.

Of course all of this relies on the assumption that the structures found by Rosetta, except for the F(rmsd_native) dependence, are more or less evenly distributed in the search space - which most likely is not the case. If the distribution of structures is sort of clumpy the scatter in the rmsd_nearest vs. rmsd_native relation may well be so large as to render it useless. I guess the only way to find out would be to plot rmsd_nearest vs. rmsd_native using real data (unless of course the experts tell me that all of this is silly and I made some obvious conceptual mistake ;-) ...
ID: 12070 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12086 - Posted: 16 Mar 2006, 6:59:41 UTC - in response to Message 12070.  
Last modified: 16 Mar 2006, 16:46:34 UTC

All those cool new top prediction plots sort of got me to fool around with what from the point of view of the experts probably is the silly idea of an amateur-biochemist. I will let you know anyway: ;-)

The goal (e.g., for CASP7) seems to be to find the structures which are closest to the native structure in terms or rmsd for proteins where the native structure isn't known yet and the rmsd therefore can't be calculated directly. Since the native structure is thought to be the one with the lowest energy we therefore assume the lowest energy strucutre found by Rosetta to be our best prediction of the native structure. However, looking, e.g., at the energy vs. rmsd plot of 1mky, the correlation between energy and rmsd seems less than perfect. The point I want to make in this post is that the rms distance of each strucuture found by Rosetta to its nearest neighbour in the search space (rmsd_nearest) may in fact be a better predictor of rmsd to native (rmsd_native) than energy.






Excellent points! in fact, you anticipated what I am trying out currently. I take the lowest energy subset of structures, measure the rmsd between each pair of structures, and identify those with the most other structures close to them (within an rmsd threshold of 1.5 to 3.0).
I can then ask whether structures in these densely populated regions are more likely to be close to the native structure, and the answer as you anticipate is usually yes. Just today I found, going further, that if I predicted structures for not only the native sequence, but also 30 homologs, and selected those homologs with densely populated low energy regions, these were almost always the ones that had produced the best predictions. This is very nice, because it suggests that the presence of densely populated low energy neighborhoods is an indicator of both model accuracy and prediction confidence.

(if anyone is interested in playing with the actual data from these plots just let me know and we will make it available)







A number of people on the forum have asked (e.g., Dimitris a few posts down in this thread) why the points on the top prediction plots don't cluster close to rmsd=0 but rather somewhere further to the right. This mainly is due to the large number of free parameters or in other words the high dimensionality of the search space, meaning that, due to the one-dimensional representation on the plot, there is _much_ more volume to be searched at higher rmsd values than at low values (there is 0 search volume at rmsd=0). If one could look at the distribution of 'points' in the multi-dimensional search space directly, the highest density of 'points' could very well be much closer to the native sturcture. In the remainder of this post I will try to discuss this in a quantitative way:

If d is the number of free parameters (the number of dimensions of the search space), then the (d-1)-dimensional volume, V, that needs to be searched at each rmsd_native (the surface of a d-dimensional sphere) can be expressed as

V(rmsd_native) ~ rmsd_native^(d-1) [1] (all constants omittted)

Assuming that F(rms_native) is the density of structures along the rmsd_native axis (the number of points at each rmsd_native) then the density, p, of structures in the d-dimensional search space can be expressed as

p ~ F(rmsd_native)/V(rmsd_native) ~ F(rmsd_native) x rmsd_native^-(d-1) [2]

Let's now consider the rms distance of each structure to its nearest neighbour, rmsd_nearest: The volume in the d-dimensional search space corresponding to rmsd_nearest (volume of d-dimensional sphere with radius rmsd_nearest) is

V(rmsd_nearest) ~ rmsd_nearest^d [3] (again omitting all constants)

The expectation value of V(rmsd_nearest) must be inversely proportional to the local density of structures (if the density is reduced by a factor of two, the volume needs to be doubled to obtain the same probabiltiy for finding a structure in V):

V(rmsd_nearest) ~ 1/p [4]

which using [3] gives

rmsd_nearest ~ p^-1/d [5]

Combining equation [2] and [5], we now can relate rmsd_nearest to rmsd_native:

rmsd_nearest ~ F(rmsd_native)^-1/d x rmsd_native^1-1/d [6]

Since d is very large (> 100 ?), the first term, F(rmsd_native)^-1/d, as well as the exponent of the second term, 1-1/d, are both very close to 1, leading to the approximate relationship

rmsd_nearest ~ rmsd_native [7]

i.e., rmsd_nearest is approximately proportional to rmsd_native, with a small downward hump where F(rmsd_native) is largest. The approximation should be better for larger d (larger proteins). So, it really seems like rmsd_nearest may be as a predictor of rmsd_native.

I also calculated the count-statistics errors on rmsd_nearest (since this is already getting pretty long I will only give the results). In addition to the nearest neighbour (n=1) I also did this for the nth nearest neighbour which leads to somewhat smaller errors. For d=100 I find the following relative 90% error ranges:

n=1: [0.974,1.014]
n=5: [0.990,1.008]

Even for n=1 the errors seem to be pretty small, much smaller than the scatter in the energy vs. rmsd_native plots; rmsd_nearest may thus indeed be a useful predictor of rmsd_native.

Of course all of this relies on the assumption that the structures found by Rosetta, except for the F(rmsd_native) dependence, are more or less evenly distributed in the search space - which most likely is not the case. If the distribution of structures is sort of clumpy the scatter in the rmsd_nearest vs. rmsd_native relation may well be so large as to render it useless. I guess the only way to find out would be to plot rmsd_nearest vs. rmsd_native using real data (unless of course the experts tell me that all of this is silly and I made some obvious conceptual mistake ;-) ...


ID: 12086 · Rating: 0 · rate: Rate + / Rate - Report as offensive
James

Send message
Joined: 8 Jan 06
Posts: 21
Credit: 11,697
RAC: 0
Message 12123 - Posted: 17 Mar 2006, 3:00:28 UTC - in response to Message 12086.  

David,
The new 'top prediction of the day' feature on the main page is great. I do think it will lead to a greater feeling of 'involvement' (which alot of people are interested - look only at stats, teams, etc). It is also interesting to see how close the predictions are on a daily basis.

One further 'enhancement' to the 'top prediction' would be a profile of the user team merged into the graphical section of the 'further details' part (you're using two links).

What I mean is that you should make it 1 link rather than 2, nix the user info save for a link to their computer and add their profile info at the top of the prediction graphics page. That way it would not distract from the main page 'user of the day' but also give some more info on who the user (or team in this case) actually is.

It also reinforces that the top result is the result of the day. The two links are, perhaps, a bit 'inefficient'? If you could script it - it would likely be easier, given that this will be daily, to generate userinfo followed by the plots.

People are vain - they like their profiles displayed. This way you have the user of the day and the prediction of the day and both get prominence (user of the day more so, really).

Just a thought - but integration in the form of one link might be a good idea. That way people don't just click on the graphics.

The BBC may or may not help you. The climate prediction deal was a big one, but you MUST speak to one of the people at climate prediction (oxford folks I believe). They will pretty readily help you out I would think. They are the only way you will get noted in any real way on BBC or climateprediction.net's web site (they are also now running seasonal). They run the project and the site - convince them:) It won't take much.

Also, perhaps ask to do a 'project update' on boinc synergy so you get a main page review. That will draw some users. This would be particularly useful for Ralph - many of the experimental projects need users that boinc synergy people are willing to go for. Troll for hosts there?

I would have said it was a great idea to move it had I not been enjoying the island life for over a week:) Meanwhile I had a power outage here stopping all WUs for a week. I suppose that's the price I paid to take a vacation to a quiet island with no internet access.


ID: 12123 · Rating: 0 · rate: Rate + / Rate - Report as offensive
hugothehermit

Send message
Joined: 26 Sep 05
Posts: 238
Credit: 314,893
RAC: 0
Message 12128 - Posted: 17 Mar 2006, 6:17:23 UTC
Last modified: 17 Mar 2006, 6:22:20 UTC

If you look at the energy vs rmsd plot for 1di2, you will see that one amazing rosetta run produced a structure far lower in energy than any other, and has an rmsd of ~1.3�. As you can see from the superposition on the right, this structure is essentially identical to the native structure--a nearly perfect prediction.


Does that mean that this result could be applied in the real world? I realise that this is an already know structure, but if it wasn't, could it be assumed to be so close to the correct structure that other scientists could use it with confidence? Is it within the x-ray crystallography error threshhold?


It is remarkable that exactly 1 run, out of the roughly 500,000 independent calculations all of you did, found the native minimum. With five fold less sampling, there would only be a one in five chance of having landed in the correct minimum, and rather than achieving an incredibably accurate prediction such as this one, the prediction would have been quite incorrect as the next lowest energy structures are quite a bit higher in rmsd.

Of course, a level of sampling for which just one trajectory lands in the native minimum is not adequate for reliably predicting structure--this is why I keep harping on the need for more cpu time. with ten times more sampling, we would expect ten hits in the native minimum and a much lower chance for failure. Indeed, for the preceding two proteins, 1mky and 1tif, which are somewhat bigger than 1di2, we did not have enough sampling, and the native minimum was not found.


I've re-posted my ABC (Australia) request for a show as they didn't get back to me, but that still begs the question I asked in my E-mail, can the Rosetta@Home servers handle the load if they make and air a programme on Rosetta@Home? You were having difficulties with just us for a while, I would hate to think that a premature recruitment drive turned lots of people off because of limited server capabilites, if you think your servers can handle the load and they (the ABC) don't get back to me in the next week I'll give them a call to ask what's happening.

Great news about the hit on the protein and great journal.


Edit: added a few words to clarify, wish I could spell
ID: 12128 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12134 - Posted: 17 Mar 2006, 7:31:46 UTC - in response to Message 12128.  

If you look at the energy vs rmsd plot for 1di2, you will see that one amazing rosetta run produced a structure far lower in energy than any other, and has an rmsd of ~1.3?. As you can see from the superposition on the right, this structure is essentially identical to the native structure--a nearly perfect prediction.


Does that mean that this result could be applied in the real world? I realise that this is an already know structure, but if it wasn't, could it be assumed to be so close to the correct structure that other scientists could use it with confidence? Is it within the x-ray crystallography error threshhold?


It is remarkable that exactly 1 run, out of the roughly 500,000 independent calculations all of you did, found the native minimum. With five fold less sampling, there would only be a one in five chance of having landed in the correct minimum, and rather than achieving an incredibably accurate prediction such as this one, the prediction would have been quite incorrect as the next lowest energy structures are quite a bit higher in rmsd.

Of course, a level of sampling for which just one trajectory lands in the native minimum is not adequate for reliably predicting structure--this is why I keep harping on the need for more cpu time. with ten times more sampling, we would expect ten hits in the native minimum and a much lower chance for failure. Indeed, for the preceding two proteins, 1mky and 1tif, which are somewhat bigger than 1di2, we did not have enough sampling, and the native minimum was not found.


I've re-posted my ABC (Australia) request for a show as they didn't get back to me, but that still begs the question I asked in my E-mail, can the Rosetta@Home servers handle the load if they make and air a programme on Rosetta@Home? You were having difficulties with just us for a while, I would hate to think that a premature recruitment drive turned lots of people off because of limited server capabilites, if you think your servers can handle the load and they (the ABC) don't get back to me in the next week I'll give them a call to ask what's happening.

Great news about the hit on the protein and great journal.


Thanks, Hugo. We can get new servers to split the load if necessary--it is a problem we would love to have. But I'm thinking we should get all the errors out of the system as much as possible before starting a big recruitment drive; I'm optimistic there will be real progress by the end of the month.


Edit: added a few words to clarify, wish I could spell


ID: 12134 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Marky-UK

Send message
Joined: 1 Nov 05
Posts: 73
Credit: 1,689,495
RAC: 0
Message 12136 - Posted: 17 Mar 2006, 8:18:40 UTC
Last modified: 17 Mar 2006, 8:18:50 UTC

(2). I'm a bit disappointed that the total cpu power has remained constant for the past weeks rather than increasing as it had up until recently. More users and hosts are joining every day, but this is not translating into increased computing capability.

I know there's probably not much you can do about it, but I could probably double our team's throughput if the memory requirements of Rosetta weren't so high. I have a quad 2 GHz Xeon PC here that I can't run Rosetta on because it hasn't got enough memory to run four WUs at once, so it's crunching for a different project with a lower memory requirement (at the moment SIMAP).
ID: 12136 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 12160 - Posted: 17 Mar 2006, 19:56:12 UTC

There was a comment about how some of the participants are running with less than 512 megs of ram per cpu, and they're having more than a normal amount of failures with larger WUs. Is it possible to get the system Ram and number of system cpus - for each machine and then give out the larger WUs to only those with around 512Megs per cpu or higher? And if not.. do we have to ask the Boinc developers to add that possibility to the newer Boinc clients?


ID: 12160 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 12178 - Posted: 18 Mar 2006, 3:15:14 UTC - in response to Message 12136.  

(2). I'm a bit disappointed that the total cpu power has remained constant for the past weeks rather than increasing as it had up until recently. More users and hosts are joining every day, but this is not translating into increased computing capability.

I know there's probably not much you can do about it, but I could probably double our team's throughput if the memory requirements of Rosetta weren't so high. I have a quad 2 GHz Xeon PC here that I can't run Rosetta on because it hasn't got enough memory to run four WUs at once, so it's crunching for a different project with a lower memory requirement (at the moment SIMAP).


This is probably a damn fool question, but what happens if you create a custom config and use that to limit the number of CPU's?
ID: 12178 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Hoelder1in
Avatar

Send message
Joined: 30 Sep 05
Posts: 169
Credit: 3,915,947
RAC: 0
Message 12196 - Posted: 18 Mar 2006, 17:26:40 UTC
Last modified: 18 Mar 2006, 17:29:33 UTC

I had a good look at the new 1elw plot. Great ! The obvious question of course is: what causes the horizontal gaps ? Is that because of the peculiar shape of the protein ? I also noticed a cluster of three points at about 8 A rmsd with energies only slightly higher than the top prediction. I wonder whether these structures actually cluster in parameter space or just appear close in the plot? If the latter is the case this might be an example where analyzing the density of the structures in parameter space could help to distinguish the 'correct' from the 'wrong' low energy structures.

Also, studying some of the previous plots once more, mostly the ones where the prediction didn't work so well, I was wondering what might cause structures very close to the native structure to have high energies (e.g., 1tif) while some structures far away from the native structure seem to have low energies (1mky). Perhaps in the high energy, close structures most parts of the protein are at about the same energy as in the native strucuture with just a few outliers that drive the total energy up ? Looking at the low energy prediction of 1mky the reverse might be the case, with just a small part of the protein being very tightly bound and contributing very low energies to the total, while all of the rest is only loosely bound ? Assuming that it is possible to determine the energy contribution of each residue separately, I wonder whether it wouldn't make sense to get rid of the outliers (both high and low) when calculating the total energy (just for analyzing the results, not for minimizing the energy)? Combining this with the previous discussion on the distances of structures in parameter space, I am guessing that perhaps the median energy of the residues, divided by the rms distance of the structure to its nearest neighbor might be a promising predictor of rmsd to native ?

Taking this one step further (again assuming that it is possible to determine the contribution of each residue to the total energy), wouldn't it make sense to study the distribution of these energies (say, number of residues with energies < E plotted vs. E)? Perhaps these distributions have characteristic properties for structures close to native? They might for example be relatively flat, such that each residue is bound equally well, giving the protein a stable shape? Perhaps the distributions can be parameterized in a simple way (power-law, exponential...?), providing an additional quantity (in addition to energy and distance of structures in parameter space) to characterize the structures ?

Oh well, time to shut up - trying to be creative when one lacks essentially all the relevant background knowledge probably isn't that helpful. ;-)
ID: 12196 · Rating: 0 · rate: Rate + / Rate - Report as offensive
R/B

Send message
Joined: 8 Dec 05
Posts: 195
Credit: 28,095
RAC: 0
Message 12249 - Posted: 19 Mar 2006, 7:59:42 UTC

I really enjoy this thread and the D.Baker journal thread. I don't know of another BOINC project that provides this large amount of feedback from the project staff. Wonderful.
Founder of BOINC GROUP - Objectivists - Philosophically minded rational data crunchers.


ID: 12249 · Rating: 0 · rate: Rate + / Rate - Report as offensive
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12314 - Posted: 20 Mar 2006, 5:02:18 UTC - in response to Message 12196.  

I had a good look at the new 1elw plot. Great ! The obvious question of course is: what causes the horizontal gaps ? Is that because of the peculiar shape of the protein ? I also noticed a cluster of three points at about 8 A rmsd with energies only slightly higher than the top prediction. I wonder whether these structures actually cluster in parameter space or just appear close in the plot? If the latter is the case this might be an example where analyzing the density of the structures in parameter space could help to distinguish the 'correct' from the 'wrong' low energy structures.

Also, studying some of the previous plots once more, mostly the ones where the prediction didn't work so well, I was wondering what might cause structures very close to the native structure to have high energies (e.g., 1tif) while some structures far away from the native structure seem to have low energies (1mky). Perhaps in the high energy, close structures most parts of the protein are at about the same energy as in the native strucuture with just a few outliers that drive the total energy up ? Looking at the low energy prediction of 1mky the reverse might be the case, with just a small part of the protein being very tightly bound and contributing very low energies to the total, while all of the rest is only loosely bound ? Assuming that it is possible to determine the energy contribution of each residue separately, I wonder whether it wouldn't make sense to get rid of the outliers (both high and low) when calculating the total energy (just for analyzing the results, not for minimizing the energy)? Combining this with the previous discussion on the distances of structures in parameter space, I am guessing that perhaps the median energy of the residues, divided by the rms distance of the structure to its nearest neighbor might be a promising predictor of rmsd to native ?

Taking this one step further (again assuming that it is possible to determine the contribution of each residue to the total energy), wouldn't it make sense to study the distribution of these energies (say, number of residues with energies < E plotted vs. E)? Perhaps these distributions have characteristic properties for structures close to native? They might for example be relatively flat, such that each residue is bound equally well, giving the protein a stable shape? Perhaps the distributions can be parameterized in a simple way (power-law, exponential...?), providing an additional quantity (in addition to energy and distance of structures in parameter space) to characterize the structures ?

Oh well, time to shut up - trying to be creative when one lacks essentially all the relevant background knowledge probably isn't that helpful. ;-)



Again, very good ideas here! A graduate student in my group, Will Sheffler, is investigating the distributions of energies and interactions in native structure compared to the low energy structures you are generating. He may at some point give you a fully description, but a clear result so far is that while native structures are pretty uniformly packed, with relatively low energies for all interactions , some of the low energy wrong models are much less uniformly packed, with clusters of atoms making very low energy interactions and others making relatively poor interactions. In this case, while such a (wrong) model may have an overall energy close to that of the native structure a histogram of the per residue energies would clearly distinguish the two. Will is working on two approaches to get at this--first, an model evaluation approach that explicitly takes these per residue interaction energy distribtuions into account, and second, improvements to rosetta which will penalize these overly compact portions of models that are now getting overly low energies. he should be testing the latter approach on rosetta@home in the next few weeks.

ID: 12314 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Hoelder1in
Avatar

Send message
Joined: 30 Sep 05
Posts: 169
Credit: 3,915,947
RAC: 0
Message 12315 - Posted: 20 Mar 2006, 5:40:53 UTC - in response to Message 12314.  

Also, studying some of the previous plots once more, mostly the ones where the prediction didn't work so well, I was wondering what might cause structures very close to the native structure to have high energies (e.g., 1tif) while some structures far away from the native structure seem to have low energies (1mky)...

Again, very good ideas here! A graduate student in my group, Will Sheffler, is investigating the distributions of energies and interactions in native structure compared to the low energy structures you are generating. He may at some point give you a fully description, but a clear result so far is that while native structures are pretty uniformly packed, with relatively low energies for all interactions , some of the low energy wrong models are much less uniformly packed, with clusters of atoms making very low energy interactions and others making relatively poor interactions. In this case, while such a (wrong) model may have an overall energy close to that of the native structure a histogram of the per residue energies would clearly distinguish the two. Will is working on two approaches to get at this--first, an model evaluation approach that explicitly takes these per residue interaction energy distribtuions into account, and second, improvements to rosetta which will penalize these overly compact portions of models that are now getting overly low energies. he should be testing the latter approach on rosetta@home in the next few weeks.

WOW - very interesting ! I would be delighted to hear more about Will's work - whenever he has something definite to report (good luck with your work, Will, I am keeping my fingers crossed that these ideas will work out as intended !).
ID: 12315 · Rating: 0 · rate: Rate + / Rate - Report as offensive
will sheffler

Send message
Joined: 20 Mar 06
Posts: 3
Credit: 0
RAC: 0
Message 12343 - Posted: 20 Mar 2006, 18:57:37 UTC - in response to Message 12315.  

Also, studying some of the previous plots once more, mostly the ones where the prediction didn't work so well, I was wondering what might cause structures very close to the native structure to have high energies (e.g., 1tif) while some structures far away from the native structure seem to have low energies (1mky)...

Again, very good ideas here! A graduate student in my group, Will Sheffler, is investigating the distributions of energies and interactions in native structure compared to the low energy structures you are generating. He may at some point give you a fully description, but a clear result so far is that while native structures are pretty uniformly packed, with relatively low energies for all interactions , some of the low energy wrong models are much less uniformly packed, with clusters of atoms making very low energy interactions and others making relatively poor interactions. In this case, while such a (wrong) model may have an overall energy close to that of the native structure a histogram of the per residue energies would clearly distinguish the two. Will is working on two approaches to get at this--first, an model evaluation approach that explicitly takes these per residue interaction energy distribtuions into account, and second, improvements to rosetta which will penalize these overly compact portions of models that are now getting overly low energies. he should be testing the latter approach on rosetta@home in the next few weeks.

WOW - very interesting ! I would be delighted to hear more about Will's work - whenever he has something definite to report (good luck with your work, Will, I am keeping my fingers crossed that these ideas will work out as intended !).


Hi Hoelderlin. I am indeed thinking about some of the things you mentioned. I agree that it's important to look at distributions of scores. This can help to see if a few of sub-structures (residues, atoms, etc) have really bad scores and others are pretty good, or if sub-scores are pretty even overall. One could imagine that if there are a few really bad scores, the structure has some kind of serious flaw. An example of such a local problem which never happens in real proteins is a crack or hole. We haven't had very good luck picking out holes, and this is one case where breaking scores down by residue or even by atom is helpful. Just last friday I was looking into a hole-detector baded on scores for individual atoms. Take a look here if you are interested.

http://www.gs.washington.edu/~wsheffle/boinc/
ID: 12343 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 12347 - Posted: 20 Mar 2006, 20:15:11 UTC - in response to Message 12343.  

Hi Hoelderlin. I am indeed thinking about some of the things you mentioned. I agree that it's important to look at distributions of scores. This can help to see if a few of sub-structures (residues, atoms, etc) have really bad scores and others are pretty good, or if sub-scores are pretty even overall. One could imagine that if there are a few really bad scores, the structure has some kind of serious flaw. An example of such a local problem which never happens in real proteins is a crack or hole. We haven't had very good luck picking out holes, and this is one case where breaking scores down by residue or even by atom is helpful. Just last friday I was looking into a hole-detector baded on scores for individual atoms. Take a look here if you are interested.

http://www.gs.washington.edu/~wsheffle/boinc/


Correct me as you see fit.. but we're currently working with atoms or groups of atoms (residues) to create molecules. The current energy scoring is based on the atoms, but while it's producing great residues with low scores, we now need another energy scoring algorithm to score how the residues fit together and form the molecule.
And while the ribbon and string visual representation of the molecule allows us to see the "this is good and this is bad" versions - it's tough (at least for me) to tell what's better or worse about the atom based models that follow.

While I can't read the lower set of models - is what you're saying, pointing to basically creating a library of higher level residues, recognizing when we've come across a known residue and substituting the known structure - and then building the unknown molecule by making known and unknown residues fit together best?


ID: 12347 · Rating: 0 · rate: Rate + / Rate - Report as offensive
will sheffler

Send message
Joined: 20 Mar 06
Posts: 3
Credit: 0
RAC: 0
Message 12369 - Posted: 21 Mar 2006, 3:23:59 UTC - in response to Message 12347.  

Correct me as you see fit.. but we're currently working with atoms or groups of atoms (residues) to create molecules. The current energy scoring is based on the atoms, but while it's producing great residues with low scores, we now need another energy scoring algorithm to score how the residues fit together and form the molecule.
And while the ribbon and string visual representation of the molecule allows us to see the "this is good and this is bad" versions - it's tough (at least for me) to tell what's better or worse about the atom based models that follow.


While I can't read the lower set of models - is what you're saying, pointing to basically creating a library of higher level residues, recognizing when we've come across a known residue and substituting the known structure - and then building the unknown molecule by making known and unknown residues fit together best?[/quote]

I'm sorry the models aren't very clear. The second set of pictures with all the atoms shown as large spheres is intended to show the hole/crack in the model with incorrect topology. In the middle of the protein - where the yellow and green meet - there's a fairly sizeable gap. It's definitely tough to see these kinds of features from a static 2D image, but you should be able to pick out a little spot in the right hand image where you can see entirely through the protein. These kinds of cracks/holes aren't something that happen very often in real proteins, but they are surprisingly hard to detect and prevent using just our current energy function. This isn't to say that our main energy function isn't good -- it's very good -- just that different methods of evaluating structures can be useful.
ID: 12369 · Rating: 0 · rate: Rate + / Rate - Report as offensive
R/B

Send message
Joined: 8 Dec 05
Posts: 195
Credit: 28,095
RAC: 0
Message 12405 - Posted: 21 Mar 2006, 8:01:04 UTC

You just hop on that radio show if you can, Dr. Baker. 700 radio stations with millions of listeners to your 1 hour phone interview has got to bring in new support. You'll pick up enthusiastic support; I assure you.
Founder of BOINC GROUP - Objectivists - Philosophically minded rational data crunchers.


ID: 12405 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 10 · Next

Message boards : Rosetta@home Science : Comments/questions on Rosetta@home journal



©2024 University of Washington
https://www.bakerlab.org