Large Homology Modeling Benchmark

Message boards : Rosetta@home Science : Large Homology Modeling Benchmark

To post messages, you must log in.

AuthorMessage
Mike Tyka

Send message
Joined: 20 Oct 05
Posts: 96
Credit: 2,190
RAC: 0
Message 55937 - Posted: 21 Sep 2008, 22:47:10 UTC
Last modified: 22 Sep 2008, 0:49:42 UTC

Over the next 4 months we will be testing some brand new ROSETTA code developed to address the problem of homology modeling in a standardized fashion. We will be using BOINC to run the code with various parameters and sub-algorithms on a comprehensive set of problems to try and come up with a good approach that yields consistent results.

What is Homology Modeling (a.k.a. Comparative Modeling) ?

Homology modeling is an approach to protein structure prediction that utilizes information based on structures likely to be similar to other known structures in the Protein Data Bank.
Unlike in the ab initio problem, in comparative modeling we search in the database of known protein structures (PDB) for structures with similar sequences (its cousins so to speak) and use these structures as starting points for our modeling. The basic premise is that a protein will have a similar but not identical structure to its close relatives. Thus these relatives should be good starting points for the prediction process.

Read more about it here:
http://en.wikipedia.org/wiki/Homology_modeling



Why is it important ?
Structural information of protein molecules is an integral key understanding the mechanisms behind normal and abnormal cell function and is thus fundamental to understanding human disease. However, even today the number of known high resolution structures lags far behind the number of known protein sequences obtained from the efforts of genomic sequencing over the last decade. Closing this gap is one of the most important challenges in modern biochemistry and has given rise to a number of Structural Genomics projects employing X-ray crystallography and Nuclear Magnetic Resonance (NMR) in a high throughput fashion. However, despite the scale of the efforts undertaken, many proteins must be rejected because they resist being expressed, purified, crystallized or solved, leaving only a fraction for which the lengthy process of structure determination succeeds. Many structural genomics projects thus aim primarily to solve at least one representative member of each structural fold family.

This last point is important - if we only have structral information coarsly dispersed, we need a consistent and reliable method to take an unknown structure, and use it’s closely related cousins to predict it's structure!


What are the problems ?
While conceptually comparative modeling is a very simple approach we face many problems. When searching for template structures it is not always obvious how to align the query structure onto the template structure. We thus have to try various alignments and see which ones give us better structures with lower energies.
Further we usually we find multiple templates with varying similarity and one of the problems is to find a way to productively integrate information from multiple templates into one prediction.
In many cases the templates are missing loops that are present in the query protein (the one we’re trying to predict the structure of) and thus we need to model these portions de novo i.e. very much like ab initio but applied only to a small stretch of the protein. We call this process “loopmodelling” or “looprelax”.
Although most of the differences between the templates and the query structure are in the loops, we often also see differences in those regions that do align to the template. Thus we also let those aligned parts vary but subject to constraints derived from the templates.

All these problems will be addressed in this large homology modeling benchmark.



What are we doing here ?
We are developing a unified, automatic pipeline for comparative protein structure modeling incorporating a variety of techniques developed in the Baker Laboratory over the past 5 years. It comprises the detection of homologs, the selection of suitable templates, the building and subsequent refinement of models and selection of final predictions, including estimation of confidence.

We've been constructing a pretty large benchmark of about 20 prediction problems from the past CASP5, CASP6 and CASP7 experiments. As soon as the initial testing is done we will extend this to about 60-80 targets covering the wide range of typical modeling problems. We will develop, test and optimize the modeling engine on this comprehensive benchmark.
This process will require a large number of iterations during which the approach is tested, the successes and failures analyzed and the process adjusted to improve its performance at each iteration. The large number of cases in the benchmark and the complexity of the modeling pipeline make comprehensive testing computationally extremely intensive; however due to the breadth of modeling problems included in the benchmark we expect to arrive at a modeling pipeline that will be applicable to a large number of real life cases!

The work units for this project will start with hombench_ followed by the run name, the method name and a target number.


Thank you for your contribution!

So thank you for crunching, this is an extremely exciting project for us since we are finally putting together a number of different, separate algorithms developed in our lab and others as well incorporating our experiences from the CASP7 and CASP8 competitions.


Stay tuned!
We'll edit this first post to add information about the algorithm and we will use this thread to post what’s currently going on, what sort of problems were trying to address and what the jobs are doing that we are submitting. Feel free to ask questions too :)

Stay tuned !
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/
ID: 55937 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Tyka

Send message
Joined: 20 Oct 05
Posts: 96
Credit: 2,190
RAC: 0
Message 55938 - Posted: 21 Sep 2008, 22:50:28 UTC

Initial testing has begun !

I've submitted a first set of jobs using the methods of fold constraints (we derive a set of constraints and fold from scratch using those constraints) and looprelax (we start with a rigid core, taken from each template, and model in the loops).

The purpose here is mainly to test that the automatic machinery to set up, submit and analyse the benchmark works. There may well be stupid errors occuring here still, i've extensively tested all this locally on our computing facilities in the lab first, but you never know.

Wheee!

http://beautifulproteins.blogspot.com/
http://www.miketyka.com/
ID: 55938 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Michael G.R.

Send message
Joined: 11 Nov 05
Posts: 264
Credit: 11,247,510
RAC: 0
Message 55952 - Posted: 22 Sep 2008, 16:58:04 UTC

Sounds awesome! I feel like every new incremental improvement in the Rosetta code brings us closer to a big scientific/medical breakthrough.
ID: 55952 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,685,151
RAC: 75
Message 56132 - Posted: 30 Sep 2008, 21:17:49 UTC
Last modified: 30 Sep 2008, 21:19:07 UTC

Related paper by Christopher Kauffman, Huzefa Rangwala, George Karypis; University of Minnesota:
Improving Homology Models for Protein-Ligand Binding Sites
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 56132 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Hubington

Send message
Joined: 3 Feb 06
Posts: 24
Credit: 127,236
RAC: 0
Message 56188 - Posted: 3 Oct 2008, 12:16:45 UTC - in response to Message 55937.  
Last modified: 3 Oct 2008, 12:36:26 UTC

Over the next 4 months we will be testing some brand new ROSETTA code developed to address the problem of homology modeling in a standardized fashion. We will be using BOINC to run the code with various parameters and sub-algorithms on a comprehensive set of problems to try and come up with a good approach that yields consistent results.


Just a thoguht but isn't this what RALPH is for?

The idea behind RALPH also being that people who signed up for it were activly keeping an eye on these things so that they could report issues to you while the Rosetta users were more your, well meaning but don't really want to get to involved sorts.

The other advantage being that the RALPH users sign up knowing they arn't doing actual work but simply testing different ways in which the work can be done to try and find a better way of doing it. As a result they accept that at times someone is going to get it a little wrong and a 3 hour packet will take 40+ hours. While your average Rosetta user sees this and thinks the project is, buggy, wasting there resources they kindly donating and potentially either drops the project or becomes disillusioned with the whole grid computing idea and knocks the whole thing on the head.

To Quote R L Casey
This is research, and I am reminded of the saying "If we (really) knew what we were doing, it wouldn't be research."!


I agree it is research, however the research this project was set up for was protein folding not discovering how to write the code to do simulated protein folding. While RALPH was set up to research just that with a much smaller community who are prepared to be more involved with there feedback.

Given that someone went to the trouble of actually seperating the two elements to run them in partnership, I can't understand why it isn't being used in that way.
ID: 56188 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Tyka

Send message
Joined: 20 Oct 05
Posts: 96
Credit: 2,190
RAC: 0
Message 56204 - Posted: 3 Oct 2008, 19:27:19 UTC - in response to Message 56188.  

Hubington,

i think our usage of the word "test" is slightly different. The idea here is *not* to test the code itself, that has already been done locally and on RALPH.
(although the long running WUs are an issue i hadn't encountered here, locally, so i'm looking into it).

The "testing" really refers to working out *scientifically* how to solve the
homology modelling problem best. Research is largely a process of trial and error, so we try conceptually different approaches, or different ideas, and we apply them to a large number of folding problems. We analyse the results and based on our analysses we attempt to improve our methodologies. **Ultimately** we will end up with a method that is state of the art. But its a long, tedious road. The same is true of *every* other discovery in history, it takes years of failure before making progress.


>Just a thoguht but isn't this what RALPH is for?

RALPH's purpose is to make sure the software runs smoothly (i.e. bug free). Its purpose is not to access the performance of the algorithms as such.


I hope the purpose of this project is a little clearer now :)




http://beautifulproteins.blogspot.com/
http://www.miketyka.com/
ID: 56204 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 56206 - Posted: 3 Oct 2008, 19:46:59 UTC - in response to Message 56188.  
Last modified: 3 Oct 2008, 20:14:36 UTC

i think our usage of the word "test" is slightly different.

So when they say "test" a new approach, they really are gathering a statistically significant number of results and then comparing the quality of the protein model produced with prior approaches.


Just a thoguht but isn't this what RALPH is for?
...they accept that at times someone is going to get it a little wrong and a 3 hour packet will take 40+ hours.


Unfortuantely, it's a lot like me asking you to estimate how long it will take you to get from here to downtown. You can give me a number, but when you actually set out to do it, your times can vary dramatically. And you can't necessarily foresee the delays until you get there. Yet, on your journey there are points where you can pretty well see you are getting nowhere fast, and potentially try a different route. The intelligence to make that decision to end pursuit of one route and try another is part of what they are talking about when they say they are working to make all models have more consistent runtimes.

So, sort of like adding the smarts to avoid known construction areas, and have enough gas in the tank before you begin the jorney (or factor the time for a fill up in to your estimate), but you never know, you could always get a flat tire along the way.
Rosetta Moderator: Mod.Sense
ID: 56206 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Rosetta@home Science : Large Homology Modeling Benchmark



©2022 University of Washington
https://www.bakerlab.org