What we have learned thus far

Author	Message
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 1379 - Posted: 15 Oct 2005, 7:41:28 UTC The past few weeks have been incredibly exciting for us as we have watched all of you explore protein folding landscape far too large to adequately map out using traditional computing methods. As often happens in scientific research, we didn't find what we expected. We had hoped that with the amount of computing power you all are bringing to bear on the problem, it might be possible to find the global energy minimum at the experimentally observed structure (rmsd close to zero). Instead, as you can see from the plots that David has posted for the four protein landscapes you have investigated so far, nobody has yet to land within 1? rmsd of the actual structure. To solve this problem we need to improve our search methods, and you need to recruit more crunchers. We will return to these four proteins in several months by which time I'm optimistic that both we and you will have done what is necessary to solve these four problems perfectly. I should say, however, that the 2? rmsd models that you have found in the past weeks are quite close to the correct structures, as you can see in the pictures, and would be accurate enough for understanding how the proteins work. For the next month we will be focusing on improving our sampling methods on smaller proteins, and, due to the decreased size and also due to efforts made by the rosetta community across the country to reduce the memory requirements in response to your concerns, you should see considerably lower memory usage. Now for the unexpected results! If we take the lower energy predictions, and cluster them into groups based on the similarity of the structures, we find that the best models (lowest rmsd) are collected together in the largest or one of the largest clusters. So while we can't necessarily identify the best predictions based on the energy, as we had hoped would be possible, we can identify them based on their proximity to many other predictions. An explanation for this is that because the landscape is so big, there are very many more ways of being far from the global minimum than close, and due to fluctuations in the energy landscape some of the very large number of farther away structures by chance end up being lower in energy than the lowest rmsd structures, however, these far away models are all differerent from each other, so when we cluster we correctly pick out the low rmsd models which are close together near (but not at the bottom of) the global minimum. In the context of the landscape exploration analogy, a number of independent explorers have found the rim of death valley, but are not quite as low in elevation as a number of isolated explorers exploring crevices and valleys all over the world. The control center receiving reports from all the explorers can see the large number of reports coming from the vicinity of death valley, and might send more explorers to search for the bottom of it, and with your help we may try a similar strategy to get to the lowest point on the folding energy landscapes. ID: 1379 · Rating: 1 · rate: / Reply Quote

UBT - Halifax--lad Send message Joined: 17 Sep 05 Posts: 157 Credit: 2,687 RAC: 0	Message 1382 - Posted: 15 Oct 2005, 11:30:54 UTC I'm glad your concetrating on fixing some of the problems with Rosetta due to memory and checkpoints etc. If you want more users that is possibly the way to go create more checkpoints and ensure the work units will work with 256Mb of Memory not everyone has a high/medium spec computer, that may spur more people to crunch over here as Error Messages keep them away. Join us in Chat (see the forum) Click the Sig Join UBT ID: 1382 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0	Message 1387 - Posted: 15 Oct 2005, 14:37:34 UTC Except it begs the question I asked in other threads ... If we are not exploring the space correctly, maybe we are not covering the space "correctly". When I look at the graphs what I see is that all the explorers are clustered together. Though there is some "spread", there is not a sparce but full coverage of the space. If I followed the discussion in the other thread the next spot to check is based in part on the prior spots checked. Since we cannot ever have enough computing power to exhaustively search all possibities, we have to get smarter ... :) But you know that... this is the reason I was asking the questions I did, not so much in an expectation that I would discover fire, but might spark a thought for you ... But, since data used to be my life... there is the possibility that I MAY be able to offer a differing view ... again, simplisticly ... the graph implies that we have a known and bounded space. Is that correct? Have you considered using a multiple spawn search? Where instead of having one "center", you have several widely spaced ... the coverage is great where explored, but again, I saw an implication that we may be exploring too close to each other ... in other words, the first point is fine, but the next should be some minimum distance away, with the intent to cover as much ground as possible ... from there you can take the most promising candidates from pass 1 to try with pass 2 to fill in the spaces near those points that look the "best"... Again, ideally the search size should be 1/2 the detection range expected ... I guess what I want to know is, are you interested in outside help? Or just our CPU time? EIther way is fine, but, as I said, I used to do data ... and I did not like the first graphs, not because they did not find "Norman", but because the spatial coverage stunk ... and that was why I asked if you watched the evolution over time ... I still think that that MIGHT be revealing ... but then again, it is that data thingie ... ID: 1387 · Rating: 0 · rate: / Reply Quote

Honza Send message Joined: 18 Sep 05 Posts: 48 Credit: 173,517 RAC: 0	Message 1388 - Posted: 15 Oct 2005, 14:38:04 UTC Thanks for the feedback, much appreaciated. I believe it's always of benefit to share feedback...whereas feedback is usually lacking on the project team side - which is NOT the case of Rosetta :-) ID: 1388 · Rating: 0 · rate: / Reply Quote

rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0	Message 1390 - Posted: 15 Oct 2005, 15:11:31 UTC - in response to Message 1379. For the next month we will be focusing on improving our sampling methods on smaller proteins, and, due to the decreased size and also due to efforts made by the rosetta community across the country to reduce the memory requirements in response to your concerns, you should see considerably lower memory usage. Thanks for the wonderful report and feedback! :) You have certainly gotten me excited! I do not know if BOINC is capable of this, but perhaps there is some way to send whatever units might need more memory to higher memory host machines, while the other units go to the 256mb machines? That sort of seems the best of both worlds perhaps. Thanks again for the great report, and I look forward to following this journey with you into the future! Regards, Bob P. ID: 1390 · Rating: 0 · rate: / Reply Quote

Jeff Send message Joined: 21 Sep 05 Posts: 20 Credit: 380,889 RAC: 0	Message 1393 - Posted: 15 Oct 2005, 15:59:04 UTC Excellent communications. THAT is one way you are going to get more and KEEP existing members. Thank you. Jeff's Computer Farm ID: 1393 · Rating: 0 · rate: / Reply Quote

AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0	Message 1396 - Posted: 15 Oct 2005, 20:48:33 UTC Thank you for the timely and intresting feedback. As others have said, it is a great way to keep the people with you throughout the journey! Cheers, Rog. ID: 1396 · Rating: 0 · rate: / Reply Quote

David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 1399 - Posted: 16 Oct 2005, 6:34:19 UTC - in response to Message 1390. [quote] For the next month we will be focusing on improving our sampling methods on smaller proteins, and, due to the decreased size and also due to efforts made by the rosetta community across the country to reduce the memory requirements in response to your concerns, you should see considerably lower memory usage. Thanks for the wonderful report and feedback! :) You have certainly gotten me excited! I do not know if BOINC is capable of this, but perhaps there is some way to send whatever units might need more memory to higher memory host machines, while the other units go to the 256mb machines? That sort of seems the best of both worlds perhaps. We would really like to be able to do this, as it would be the best of both worlds!. We will indeed have a mix of low memory and high memory jobs (folding calculations for small and medium sized proteins) over the next several months. But I don't know if BOINC has any mechanism for directing work units according to the amount of host memory--does anybody know? ID: 1399 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0	Message 1400 - Posted: 16 Oct 2005, 7:12:29 UTC I cannot recall ... um, yes it does ... rsc_mem_bound See ... http://boinc.berkeley.edu/work.php or http://boinc-doc.net/boinc-wiki/index.php?title=Work_Unit_Record Mostly I like the Wiki as my source, not entirely sure why, especially as you can go from there back to the UCB to see if there is anything I missed ... ID: 1400 · Rating: 0 · rate: / Reply Quote

Honza Send message Joined: 18 Sep 05 Posts: 48 Credit: 173,517 RAC: 0	Message 1402 - Posted: 16 Oct 2005, 19:09:23 UTC - in response to Message 1399. I do not know if BOINC is capable of this, but perhaps there is some way to send whatever units might need more memory to higher memory host machines, while the other units go to the 256mb machines? That sort of seems the best of both worlds perhaps. We would really like to be able to do this, as it would be the best of both worlds!. We will indeed have a mix of low memory and high memory jobs (folding calculations for small and medium sized proteins) over the next several months. But I don't know if BOINC has any mechanism for directing work units according to the amount of host memory--does anybody know? A topic had been already discussed on projects that should consider parameters of WU - namely CPDN (due to it's long time processing, HD space and possibly memory requirements) and BURP (memory requirements). On CPDN, we discussed that user should be able to specify what type of job he is allowing to download and process (i.e. slab model, suplhur cycle, coupled model etc.). This sounds good for user but may be causing troubles to the project team. Apart that, some users having different macine with different specification (low vs. hign-ends). Anyway, as I understand, when BOINC client contact scheduler, serverside must 'known' to which host it is sending WU hence a machine specification must be known via host database (hostID). The problem in mechanism was that BOINC immediate writes ResultID-HostID link even when downloading fails. [remember the ghost units?]. It should have changed with BOINC 5.x or had been already fixed in later version of 4.x Perhaps a simple threasold - e.g. "allow 256+ MB RAM jobs" - in a user profile would do the job? A number of CPUs may play the role as well... ID: 1402 · Rating: 0 · rate: / Reply Quote

UBT - Halifax--lad Send message Joined: 17 Sep 05 Posts: 157 Credit: 2,687 RAC: 0	Message 1406 - Posted: 16 Oct 2005, 20:56:16 UTC If BOINC isn't capable of determining what WU's to give to people with a small amount of memory or to people with huge amounts you can always contact David Anderson over at SETI he prob wouldn't mind developing a new version of BOINC able to do so. Join us in Chat (see the forum) Click the Sig Join UBT ID: 1406 · Rating: 0 · rate: / Reply Quote

rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0	Message 1407 - Posted: 16 Oct 2005, 21:09:28 UTC - in response to Message 1406. If BOINC isn't capable of determining what WU's to give to people with a small amount of memory or to people with huge amounts you can always contact David Anderson over at SETI he prob wouldn't mind developing a new version of BOINC able to do so. This link on the BOINC mainpage may be what you are looking for. http://boinc.berkeley.edu/email_lists.php Regards, Bob P. ID: 1407 · Rating: 0 · rate: / Reply Quote

adrianxw Send message Joined: 18 Sep 05 Posts: 662 Credit: 12,140,580 RAC: 268	Message 1512 - Posted: 19 Oct 2005, 9:27:08 UTC Last modified: 19 Oct 2005, 9:29:02 UTC I have a slight problem. I saw this... >>> would be accurate enough for understanding how the proteins work. ... at the top. Now from reading around Folding@Home, it would seem to me that a very small misfold can have a very large consequence. Can someone tell me how to reconcile the statement and my feeling about folding? I also agree with Paul. I think it would be a good idea to clearly state the actual problems you are trying to solve with this in absolute terms. I have found in 20+ years of data processing experience, that someone fresh looking at a data problem, that is not weighed down with baggage, i.e. what is "known", (when in fact the "known" is often simply theory), can often cut straight to the chase. The preconceptions cloud the view of the problem. Somebody from a completely unrelated discipline can sometimes see a solution to the real problem, simply because their thought process when looking at the data set is different. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 1512 · Rating: 0 · rate: / Reply Quote

David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 1530 - Posted: 20 Oct 2005, 4:54:52 UTC I have a slight problem. I saw this... >>> would be accurate enough for understanding how the proteins work. ... at the top. Now from reading around Folding@Home, it would seem to me that a very small misfold can have a very large consequence. small changes in protein structures can cause disease, it is true. but to understand what the function of a protein is a model with an accuracy of 2? will often be good enough. also agree with Paul. I think it would be a good idea to clearly state the actual problems you are trying to solve with this in absolute terms the problem is extremely easy to state: we are trying to find the protein structure with the lowest energy. each protein has a unique amino acid sequence, and for this sequence, there is a unique lowest energy structure The preconceptions cloud the view of the problem. Somebody from a completely unrelated discipline can sometimes see a solution to the real problem, simply because their thought process when looking at the data set is different. agreed--it would be terrific if you provided insights here. we would be happy to make available to you, Paul and any others the sets of structures (and energies) sampled so far. we try to spread sampling out as much as possible--we will be experimenting with different ways of accomplishing this over the next few weeks. ID: 1530 · Rating: 0 · rate: / Reply Quote

Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0	Message 1531 - Posted: 20 Oct 2005, 6:04:35 UTC - in response to Message 1530. Last modified: 20 Oct 2005, 6:09:36 UTC The preconceptions cloud the view of the problem. Somebody from a completely unrelated discipline can sometimes see a solution to the real problem, simply because their thought process when looking at the data set is different. I'm undisciplined but here's my thoughts, using your explorers analogy :) As I understand it, we're dropping our explorers at random points on the globe. We keep dropping more explorers at more random points, they each report back with their findings and we look for the best result. But, how about we drop 1,000 explorers at random initially, pick the most promising 1% (or 10%) of their reports and send more explorers to those areas, rather than continuing to drop them in random locations? I mean, once we know the lowest point is NOT in the Himalayas, why keep sending our resources to that region. In other words, build a broad sample initially, examine the initial results, then concentrate the resources on the sequences that hold the most promise. A staged approach, extending what's being done on the invidual computers (with their 10+2 models). But maybe that's already being done or I've lost the plot... * Join BOINC@Australia today * ID: 1531 · Rating: 0 · rate: / Reply Quote

David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 1532 - Posted: 20 Oct 2005, 6:57:45 UTC - in response to Message 1531. The preconceptions cloud the view of the problem. Somebody from a completely unrelated discipline can sometimes see a solution to the real problem, simply because their thought process when looking at the data set is different. I'm undisciplined but here's my thoughts, using your explorers analogy :) As I understand it, we're dropping our explorers at random points on the globe. We keep dropping more explorers at more random points, they each report back with their findings and we look for the best result. But, how about we drop 1,000 explorers at random initially, pick the most promising 1% (or 10%) of their reports and send more explorers to those areas, rather than continuing to drop them in random locations? I mean, once we know the lowest point is NOT in the Himalayas, why keep sending our resources to that region. In other words, build a broad sample initially, examine the initial results, then concentrate the resources on the sequences that hold the most promise. A staged approach, extending what's being done on the invidual computers (with their 10+2 models). But maybe that's already being done or I've lost the plot... This is of course an excellent suggestion. We are experimenting with a number of ways of building up a picture of the landscape from an initial set of samples to guide subsequent sampling. This turns out to be harder than it might at first seem. There are roughly five rotatable bonds (variables) for each amino acid in a protein, so for a 100 residue protein, the surface which we are mapping out has ~500 dimensions, which as you can imagine makes things difficult. We are trying several ways of projecting from this very high dimensional space to a lower dimensional space where building a map is more feasible. We could, as you suggest, send the second round of explorers out to the most promising regions identified by the first set of 1.000, but this is a bit of a risk because none of the first set may have come anywhere close, and we would be putting more energy into exploring false minima. There is a balance between diversification--exploring brand new areas, and intensification--searching more thoroughly in the most promising regions identified thus far, which we are trying to optimize. ID: 1532 · Rating: 0 · rate: / Reply Quote

Mike Tyka Send message Joined: 20 Oct 05 Posts: 96 Credit: 2,190 RAC: 0	Message 1533 - Posted: 20 Oct 2005, 7:22:36 UTC The idea of 'spreading out' the search to avoid re searching already explored areas as well as that of slowly concentrating a wide search on the areas of highest promise are not new. Global optimisation of the kind needed for protein structure prediction is a common problem in many areas of science and over the last few decades people have been developing new methods which can be more efficient compared with monte carlo. One of these we are currently investigating for use in rosetta is a method developed by Harold Scheraga (Cornell University) called 'Conformational Space Annealing'. The idea is to evolve many structures (a "library" of structures) concurrently and using a measure of diversity to keep the search diverse at first and then slowly allowing it to concentrate on the lower energy areas until finally (so one hopes) the global energy minimum is found. To keep the anology with the explorers - you send a thousand explorers on your planet but give each a mobile phone so they know how far they are from other explorers nearby. You intruct them at first not to approcah eachother closer than some distance. As they explore you then slowly decrease that distance. Because the explorers will prefer goint to low lying areas, as they're allowed to come closer, they will cluster in the low 'basins'. I guess the anology is not quite complete, since in the real algorithm explorers can both die and have children but the basic idea is the same.. ;) The problem in terms of implementation is that of communication between processes -on a boinc like architecture it is hard for processes to exchange information on the fly - so this kind of approcah to the problem is better suited to loosely interconnected parallel machines which can communicate reasonably frequently (say once every 30 seconds) Mike http://beautifulproteins.blogspot.com/ http://www.miketyka.com/ ID: 1533 · Rating: 0 · rate: / Reply Quote

adrianxw Send message Joined: 18 Sep 05 Posts: 662 Credit: 12,140,580 RAC: 268	Message 1543 - Posted: 20 Oct 2005, 14:07:24 UTC Last modified: 20 Oct 2005, 14:12:27 UTC >>> once we know the lowest point is NOT in the Himalayas, You need to be careful with extrapolating between the problem and the analogy. The surface topography of a planet is not crafted by the same forces that shape a protein. It is therefore reasonable to assume that the lowest point on the planet is not going to be in the heart of an area of tectonic crustal thickening due to plate boundary collision. I would imagine that with a protein with, two very large low energy "chunks" connected together may have hundreds of relatively low energy structures with minor adjustments within the two "chunks", but a small mod in the short simple section that connects them might produce a huge change. Extrapolated to the analogy, there could well be a very deep hole in the middle of the Hmalayas. The analogy is there to help visualise the problem. In practice, if you wanted to find the lowest point on a planet, you would not randomly place explorers, you'd use an orbiting radar altimeter. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 1543 · Rating: 0 · rate: / Reply Quote

David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 1544 - Posted: 20 Oct 2005, 14:35:40 UTC - in response to Message 1543. >>> once we know the lowest point is NOT in the Himalayas, You need to be careful with extrapolating between the problem and the analogy. The surface topography of a planet is not crafted by the same forces that shape a protein. It is therefore reasonable to assume that the lowest point on the planet is not going to be in the heart of an area of tectonic crustal thickening due to plate boundary collision. I would imagine that with a protein with, two very large low energy "chunks" connected together may have hundreds of relatively low energy structures with minor adjustments within the two "chunks", but a small mod in the short simple section that connects them might produce a huge change. Extrapolated to the analogy, there could well be a very deep hole in the middle of the Hmalayas. The analogy is there to help visualise the problem. In practice, if you wanted to find the lowest point on a planet, you would not randomly place explorers, you'd use an orbiting radar altimeter. Adrian is right--even very close to the lowest point, there can be very high mountains. This is because the lowest energy structure is very tightly packed, like a jigsaw puzzle, and hence random perturbations of atoms in any direction can cause extremely unfavorable atomic overlaps. This makes the lanscape extremely bumpy. There is a not too technical discussion of the difficulty of the search in Bradley, P., Misura, K. M., Baker, D. (2005). Toward high-resolution de novo structure prediction for small proteins Science 309, 1868-1871 A pdf of this paper can be obtained from the publications section at /depts.washington.edu/bakerpg. ID: 1544 · Rating: 0 · rate: / Reply Quote

divyab Send message Joined: 20 Oct 05 Posts: 6 Credit: 0 RAC: 0	Message 1554 - Posted: 20 Oct 2005, 23:25:47 UTC In addition to the CSA method Mike described below, we are also developing other algorithms that address some of the sampling concerns you mentioned below. Currently, if Rosetta is to generate 10,000 structures, each of those 10,000 runs starts at a random point in the confomational space. Monte Carlo and then some local optimization is performed, and the structure at the local minimum is reported. Instead, what we would like to do, is use the information from the first 5,000 runs to guide the sampling for the next 5,000 runs. Thus we will use the information in the local minima to help guide us toward a global minimum, or at the very least, find areas that either appear promising, or are undersampled. The protein conformational space is extremely high dimensional - too high to feasibly globally optimize. We are thus reducing the dimensionality using Principle Component Analysis, and attempting to fit simple energy surfaces (parametric and non-parametric) to this reduced dimensional space. Minimizing these fitted functions give us new areas to sample. Additionally, we can look at this reduced dimensional space and look for areas in which sampling is very low. We can then identify these areas for further sampling. ID: 1554 · Rating: 0 · rate: / Reply Quote