Posts by James Thompson

21) Message boards : Rosetta@home Science : How do we interpret the terminology? (Message 39090)
Posted 6 Apr 2007 by James Thompson
Post:
I and others are looking for definitions of the terms used in the task output; terms like nstruct, model, decoy, attempt.
What does it mean to have "1 starting structure built 30(nstruct) times?"
Or "This process generated 20 decoys from 20 attempts."?
Is a decoy the same as a model? Decoy sounds clandestine!

Could someone from the project help us out here? Or could a moderator expedite this? Seems like it would be nice to have these in the Q&A that is being built.



Here are some quick definitions for you folks:

nstruct - an abbreviation for "number of structures," which is the number of structures that will be created within a Rosetta run.
model - a predicted model for a given protein (or RNA, thanks to Rhiju's work!)
decoy - a decoy is also a model for a given protein (or RNA). This term came about during the early days of energy function design, when people first started writing energy functions to discriminate between "native" structures, which are correct structures solved by experimental methods, and "decoy" structures, which are computationally created structures that are incorrect to some degree. I found the term to be odd when I first encountered it as well, but I've since grown accustomed to it. In most contexts, model and decoy can be used interchangeably.
attempt - an attempt to create a structure. Some structures created by Rosetta are obviously incorrect, and these are thrown away by some filters measure various characteristics of the energy function, and thus is not saved. This is usually done during the ab-initio phase of our runs, which is extremely fast compared to the full-atom manipulation and scoring of the decoy.

Cheers,

James
22) Message boards : Number crunching : Predictor of the day (Message 38018)
Posted 19 Mar 2007 by James Thompson
Post:
Think there may be a slight glitch in the scripts - the user Charles has predicted the lowest energy structure for the same workunit 3 days in a row now.


Thanks for bringing this to my attention. The bug in picking a predictor of the day is now fixed, and there are now new predictors for March 10th and 11th. Feel free to e-mail me directly with any more predictor of the day related issues, my e-mail address is tex - at - u.washington.edu.

Cheers,

James


Our talented sysadmin Keith Laidig has won Predictor of the Day twice in a row as of today. Starting today, I'm going to disallow members of team Baker Lab from winning Predictor of the Day. Especially Keith might have a bit of an unfair advantage. :)

Cheers,

James
23) Message boards : Number crunching : Predictor of the day (Message 37690)
Posted 12 Mar 2007 by James Thompson
Post:
Think there may be a slight glitch in the scripts - the user Charles has predicted the lowest energy structure for the same workunit 3 days in a row now.


Thanks for bringing this to my attention. The bug in picking a predictor of the day is now fixed, and there are now new predictors for March 10th and 11th. Feel free to e-mail me directly with any more predictor of the day related issues, my e-mail address is tex - at - u.washington.edu.

Cheers,

James
24) Message boards : Number crunching : Predictor of the day (Message 37482)
Posted 5 Mar 2007 by James Thompson
Post:
Thanks for getting POTD going again.

There's an error in the POTD RSS feed, caused by the '&' in the current POTD's team name.

http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fboinc.bakerlab.org%2Frosetta%2Frah_rss_potd.php


Thanks for the heads-up. I just fixed the offending invalid XML, and the W3 validator appears to be happy.
25) Message boards : Number crunching : Predictor of the day (Message 37439)
Posted 5 Mar 2007 by James Thompson
Post:
The 'predictor of the day' seems to have stopped since Feb 13th - will this be restarted at some point? I think it's a useful incentive to some crunchers (myself included...)



Hi Proxima,

With the help of David Kim, I've just finished reworking the Predictor of the Day so that the reporting will be automatic and no longer require intervention from the scientists in our lab. Thanks for bringing this to my attention, and feel free to e-mail me with any questions or comments you have about it.

Cheers,

James
26) Message boards : Rosetta@home Science : Does the N-terminus fold first? (Message 35250)
Posted 22 Jan 2007 by James Thompson
Post:
...if they fold the "wrong" part of the protein first, they might become kinetically trapped in the incorrect conformation.


That sounds a lot like the ranger's presentation of why the Gunnison river is where it is. They said if it had "known" then what it "knows" now that it has ground it's way down through 100s of feet of granite, it would have chosen a different route. But now that it is so deep, the river cannot take the "easier" route.

no matter what route the protein takes... given enough time it will find the lowest energy native conformation.


So, that makes it sound like no matter what random number our machine is given to start with, it should be able to find the lowest energy structure... but we don't want the process to take forever, so we use a "shotgun" approach and fire out 100,000 different starting points. And for some of them, the "route to take" will be comparatively short.

I've been wondering, it would seem if you start at random configuration 1, and test a few possible atomic twists that you would reach the same configuration as someone that starts at random number 2. I have read that there are on average roughly 3 to the n possible conformations, where n is the number of amino acids in the protein. So, a 100 long protein is a modest size it seems, and 3 to the 100 power is 5.14 e+47, so a 5 with 47 zeros after it. (can I call that roughly the square root of a google?? Did you know a "google" is a 1 with 100 zeros after it?) So ANYWAY, my question is, when I complete one of the roughly 100,000 models done for a detailed protein study, how many of these 5.14 e+47 possible conformations have we actually tested? And how common is it for my processing of random model 1 to eventually reach the identical configuration that someone else reaches as they are processing random model 2?



Those are excellent questions. During the protocol we're currently publishing for our structural prediction studies, we actually test for structural convergence of low energy models. Put another we, if see 20 / 50 low energy predictions with very low RMSD to each other, this would be a very high confidence prediction. Enumberating the possibilities for various proteins is something we're currently working on inside of the lab.

The first three degrees of freedom that many people talk about for protein structure prediction are usually phi, psi and omega angles, which describe the torsion angles between C-alpha atoms in the protein backbone. For a protein of N residues, there are 3^N degrees of freedom (as you point out), and 3^N gets big very quickly. Even worse, these degrees of freedom do not always give you the resolution that you need in order to discriminate the properly folded from the improperly folded conformations. So, this low resolution-search isn't sufficient to solve the problem. However, you can use it to very quickly through away highly unlikely conformations, so that you can spend your more detailed (aka higher-resolution) inspections for conformations that are more reasonable.


I took a graduate class in artificial intelligence. We were asked to write a program to solve the priests and wolves problem. You have... what was it? 3 priests and 5 wolves on opposite sides of a river and you have to ferry them around to reverse which bank each are on. But if you have too many wolves in the boat they will eat the priests, and more wolves the priests on either shore, the wolves will eat the priests. Basically you have to make a LOT of virtual trips back and forth across the river to find a combination that solves the problem with noone getting killed.

My proud moment was raising my hand in class when the assignment was due, when the instructor asked if anyone completed the assignment. You see, in 1986, the computer we were using didn't have enough memory to store the "game tree" that it was obvious to use to complete the assignment. The rest of the class wrote a program that would work in theory, but maxed out the memory and crashed before it could produce an answer.

The way I solved the problem was to keep a global record of configurations I had reached before. I mean if you are presently studying a situation with 2 priests on the left bank with 2 wolves, and 1 priest and one wolf in the boat, and 2 wolves on the right bank... it doesn't really matter how you arrived at that configuration. But you've been there before and concluded there was no amount of shuffling possible to make it work. So I stashed that configuration away, and from that point forward, before I took my virtual trip across the river, I'd check the list. If the configuration I was about to reach had already been tried, then I'd save the trip!

The other students were actually reaching the same point in the simulation more then once and processing it all over again. They were finding that you could reach the configuration in scores of different ways, some of which took dozens of additional trips. In fact, if you send the boat back with the same number of priests and wolves as it just came with, you can work yourself into an infinite loop! And consume more memory then the machine had, and crash your job.

I realize that there are too many conformations to stash away in a list, and it would probably take more time to maintain such a list and to compare to it prior to each proposed twist and turn... but I was wondering if there is some way to apply the same concept to Rosetta. To end the pursuit when you reach a point that is already known (by others) not to be fruitful.


That's a fun story, it's very gratifying to figure out things like that on your own. I don't know if you know it, but you may have rediscovered memoization, which does exactly what you're talking about. This is a way to try and avoid re-computing things many times. The problem (as you point out) is that the function space for this problem is huge, and no single computer can even come close to storing all of the possible answers. You could presumably store everything in a huge database and send things to and from Rosetta clients, but the heterogenous nature of Rosetta@Home makes this logistically hard. We still have a lot of users on dial-up.

There are a variety of tricks to conserving computational power, and we're definitely interested in pursuing them.
27) Message boards : Number crunching : Servers down? (Message 33437)
Posted 26 Dec 2006 by James Thompson
Post:
I am having a lot of trouble uploading over the last couple of days. Here is a list of the recent messages:

12/25/2006 7:47:44 PM|rosetta@home|Started upload of file s026__BOINC_ABRELAX_NEWRELAXFLAGS_hom002__1462_16588_0_0
12/25/2006 7:48:39 PM|rosetta@home|Started upload of file s023__BOINC_ABRELAX_NEWRELAXFLAGS_hom016__1456_16650_0_0
12/25/2006 7:52:55 PM||Project communication failed: attempting access to reference site
12/25/2006 7:52:57 PM|rosetta@home|Temporarily failed upload of s026__BOINC_ABRELAX_NEWRELAXFLAGS_hom002__1462_16588_0_0: http error
12/25/2006 7:52:57 PM|rosetta@home|Backing off 1 minutes and 5 seconds on upload of file s026__BOINC_ABRELAX_NEWRELAXFLAGS_hom002__1462_16588_0_0
12/25/2006 7:52:58 PM||Access to reference site succeeded - project servers may be temporarily down.
12/25/2006 7:53:48 PM||Project communication failed: attempting access to reference site
12/25/2006 7:53:49 PM|rosetta@home|Temporarily failed upload of s023__BOINC_ABRELAX_NEWRELAXFLAGS_hom016__1456_16650_0_0: http error
12/25/2006 7:53:49 PM|rosetta@home|Backing off 20 minutes and 3 seconds on upload of file s023__BOINC_ABRELAX_NEWRELAXFLAGS_hom016__1456_16650_0_0
12/25/2006 7:53:50 PM||Access to reference site succeeded - project servers may be temporarily down.
12/25/2006 7:54:03 PM|rosetta@home|Started upload of file s026__BOINC_ABRELAX_NEWRELAXFLAGS_hom002__1462_16588_0_0
12/25/2006 7:54:08 PM|rosetta@home|Finished upload of file s026__BOINC_ABRELAX_NEWRELAXFLAGS_hom002__1462_16588_0_0
12/25/2006 7:54:08 PM|rosetta@home|Throughput 2274 bytes/sec
12/25/2006 8:13:52 PM|rosetta@home|Started upload of file s023__BOINC_ABRELAX_NEWRELAXFLAGS_hom016__1456_16650_0_0
12/25/2006 8:16:43 PM|rosetta@home|Sending scheduler request to http://boinc.bakerlab.org/rosetta_cgi/cgi
12/25/2006 8:16:43 PM|rosetta@home|Reason: Requested by user
12/25/2006 8:16:43 PM|rosetta@home|Reporting 1 tasks
12/25/2006 8:19:04 PM||Project communication failed: attempting access to reference site
12/25/2006 8:19:04 PM|rosetta@home|Temporarily failed upload of s023__BOINC_ABRELAX_NEWRELAXFLAGS_hom016__1456_16650_0_0: http error
12/25/2006 8:19:04 PM|rosetta@home|Backing off 2 hours, 56 minutes and 17 seconds on upload of file s023__BOINC_ABRELAX_NEWRELAXFLAGS_hom016__1456_16650_0_0
12/25/2006 8:19:06 PM||Access to reference site succeeded - project servers may be temporarily down.

Is there some way to resolve all this?

Thanks,
Karl Loucks



I'm currently in Texas for the holidays and I just forced BOINC to report an old workunit and grab a new one. Can you try again and confirm that the problem is still present for you?
28) Message boards : Rosetta@home Science : CASP 7 Results (Message 31177)
Posted 15 Nov 2006 by James Thompson
Post:
Hello,

Later this month David Baker and the rest of the CASP team (including myself) will travel to California to attend the CASP meeting (which occurs from November 26-30th). David tells me that a large part of the meeting is spent huddling around laptops, looking at score distributions for the various groups and their various submissions. I think that it will be very interesting to see how many targets each group has submitted, as there were many more targets during this CASP than in any previous.

Also, I'm happy that you noticed Target 0354. I worked on that one quite a lot, and am very proud of our accuracy in that prediction. I also know that we would not have achieved our level of accuracy on so many targets without your help. Thank you very much for your time.


Hello again!

We were excited to see David Baker's post about Target 0354, "This is one of the best ab initio structure predictions we made in the CASP7 experiment", not least because we helped to find it. I see CASP 7 is from November 26th until the 30th. Does anyone know when we will find out the rankings on how well each project did? For instance, when will we know the other group's structure predictions for each protein? I am curious to see how Rosetta@home fared this year. Best wishes to all projects of course!
29) Message boards : Rosetta@home Science : A thought about method (Message 30963)
Posted 11 Nov 2006 by James Thompson
Post:
There are people in the group who have tried evolutionary methods similar to what you describe. As I understand it, one large problem comes when trying to implement crossover of two structures: sometimes an attempt at crossing over will result in atoms attempting to occupy the same space. As we get closer and closer to the real solution, each organism should be more tightly packed together, and these clashes will likely be more frequent. Without recombination between organisms genetic algorithms are probably not too much better than just random searches. I do know that people within our group have successfully used recombination of structures in the past, but currently recombination-style approaches are not part of our standard structure prediction protocol.

Here's another way to think about it: genetic algorithms are an approach to function optimization that works very well generally, as there are no explicit mathematical requirements for the function to be optimized. Rosetta's energy function has been designed to have some useful properties that can be exploited by more specific function optimization methods. One example of this is that the energy function is first-order differentiable, so that we can use gradient descent to find local minima in the neighborhood of the current trajectory.

Feel free to ask more questions if any of that doesn't make sense!

Has the use of a genetic algorithm (in the computer science sense) been considered? For those not familiar with this, it tries to evolve better solutions to a problem through "natural selection".

You start with a population pool of randomly generated solutions.

Members of the pool are randomly selected and bred to produce an offspring solution which inherits its characteristics at random from its parents. A small random mutation might also be added at this point.

All the offspring are evaluated for "fitness" against some predefined criterion.

The total population pool is then culled dropping those of lowest fitness.

Repeat breeding etc. The fitness of the pool should increase with time. When things seem to have stabilised take your fittest individual.

The above process can be repeated from scratch with a new random pool to see how good your previous solution was.

Rosetta could work well with this. The Rosetta main computers could keep track of the pool, and do the breeding. Offspring could then be sent out as starting positions for the normal process (which would do some random jiggling) and come back with an energy to be used for the fitness evaluation.

The breeding could be done by selecting a random point along the chain and using one parent's angles up to that point and the other thereafter. Some rearragement might be needed if this resulted in the molecule going through itself.

Tim

30) Message boards : Number crunching : What's up with the home page. (Message 30788)
Posted 7 Nov 2006 by James Thompson
Post:
David Kim and I just fixed this today. Thanks for the heads up!

Yes i did see that but what i'm getting at is that the top predictor

on the main page is the 6th Nov, but when you click on the more the

last one listed there is for the 1st Nov what happened to the rest.

31) Message boards : Rosetta@home Science : Rosetta@home Active WorkUnit(s) Log (Message 30090)
Posted 27 Oct 2006 by James Thompson
Post:
Starting tomorrow we will be sending out our first workunits for our newest ab-initio structure prediction project. These workunits will look like this:

s001__BOINC_ABRELAX_SAVE_ALL_OUT_hom001_

We are collaborating with a number of structural genomics centers so that we will attempt to predict structures whose structures will be solved within the next six months. This project is very obviously inspired by the CASP competitions, and some of us in the lab have started calling this "CASP all-the-time." It will be very useful for us to have a running benchmark of our methods on absolutely new crystal structures.
32) Message boards : Number crunching : why is this machine failing so much? (Message 29839)
Posted 23 Oct 2006 by James Thompson
Post:
It is an old PII Inspiron 7000 laptop. I have reinstalled the OS several times, the most recent reinstall was Friday evening. It does not error out for Docking or SETI. So it is something unique to Rosetta.

I will try running the test referenced anyway, probably tomorrow when I get a chance. Where does one get Memcheck86+? A quick google didn't turn anything up. This is a linux box, will it run on linux?


Hi Zombie,

I don't know much about memcheck86, but I have used memtest86 many times in the past. I would run it from a Linux LiveCD (such as Knoppix, http://www.knoppix.org), as that won't rely on you installing any programs on your laptop. Here's a link to an article that describes the process in detail:

http://software.newsforge.com/software/06/06/27/206209.shtml?tid=91&tid=132

33) Message boards : Rosetta@home Science : Rosetta@home store (Message 29777)
Posted 21 Oct 2006 by James Thompson
Post:
Here's the consensus that I have from talking to people:

- We need to talk with people from both UW and HHMI about how to do this within the confines of our grant agreements.
- We'd definitely like to have some sort of t-shirt (or possibly a dressier polo shirt) to wear at the CASP conference that members of the lab will be attending in late November.

I'll try to keep this in the lab group consciousness, as I know many people would get a kick out of Rosetta@Home schwag (myself included).
34) Message boards : Rosetta@home Science : Rosetta@home store (Message 29721)
Posted 20 Oct 2006 by James Thompson
Post:
I don't think we currently have this set up, but I'll bring this up at our group meeting in ~45min.
35) Message boards : Rosetta@home Science : MD in Rosetta@Home? (Message 26466)
Posted 9 Sep 2006 by James Thompson
Post:
Feet1st "NMR structure prediction" is (I think) what R@H uses for
"full atom relax", I don't know how this ties in with NMR (which it must do somehow)
here
Incorporates NMR data into the basic Rosetta protocol for rapid structure determination at moderate to high resolution and speeds up the process of NMR structure prediction.


Actually, that's not quite correct. NMR stands for Nuclear Magnetic Resonance, which is the absorption of energy by atoms when held in a magnetic field of a certain strength. This phenomenon is exploited in NMR spectroscopy, which is technique that is used to experimentally determine the structure of proteins. NMR spectroscopy (usually abbreviated as just NMR) gives a series of constraints of which atoms are in close contact with each other. Structure prediction by NMR consists of elucidating these constraints and using them to make a structure representation that makes sense.

We can use these constraints from NMR in our full atom relax protocols in order to shrink the size of the conformational space. This means that if we can fix two different parts of the protein together, there are fewer possible conformations for the protein.
36) Message boards : Rosetta@home Science : Rosetta@home Active WorkUnit(s) Log (Message 26411)
Posted 9 Sep 2006 by James Thompson
Post:
Over the next two days I'll be adding some new jobs to the Rosetta@Home queue that will attempt to use a slightly modified methodology for ab initio prediction. The jobs look like this:

2vik__BARCODE_SEARCH_BARCODE_FROM_FRAGS_ABINITIO_barcode_from_frags

That batch of workunits (and three others) are currently running on Ralph.

We use a method called a barcode to constraint certain residues of a protein to adopt a specific conformation. This barcode ensures that the conformation stays fixed throughout an entire run. The barcode_from_frags tries to infer the residues for trying a barcode by examining the distribution of conformations for analogous sections of proteins of known structure.

And yes, I know that barcode_from_frags is in there twice and I'll try to fix that before my runs on Rosetta@Home commence. :)
37) Message boards : Rosetta@home Science : Are the results of SIMAP interesting for Rosetta? (Message 26116)
Posted 5 Sep 2006 by James Thompson
Post:
We do not currently use the results of SIMAP in our laboratory. However, searching a sequence database for sequences similar to a given query sequence is a very common task in computational biology, and there are applications that use this technique. First, let me give you the ten-second rundown on algorithms for aligning two protein (or DNA) sequences:

- Smith-Waterman: an exhaustive algorithm guaranteed to find the best alignment given a scoring system.
- FASTA: a heuristic algorithm Smith-Waterman that looks for "seeds" of true matches by finding submatches of a given length (usually 3-5 for protein sequences, or 10-12 for DNA sequences).
- BLAST: another heuristic algorithm that improves on the speed of FASTA without significantly decreasing its ability to find good matches.

These algorithms can be extended to compare one protein vs. many proteins, as one might do with a CASP target of unknown function. However, using Smith-Waterman quickly becomes unreasonable because aligning sequences of this form is an NP-complete problem.

SIMAP is using the FASTA algorithm for computing similarity between proteins, while for most of our purposes we use BLAST (and most often we use a variant of BLAST known as PSI-BLAST). BLAST is definitely faster than FASTA, and for similar sequences they give the same results. The speed gains by using BLAST are especially significant for our purposes, we're dealing with a large number of comparisons on our hardware because we're doing

Certainly the idea of pre-computing protein similarities is a good one. However, when performing a PSI-BLAST search, pre-computing these similarities presents a number of problems. I am not quite sure why SIMAP decided on FASTA rather than BLAST as I have not reviewed that project extensively.

I need to run right now, but if I have time later today or tomorrow I'll post more on why we use PSI-BLAST rather than BLAST, and I'll tell you folks about a similar project to SIMAP that tries to accomplish a similar goal usign 3-dimensional structures of proteins.

Hope that this makes sense! Cheers,

James Thompson

*bump*

Now that CASP is over and the transition to the new credit system has been succesfully completed maybe someone from the project team find the time to answer whether BOINC SIMAP (Similarity Matrix of Proteins) is of any use for Rosetta:

http://webclu.bio.wzw.tum.de/cgi-bin/simap/start.pl
and the BOINC project is here
http://boinc.bio.wzw.tum.de/boincsimap/
38) Message boards : Number crunching : out of work (Message 25840)
Posted 1 Sep 2006 by James Thompson
Post:
I've submitted 11,000 new work units to the queue, the home page should be updated soon. I'll put a quick message into the Active Workunits Thread in the next hour. Thank you all for keeping an eye on this, and my apologies for letting the workunit queue run dry.

Increasing the buffer is a great idea, I'll talk to David Kim about doing this.

Cheers,

James

I get this message.

2006-09-01 16:40:33|rosetta@home|No work from project

Anybody?

Anders n

39) Message boards : Number crunching : Changelog, RMSD and native structure (Message 24184)
Posted 21 Aug 2006 by James Thompson
Post:
1) Ok, CASP was a nice reason, but it is over for 3 weeks now!

This is true, but the utility of CASP is still not over for our lab. We're still submitting workunits on some proteins to investigate how increased sampling and different approaches can influence our results. Also while CASP is over, not all of the structures are released (less than half of the proteins that I've worked on have been released, and I'm very interested in them). This means that the CASP workunits are still a blind test for us! Also, as Feet1st points out, CASPR is still ongoing and the lion's share of our workunits are directed towards that.

2) Cmon, it would take less then a minute to add few short sentences to change log file. Just what is new and what was improved in current version compare to previous one. The only reason I see they don't do that - is no progress have been made...[/quote]

If people are interested in what's being added to the Rosetta application, I'll add some more detail on changes when I update the application this week. There are at least two scientifically interesting updates that will be added in the next version, including changes that allow Rosetta@Home to be efficiently and effectively used for our vaccine design trials, and a new term that will be used to measure how tightly a given structure is packed together. I'll go into more detail when I update the application later this week.
40) Message boards : Number crunching : Crunching question (non points related!!!) (Message 22923)
Posted 18 Aug 2006 by James Thompson
Post:
I checked the the computer has PC100 RAM.

A friend said I could replace that with PC133 RAM I had around and I did, from 128 to 384 - now things are faster and now Rosetta is attached to it!

Thanks to everyone for their help, especially James Thompson for his generous offer.


No problem. Thanks for running Rosetta, and congratulations on finding such a good deal on a spare crunching machine.


Previous 20 · Next 20



©2024 University of Washington
https://www.bakerlab.org