BioInfo:Helen Lee

Extricating Protein Structure for Protein Structure Prediction

Introduction

Why has protein structure prediction fascinated biologists and computational biologists for the past three decades? Part of that fascination is the need to find a given protein's function, and part is that the sequence information codes for a specific structure. Proteins ``are instrumental in almost everything cells do,'' [2] and some of these functions include transport, storage, signaling, movement, and defense. The function of the proteins depends on their structure. Thus, knowing the structure of a protein can lead to discovery of the function. Combined with the knowledge that a sequence of proteins will fold into a specific structure, one should be able to predict function of a protein the sequence. Knowledge of the sequence, structure, and function of a protein can lead to powerful designer drugs to fight disease. One such example is the HIV protease inhibitors.

Tremendous efforts have been centered on predicting the structure of proteins from their sequence. This type of prediction is called ab initio prediction [6]. Before current methods of prediction, scientists tried to predict structures using physiochemical principles, more explicitly, molecular dynamic methods [6]. Rost states, ``in practice, however, such approaches are frustrated by the enormous complexity of the calculation (requiring many order of magnitude more computation that is currently feasible) and by inaccuracies in the experimental determination of basic parameters.'' These molecular dynamic methods can be far more difficult than necessary. Knowledge based methods for structure prediction can be far faster and do not require as many parameters. Any method that is knowledge based uses known structures that have similar sequences to predict an unknown structure.

Current Methods

Some of the current methods, that have proved far more successful than molecular dynamics, for ab initio prediction are homology modeling (of which there are many) and threading methods. Most of these methods use some sort of evolutionary information to guide them. This is very true of homology modeling. Homology modeling takes close evolutionary proteins, and attempts to predict a structure for the unknown structure from the known structure. Remote homology is similar to homology modeling except the number of pairwise sequence identities is less than twenty five percent. The recognition of these cannot be done with normal homologous methods. Threading methods can obtain remote homologies. The threading method attempts to place the target sequence onto the template backbone and then evaluate the new structure. These evaluations are done with mean-force potentials or environmental classifications.

Structure prediction can be broken down into a couple of categories, according to Rost and O’Donoghue [6]. These categories are 1D, 2D, and 3D predictions. 1D predicts the secondary structure of the protein. Such items as -helices, -sheets, and loop regions are predicted in 1D predictions. No xyz coordinate predictions are done. 2D prediction builds on the 1D predictions. It attempts to predict inter-residue distances and their interactions. Finally 3D prediction does the three dimensional structure of all atoms contained within the protein. This includes their interactions with one another. Methods for doing 3D predictions have used information from 1D and 2D predictions as well as distance geometry methods [6]. Methods created by Dunbrack and Simons have performed well at CASP [4,11] when predicting 3D structure. CASP (Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction) is a meeting to assess the current prediction methods for protein structure.

Current Application

After the successful predictions at CASP3, Karplus hopes to extend the SAM-T98 prediction method into 3D-structure prediction. The SAM-T98 method was able to ``predict structures as well as all but five of the structure based methods in CASP3'' using sequence information, and no structural information [5]. The SAM-T98 method is an iterative one for prediction of remote homologues. The new prediction method will not only use SAM-T98, but will add in structural information about the sequence being predicted. Using a combination of techniques from Simons [9,10,11] and Dunbrack [4], the complete three-dimensional structure intends to be predicted. Dunbrack uses a technique of stripping side chains from a homologous sequence and replacing them with side chains from the target sequence. Simons, on the other hand, uses a simulated annealing process to predict the interactions between the non-aligned positions of the homologue and the target. The combination of all three techniques, homology detection, side chain replacement for alignments, and simulated annealing for unaligned positions, will hopefully enhance the already good protein structure predictions. The important bridge between the proposed method of 3D prediction and the current method of 1D prediction is the collection of structures to base the 3D predictions from.

Currently, Karplus and Dunbrack are working on an automated way to collect semi-threaded sequence. This involves the setup of the sequence to be threaded as well as the backbone that it will be threaded onto. The collection technique as described above has two distinct steps, the extraction step and the translation step. The extraction step looks through the aligned sequences for replacements. Those alignment columns are noted and passed to the translation step. The translation step takes the template sequence, in three dimensional format, and replaces the side chains of each amino acid with that from the target sequence, if the amino acids differ. This translation step attempts to leave as much of the template backbone in place as possible. The extraction step starts with an alignment file. The alignment file was used because many possible alignment programs could then be evaluated for the entire system. This alignment file is read to determine alignment columns between the target and template sequences. Here only sequence information is still being worked with. Next, the structure of the template sequence is read, also giving a sequence. The sequence is inherent in the format of the structure. This structure is in the PDB (Protein Data Bank) [13] format. If the template sequence and the template PDB sequence are not the same, then an alignment between the two needs to be performed. When a link has been made between the target sequence and the template PDB sequence, then the extraction of the alignment columns can begin. Only replacement columns are considered, while inserts and deletes are ignored. The alignment columns of the target sequence and the template PDB sequence are then passed to the translation step.

The translation step receives input from the extraction part in the form of alignment columns. The corresponding ATOM records in the template PDB file are found. These ATOM records are written to another PDB file. This PDB file is incomplete and only contains the ATOM records from the aligned residue positions. Only aligned segments are written because this new PDB file is used as in input to the translation program. The translation program only needs to change the residues that are aligned, or those that are evolutionarily equivalent. The translation tool used is SCWRL (Side Chain replacement With a Rotamer Library) [1]. SCWRL is a tool for ``adding side chains to a protein backbone.'' SCWRL is used to replace the side chains from the template sequence with ones from the target sequence. To accomplish this a SCWRL sequence file is written telling SCWRL which side chains to replace and which to retain. Those residues that differ in the alignment columns are replaced, while the ones that are the same are kept. Information on the SCWRL sequence file format may be obtained from the SCWRL web-site [8]. The output from SCWRL is a PDB file with the backbone from the template PDB file and the side chains from the target sequence. This translated PDB file is translated once more into a compatible format for the larger project. It is accompanied by an index file, which gives the sequence numbers of the target for the residues contained within the translated PDB file.

SCWRL

As described by the SCWRL web-site [8] ``SCWRL is a program for adding side chains to a protein backbone based on the backbone-dependent rotamer library.'' A rotamer is a ``local minima on potential energy maps with frequencies predictable from conformational analysis or organic molecules,'' according the Dunbrack and Cohen [3]. A ``backbone dependent rotamer library'' is the preferences of the rotamers dependent upon the f and y angles of the backbone. The f angle, according to Stryer [12], is ``the degree of rotation at the bond between the nitrogen and a carbon atoms of the main chain.'' And the y angle is ``that between the a carbon and carbonyl carbon atoms.'' Thus, the rotamer library contains the preferences of the side chain and backbone part of the amino acids based on their angles to which they bond to one another. SCWRL uses this information to try and minimize side chain to backbone clashes and side chain to side chain clashes.

The success of SCWRL at CASP3 prompted the inclusion of SCWRL. In Dunbrack's experiments [4] where SCWRL was used, ``8 of 11 cases, [their] predictions [were] above or well above the median.'' The second reason SCWRL is used is because it provides the well-needed bridge between previous work done [5] and current work being undertaken by Karplus.

Conclusion

Protein structure prediction is extremely hard. After three decades of experiments of trying to unlock the mystery between sequence and structure, there still is no easy way to accurately predict the sequence. This project hopes to aid in the enhancement of a very good 1D-prediction tool, and to move it to a 3D-prediction tool. It collects homologous sequences and converts their structure to the target sequence, using the template backbone.

Bibliography

1. M.J. Bower, F.E. Cohen, and R.L. Dunbrack Jr.

Prediction of protein side-chain rotamers from a backbone-dependent rotamer library: A new homology modeling tool.

Journal of Molecular Biology, 267(5):1268-1282, 1997.

2. N.A. Campbell.

Biology.

The Benjamin/Cummings Publishing Company, Inc., Redwood City, CA, 3rd edition, 1993.

3. R. L. Dunbrack Jr. and F. E. Cohen.

Bayesian statistical analysis of protein side-chain rotamer preferences.

Protein Science, 6(8):1661-1681, 1997.

4. R.L. Dunbrack Jr.

Comparitve modeling of CASP3 targets using PSI-BLAST and SCWRL.

Proteins: Structure, Function, and Genetics Supplement 3, 34(1):81-87, 1999.

5. K. Karplus, C. Barrett, M. Cline, M. Diekhans, L. Grate, and R. Hughey.

Predicting protein structure using only sequence information.

Proteins: Structure, Function, and Genetics Supplement 3, 34(1), 1999.

6. B. Rost and S. O'Donoghue.

Sisyphus and prediction of protein structure.

Bioinformatics, 13(4):345-356, 1997.

7. SAM: Sequence Alignment and Modeling System.

http://www.cse.ucsc.edu/research/compbio/sam.html.

8. SCWRL homepage.

http://www.cmpharm.ucsf.edu/~ bower/scwrl/scwrl.html.

9. K. T. Simons, C. Kooperberg, E. Huang, and D. Baker.

Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions.

Journal of Molecular Biology, 268(1):209-225, 1997.

10. K. T. Simons, I. Ruczinski C. Kooperberg, B. A. Fox, C. Bystroff, and D. Baker.

Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins.

Proteins: Structure, Function, and Genetics Supplement 3, 34(1):82-95, 1999.

11. K.T. Simons, R. Bonneau, I. Ruczinski, and D. Baker.

Ab initio protein structure prediction of casp iii targets using rosetta.

Proteins: Structure, Function, and Genetics Supplement 3, 34(1):171-176, 1999.

12. L. Stryer.

Biochemistry.

W.H. Freeman and Company, New York, 4th edition, 1995.

13. The RCSB Protein Data Bank.

http://www.rcsb.org/pdb/