Gloria Hsu

MBB 452a

Professor Gerstein

Protein Structure Alignment and Pattern Recognition

As protein sequence databases grew with automated sequencing systems, it becomes increasingly the problem to organize these sequences in some useful order. Ideally, it would be nice to have machines sequence an uncharacterized protein, and have computers determine its structure. From which we can then construct a database of these structures, verify them, and have computers, again, match them across similar motifs to narrow down the possible functions of the protein. Indeed, in the ideal scenario, the possible functions can then automatically be matched and screened for interaction with proteins involved in disease control, and new drugs can theoretically be found with virtually no human intervention. Such a utopia has yet to be created, however. While automated sequencing is possible, determination of higher structures is still imperfect, to say the least, and the alignment of similar protein structures is still under study. The databases, while presenting a huge array of information, is thus still not disgorging as much practical information as they theoretically could.

One of the particularly fascinating aspects of the automated classification systems is the superimposition of three-dimensional structures. This is interesting because the alignment of similar structures can demonstrate the common ancestry of proteins that have similar primary sequences. In proteins with dissimilar sequences, alignment is useful for its ability to identify proteins with similar functions, as well as to determine exactly which part of the protein’s substructure is necessary for such functions. Current structural classification methods include the manual clustering of structures, as in SCOP (Structural Classification of Proteins), as well as algorithmic methods based upon distances and vectors. The more commonly used distance-based methods include STRUCTAL, which utilizes iterative dynamic programming to optimize alignment for an orientation minimizing the distance between corresponding points on two proteins, and DALI, which uses distance matrices to locate similar distance patterns between two proteins. Vector-based methods such as VAST, on the other hand, represent secondary structure elements with arrows, locates similar vectors on the two proteins, and scores similarity based on a maximum subset of similar vectors. There are also programs that use both distance and vector methods, such as LOCK, which first superimposes the two based upon secondary structure vectors, then refines this superimposition by minimizing the distance between atoms on the two proteins.

Essentially what these programs try to do is to mathematically and reliably recreate human matching skills. While it is possible to manually sort through and categorize proteins into their classes, folds, superfamilies and families, it is impractical to do so on the database level, when the categories of protein folds is in the magnitude of thousands. Thus, it is infinitely more economical to encode a simplified version of this matching ability into a computer, which can then do the elementary combing of the data within the databases and bring any interesting new patterns to the attention of human minds.

One cannot help but think, though, that these computers would be even more efficient and effective, brilliant, even, if imbued with more human recognition properties than what is already encoded within their algorithms. What if, say, the DALI program is cross-referenced with a protein fold database, and can learn to recognize additional, yet-unnamed folds besides? What if the STRUCTAL program can "remember" a pattern of distances it has came across before, and thus requires less adjustment steps before settling upon the optimum configuration? One would not expect even the most brilliant of human minds to be able to recognize all the possible three-dimensional manifestations of all the known folds, much less to recognize new patterns at the same time. But what of computers specialized for such a function? The programs presently available already have x-ray like abilities in their focus on the backbone of the protein, and their artificial memories unquestionably surpass that of humans in the ability to maintain fully three-dimensional images of known protein folds. What they seem to lack, however, are pattern recognition abilities and a learning curve. Whereas these programs can "recognize" similarity between proteins through their ability to optimize superimposed orientations, they still rely on pair-wise comparison of individual proteins, instead of the wholesale scanning and recognition of patterns that humans seem to be able to do intuitively, albeit on a much smaller scale. On the other hand, a program can be made to scan the specs of a multitude of proteins, not matching them up in 1-1 matrices, but scanning them each once, automatically labeling the folds already identified, and programmed to learn from mistakes made in relation to proteins already categorized manually. This automation of the already computer-controlled programs, through linkage to databases for reference and self-checks, would perhaps allow for the growth of a more sophisticated system.

My speculations end here. Not being a computer-oriented person, I do not know if these ideas are actually feasible within the scope of normal lab computations. I merely note that, while computer programs are powerful in their own right as analytical tools in search of related proteins, they can be further empowered if backed up by a database of known tertiary motifs, and immeasurably strengthened by the ability to "see" certain motifs without the individual 1-1 matching of proteins. The strength of computers is in their impeccable memories and consistent action. While an AI may be too much to hope for, a program that is able to correct and learn from its mistakes through self-analysis is perhaps not.

References:

Altman, Russ B. (1999) Lecture notes, MIS 214/CS 274, Stanford University.

Gerstein, M (1999) Lecture notes, MB&B 452a, Yale University.

Gerstein, M. & Levitt M. (1998) "Comprehensive Assessment of Automatic Structural Alignment against a Manual Standard, the Scop Classification of Proteins," Protein Science 7: 445-456.

Holm, L. and Sander, C. (1993). "Protein Structure Comparison by Alignment of Distance Matrices." J. Mol. Biol. 233: 123-128.

Singh, Amit P. (1999). Lecture notes, Biochemistry 218/MIS 231, Stanford University.