MB&B Bioinformatics Project
Program for the Comparison of Protein or Nucleic Acid Sequences

with User Defined Parameters



The key concept in the field of bioinformatics is the comparison of two or more sets of biological data, to determine the similarities and differences between them, and thereby arrive at some understanding of their relationship. One of the first and still the most widely used example of this sort is the comparison of protein or nucleic acid sequences. In order to accurately compare two sequences, you must take into consideration the substitution, gain, and loss of amino or nucleic acids within them. The end result of this comparison is an alignment. With enough comparisons, statistical data can be accumulated, yeilding the relevance of alignment scores.

I have written a program which can compare two protein or nucleic acid sequences, and produce a global alignment for them. The program is flexible and can take many factors into account. The program is not limited by sequence length, although of course they will take longer.

The first step in the program is to build a similarity matrix. Every residue in the first sequence is compared to every residue in the second sequence, cross referenced against a substitution matrix, and their similarity is scored into a new matrix. The program can make use of many different types of substitution matrices. I have coded in an ordinary identity matrix, where if they are identical they get a score of 1 and otherwise a 0, as well as (what appears to be) a BLOSUM matrix which was printed in the class notes.

The second step is to generate a sum matrix from the similarity matrix. The program works from the end of the sequence to the beginning, although in retrospect I believe working from the beginning to the end may be just as effective. This step takes into account a user input gap opening penalty and gap extension penalty.

The third and final step is to trace back from the highest value in the matrix to the end of the consensus sequence, generating an alignment as it goes. The alignment generated indicates both gaps within sequences and alignments of identical residues between sequences.

In addition, I have included several other minor subroutines including printing and cleanup functions. I have also gone to the trouble of making the interface simple, and including a demonstration subroutine. This is my first attempt at programming anything more complex than a cgi. I have attempted to make the various subroutines modular, although this is somewhat complex in perl. It was coded using MacPerl 5.2.0r4, and tested on the pantheon successfully. A sample file is included. The efficiency of my program could probably be improved, but it works.



Created: 1999 To Jeremy's Homepage...