Hui-Teng Cheng

Genomics and Bioinformatics MBB452a

Final Project 121400

Bioinformatic tools in searching candidate sequences of a putative novel tumor necrosis factor receptor in mice

This project is to review the important bioinformatic tools used to search candidate gene sequences of a putative tumor necrosis factor (TNF) receptor. Previous genetic studies in mice have shown that lymphotoxin alpha homotrimers may bind a novel tumor necrosis factor receptor for its effect on maintaining development of a secondary lymphoid organ called mesenteric lymph node. Currently three members of the receptor family are known to bind lymphotoxins. Using homology-based method to search the expressed sequence tag (EST) database will provide an opportunity to identify candidate sequences for the novel receptor. In this paper, I will review the gene-finding methods used for this project, and show the primitive results.

Identification of new genes in genomic sequence databases has been an important problem in bioinformatics (1). The major approach to this problem is homology-based method (the other one is called ab initio method, which is based on the general characteristics of protein-coding genes to measure the coding potential of exonic vs intronic and intergenic sequences). To start searching the databases, a query sequence is necessary to serve as a key to retrieve the data of interest. To find the novel receptor, the ideal query sequence is the ligand binding sequence. A reasonable assumption is that all three receptor members should share the similar sequences for ligand binding because they can bind the similar ligand proteins. Actually this was approved experimentally. However, suppose no experimental data is known, to find out the ligand binding sequence is to align all three receptor sequences, and define the conserved sequence as motif to be the query sequence for further search.

When searching databases for homologous sequences, one method is to use position-specific information for sequence alignment. One essential step of such “profile” method is building position-specific scoring model for further comparison. The problem with profile is that the models are complicated by many free parameters such as how to set up the position specific residue score (2). Hidden Markov models have been introduced to address these problems. The main idea is that HMMs describes a probability distribution (so they are a class of probabilistic models) to explain an infinite number of possible sequences.

The other sequence alignment method is called “pairwise” method. There are three main aspects: comparison algorithm, scoring matrices, and gap penalty. The program I used for multiple alignments is ClustalW. The ClustalW programs consist of three main stages: 1) all pairs of sequences are aligned separately in order to calculate a distance matrix giving the divergence of each pair of sequences; 2) a guide tree is calculated from the distance matrix; 3) the sequences are progressively aligned according to the branching order in the guide tree (3). During progressive alignment, a series of pairwise alignments are conducted to align larger and larger groups of sequences, following the branching order in the guide tree. At each step a full dynamic programming algorithm is used with a residue weight matrix and penalties for opening and extending gaps. Dynamic programming means that gaps that are present in older alignments remain fixed so do the score. This is the most demanding part in he alignment strategy in terms of processing time and memory usage.

Scoring matrices is chosen based on the evolutionary diversity between aligned sequences. There are two approaches to establish scoring matrices. One is called PAM (Percent Accepted Mutations) matrices, or Dayhoff matrices, are derived using the point accepted mutation model of evolution, wherein all positions are equally mutable (4). A series of closely related proteins (<15% amino acid differences) were aligned. They used these aligned sequence to calculate the probability that a mutation would change a residue to other possible residues. The probabilities are then normalized to represent the average mutational change that will take place when 1 residue out of 100 undergoes mutation (1% accepted mutation, or PAM1). The matrix is multiplied by itself to obtain a probability matrix for other degrees of evolutionary change. The number with the matrix (eg. 30 in PAM30) refers to the evolutionary distance (PAM30=30 mutations per 100 residues); greater numbers mean greater distance.

The other scheme is called the BLOSUM (Block Substitution Matrix) family of matrices. They are derived from substitutions observed in more than 2,000 blocks of aligned sequences from PROSITE and SWISS-PROT databases (5). Proteins belonging to the same family are aligned together to isolate regions of local similarity, or called BLOCKS. The occurrences of each amino acid pair in each column of each block alignment is counted, and used to determine the probability of one amino acid mutating to another at a given position. Sequences within blocks are clustered according to percentage sequence identity for the purpose of weighting the sequences in the count. This results in a range of scoring matrices which differ depending on the percentage identity used in the weighting. The number in the matrix name (eg. 62 in BLOSUM62) refers to the minimum percent identity of the blocks used to construct the matrix. Thus, greater numbers equate to higher percentage identity, ie. lesser evolutionary distance. Regarding the gap penalty, there are various penalty schemes, which are not discussed in detail here.

BLAST is also a pairwise method for sequence alignment. The concepts of dynamic programming, scoring matrix and gap penalty are the same as described above. The statistical result reported by BLAST is E-value. The Expect value (E) is a parameter that describes the number of hits which may be obtained just by chance when searching a database of a particular size. PSI-BLAST is a more sensitive tool to find the homologous sequences in the evolutionarily distant organisms. PSI-BLAST starts from a single query sequence, and collects homologous sequences by BLAST search. The homologues sequences are aligned with the query, a model is built. The model is searched against the database again, new homologues are found, and added to the alignment and the model. New model is used to search the database until no new homologues are discovered. It is an iterative process; and the sensitivity to retrieve homologous sequences is increased.

By running ClustalW of three TNF receptors, a cysteine-rich consensus region was identified (Figure 1). Using this region of TNF receptor 1 as query sequence, I searched the NCBI mouse EST database using tBLASTn algorithm and the default scoring matrix BLOSUM62 (6,7). 115 EST hits were retrieved, part of which are shown in Figure 2. To examine whether homologous sequence of TNF receptor is present in distant organisms, PSI-BLAST was run against Drosophila genome using the same query sequence. 3 sequences with high E-value, which means low similarity, were found (Figure 3).

Results:

Figure 1. Multiple sequence alignments of three TNF receptors to localize the ligand binding sequence

              1         11        21        31        41

TNR2_MOUSE    -MAP--AALWVALVF-ELQLWATGHTVPAQVVLTP-YKPEPGYECQISQE

TNRC_MOUSE    MRLPR-ASSPCGLAWGPLLLGLSGLLVASQPQLVPPYRIENQTCWDQDKE

TNR1_MOUSE    MGLPTVPGLLLSLVLLALLMGIHPSGVTGLVPSLG-DREKRDSLCPQGKY

              51        61        71        81        91

TNR2_MOUSE    YYDRKAQMCCAKCPPGQYVKHFC-NKTSDTVCADCEASMYTQVWNQFRTC

TNRC_MOUSE    YYEPMHDVCCSRCPPGEFVFAVC-SRSQDTVCKTCPHNSYNEHWNHLSTC

TNR1_MOUSE    VHSKNNSICCTKCHKGTYLVSDCPSPGRDTVCRECEKGTFTASQNYLRQC

              101       111       121       131       141

TNR2_MOUSE    LSCS-SSCTTDQVEIRACTKQQNRVCACEAGRYCALKTHS-GSCRQC--M

TNRC_MOUSE    QLCRPCDIVLGFEEVAPCTSDRKAECRCQPGMSCVYLDN---ECVHCEEE

TNR1_MOUSE    LSCKTCRKEMSQVEISPCQADKDTVCGCKENQFQRYLSETHFQCVDC---

              151       161       171       181       191

TNR2_MOUSE    RLSKCGPGFGVASSRAPNG-NVLCKACAPGTFSDTT--SSTDVCRPHRIC

TNRC_MOUSE    RLVLCQPGTEAEVTDEIMDTDVNCVPCKPGHFQNTS--SPRARCQPHTRC

TNR1_MOUSE    --SPCFNGTVTIPCKETQN--TVCN-CHAGFFLRESECVPCSHCKKNEEC

Figure 2. Part of the results of mouse EST searching (total 115 hits)

Sequences producing significant alignments:                     (bits)  Value

gb|AI119338.1|AI119338  uf03d04.y1 Sugano mouse liver mlia M...   306  9e-83

gb|AI226579.1|AI226579  uj10b12.y1 Sugano mouse kidney mkia ...   286  6e-77

gb|AI326643.1|AI326643  mo02b02.y1 Stratagene mouse lung 937...   275  2e-73

gb|BF164835.1|BF164835  601777753F1 NCI_CGAP_Lu29 Mus muscul...   219  1e-56

Figure 3. The result of PSI-BLAST search in Drosophila genome

gi|7293489|gb|AAF48864.1|  (AE003509) CG6531 gene product [Drosop...    30  0.65

gi|7298412|gb|AAF53636.1|  (AE003656) CG7527 gene product [Drosop...    27  3.2

gi|7294049|gb|AAF49404.1|  (AE003526) CG9715 gene product [Drosop...    26  6.8

References

Searls DB. Bioinformatics tools for whole genomes. Ann Rev Genomics Hum Genet 2000 1:251-79
Eddy SR. Hidden Markov models. Curr Opin Struct Biol 6:361-65
Thompson J.D., Higgins D.G. & Gibson T.J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice." Nucleic Acids Research 1994 22:4673-4680
Dayhoff M et al. in :Atlas of protein sequence and structure”, Vol. 5, Suppl. 3, p.345. National Biomedical Research Foundation, Silver Spring, Maryland, 1978
Henikoff S and Henikoff JG. PANS 1992 89:10915
Altschul SF et al. Basic local alignment search tool. J Mol Biol 1990 215:403-10
NCBI website www.ncbi.nlm.nih.gov