back to bioinformatics website....

  http://bioinfo.mbb.yale.edu/course/

 

 or visit Ranjit's presentation on Structural Alignment Methods...

http://bioinfo.mbb.yale.edu/course/projects/talk-5/

Email: ranjit.bindra@yale.edu


Ranjit S. Bindra, 4/20/98

MB&B 447b

 

The Linkage between Bio and Informatics

The relevance of bioinformatics to linkage analysis

 

 

Introduction...

Bioinformatics refers to the application of computer technology to genomic analysis. When thinking of examples of bioinformatics, one is quick to refersto such techniques as DNA sequence alignment or protein structure prediction. However, bioinformatics also spans other facets of scientific research such as the construction of digital libraries that serve as knowledge bases for biological literature, and the complex simulation of metabolic pathways. The field of molecular genetics and linkage analysis has attributed much of its rapid growth to relevant advances in bioinformatics. Linkage analysis is a powerful method of genetic mapping that compares the inheritance pattern of an observed trait with the inheritance patterns of chromosomal regions. Using this technique, one can confirm both the existence of a trait-causing gene locus, as well as its approximate location; using only data on the phenotypes and familial relations of individuals in a pedigree1. Much of the research in this field relies on complex statistical analyses of genetic data, as well as myriads of simultaneous searches and data comparisons. Specifically, linkage analysis requires extensive calculations of complex quantities such as likelihood ratios and recombination frequencies that are virtually impossible to decipher without sophisticated computer programs. This paper briefly examines both the mathematical complexities of linkage analysis, as well as various significant advances in bioinformatics that have facilitated the development of more complex analyses of genetic data.

 


 Mathematics of linkage techniques...

Mathematically speaking, linkage analysis is based on a probability ratio test which calculates a likelihood ratio (LR) in the basic form: LR = prob.(Data|M1)/prob.(Data|M0)2. M1 is the model assigning a trait-causing gene to a given locus which "fits" the data better than the M0, the null hypothesis assuming no linkage. The logarithm of the likelihood ratio, Z = log10(LR), is commonly used as a measure of linkage strength; the traditional threshold T is set at Z>3. The lod score is simply the decimal logarithm of the odds for various values of recombination against no linkage. A maximum likelihood (ML) model is unequivocally accepted over the null hypothesis if it results in a lod score significantly above the threshold. In the case of a simple Mendelian disease, the ML model might have a single unknown parameter specifying recombination frequency (theta) between the disease gene and a marker. Thus, the null hypothesis corresponding to no linkage specifies theta to be 0.5 (50% recombination), and Z is calculated from the equation shown below.

 

Z = log10 [f (yi ; q) / f (yi ; 0.5)]

 

In this equation, f(yi; theta) corresponds to a calculated likelihood defining the probability of occurrence of the ith family3. This calculation is rather complex, but the algorithm(s) involved will now be discussed in order to further elucidate the mathematical basis of linkage analysis.

Construction of a likelihood algorithm involves first defining a relationship between phenotype and genotype. Assume k = the smallest number of unique genotypes causing variation in a given trait being measured. Since each genotype can correspond to a range of phenotypic values, a probability density function must be assigned to each uth genotype such that a total of k (potentially redundant) functions exist (where u = 1,2,,k) . These functions are denoted as gu(x), where x is the given trait being measured. For a single sibship, the likelihood is first expressed as the weighted sum over all genotypes where each summand corresponds to the product of pstu (the probability of having genotype u with parental genotypes s and t) and gu(x)4,5. This equation is shown below, where the values of x for size n sibship are x1, x2,,xn.

 

n ni=1P Eu=1 Pstu gu(xi)

 

While gu(x) specifies the genotype-phenotype relationship, the genetic transition matrix (P) which determines pstu is solely dependent on the genetic mechanism of inheritance. As shown in figure 2, the simplest matrix involves two alleles at one autosomal locus whereas the matrix resulting from a similar scenario with two 'different' heterozygotes6 is slightly more complex.

 

S

T

1=AA

2=Aa

3=aa

1=AA

1,0,0

.5,.5,0

0,1,0

2=Aa

.5,.5,.5

.25,.5,.25

0,.5,.5

3=aa

0,1,0

0,.5,.5

0,0,1

but...'different' heterozygotes yields:

S

T

1=AA

2=Aa

3=aA

4=aa

1=AA

1,0,0,0

.5,.5,0,0

.5,.5,0,0

0,1,0,0

2=Aa

.5,0,.5,0

.25,.25,.25,.25

.25,.25,.25,.25

0,.5,0,.5

3=aA

.5,0,.5,0

.25,.25,.25,.25

.25,.25,.25,.25

0,.5,0,.5

4=aa

0,0,1,0

0,0,.5,.5

0,0,.5,.5

0,0,0,1

 

Computers enter the realm of molecular genetics...

 

LIPED was one of the first linkage programs used to facilitate the calculation of likelihoods. A large pedigree is divided into subunits and analyzed using a version of the previously mentioned algorithm, and the calculated lod scores are then summed in order to determine the lod scores for the entire pedigree. The size of the pedigree is m and the ancestral individuals of the kindred are labeled with the lowest numbers i= 1,2,,m. As shown below, LIPED uses a similar likelihood equation in which each summation extends over all distinct genotypes g1-gm7 (in this example-equation, g represents genotype and x refers to phenotype).

 

 

L = Eg1 ...E gm m ni=1 P(xi|gi) P(gi|..)

 

 

In this equation, P(xi|gi) represents the probability that the genotype produces the given phenotype while P(gi| ..) refers to the probability of the given configuration of genotype(s). As stated by Jurg Ott: "the obvious advantage of Elston and Stewart's representation [shown above] is that the summation for individual m can be computed first and the result attached as a factor to the appropriate term in the summation for his parents. Individual m is then no longer needed in further computations. This elimination is repeated for each individual."8 It should be noted that in this manner, the likelihood is computed recursively beginning first with the most recent generation. LIPED requires two quantitative sets of information in order to accurately obtain a likelihood value: 1) population gene frequencies at both loci, as well as recombination fractions, and 2) a genetic transition matrix which thus allows the calculation of P(x|g). In terms of jointly calculating the number of genotypes (k for multiple alleles at two loci) the following equation can be used: k = a1a2(a1a2 + 1)/2 where a1 and a2 represent the number of alleles at each loci. For example, two loci each with two unique alleles corresponds to a total number of possible genotypes k = 10. The following example is given in an article on LIPED by Ott using the pedigree shown below.

 

P(x4,x5,x10,x11|g4) = P(x4|g5)[Eg5 P(x5|g5)P(g5) Eg10 P(x10|g10) P(g10|g4g5) Eg11 P(x11|g11) P(g11|g4g5)]

 

A small subsection of the pedigree is used to test the relationship of individual 4's genotype to the observed phenotypes x4, x5, x10, and x11. The expression enclosed by the brackets is calculated for each of the k number of genotypes for individual 4, and the result is multiplied into the probability P(x4|g4) thus creating a "modified" probability, termed P(x4*|g4). This process of subdivision and calculation is repeated recursively through the pedigree, and each subgroup is effectively removed from the next calculation since the relevant information is stored in each P(x*|g)9. The likelihood is then determined through a final summation as shown below. This likelihood can then be used in the lod score calculation to test for linkage of a given trait-causing gene/genotype.

 

  L = P(x1,x2...,x15) = E P(x1*|g1)P(g1)

 

As stated by Guo et al., "Advances in biology and molecular genetics have generated so much data that the availability of statistical techniques has become a bottleneck in the process of the mapping of quantitative traits."10 These advances include the consideration of multiple genetic markers simultaneously, increases in pedigree size, and larger numbers of alleles per marker. Algorithmic modifications of older linkage programs have significantly ameliorated the problem, specifically in the case of the popular analysis program, LINKAGE. Cottingham et al. describes several improvements made without resorting to parallel computation, which synthesize biological principles and computer science techniques11.

The two-point analysis program in the LINKAGE program (referred to as LODSCORE) is the main focus of their work. LODSCORE analyzes a list of family pedigrees along with phenotype data for each individual. A set of loci is specified for each run, and the program computes an approximate recombination frequency (theta) for each loci pair. The program accounts for uncertainties in genotypes or phase by assigning conditional probabilities to several genotypes that any pedigree member could have. LODSCORE then computes joint genotypes based on pairs of previously calculated joint haplotypes. An inner and outer (iterating) loop acts to update the probabilities of each joint genotype per individual. Furthermore, the traversal algorithm is modified to allow upward and downward analysis; thus one can update probabilities for parents based on those for their children and vice versa. The first improvement to LINKAGE was the elimination of impossible genotypes and subsequent termination of the relevant loop iteration. It should be noted that the set of feasible genotypes for the mother are pre-computed and cached for reference. Another improvement was that the bi-directional traversal of the pedigree was limited to conventional bottom-up (recursive) analysis when the number of children was only 1; this eliminates redundancy in this case. Boolean logic is also used to replace real number expression, and also to abbreviate feasible versus impossible haplotypes in the formation of genotypes12. These are some examples of how mere adjustments to the combinatorial part of the code can significantly impact the speed of analysis programs.

Three-point and other multi-locus analysis programs were quick to follow the initial versions of LODSCORE and LINKAGE, as they allow the calculation of genetic risks using several linked markers. When given three linked loci, the conventional approach involves pairwise lod-score and recombination computation with subsequent gene order inference using the calculated recombination rates. Multipoint analysis considers these three loci jointly, and it is only necessary to calculate two recombination rates assuming a specific mapping function. As asserted by Lanthrop et al.: "This approach has several advantages: multilocus analysis is the only way of accounting for the nonindependence of the recombination estimates; it affords increased precision of the estimated location of a new locus on the genetic map; and, under the assumption of no interference, it also reduces the problem of estimating multiple parameters to that of estimating a single parameter."13

Advances in data-processing power and speed have also resulted from the combination of various analysis techniques. For example, Guo et al. propose a Monte Carlo approach for combined segregation and linkage analysis14. Although linkage analysis alone can confirm the existence of trait-causing gene/gene-locus as well as its approximate location, it cannot reveal information about the mode of inheritance. Segregation analysis, on the other hand, can reveal the presence of major genes implicated in disease onset but it cannot localize them. Combining both techniques to calculate likelihood ratios provides an excellent technique for handling complex genetic models and data from homogenous pools associated with large pedigrees. In this method, a Gibbs sampler is utilized within the Monte Carlo EM algorithm in order to obtain current parameter values of theta and also to estimate the required conditional expectations. As described clearly by Guo: "Beginning from any realization of polygenic values and combined major genotypes [-] that is consistent with phenotypic and marker observations [-] the polygenic values and genotypes are updated, for each individual in the pedigree in turn (in random order), by sampling from the local conditional distribution at parameter values theta, given observed data (if any) and the polygenic values and genotypes of all members in the pedigree."15 As in conventional Monte Carlo analysis, the entire process is iterated until the parameter estimates no longer show directional trends over each iteration. Furthermore, an initial genotypic configuration is determined followed by a "burn-in" period of Gibbs sampler cycling, which decreases the random effects of the chosen starting point that often lead to unnecessary iterations.

 


 Final notes...

Thus, there have been many advances in the computational aspects of molecular genetics during the past decade. Such developments have resulted in the mainstay of linkage analysis as a powerful bioinformatics tool. As automated genotyping and improvements in the human genetic map lead to the exponential increase of available data sets for analysis, computational effort will indeed become the rate-limiting factor in linkage analyses. Thus, vast improvements and developments in the computer science of molecular genetics (and more specifically linkage analysis) - such as the utilization of neural nets and extensive parallel computing - will further strengthen the link between biology and computer informatics in this field.

 


 

 Footnotes

 

1. Lander, E.S., Schork, N.J. Genetic dissection of complex traits. Science 265, 2037 (1994).

2. Lander et al.

3. Morton, N.E. The detection and estimation of linkage between the genes for elliptocytosis and the Rh blood type. Am. J. Hum. Genet. 8, 80 (1956).

4. Ott, J. Estimation of the recombination fraction in human pedigrees: efficient computation of the likelihood for human linkage studies. Am. J. Hum. Genet. 26, 588 (1974).

5. Elston, R.C., Stewart, J. A general model for the analysis of pedigree data. Hum. Heredity 21, 523 (1971).

6. Elston et al.

7. Ott et al.

8. Ott et al.

9. Ott et al.

10. Guo, S.W., Thompson, E.A. A Monte Carlo method for combined segregation and linkage analysis. Am. J. Hum. Genet. 51, 1111 (1992).

11. Cottingham, R.W., Idury, R.M., Schaffer, A.A. Faster sequential genetic linkage computations. Am J. Hum. Genet. 53, 252 (1993).

12. Cottingham et al.

13. Lathrop, G.M., Lalouel, J-M., Julier, C., Ott, J. Strategies for multilocus linkage analysis in humans. Proc. Natl. Acad. Sci. USA 81, 3443 (1984).

14. Guo et al.

15. Guo et al.