“In Biology nothing makes sense except in the light of evolution.”[1] What about Junk DNA[2]?  Non-coding DNA[3] [4] (also known as selfish[5], ignorant, parasitic[6] and incidental DNA[7]) includes introns [8] [9] [10] [11] [12] [13], transposable elements, pseudogenes, repeat elements, satellites, UTRs hnRNAs LINEs SINEs,[14] as well as unidentified junk[15] and makes up approximately 97% of the human genome.[16] Some scientists were so overwhelmed by the amount of non-coding DNA, that they referred to the genome as  “…a collection of non-coding regions interrupted by small coding regions.” [17]  Junk DNA is ubiquitous and extends to all forms of life, making it an exciting evolutionary phenomenon.[18] [19] [20] Thus, with the heralded human genome project on the horizon and the powerful tools of the bioinformatist, the prospect of shedding light on this problem is very appealing. [21] [22]


In all theories of Junk one must tread carefully and differentiate between cause and consequence of its existence.[23] The most functionless theory of Junk DNA claims that it is just that, a generous juxtaposition of non-functional junk. This useless DNA grows[24] in the genome until the costs of replicating it become too great to maintain.  Thus organisms that develop at a slower rate tolerate more junk and use it to their advantage to slow down the rate of development via increased cell cycle length. [25] [26]


Some scientists have posited that junk has only a passive purpose. [27] The total genome size is related to a number of organismal and cellular level traits, thereby suggesting that there is a selective advantage in larger genomes, including those that result from junk DNA filler. One theory claims that Junk absorbs harmful chemicals that could affect genuine genes.  This has been refuted; in fact, larger genomes are subjected to more physical and chemical damage, outweighing the bodyguard function.[28]  Moreover, it has been shown that most mutations that reduce viability occur in non-coding DNA,[29] possibly indicating that Junk plays an active role in the genome.


Another postulated purpose hypothesizes Junk as a sink for DNA-tropic proteins, thereby buffering the effect of intracellular solute concentrations on nuclear machinery.  This energy independent function of Junk could allow for a reduced basal metabolic rate and therefore an evolutionary niche.[30]


Researchers studying Junk tend to favour searching through repeat elements for function. Repeat elements are thought to be involved in chromosomal integrity.  Many are also found in the heterochromatin and may be involved in centromeric activity and chromosome pairing, both have relevance to evolutionary divergence and speciation through manipulation of chromosomes.[31] The Alu sequence[32] comprising the largest class of SINEs[33] [34] is just one of the many transposon elements that make up roughly 35% of our genome. Past studies have indicated no selective pressure for the non-functional Alu repeats.[35]  A recent theory suggests that hypomethylation of Alus in sperm, compared with the female germ line implies that Alus (and their mRNAs through the action of PKR) may be involved in signaling events in early embryogenisis.[36] Using computers it is possible to determine consensus sequences for Alu insertions sites, which would help resolve their functions.  Moreover, one could use phylogenic tools to confirm the selective placement of Alu repeats across species.


Additionally, non-repetitive Junk may also have functions that can be elucidated.  It was determined, that non-coding DNA has a high GC concentration[37].  Theoretically ORFs within these regions would have higher adaptive plasticity, as they could use   GC rich DNA to vary their final products for selective advantage via alternative splicing.[38]  [39]  Junk sequences may also function as cis-acting transcriptional regulators.[40] [41] One theory posits that base pair distribution in non-coding DNA has an effect on gene transcription through a thermodynamic process.  Moreover, the movement of transposable Junk results in a dynamic system of gene activation, which allows for the organism to adapt to its environment without redesigning its hardwired   system of gene activators.[42][43]


Even random Junk may not be just random junk.  Powerful algorithms are currently searching for homologous sequences in Junk to find promoter functions in our DNA netherworlds[44]. These algorithms are designed to detect sequences that are also highly conserved across evolution, thus probably being functional. [45] [46]  Statistical models have also been used to study the randomness of spreading and loss of repeat sequences throughout the genome.[47] [48]



In 1994 it was proposed[49] [50] that Junk might be similar to natural languages since it follows, among other things,[51] Zipf’s law.[52] [53] The researchers cited this as a possible proof that there exists one or more structured language in our Junk.  This idea was refuted in many letters that claimed, among other things, that Junk does not fit Zipf’s law any better than coding DNA. .[54][55][56][57][58][59][60]      Nevertheless supporters of Zipf’s law maintain their stance.[61] [62] [63]


Similarly, researchers, using a variety of statistical techniques[64] have deduced long-range correlations in Junk DNA.[65]  This may be attributable to the new found functions of many intergenic non-coding regions, which include replication, chromosome segregation, recombination, chromosome stability and interaction with the nuclear matrix.   All of these require the  ‘high redundancy low information’ sequence inherent in Junk.[66]


Obviously, there are functions hidden in Junk DNA. While previously genetics or biochemistry had been the main thrust of this research, I believe that bioinformatical resources will be much more powerful.  The same tools that have been used in the past to delve into the secrets of coding DNA can be used on Junk. A database of Junk sequences should be set up so that one can perform informatics experiments on the data; as there should be few biases and redundancies in a database of Junk.[67] BLAST searches,[68] sequence analysis, phylogenic analysis, even secondary structure analysis.[69] [70] [71] Molecular modeling may define docking sites in Junk secondary structure for DNA associated proteins binding.   Junk specific alignment programs will have to be written to take into account their high mutation rate.    Furthermore, other statistical analysis[72] along the same lines as the failed Zipf’s law experimentation, will definitely prove extremely useful in uncovering Junk’s hidden treasures. Moreover, computational analysis of Junk will reveal other useful information regarding evolutionary progress[73] and evolutionary knowledge is the key to understanding our biology.[74] 


[1] Theodosius Dobzhansky (as heard at the Genetics Graduate Student Seminar)

[2] This pejorative name for the silent majority of DNA was coined by Suumu Ohno in the early 70’s See:

                Kuska B Journal of the National Cancer Institute 90, 1032 (1998)

[3] See:  Brosius J Gould SJ PNAS 89, 10706 (1992) for failed attempt to conjure up new nomenclature

[4] As opposed to non-coding RNA which is another topic unto itself.

 See for example Askew DS, Xu F Histology and Histopathology 14, 235 (1999)

[5] These are not the same. See:  Ohno S Yomo T Electrophoresis 12, 103 (1991)

[6] Orgel et al Nature 288, 645 (1980)

[7] Dover G Doolittle WF Nature 288, 646 (1980)

[8] Tycowski KT et al Genes Development 7A, 1176 (1993) For example of a coding gene found in intronic

region of a different gene

[9] Hall DL et al Canadian Journal of Statistics 26, 455 (1998) Found evidence using sequence alignment

programs and clustering algorithms, of non randomness in introns but does not speculate as to the function

[10] Introns may have a second gene regulatory control mechanism that has yet to be worked out see Mattick

J Current Opinions in Genetics and Development 4, 823 (1994)

[11] Moore, MJ Nature 379, 402 (1996) snoRNAs are encoded by introns also see article Tycowski et al

Nature 379, 464 (1996)

[12] Kuska B Journal of the National Cancer Institute 90, 1125 (1998) Introns are 33% repetitive


[13] Gardiner K Gene 205 39 (1997) Many lower eukaryotes have the same gene with the same function

without introns.

[14] Nowak R Science 263, 608 (1994) 

                Satellites-repeats at ends and centers of chromosomes

                UTR-untranslated regions DNA that is transcribed into RNA but not translated

                SINEs Short interspersed elements i.e. Alu

                LINEs long interspersed elements

                HnRNA  heterogeneous nuclear RNA 25% is immature mRNA the other 75% are a mystery

[15] The ORFans of the yeast genome are thought to be non coding DNA as well.

                See  Mackiewicz et al Nucleic Acids Research 27, 3503 (1999)

[16] Nowak R Science 263, 608 (1994)

[17] Provata A Almirantis Y Physica A 247, 482 (1997) (emphasis mine)

[18] In Viruses see for example:  Maki et al Journal of Genetic Virology 77, 453 (1996)

[19] In  Bacteria see for example: Higgens CF et al Gene 72, 3 (1988)

[20] In Plants see for example: Kubis S Annals of Botany 82, 45 (1998)

[21] See editorial content of Koonin who puts it on the top ten list of things to do for bioinformatics. 

Koonin Bioinformatics 15, 265 (1999)

[22] See talk next week EM Rubin (Dec 14, 1999)

[23] Edgell et al Current Biology 6, 385 (1996)

[24] via mechanisms such as transposition, slippage, gene conversion, unequal crossing over ect. See

Vinogradov AE Journal of Theoretical Biology, 193, 197 (1993)

[25] Pagel M et al Proceedings of the Royal Society of London Biological Sciences 249, 119 (1992)

[26] See Orgel LE et al Nature 288, 645 (1980) who claim the opposite effect on cell development

[27] There are many papers that the lack of Junk or messed up Junk actively  results in disease.  See for

example Epplen JT et al Cytogenic Cell Genetics 80, 75 (1998)

[28] Hsu, TS Bioessays 14, 785 (1992)

[29] Tachida H Japanese Journal of Genetics 68, 549 (1993)

[30] Vinogradov AE Journal of Theoretical Biology 193, 197 (1993)

[31] Dimitri P, Junakovic N Trends in Genetics 15, 123 (1999)

 And thus an important function that must be conserved.

[32] A 282 nt consensus sequence See: Schmidt CW Progress in Nucleic Acids Research and Molecular

Biology 53, 283 (1996)

[33] Other types include the Mariner which is many orders of magnitude less in copy number  See:

Robertson HM Martos R Gene 205, 219 (1997)

[34] They are present in primates at 5x105-1x106 copies per cell  See: Vansent G, Reynolds WF PNAS, 92, \

8229 (1995)

[35] Denninger PL Batzer MA Molecular Genetic Metabolism 67, 183 (1999)

[36] Schmidt CW Nucleic Acids Research 26, 4541 (1998)

[37] Raghavan S et al Journal of Molecular Evolution 45, 485 (1997)

[38] Guigo R, Fickett JW Journal of Molecular Biology  13, 51 (1995)

[39] Possibly what is meant in Jain HK Nature 288, 647 (1980)

[40] Brahmachari SK el al Gene 190, 17 (1997)

[41] Lipman DJ Nucleic Acids Research 15, 3580 (1997) non coding regions may be involved in mRNA

stability again influencing the post transcriptionally the function of genes.

[42] Sandler U, Wyler A Journal of Theoretical Biology 193, 85 (1998)

[43] See Zuckerkandl E Gene 205, 323 (1997) who as a major proponent of functional junk DNA, also

proposes that Junk is involved in sectorial gene repression

[44] Ohler U et al Bioinformatics 15, 362 (1999) as an example of such a search

[45]  Duret L, Bucher P Current Opinions in Structural Biology 7, 399 (1997)

[46] Donehower LA et al Nucleic Acids Research 17, 699 (1989)

[47] Ohta T Nature  292, 648 (1981)- occur randomly (Cell control independent?)

[48] Charlesworth B et al Nature 371, 215 (1994) non random loss (Cell control?)

[49] Mantegna RN et al Physics Review Letters 73, 3169 (1994)

[50] Flam F Science 266, 1320 (1994)  Many papers on the subject cite this news article.

[51] i.e. Shannon’s redundancy function.  This theory states that a language can lose words or letters and still

be decipherable, Shannon computed this redundancy using the concept of entropy.

[52] Simply that, if one were to create a histogram containing the total amount of words in a language and

their occurrence, the arrangement in rank order would  be linear on a double logarithmic scale with a slope of  -z. This is the case for all natural languages

[53] See  for interesting usage of this phenomenon,  S Singh The Code Book : The Evolution of Secrecy     from Mary, Queen of Scots to Quantum Cryptography, Doubleday 1999

[54] Konopka AK, Marindale C Science 268, 789 (1995)

[55] Bonhoeffer et al Science 271, 14 (1996)  and Bonhoeffer et al Physical Review Letters  76, 1977 (1996)

Claims that the results are do solely to unequal nucleotide frequencies in coding vs. non coding DNA

[56] Israekoff NE et al Physical Review Letters 76, 1976 (1996) claimed that there was no control study

backing up their results.  Did their own study and found that could not differentiate using Zipf, between language and power-law noise

[57] Voss RF Physical Review Letters 76, 1978 (1996) Claims that the paper ignores the fact that while

Zipf's law exists, it provides no useful information about a language.  As well random sequences are also found to follow Zipf’s law.

[58] Tsonis AA et al Journal of Theoretical Biology184, 25 (1997) As opposed to the other short letters this

paper  is a little more in depth claiming both biological and statistical proof

[59] Attard GS et al Europhysics Letters 36, 391 (1996) The paper is somewhat misleading as the reader

might mistake it for supporting Mantegna et al.  But the paper’s conclusions are that any language

found in Junk DNA is that of opportunistic elements that exploit the structure of Junk DNA and

not the sequences themselves,

[60] Chatizidmitriou-Dreismann CA et al Nucleic Acids Research 24, 1676 (1996) used computer

simulations on both natural and artificial sequences to arrive at their conclusions

[61] See also Stanley et al Nuovo Cimento Della Societa Itlaiana Di Fisca D 16, 1339  (1994) for support of

the Zipf law theory.

[62] Mantegna et al Physical Review Letters76, 1980 (1996)

[63] Chechetkin VR, Lobzin VV Physics Letters A 222, 354 (1996) Supports the usage of  the Shannon

Redundancy Function as proof that there is information in Junk DNA although what it is, is unknown

[64] for example the expansion modification system

[65] Li W, Kaneko K Europhysics Letters 17, 655 (1992) and Li W et al Physica D  75, 392  (1994)

[66] Frontali C, Pizzi E Gene 232, 87 (1999)

[67] This is a major problem in coding databases.  See Altschul, SF et al Nature Genetics 6, 119 (1994)

[68] Psi Blast is probably more useful when dealing with Junk as it allows for gaps, something that is

probably likely as Polymerase slippage is responsible for a lot of Junk  Altschul SF  Nucleic Acids Research 25, 3389 (1997)

[69] See McMurry CT PNAS 96, 1823 (1999) for the current debate on different possibilities in DNA

secondary structure

[70] Lakhoita SC Indian Journal of Biochemistry and Biophysics 33, 93 (1996)

[71] Morozov Syu et al Journal of Biomolecular Structural Dynamics 11,837 (1994) for affects of DNA

structure on replication

[72]  I would assume that harnessing the power of the NSA’s cipher cracking computers and analysts would

be very useful in this field  See http://www.nsa.gov:8080/programs/msp/grants.html

[73] See for example Almirantis Y Journal of Theoretical Biology 196, 297 (1999) and  Elder JF, Turner BJ

Quarterly Review of Biology 70, 297 (1995)

[74] See first endnote