Using Detrended Fluctuation Analysis to identify exons

Some researchers studying Universality, etc. have applied some of their tools to the identification of exons within chromosomes. In particular, Peng, et al. [1] developed a technique called Detrended Fluctuation Analysis (DFA) to identify scale-invariant sequences of numeric values. They then converted DNA base sequences to numeric sequences by using the bases to direct a random walk. For example, each walk started with a value of zero and then each base was examined, in turn. If a base was a pyrimidine, they added 1 to the walk value; if a purine, they subtracted 1. The walk value at each step was recorded as part of the sequence.

They applied DFA to a groups of coding and non-coding DNA sequences and observed significantly different values for each group. Using their program dfa.c, the following table was generated:

DNA sequenceStatistic
Sequence slopeSequence correlation Walk slopeWalk correlation
first 1000 random bases .55.998 1.47.998
Human coagulation
(M11314)
.52.997 1.61.999
Human p53.50.9951.61.999
Chrom 22:first 1000 bases.54.9981.72.999

Two applications have been built to help evaluate the ability of DFA to distinguish exons from non-coding regions within a chromosome. The program detrend-sequence-and-detrend-walk-from-matrix.pl breaks a chromosome into contiguous segments and performs a DFA on

Look-at-chromosome.pl reads the data from the first program and displays it for the user. The user may enter a Boolean expression to use to predict which segments contain or are part of exons.

For example, the 1Mb region of chromosome 22 beginning with base 13000001 includes 21 genes composed of 90 exons distributed among the 10,000 100-base segments of the region. The following expression:

(   
   ( walk_slope > 1.44 ) and ( walk_slope < 1.65 )  
)   
and 
( GC_percentage > .52 )  
generated 491 guesses which found 39 of the 90 exons in the range examined. That is, choosing 5% of the segments identified 43% of the exons. (Of course, some guesses may be correct even though they have not yet been identified officially.) Changing the first "and" to an "or" resulted in 1120 guesses that identified 54 of 90 exons (11% identifed 60%).

The following graph shows the

Note that segment number "1" along the X-axis in this graph corresponds to the segment beginning at base number "13000001". Note also that gene ranges plotted in blue overlay exon ranges plotted in red, so that genes composed of a single exon show no exons, and one side of an exon beginning or ending a gene range will be blue.

For chunks intersecting exons, the mean sequence slope is around .61 and the mean walk slope is around 1.55. Both slopes are roughly normally distributed. Presumably, chunks not-intersecting exons have slightly different mean values.

This facility is available as a a web-based portal, which allows the user to plot any of the prediction criteria.

More about DFA

DFA was first proposed in: Peng C-K, Buldyrev SV, Havlin S, Simons M, Stanley HE, Goldberger AL. Mosaic organization of DNA nucleotides. Phys Rev E 1994;49:1685-1689, and was obtained from Physionet The DFA algorithm (informally) takes input data series of length N and divides it into a series of equal-sized intervals, or "boxes," each containing n entries, and for each box:
  1. a regression line is determined using the data in each interval, and
  2. the variation of the data with respect to the regression line is determined.
Finally, a root mean square variation (relative to the local trend), F(n), is calculated from variations for the individual boxes of size n, and the process is repeated for a group of box sizes, n (ranging from 2 (usually) up to N/4). The natural logs of the resulting value pairs, (n, F(n)), are then regressed to determine the slope of the variation with respect to box size.

According to Peng, et al. in "Quantification of scaling exponents and crossover phenomena in nonstationary heartbeat time series" (CHAOS, Vol. 5, No. 1, 1995), a straight line "indicates the presence of scaling," and the slope of the line indicates the nature of that scaling:

Possible directions for investigation

It seems reasonable that other genic stuctures should also display non-scaling autocorrelations. For example, enhancer regions, microORFs, pseudo-genes, and conserved non-genic sequences (CNGs) (Dermitzakis, Emmanouil T., et al., Evolutionary Discrimination of Mammalian Conserved Non-genic Sequences, Science, 302, 2003) probably yield false positives. The current program does not identify such regions.

It would also be interesting to know if this approach identifies large exons more accurately than small exons. Large exons typically include one or more data segments that do not overlap with non-translating regions, whereas small exons commonly include bases organized into both exon and non-translating sequences.

It might be possible to use multiple regressions on this data to identify better prediction expressions. In particular, it seems that a collection of expressions, each tailored for a specific G+C proportion would yield the most accurate predictions. It might also be useful to compare the 100-base segmentation results with results derived using segments of other sizes, possibly using both sets of data for prediction.

Finally, recent work with images of stromatolites suggests that images of stomatolites compress better than images of similarly stratified sedimentary rocks. Presumably that is due to the presence of patterns in the stomatolites that are more susceptible to compression than are whatever patterns may obtain in sedimentary deposits.

It might also be the case that cDNA gene sequences compress differently than non-coding DNA sequences. To test that idea quickly, cDNA sequences for 2 different genes were downloaded from NCBI and compressed using 2 different compression techniques common to contemporary computing. The compression ratios obtained for these genes sequences were then compared with the ratios obtained for bases 13,000,001 on within chromosome 22 as well as several random sequences of different lengths. Here are those results:

DNA sequenceFile size
OriginalCompressedRatio
Chromosome 22336399068785821 .2612
500000 random bases500500 139577.2789
10000 random bases10010 3104.3101
Human coagulation
(M11314)
90302689 .2978
Human p531755618.3521
1000 random bases1001 390.390

It appears that compression ratios of this sort are a function of string length, rather than indigenous patterns.

References:

  1. Peng C-K, Buldyrev SV, Havlin S, Simons M, Stanley HE, Goldberger AL. "Mosaic organization of DNA nucleotides," Phys Rev E 1994;49:1685-1689.
    http://prola.aps.org/abstract/PRE/v49/i2/p1685_1
  2. Peng, et al., "Quantification of scaling exponents and crossover phenomena in nonstationary heartbeat time series" (CHAOS, Vol. 5, No. 1, 1995.
    http://reylab.bidmc.harvard.edu/publications/Chaos/phsg.pdf
  3. H.E. Stanley, V. Afanasyev, L. A. N. Amaral, S. V. Buldyrev, A. L. Goldberger, S. Havlin, H. Leschhorn, P. Maass, R. N. Mantegna, C.-K. Peng, P. A. Prince, M. A. Salinger, M. H. R. Stanley, and G. M. Viswanathan, "Anomalous Fluctuations in the Dynamics of Complex Systems: From DNA and Physiology to Econophysics," Physica A 224, 302-321 (1996).
    http://polymer.bu.edu/hes/articles/saabghlmmppssv96.pdf
  4. Goldberger, et al., "Fractal dynamics in physiology: Alterations with disease and aging."
    http://reylab.bidmc.harvard.edu/heartsongs/pnas-2002-99-2466.pdf

For more information about this activity contact Michael Grobe.