Some researchers studying Universality, etc. have applied some of their tools to the identification of exons within chromosomes. In particular, Peng, et al. [1] developed a technique called Detrended Fluctuation Analysis (DFA) to identify scale-invariant sequences of numeric values. They then converted DNA base sequences to numeric sequences by using the bases to direct a random walk. For example, each walk started with a value of zero and then each base was examined, in turn. If a base was a pyrimidine, they added 1 to the walk value; if a purine, they subtracted 1. The walk value at each step was recorded as part of the sequence.
They applied DFA to a groups of coding and non-coding DNA sequences and observed significantly different values for each group. Using their program dfa.c, the following table was generated:
| DNA sequence | Statistic | |||
|---|---|---|---|---|
| Sequence slope | Sequence correlation | Walk slope | Walk correlation | |
| first 1000 random bases | .55 | .998 | 1.47 | .998 |
| Human coagulation (M11314) |
.52 | .997 | 1.61 | .999 |
| Human p53 | .50 | .995 | 1.61 | .999 |
| Chrom 22:first 1000 bases | .54 | .998 | 1.72 | .999 |
Two applications have been built to help evaluate the ability of DFA to distinguish exons from non-coding regions within a chromosome. The program detrend-sequence-and-detrend-walk-from-matrix.pl breaks a chromosome into contiguous segments and performs a DFA on
For example, the 1Mb region of chromosome 22 beginning with base 13000001 includes 21 genes composed of 90 exons distributed among the 10,000 100-base segments of the region. The following expression:
( ( walk_slope > 1.44 ) and ( walk_slope < 1.65 ) ) and ( GC_percentage > .52 )generated 491 guesses which found 39 of the 90 exons in the range examined. That is, choosing 5% of the segments identified 43% of the exons. (Of course, some guesses may be correct even though they have not yet been identified officially.) Changing the first "and" to an "or" resulted in 1120 guesses that identified 54 of 90 exons (11% identifed 60%).
The following graph shows the
Note that segment number "1" along the X-axis in this graph corresponds to the segment beginning at base number "13000001". Note also that gene ranges plotted in blue overlay exon ranges plotted in red, so that genes composed of a single exon show no exons, and one side of an exon beginning or ending a gene range will be blue.
For chunks intersecting exons, the mean sequence slope is around .61 and the mean walk slope is around 1.55. Both slopes are roughly normally distributed. Presumably, chunks not-intersecting exons have slightly different mean values.
This facility is available as a a web-based portal, which allows the user to plot any of the prediction criteria.
According to Peng, et al. in
"Quantification of scaling exponents and crossover
phenomena in nonstationary heartbeat time series" (CHAOS, Vol. 5,
No. 1, 1995),
a straight line "indicates the presence of scaling," and the slope of
the line indicates the nature of that scaling:
Possible directions for investigation
It seems reasonable that other genic stuctures should also display
non-scaling autocorrelations. For example, enhancer regions,
microORFs, pseudo-genes, and conserved non-genic sequences (CNGs)
(Dermitzakis,
Emmanouil T., et al., Evolutionary Discrimination of
Mammalian Conserved Non-genic Sequences, Science, 302, 2003) probably
yield false positives. The current program does not identify such
regions.
It would also be interesting to know if this approach identifies large
exons more accurately than small exons. Large exons typically
include one or more data segments that do not overlap with non-translating
regions, whereas small exons commonly include bases organized into both
exon and non-translating sequences.
It might be possible to use multiple regressions on this data to identify
better prediction expressions. In particular, it seems that a collection
of expressions, each tailored for a specific G+C proportion would yield
the most accurate predictions. It might also be useful to compare the
100-base segmentation results with results derived using segments of
other sizes, possibly using both sets of data for prediction.
Finally, recent work with images of stromatolites suggests that images of
stomatolites compress better than images of similarly stratified
sedimentary rocks. Presumably that is due to the presence
of patterns in the stomatolites that are more susceptible to
compression than are whatever patterns may obtain in sedimentary
deposits.
It might also be the case that cDNA gene sequences compress differently
than non-coding DNA sequences.
To test that idea quickly, cDNA sequences for 2 different genes
were downloaded from NCBI and compressed using 2 different compression
techniques common to contemporary computing. The compression ratios
obtained for these genes sequences were then compared with the
ratios obtained for bases 13,000,001 on within chromosome 22 as well
as several random sequences of different lengths.
Here are those results:
It appears that compression ratios of this sort are a function of
string length, rather than indigenous patterns.
References:
For more information about this activity contact
Michael Grobe.
DNA sequence File size Original Compressed Ratio Chromosome 22 33639906 8785821
.2612 500000 random bases 500500
139577 .2789 10000 random bases 10010
3104 .3101 Human coagulation
(M11314)9030 2689
.2978 Human p53 1755 618 .3521 1000 random bases 1001 390 .390
http://prola.aps.org/abstract/PRE/v49/i2/p1685_1
http://reylab.bidmc.harvard.edu/publications/Chaos/phsg.pdf
http://polymer.bu.edu/hes/articles/saabghlmmppssv96.pdf
http://reylab.bidmc.harvard.edu/heartsongs/pnas-2002-99-2466.pdf