Home | About | Journals | Submit | Contact Us | Français |

**|**Adv Bioinformatics**|**v.2012; 2012**|**PMC3418640

Formats

Article sections

Authors

Related links

Adv Bioinformatics. 2012; 2012: 287486.

Published online 2012 July 20. doi: 10.1155/2012/287486

PMCID: PMC3418640

Gaetano Pierro^{*}

System Biology, PhD School, University of Salerno, Via Ponte Don Melillo, 84084 Fisciano, Italy

*Gaetano Pierro: Email: ti.liamtoh@orreiponateag

Academic Editor: Ramana Davuluri

Received 2012 February 29; Revised 2012 May 16; Accepted 2012 June 7.

Copyright © 2012 Gaetano Pierro.

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article has been cited by other articles in PMC.

The nucleotide sequences complexity in chromosome 3 of *Caenorhabditis elegans* (*C. elegans*) is studied. The complexity of these sequences is compared with some random sequences. Moreover, by using some parameters related to complexity such as fractal dimension and frequency, indicator matrix is given a first classification of sequences of *C. elegans*. In particular, the sequences with highest and lowest fractal value are singled out. It is shown that the intrinsic nature of the low fractal dimension sequences has many common features with the random sequences.

The *Caenorhabditis elegans* (*C*. *elegans*) is a 1mm length transparent nematode. Thanks to its simple organic structure, it was taken as a model for research into genetic field. Early studies on *C*. *elegans* began in 1962 with some works on cell lineage and apoptosis [1, 2]. There are 2 distinct sexual types of the *C*. *elegans*, the hermaphrodite and the male. The second one is very rarely represented in nature (being approximately only the 0.05% of the population). We have 959 cells in the hermaphroditic species and 1031 cells for the male. The sexual difference at the chromosomal level provides: XX chromosomes for hermafrodite and X0 for the male. The sexual reproduction of *C*. *elegans* is realized by 2 distinct pathways: mating or, in case of the hermaphrodite, by a self-fertilization. The life cycle of *C*. *elegans* consists of 4 larval stages (from L1 to L4); however, if there exists some hard environment conditions, such as lacking of food, the *C*. *elegans* remains in the L3 larval stage, until the conditions improve.

The complete sequencing of *C*. *elegans* genome was completed in 2002. The *C*. *elegans* has 5 chromosomes autosomes plus the sex chromosome X. Totally, it is made up of nearly 100 million base pairs and 19000 genes [3–5]. Study on fractal analysis of multigenome of *C*. *elegans* has shown that chromosome 3 is the one with multifractal characteristics higher than the others, the less multifractal appears to be the chromosome sexual X [6]. For the first time, in this work, we have analyzed the different types of sequences belonging to the genome of *C*. *elegans*, focusing our investigation on those that show fractal characteristics. Thus, chromosome 3 of *C*. *elegans* has been carefully studied because its unsymmetrical and inhomogeneous statistical characteristics. Through the analysis of this chromosome we can investigate what are the features that make it more “complex” from a biostatistical point of view and in particular with the use of statistical parameters such as the complexity, the fractal dimension, the matrix correlation, and the nucleotide frequency. The concept of fractality in biology is further clarified.

On the chromosome 3 of *C*. *elegans,* 2780 genes have been identified. In this paper, almost all nucleotide sequences that are located on chromosome 3 of *C*. *elegans* were analyzed and compared with random sequences. In particular, it will be shown that the nucleotide sequences with a low fractal value have common features with random sequence with low fractal dimension. Moreover, the highest fractal dimension corresponds to sequence close to random sequence with high fractal value, and in particular, it is shown a high frequency of cytosine.

From mathematical point of view, a fractal is a geometric object, characterized by the self-similarity; that is, it repeats its structure cyclically in the same way at different scales. A more rigorous definition of a fractal is based on four properties: self-similarity, fine structure, irregularities, and noninteger dimension [7]. The fractal dimension is a parameter to compute the degree of complexity or disorder by measuring the unsmoothness of the object. This value enables to measure the amount of information contained in the sequence, the higher value corresponds to a higher information content. Generally, this value ranges between 1 and 2, so that the higher value corresponds to the higher complexity. Fractality has been observed and measured in pathology and cancer models [8, 9], the study of branching blood vessels, or the irregularity of the contours of tumor cells [10, 11], the analysis of complete genomes [12], the correlation analysis of protein sequences [13] tissue pathology [14], in exons, introns [15], and nuclei [16], and it is involved in blood cancer [17, 18].

In the chromosome 3 of *C*. *elegans,* there have been singled out 2780 genes [19]. Some of them are very short, less than about 50 nucleotides, thus being useless for any statistical analysis, and some of them are still under investigation, so that some nucleotides are not yet properly identified. For this reason, there have been selected only some sequences with significant length, the shortest being about 100 nucleotides. In particular, we investigated 100 genes (whole sequence), 85 repeats sequences, 71 noncoding sequences (introns), and 100 coding sequences (exons lacks of UTR). In order to make a comparison with random sequences, 100 random sequences of 100 nucleotides have been generated. In this work, all sequences were downloaded from the National Center for Biotechnology Information [19]. A simple formula to estimate the fractal dimension has been given in [20, 21] and based on the correlation matrix, as follows. The fractal dimension is defined as the average of the number *p*(*n*) of 1 in the randomly taken *n* × *n* minors of the *N* × *N* correlation matrix *u*_{hk} (see also [20–24]).

In particular, let

$$\begin{array}{c}{\mathrm{4}}_{=}\end{array}$$

(1)

be the finite set (alphabet) of nucleotides and *x* _{4} any member of the 4 symbols alphabet.

A DNA sequence is the finite symbolic sequence (*N*) = *¥* × _{4} so that

$$\begin{array}{c}\mathrm{\left(N\right)={\left\{{x}_{h}\right\}}_{h=\mathrm{1},\dots ,N},N\infty}\end{array}$$

(2)

being

$$\begin{array}{c}{x}_{h}=\left(h,x\right)=x\left(h\right),(h=\mathrm{1,2},\dots ,N;x{\mathrm{4}}_{})\end{array}$$

(3)

the acid nucleic *x* at the position *h*.

Let _{1}(*N*), _{2}(*N*) be two DNA sequences, the indicator function [20, 22–26] is the map

$$\begin{array}{c}u:{\mathrm{1}}_{\left(N\right)}\end{array}$$

(4)

such that the correlation matrix

$$\begin{array}{c}{u}_{hk}=u\left({x}_{h},{x}_{k}\right)=\{\begin{array}{cc}\mathrm{1},\hfill & {\text{if}\mathrm{xh}={x}_{k},}_{}\mathrm{0},\hfill & \text{if}\mathrm{{x}_{h}\ne {x}_{k},}\hfill \hfill \end{array}({x}_{h}{\mathrm{1}}_{\left(N\right)}\end{array}$$

(5)

is a matrix of 0's and 1's showing the existence of correlation. When _{1}(*N*) _{2}(*N*), the indicator function shows the existence of autocorrelation on the same sequence.

The probability distribution of nucleotides can be defined by the frequency

$$\begin{array}{c}{p}_{X}\left(n\right)=\frac{\mathrm{1}}{n}\sum _{i=\mathrm{1}}^{n}{u}_{Xi},(X{\mathrm{4}}_{,}\end{array}$$

(6)

that the acid nucleic *X* can be found at the position *n*. This value can be approximated by the frequency count (on the indicator matrix) of the nucleotide distribution before *n* [20, 21, 23, 24]

$$\begin{array}{c}D=\frac{\mathrm{1}}{\mathrm{2}}\frac{\mathrm{1}}{N}\sum _{n=\mathrm{2}}^{N}\frac{\mathrm{log}p\left(n\right)}{\mathrm{log}n}.\end{array}$$

(7)

In order to have a measure of complexity, for an *n*-length sequence, we use the following definition [20–24]:

$$\begin{array}{c}K=\mathrm{log}{\left(\frac{n!}{{a}_{n}!{c}_{n}!{g}_{n}!{t}_{n}!}\right)}^{\mathrm{1}/n}\end{array}$$

(8)

with

$$\begin{array}{c}{a}_{n}=\sum _{h=\mathrm{1},\dots ,n}u\left(A,{x}_{h}\right),{c}_{n}=\sum _{h=\mathrm{1},\dots ,n}u\left(C,{x}_{h}\right),{g}_{n}=\sum _{h=\mathrm{1},\dots ,n}u\left(G,{x}_{h}\right),{t}_{n}=\sum _{h=\mathrm{1},\dots ,n}u\left(T,{x}_{h}\right).\end{array}$$

(9)

By using formula (7), for each sequence of nucleotides, the corresponding fractal dimension has been computed, and obtained results are shown in Tables Tables11 and and2.2. In particular, the sequences with max/min values of fractal dimension among the whole sequences, coding/noncoding sequences, repeat sequences, random sequences have been singled out.

From these computations, we can see that the repeats sequence AT rich (69826–69901) has the lowest fractal value 1.24155. This could be explained because we have a large number of only 2 nucleotides, so that the sequence is simple in the sense that there is a low variability and it shows a low complexity. Analogously, the sequence with the highest value of fractality is still a repeats sequence CER 16-2-i-CE with a fractal dimension 1.31280. Although there are some fluctuations, due to the fact that random generation, by a computer, is indeed a pseudorandom generation, the values of fractal dimension for random sequences are localized around 1.28, which appears to be the intermediate value between the maximum and minimum values obtained for all sequences examined. Further information about the heterogeneity of data is given by the complexity parameter (8). In Figure 1, the complexity curves corresponding to the sequences for maximum and minimum values of the fractal are plotted.

Curves of min-max complexity: (a) whole gene, (b) noncoding, (c) coding, (d) repeats, and (e) random sequences.

We investigated the complexity of the nucleotide sequences. In all cases, we obtained that the curve of higher complexity corresponds to the sequence with the highest fractal dimension. Thus, we can draw the conclusion that complexity and fractal dimension are equivalent parameters for studying the complexity. These results depend on the distribution of nucleotides. By using the definition (6), we can compute the frequency distribution on a sequence. Below are shown the frequencies for each nucleotide (adenine, cytosine, guanine, thymine). In particular, in Figure 2, the max-min curves for frequencies on the whole gene sequence are plotted. It can be seen that, in this case, adenine and cytosine tend to have the same value, while thymine and guanine maintain a significant distance between the max and min curves. Max-min frequency curves for noncoding sequences are shown in Figure 3. By taking into account the values of fractal dimensions, as given in Tables Tables11 and and2,2, we can observe that the higher frequency of cytosine corresponds to the higher fractal dimension. Thymine, instead, is more present in sequences with low fractal dimension. In Figure 4, the curves for max-min frequency of coding sequences are drawn. It can be seen, also in this case, that adenine and thymine are more present in the sequence with lower fractal dimension. As before, cytosine is more present in sequences with higher fractal dimension. Repeats and random sequences are given in Figures Figures55 and and6,6, respectively. In the first case for adenine, we have more frequencies rate for the low fractal sequence, while for cytosine we have more frequencies rate for the high fractal sequence. For random sequences, we have that the cytosine is more frequent in the sequence that has the highest value of fractal.

By the frequency analysis and the results of Tables Tables11 and and22 on the fractal dimension we can see that there is a correspondence between the frequencies of nucleotides and the fractal dimension. So that, sequences that show a lower fractal dimension have always a higher frequency for the adenine and thymine (in most cases), while the cytosine is more frequent in high fractal sequences. Almost the same results are true also for random sequences, especially for the thymine and cytosine. According to (5), the indicator map of the *N*-length sequence can be easily represented by the *N* × *N* sparse matrix of binary values {0, 1} and this matrix can be visualized by the following (autocorrelation) dot-plots [20, 22] of Figures Figures7,7, ,8,8, ,9,9, ,10,10, and and11.11. Figure 11(a) shows the sequences (of Table 1) with max value of fractal dimension, while in Figure 11(a), there are the sequences of Table 2 with min value of fractal dimension. We can see that also in these plots the distribution of nucleotides gives rise to some typical patterns.

Autocorrelation plots on the whole sequence gene corresponding to max and min values of fractal dimension in (a) and (b), respectively.

Autocorrelation plots on the noncoding sequences corresponding to max and min values of fractal dimension in (a) and (b), respectively.

Autocorrelation plots on the coding sequences corresponding to max and min values of fractal dimension in (a) and (b), respectively.

Autocorrelation plots on the repeats sequences corresponding to max and min values of fractal dimension in (a) and (b) respectively.

Autocorrelation plots on the random sequences corresponding to max and min values of fractal dimension in (a) and (b), respectively.

All sequences with low fractal dimension (Figure 11(b)) turn out to have an important presence of nucleotide correlation, this feature is less present in the sequences with higher fractal dimension, where we expect to have a more complex structure of the sequence.

In this work, by means of statistical parameters such as indicator matrix, complexity, frequency, and fractal dimension, the different types of sequences (repeats, coding, noncoding, whole gene, random) of chromosome 3 (the one with the highest fractality) of the *C*. *elegans* have been analyzed. Our attempt was to give a statistical classification of these sequences and to understand the complexity of the sequences as a function of the nucleotides' distribution. By using (7) the values of the fractal dimension for all sequences are obtained. In detail, it was observed that the repeats sequences (which do not code for proteins) have a higher variability of values, since they assume the minimum and maximum on all sequences in the *C*. *elegans*. This leads us to analyze the role and the functional meaning of the repeats within the sequences of genes. Thereafter, we have verified the equivalence, with respect to the complexity, between the fractal dimension and complexity, since the sequences with highest fractality appear to have also a greater degree of complexity. Through the frequency distribution of nucleotide, it was noticed that the adenine is more present in sequences having a lower fractal dimension and, in particular, for the one being in absolute the lowest fractal (AT RICH). This result seems to be dependent on the fact that the sequence is made up of only 2 nucleotides, that is, adenine and thymine. Cytosine, instead, appears to be the most frequent nucleotide in the sequence with the highest fractal value and in particular for the sequence CER 16-2-i-CE. These results lead us to conjecture that there is a correlation between fractal dimension and the frequency of nucleotides such as adenine and cytosine. The information contents of a sequence of nucleotides depend on the different distribution of nucleotides, so that two sequences having the same nucleotides which are distributed according to two different permutations might have two different complexities (fractal dimension). In future work, this aspect of the different organization within the sequence will be further analyzed. Moreover, these results must be confirmed in other organisms which are evolutionarily distant from each other to better investigate the findings so far. At the moment, the obtained results were compared with some random sequences, which have a nucleotide random distribution, and in that case, we have obtained a significant correspondence with the complexity of the nucleotide sequences.

1. Brenner S. The genetics of Caenorhabditis elegans. *Genetics*. 1974;77(1):71–94. [PubMed]

2. Kenyon C. The nematode Caenorhabditis elegans. *Science*. 1988;240(4858):1448–1453. [PubMed]

3. Hodgkin J, Horvitz HR, Jasny BR, Kimble J. C. elegans: sequence to biology. *Science*. 1998;282(5396):p. 2011.

4. Bird AF, Bird J. *The Structure of Nematodes*. San Diego, Calif, USA: Academic Press; 1991.

5. Riddle DL, Blumenthal T, Meyer RJ, Priess JR. *C. elegans II*. New York, NY, USA: Cold Spring Harbor Laboratory Press; 1997.

6. Velez PE, Garreta LE, Martinez E, et al. The Caenorhabditis elegans genome: a multifractal analysis. *Genetics and Molecular Research*. 2010;9(2):949–965. [PubMed]

7. Mandelbrot B. *The Fractal Geometry of Nature*. San Francisco, Calif, USA: W. H. Freeman & Co; 1982.

8. Cross SS. Fractals in pathology. *Journal of Pathology*. 1997;182(1):1–8. [PubMed]

9. Baish JW, Jain RK. Fractals and cancer. *Cancer Research*. 2000;60(14):3683–3688. [PubMed]

10. Cross SS, Cotton DWK. The fractal dimension may be a useful morphometric discriminant in histopathology. *Journal of Pathology*. 1992;166(4):409–411. [PubMed]

11. Goldberger AL, West BJ. Fractals in physiology and medicine. *Yale Journal of Biology and Medicine*. 1987;60(5):421–435. [PMC free article] [PubMed]

12. Yu ZG, Anh V, Lau KS. Measure representation and multifractal analysis of complete genomes. *Physical Review E*. 2001;64(3):319031–319039.031903 [PubMed]

13. Yu ZG, Anh V, Lau KS. Multifractal and correlation analyses of protein sequences from complete genomes. *Physical Review E*. 2003;68(2):021913-1–021913-10.021913 [PubMed]

14. Losa GA, Nonnenmacher TF. Self-similarity and fractal irregularity in pathologic tissues. *Modern Pathology*. 1996;9(3):174–182. [PubMed]

15. Xiao Y, Chen R, Shen R, Sun J, Xu J. Fractal dimension, of exon and intron sequences. *Journal of Theoretical Biology*. 1995;175(1):23–26. [PubMed]

16. McNally JG, Mazza D. Fractal geometry in the nucleus. *The EMBO journal*. 2010;29(1):2–3. [PubMed]

17. Adam RL, Silva RC, Pereira FG, Leite NJ, Lorand-Metze I, Metze K. The fractal dimension of nuclear chromatin as a prognostic factor in acute precursor B lymphoblastic leukemia. *Cellular Oncology*. 2006;28(1-2):55–59. [PubMed]

18. Ferro DP, Falconi MA, Adam RL, et al. Fractal characteristics of May-Grünwald-Giemsa stained chromatin are independent prognostic factors for survival in multiple myeloma. *PLoS ONE*. 2011;6(6)e20706 [PMC free article] [PubMed]

19. National Center for Biotechnology Information, http//www.ncbi.nlm.nih.gov/genbank/

20. Cattani C. Fractals and hidden symmetries in DNA? *Mathematical Problems in Engineering*. 2010;2010:31 pages.507056

21. Cattani C, Pierro G. Complexity on acute myeloid leukemia mRNA transcript variant. *Mathematical Problems in Engineering*. 2011;2011:16 pages.379873

22. Cattani C. Wavelet algorithms for DNA analysis. In: Elloumi M, Zomaya AY, editors. *Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications*. chapter 35. New York, NY, USA: John Wiley & Sons; 2010. pp. 799–842. (Wiley Series in Bioinformatics).

23. Cattani C. On the existence of wavelet symmetries in archaea DNA. *Computational and Mathematical Methods in Medicine*. 2012;2012:21 pages.673934 [PMC free article] [PubMed]

24. Cattani C. Complexity and simmetries in DNA sequences. In: Elloumi M, Zomaya AY, editors. *Handbook of Biological Discovery, (Wiley Series in Bioinformatics)* Chapter 5. New York, NY, USA: John Wiley & Sons; 2012. pp. 700–742.

25. Voss RF. Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. *Physical Review Letters*. 1992;68(25):3805–3808. [PubMed]

26. Voss RF. Long-range fractal correlations in DNA introns and exons. *Fractals*. 1992;2(1):1–6.

Articles from Advances in Bioinformatics are provided here courtesy of **Hindawi Publishing Corporation**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |