|Home | About | Journals | Submit | Contact Us | Français|
A genome space is a moduli space of genomes. In this space, each point corresponds to a genome. The natural distance between two points in the genome space reflects the biological distance between these two genomes. Currently, there is no method to represent genomes by a point in a space without losing biological information. Here, we propose a new graphical representation for DNA sequences. The breakthrough of the subject is that we can construct the moment vectors from DNA sequences using this new graphical method and prove that the correspondence between moment vectors and DNA sequences is one-to-one. Using these moment vectors, we have constructed a novel genome space as a subspace in RN. It allows us to show that the SARS-CoV is most closely related to a coronavirus from the palm civet not from a bird as initially suspected, and the newly discovered human coronavirus HCoV-HKU1 is more closely related to SARS than to any other known member of group 2 coronavirus. Furthermore, we reconstructed the phylogenetic tree for 34 lentiviruses (including human immunodeficiency virus) based on their whole genome sequences. Our genome space will provide a new powerful tool for analyzing the classification of genomes and their phylogenetic relationships.
Comparative genomics at the sequence level has existed for ~20 years, since the time genome sequencing started in earnest. Already there have been many proposals to compare genomes. Boore and Brown1 used gene order to study the evolutionary relationships of metazoan mitochondrial genomes. Snel et al.2 later constructed the phylogenetic tree for completely sequenced prokaryotic genomes based on gene content. In Snel et al.'s method, the similarity between two genomes is defined as the number of genes that they have in common divided by their total number of genes. However, such techniques are time-consuming as they must first identify all the genes in one genome. These approaches, together with G + C content, edit distance, and reversal and rearrangement distances,3–5 compare genomes using only partial genomic information. Thus, these results are usually controversial because single-gene sequences generally do not contain enough information to construct an evolutionary history of organisms.
In Koonin's editorial6 about the emerging paradigm and open problems in comparative genomics, he pointed out ‘… but within these superfamilies, there is nothing like a straight one-to-one correspondence between genomes, and in distant genomes, most of the members may not be orthologous’. This sentence enlightened us to get a novel idea of comparing genomes. We can construct a genome space. In this space, each point corresponds to a genome uniquely. The natural distance between two points in the genome space reflects the biological distance between these two genomes. Currently, there is no method to represent genomes by a point in a space without losing biological information.
We introduce graphical representation of DNA sequence to construct the genome space. The graphical representation of DNA sequence provides a simple way of viewing, sorting, and comparing various gene structures. Thus, it is an attractive and promising research direction. The first important method in this direction is due to Hamori.7 He used a three-dimensional curve to represent a DNA sequence. Gates8 later constructed a two-dimensional graphical representation that is simpler than the Hamori curve. However, Gates' graphical representation has high degeneracy. Recently, we reported a new two-dimensional graphical representation of gene sequences9 which has no circuit or degeneracy, so that the correspondence between gene sequences and gene graphs is one-to-one. In this way, the original DNA sequence can be recovered from its graph mathematically without loss of biological information. In this paper, we make a minor modification of our previous method and obtain a new graphical representation approach for DNA sequences. The breakthrough of the subject is that we can construct the moment vectors from DNA sequences using this new graphical method, and we can prove that the correspondence between moment vectors and DNA sequences is one-to-one. The novelty and uniqueness of our approach is that by using these moment vectors of DNA sequences, we have constructed a genome space as a subspace in Euclidean space. Each genome sequence can be represented as a point in this space. Therefore, this genome space can be used to make comparative analysis to study the clustering and phylogenetic relationship among genomes. The biological (evolutionary) distance between two genomes can be obtained through the Euclidean distance among the corresponding points in the genome space.
We constructed a new DNA sequence graph in two quadrants of the Cartesian coordinate system, with pyrimidines (C and T) in the first quadrant and purines (A and G) in the fourth quadrant. As shown in Fig. 1, the vectors corresponding to the four nucleotides G, A, T, and C are as follows:
Nucleotide vector system based on G(1, −2/3), A(1, −1/3), T(1, 1/3), and C(1, 2/3).
Here, we should emphasize that the specific ordering of the four nucleotides in the Cartesian coordinate system is related to the GC content of genomes (as it is explained in the ‘Discussion’ section). Points in the graphical representation are obtained by the sum of vectors representing nucleotides in the sequence. In Fig. 2, we give the graphical representation of whole mitochondrial genome sequences of human, common chimpanzee, Norway rat, and hedgehog, which are based on the vector system shown in Fig. 1. Since human and chimpanzee belong to the same order (primates), their mitochondrial genome graphical representations are very similar visually. These four DNA sequence graphical curves have no circuits or degeneracy and the correspondence between the sequence and the graphical curve can be mathematically proved to be one-to-one.9
Motivated by our previous work for protein sequence analysis,10 we use moment vectors to characterize a DNA graphical curve. Given the graphical curve of a DNA sequence, which can be represented by a sequence of points (1, y1), (2, y2), … , (n, yn), we can compute a sequence of numbers 1 − y1, 2 − y2, … , n − yn. Conversely, if we know the sequence of numbers 1 − y1, 2 − y2, … , n − yn, we can recover the graph (1, y1), (2, y2), … , (n, yn). Therefore, we want to find a sequence of numbers, each of which uses the global information of the sequence of numbers 1 − y1, 2 − y2, … , n − yn in such a way that this new sequence of numbers determines and is determined by the sequence of numbers 1 − y1, 2 − y2, … , n − yn. For this purpose, we consider the moments which are defined as follows:
where n is the number of nucleotides contained in a DNA sequence, and (xi, yi) represents the position of the ith nucleotide in the DNA graphical curve. According to this definition, each DNA sequence has an n-dimensional moment vector (M1, M2, … , Mn) associated with it.
The crucial point in this paper is that the correspondence between a DNA sequence and its moment vector obtained from its sequence graph is one-to-one. To obtain this conclusion, we need to prove the following theorem.
Consider the set of DNA sequences having the same number (n) of nucleotides. Then, the correspondence between a DNA sequence and its n-dimensional moment vector (M1, M2, … , Mn) is one-to-one.
We have demonstrated that the correspondence between a DNA sequence and its graphical curve is one-to-one.9 In order to prove the theorem, we will need to prove that the correspondence between a DNA graphical curve and its moment vector is one-to-one.
By the definition, one DNA sequence graph has an n-dimensional moment vector (M1, M2, … , Mn). Hence, we need to demonstrate that from any given DNA moment vector, we can recover the DNA curve, which means all (xi, yi) (i = 1, 2, … , n) can be recovered from any given DNA moment vector.
xi is the x-coordinate value of ith nucleotide on a DNA graph. Based on our assignment, xi should be equal to i. yi is the y-coordinate value of ith nucleotide on a DNA graph. The next step is to obtain yi from moment vector. Let zi = xi − yi, then the moments can be simplified as:
To solve for zi, let δj = Mjnj, then the δj can be obtained by Mj and n. Clearly, δj and zi have the below relation:
These z1, z2, … , zn are roots of the polynomial
Let sd (d = 1, 2, … , n) be the elementary symmetric polynomials in z1, z2, … , zn, i.e. , , , … , sn = z1z2···zn, then s1 = −an−1, s2 = an−2, … , sn = (−1)na0. By using Newton's famous identities11: where d = 1, 2, … , n, ai can be obtained by δj as shown below:
As a result, the coefficients of the polynomial can be confirmed, and the set of all roots can be obtained. Next, we need to identify each root z1,z2, … , zn.
Because the y-coordinate values of four nucleotides are between –1 and 1, xi − yi ≥ 0. As we have defined that the position of kth nucleotide on a graph is (xk,yk) or (k,yk), the position of (k + 1)th nucleotide on a graph (xk+1, yk+1) can be represented as (k + 1, yk + uk+1), where uk+1 may be any of y-coordinate value of the four nucleotides. Thus, zk+1 = xk+1 − yk+1 = (k + 1) − (yk + uk+1) = (k − yk) + (1 − uk+1) ≥(k − yk). Because zk = xk − yk = k − yk, zk+1 ≥ zk. As a consequence, zi is increasing and each root can be identified by this property, which means each value of yi can be obtained. With all (xi,yi), a DNA graph can be recovered.
Therefore, we have successfully proved that the correspondence between a DNA sequence and its moment vector obtained from its sequence graph is one-to-one.
We have already obtained a good numerical characterization, moment vector, to represent a DNA sequence. Now, we will use this tool to construct a genome space. Here, we emphasize that the structure of genomes is complicated. It may be single-stranded or double-stranded and in a linear or circular structure. Thus, we should consider the different structures when constructing the genome space.
For the simplest genome structures, linear single-strand forms, we can treat them as linear DNA sequences. That is, every genome corresponds to a general DNA sequence. Thus, we can utilize our moment vector to construct the genome space. In order to use whole genome information to make comparative analysis among genomes, we can use the first N components (M1, M2, … , MN) of the moment vector of a genome sequence graph to represent a genome as a point in N-dimensional space. Thus, we obtain an N-dimensional genome space as a subspace in RN. Using the Euclidean distance between two points as an index for comparison, we can perform phylogenetic and clustering analysis for genome sequences in this genome space.
For the circular single-strand genomes, the construction of genome space is more complicated because we do not know which point is the start point in this circular DNA sequence. In this case, we treat every point as the start point in this circular sequence of length n, and then we get n linear single-strand genomes. For every linear single-strand genome sequence, we can compute its n-dimensional moment vector. Then, we take average by n for these n n-dimensional moment vectors to get a normalized moment vector (M1, M2, … , Mn). For circular single-strand genomes, we use the first N components (M1, M2, … , MN) of this normalized moment vector to represent a genome as a point in an N-dimensional space. Thus, we obtain an N-dimensional genome space as a subspace in RN.
For the double-stranded genomes, we need to point out that the moment vector of reverse complementary sequence is not the same as the original sequence. Generally, when meeting the double-stranded genomes, we treat them as two single-stranded genomes. We use the above method (linear or circular) to get two n-dimensional moment vectors for these two single-stranded sequences, and then take average to get a general moment vector (M1, M2, … , Mn). By using the first N components (M1, M2, … , MN) of this general moment vector to represent a genome as a point in N-dimensional space. Thus, we obtain an N-dimensional genome space as a subspace in RN. Here, we need to point out that the two strands of some genomes (e.g. mitochondrial genomes, some bacterial genomes) are differentiated by their nucleotide content, which are called the heavy strand and the light strand, respectively. The two strands have different masses because one has a higher proportion of heavier nucleic acids and its complement a lower proportion. In this case, we just treat them as the single-stranded (by using the heavy strand) genomes to make the genome space.
To verify that the biological distance obtained in this way truly incorporates biological utility, we apply our new genome space to the phylogenetic analysis of organisms. Most existing methods for phylogenetic inference using biological sequences can be divided into two groups. The algorithms in the first group utilize various distance measures12–15 which are based on different models of nucleotide substitution or amino acid replacement, and then transform the distance matrix into a tree. In the second group of approaches, instead of building a tree, the tree that can best explain the observed sequences under the evolutionary assumption is found by evaluating of different topologies. This category includes parsimony16–18 and maximum likelihood methods.19–21 All these methods require a multiple alignment of the sequences and assume some sort of evolutionary model, which require human intervention. Thus, the results are usually controversial. However, our genome space does not need sequence alignment and any evolutionary model. It is totally automatically generated and avoids computation repetition.
First, we consider the phylogeny of mammals. Mitochondrial DNA is not highly conserved and has a rapid mutation rate, thus it is very useful for studying the evolutionary relationships of organisms.22 We extracted 35 complete mammalian mitochondrial genome sequences from the GenBank, each of which has length of more than 16 000 nucleotides. Moreover, they have double-stranded and circular structures. As mentioned in the previous section, because we have already known the gene content of both strands of these genomes, we just treat them as the single-stranded (by using the heavy strand) circular genomes. For this case, we treat every point as the start point in this circular sequence of length n, and then we get n linear single-strand genomes. For every linear single-strand genome sequence, by using the nucleotide vector system shown in Fig. 1, we can compute its n-dimensional moment vector. Then, we take average by n for these n n-dimensional moment vectors to get a normalized moment vector (M1, M2, … , Mn). Here, we use the first 60 components of the moment vector (M1, M2, … , M60) to characterize these 35 genome graphical curves and obtained 35 points in 60-dimensional genome space. By computing the Euclidean distances between these points, we got the distance matrix for these 35 organisms. The phylogenetic tree for them shown in Fig. 3 is generated using UPGMA program in the MEGA 4 package.23 The last 10 mammals are grouped into a cluster because they are primates, and the phylogenetic relationship among them coincides with those found by Raina et al.24 We also found that Norway rat, vole, and squirrel are grouped into a cluster for the reason that they are rodent species.
Phylogenetic tree of the mitochondrial genome sequences of 35 mammal species. This tree was reconstructed by the first 60 components of the moment vector (M1, M2, … , M60). The accession numbers of these 35 species in the GenBank are as follows: ...
In order to further illustrate the efficiency of our genome space, we then focus on the origins and evolution of human immunodeficiency virus (HIV). HIV is a lentivirus that can lead to acquired immune deficiency syndrome (AIDS), a condition in humans in which the immune system begins to fail, leading to life-threatening opportunistic infections. To develop the anti-HIV drugs and vaccines, the research into the origins and evolution of this virus has become very important. Rambaut et al.25 reconstructed the phylogenetic tree of the primate lentiviruses including HIV-1, HIV-2, and the simian immunodeficiency viruses (SIVs). It was discovered that the two human viruses are related to different SIVs and therefore have different evolutionary origins. However, in Rambaut et al.'s paper, the tree was constructed using the maximum likelihood method on an alignment of the nucleotide sequences of polymerase gene in these lentiviruses. Generally, a single-gene sequence does not possess enough information to construct an evolutionary history of organisms, thus we use our new method to reconstruct the phylogenetic tree based on the whole genome sequences.
The genomes of lentiviruses are single-stranded linear RNA. RNA has the base uracil (U) rather than thymine (T) that is present in DNA. In fact, these RNA genome sequences downloaded from GenBank have already been transformed into DNA sequences (change U by T). Thus, we treat them as linear DNA sequences. Using the nucleotide vector system shown in Fig. 1, the sequence graphs of the genomes of the 33 lentiviruses were obtained. Here, we use the first 12 components of the moment vector to characterize these 33 genome graphical curves, and thus we obtained 33 twelve-dimensional vectors. These 33 vectors can be viewed as 33 points in a 12-dimensional genome space. By computing the Euclidean distance between these points, we reconstructed the phylogenetic tree of these primate lentiviruses (Fig. 4) using UPGMA program in the MEGA 4 package.23 The figure illustrates that both the HIV-1 and HIV-2 lineages fall within that of the SIVs which are isolated from other primates, thus they represent the independent cross-species transmission events. In agreement with Rambaut et al.'s result, our phylogenetic tree shows that HIV-1 group N is a new HIV-1 type between the HIV-1 groups M and O (somewhat closer to the M group). It also indicates that HIV-1 groups M, N, and O are closely related to SIVs from chimpanzees (Pan troglodytes troglodytes and Pan troglodytes schweinfurthii) because there is a mixing of the HIV-1 and SIV lineages. Moreover, our result suggests that P. t. troglodytes (SIVcpz2 and SIVcpz3 in Fig. 4) is the primary reservoir for HIV-1 group O, which is not clear in Rambaut et al.'s evolutionary tree. In Rambaut et al.'s evolutionary tree, the position of HIV-1 group O is just between P. t. troglodytes and P. t. schweinfurthii, but not closer to either of them. Our suggestion coincides with those found by Gao et al.26 In Gao et al.'s work, the authors concluded that the HIV-1 group O is closely related to the just one of these SIVcpz lineages, found in P. t. troglodytes. For HIV-2, we find that the SIVs that are closely related to HIV-2 are not only SIVsm (sooty mangabey monkey) and SIVmac (macaque) as shown by Rambaut et al.'s tree, but also SIVsun (sun-tailed monkey).
Evolutionary tree of the primate lentiviruses. This tree was reconstructed by using the first 12 components of the moment vector (M1, M2, … , M12). The subtypes of HIV-1 and HIV-2 are shown. The virus abbreviations, their primate hosts and genome ...
In addition, we apply our genome space to another field of virology: the taxonomy of coronavirus. To study the classification and phylogeny of coronaviruses clearly, we apply our genome space to a large set of 30 complete coronavirus genomes from GenBank, including the two newly sequenced human coronaviruses, HCoV-NL63 and HCoV-HKU1, along with four genomes from Flaviviridae and Togaviridae which are not coronaviruses (outgroups). The coronavirus genomes are also single-stranded linear RNA. So, similar to the above lentivirus case, we treat these coronavirus genomes as linear DNA sequences. Their abbreviation, accession number, description, and classification are shown in Table 1. First, we used our two-dimensional genome space (actually, it is a two-dimensional plane with the first two moments M1 and M2 being x-axis and y-axis) to characterize these ‘34’ virus genomes and calculated ‘34’ points in Fig. 5A. Four groups, group 1, group 2, group 3, and outgroups, can be seen in this figure as four distinct clusters. To study the classification for coronavirus clearly, we expanded it and obtained Fig. 5B.
Based on genotypic and serological characterization, coronaviruses are divided into three distinct groups, with human coronavirus 229E (HCoV-229E) being a group 1 coronavirus and human coronavirus OC43 (HCoV-OC43) being a group 2 coronavirus. There is one additional branch, the group 3 coronaviruses, which are found exclusively in birds. These three distinct groups of coronaviruses clearly form three distinct clusters in our two-dimensional genome space. We also find that the newly discovered human coronavirus NL63 is very close to the human coronavirus 229E, and thus it is classified into group 1. This result is the same as the identification by van der Hoek et al.27 Our two-dimensional genome space reveals that for human SARS-CoV, the most closely related coronavirus is from a small infected mammal, the palm civet (not a bird as initially suspected). This result coincides with those found by Guan et al.28 and Wang et al.29 In Guan et al.'s work, the authors also suggested that SARS viruses were isolated from Himalayan palm civets found in a live-animal market in Guangdong, China. Similarly, in Wang et al.'s work, the researchers found six palm civets at the restaurant were positive for SARS-associated coronavirus. They think that SARS cases at the restaurant were the result of recent interspecies transfer from the putative palm civet reservoir and not the result of continued circulation of SARS-CoV in the human population.
For another newly discovered human coronavirus HCoV-HKU1, Woo et al.30 classified it into group 2 because HCoV-HKU1 contains certain features that are characteristic of group 2 coronaviruses. However, these authors also stated that the proteins of HCoV-HKU1 formed distinct branches in the phylogenetic trees, indicating that HCoV-HKU1 is a distinct member within the group and is not very closely related to any other known member of group 2 coronaviruses. In our two-dimensional genome space, we found that HCoV-HKU1 is an individual coronavirus between the SARS group (which we suggest as group 4) and the traditional group 2, thus we propose that HCoV-HKU1 belongs to a new group 5. By computing the distances between the genomes in Fig. 5A, we also reconstructed the phylogenetic tree of these coronaviruses (Fig. 6) using UPGMA program in the MEGA 4 package.23 Thus, the evolutionary relationship of these genomes is clearer.
Phylogenetic tree of the coronavirus genomes. This tree was reconstructed by the first two components of the moment vector (M1, M2).
It should be pointed out that the construction of our genome space depends on four parameters (the y-coordinates of the A, C, T, and G in Fig. 1). If we change these four parameters, we shall have different embedding of the genome space. Because the GC content of DNA molecule is found to be variable with different organisms, GC content should be considered when we assign the y-coordinate values of nucleotide vectors. The genomes of three groups in this paper have low GC content. Most of them have 30–50% GC content, thus we assign larger y-coordinate absolute values for G and C. However, the y-coordinate values of the four nucleotides must be between −1 and 1 to assure that the correspondence between a DNA sequence and its corresponding moment vector is one-to-one. Therefore, in order to obtain a universal genome space for all species, further studies will be needed to determine the universal y-coordinate values.
In the study of mammalian mitochondrion, lentiviruses (including HIV), and coronavirus genomes, we used 60, 12, and 2 moments to construct the genome space, respectively. Here, we should emphasize that we do not need to calculate all the moments to determine the biological information of genomes. Remember that in the central limit theorem in probability and statistics, the limiting process is Gaussian. For Gaussian, the first two moments determine the density function. Thus, we just use the first N moments to get the results, where N is much less than n (the length of genome). Thus, for coronavirus genomes, we only used the first two components of the moment vector (M1, M2) because these two moments have allowed us to obtain the stable classified result—when higher moments are included the relationship of being close or farther away remains unchanged. To make this point clearer, we also use the first 20 components of the moment vector (M1, M2, … , M20) to construct the 20-dimmensional genome space. By computing the Euclidean distances between these points in this genome space, we reconstructed the phylogenetic tree of these coronaviruses (Fig. 7). Comparing Figs 6 and and7,7, we found that the classification relationship of these genomes are the same—group 1, group 2, group 3, group 4, group 5, and outgroups can still be seen in the two trees as six distinct clusters. This means that when using the higher moments, the relationship of being close or farther away remains unchanged. In other word, two moments are already enough to give the right classifying relationship for these genomes. For the same reason, we used the first 60 components of the moment vector (M1, M2, … , M60) and the first 12 components of the moment vector (M1, M2, … , M12) to generate the genome space for mammalian mitochondrion and lentiviruses genomes and obtained the stable phylogenetic analysis result.
Phylogenetic tree of the coronavirus genomes. This tree was reconstructed by the first 20 components of the moment vector (M1, M2, … , M20).
Gene rearrangements play important roles in evolution. The order of genes and transcription regions are changed during evolution by gene rearrangements such as DNA inversions and transpositions,31 which do not affect the gene content of the chromosomes. For example, inversions of large genomic fragments are often observed even between closely related species. The phylogenetic analysis based on the gene order is challenging as it requires detailed complex gene order data in genomes and intensive computation. In order to test whether our genome space is stable for genomic rearrangement, we consider the following simulated experiment. We choose the human mitochondrion genome and then invert its two genes ATPase 6 and Cytochrome oxidase III to get a simulated genome, which we called human AC genome. We can treat this new genome as the result of the inversion of genes from the human mitochondrion genome. Next, we randomly generate a genome sequence which has the same length and nucleotide content as the human mitochondrion genome, which we called random genome. Thus, we have three genomes of the same length and nucleotide content: human, human AC, and random. In addition, we also choose the common chimpanzee mitochondrion genome as comparison. By using the 60-dimmensional genome space, we calculate the distance among these four genomes and get a distance matrix in Table 2. In Table 2, we found that the distance between human and human AC is very small. This means that even some genomic rearrangement happened in a genome, the new produced genome still has very close distance from the original genome. Thus, in our genome space, the original genome and its genomic rearrangement genome represent different points but have very close distance. Large-scale genomic rearrangements in closely related organisms make the genome sequences very different from the original genome sequences. Although the moment vectors of the original and rearranged DNA sequences are different, if we treat double DNA genomes as two single-stranded genomes to get two n-dimensional moment vectors for these two single-stranded sequences and take the average to get a general moment vector (M1, M2, … , Mn), the computed distances among related species by gene rearrangements will be small. The genome spaces constructed by the moment vector have advantages over the gene order phylogenetic analysis since the genome spaces do not depend on gene orders and are capable of using all gene families. We will apply the moment vector method on the detailed phylogenetic analysis using the large-scale genomic rearrangement data.
In this paper, we report a two-dimensional graphical representation for DNA sequences. A moment vector system to represent a DNA sequence is introduced, and the correspondence between a DNA sequence and its moment vector is mathematically proven to be one-to-one. With this moment vector system, each genome sequence can be represented as a point in a Euclidean space, and the genome space is constructed as a subspace of this Euclidean space. Genomes with close evolutionary relationship and similar properties plot close together in this genome space. Thus, it will provide a new powerful tool for analyzing the classification of genomes and their phylogenetic relationships. Our method is easier and quicker in handling whole or partial genomes than multiple alignment methods. There are two major advantages to our method. (i) Once a genome space has been constructed, it can be stored in a database. There is no need to reconstruct the genome space for any subsequent application, whereas in multiple alignment methods, realignment is needed for add-on new sequences. (ii) One can have global comparison of all genomes simultaneously, which no other existing method can achieve. Furthermore, in our method, the results in two-dimensional genome space can be displayed and viewed graphically; this is user-friendly and allows even non-expert to understand the relationship among different genomes via viewing the graph of genome space.
We gratefully acknowledge Professor Yuen-Ling Chan from University of Chicago, Department of Biochemistry and Molecular Biology who read our paper and gave some constructive comments. We also thank Dr Max Benson for critically reading and editing the manuscript.
Edited by Hiroyuki Toh