|Home | About | Journals | Submit | Contact Us | Français|
Genome duplication (GD) has permanently shaped the architecture and function of many higher eukaryotic genomes. The angiosperms (flowering plants) are outstanding models in which to elucidate consequences of GD for higher eukaryotes, owing to their propensity for chromosomal duplication or even triplication in a few cases. Duplicated genome structures often require both intra- and inter-genome alignments to unravel their evolutionary history, also providing the means to deduce both obvious and otherwise-cryptic orthology, paralogy and other relationships among genes. The burgeoning sets of angiosperm genome sequences provide the foundation for a host of investigations into the functional and evolutionary consequences of gene and GD. To provide genome alignments from a single resource based on uniform standards that have been validated by empirical studies, we built the Plant Genome Duplication Database (PGDD; freely available at http://chibba.agtec.uga.edu/duplication/), a web service providing synteny information in terms of colinearity between chromosomes. At present, PGDD contains data for 26 plants including bryophytes and chlorophyta, as well as angiosperms with draft genome sequences. In addition to the inclusion of new genomes as they become available, we are preparing new functions to enhance PGDD.
Most higher organisms pass through different ploidy levels at different stages of development (1,2) and continuously produce aberrant unreduced gametes at low rates. However, the extreme rarity of genome duplications (GDs) in the evolutionary history of extant lineages, occurring only once in many (sometimes hundreds of) millions of years, shows that the vast majority of GD events quickly go extinct. For the rare survivors, classical views suggest that GD is potentially advantageous as a primary source of genes with new (3,4) or modified functions (5).
The angiosperms (flowering plants) are an outstanding model in which to elucidate consequences of GD in higher eukaryotes. Gene-order conservation in vertebrates is evident after hundreds of millions of years of divergence (6,7). However, the two major branches of the angiosperms (eudicots and monocots), estimated to have diverged 125–140 MY (8) to 170–235 MYA (9) show much more rapid structural evolution, owing largely to their propensity for chromosomal duplication and subsequent gene loss (10), fragmenting ancestral linkage arrangements across multiple chromosomes (11–13). All angiosperm genomes published to date have shown evidence of paleopolyploidy (14). Although new data from yeast (15–17) and Paramecium (18) are shedding valuable light on consequences of GD in microbes, these consequences are expected to be very different in organisms with small effective population sizes such as angiosperms, mammals and other higher eukaryotes (19,20). For example, neofunctionalization is much more likely to occur in large populations, which contain more targets for mutations conferring new beneficial function. In contrast, subfunctionalization is improbable in large populations, as a partially subfunctionalized allele (the first step in the process) is more likely to be silenced by secondary mutations before reaching fixation by drift (19).
A host of investigations into the functional and evolutionary consequences of gene and GD may be empowered by genome alignments from a single resource based on uniform standards that have been validated by empirical studies. Algorithms commonly used in vertebrate genome alignments focus on identifying orthologous regions, as GDs are rare, and ancient and paralogous regions are often so diverged as to be unrecognizable. However, to reveal the consequences of the more recent and more frequent GDs in angiosperms and other taxa, identifying paralogous regions is of central importance, necessitating the use of multiple alignments, both within and among genomes. To tackle such problems, we implemented a multiple gene-order alignment tool MCScan, which reflects better the true relationships among angiosperm genomes, in which GDs are frequently superimposed on speciations (21). Further, to empower comparative and functional studies across (and potentially beyond) the burgeoning set of plant genome sequences available, we built the Plant Genome Duplication Database (PGDD), a web service providing synteny information in terms of gene colinearity, both within and between genomes.
Besides PGDD, comparative genomic data are available from some public databases such as CoGe (22), Phytozome (23), GreenPhylDB (24) and PLAZA (25). CoGe (22) provides comparative data across all species in any state of assembly by computation on the fly while this allows greater flexibility on the user-end, non-specialists who are searching for a well-curated resource may find it cumbersome to use. In green plant, Phytozome (23) and GreenPhylDB (24) provide well-controlled micro-synteny and gene family evolution data, but macro-synteny data are not supported by the databases. PLAZA (25) provides fine macro-synteny data in plants, as well as micro-synteny and gene family data such as PGDD. However, there are some differences between PGDD and PLAZA in colinearity data because to identify colinear gene pairs, PLAZA adopted i-ADHoRe (26) of which power and precision differ from MCScan (27) used in PGDD.
For the past 5 years, PGDD has provided data about syntenic relationships based on colinear blocks between plants and contributed to much research such as evolution of gene families (28–34), annotations (35–38) and polyploidy events (39–43). PGDD also provides an easily linked data web resource to be readily integrated to other external informatics portal, including TAIR (44), Legume Information System (45) and PopGenIE (46). In the past year alone, we have developed a new pipeline to promptly merge new genome data into the database and nearly tripled the number of genomes archived. At present, PGDD contains data for 26 plants including bryophytes and chlorophytes, as well as angiosperms (Table 1).
At present, the PGDD contains colinear block information within and between the genomes of 26 plants (Table 1), most recently updated to include the banana genome sequence published in August 2012 (47). Among them, 16 genomes were downloaded from the homepages of the institute that led the sequencing of the genome such as RAP-DB (Rice annotation project database; http://rapdb.dna.affrc.go.jp/) and BRAD (The Brassica database; http://brassicadb.org/brad/). Data for the remaining 10 plants, mostly sequenced by the US Department of Energy Joint Genome Institute, were downloaded from the Phytozome database (23). To build PGDD data, three types of file are used: coding DNA sequences file, protein sequences file and general feature format (GFF) file containing annotation data of the sequences in chromosomes.
There are four major steps to add a new genome into PGDD (Figure 1A) in a pipeline consisting of 18 scripts. In the first step, scripts determine basic information such as the length of chromosomes and prepare data files. For example, one script extracts information of genes from a GFF file and makes a browser extensible data (BED) file to simply determine gene loci. Then, similar protein pairs are determined between two plants by BLASTP with 1e–5 e-value cut-off in the second step. The colinear blocks between plants are determined in the third step. With the BED file containing loci information and the file containing pairs of similar proteins created in the second step, colinear blocks between the plants are determined by MCScan (27). In the post-processing step, additional data are calculated and determined. For example, Ks values between pairs of ortholog/paralog genes are determined by Clustal W (48), PAL2NAL (49) and yn00 program of the PAML package (50) in this step. Additionally, text files containing all information about colinear blocks are created. Finally, all new blocks information included in the text files is imported into MySQL, and parameters and contents in PGDD web pages are modified for the new data by scripts.
All colinear blocks and related data provide by PGDD are stored in a MySQL database (http://www.mysql.com). There are three major tables, block, locus and chromosome, in the database. The block table contains lists of gene pairs with additional information such as colinear block number and Ks value, whereas the locus table contains information about each locus such as functional description determined by BLAST against Non-redundant GenBank DB and positions of loci in a chromosome. Information for each chromosome, such as number of genes in a chromosome, is stored in the chromosomes table. Besides MySQL, we maintain up-to-date protein sequences stored as BLAST database file to use in BLAST search function of PGDD.
At the home page, PGDD shows a table containing information about all plants in the current version, including the name of a plant, version of genome used, number of genes, original URL to download the data and primary citation for the genome (Figure 2A). Additionally, the table provides related web links, such as taxonomy information at NCBI, so that users can easily get related information for each plant.
There are three major functions to show gene colinearity; Dot-plot, Locus-search and Map-view. These three functions provide means to visualize macro-synteny, micro-synteny and gene family evolution, respectively, which is often the most commonly needed information in comparative genomics research. The main menu to select a web page corresponding to each function is in the right of the table. In addition to the major functions, in the download page, a file containing colinear block information within a plant or between any 2 of the 26 plants can be downloaded.
Dot plots are used to show colinear blocks between two plant genomes in macro-scale as a two-dimensional image, so researchers can see the overall view of all blocks. For example, the dot plot in Figure 2B shows overall colinear blocks between rice and sorghum including both the orthologous regions and matching regions derived from a shared pan-cereal duplication event (ρ) (51). Each point represents a matched gene pair. Interpretation of a dot plot is not always straightforward because diverse events in evolutionary history are overlaid onto the same plot. Thus, many options are provided to modify the plot through filtering subsets of gene pairs, e.g. to show only a narrow range of synonymous substitutions (Ks values) of gene pairs as a proxy to separate the gene pairs by age. Using rice-sorghum as an example, applying a Ks filter of 0.4–0.7 renders signal from the orthologous gene pairs more prominent on the dot plot. Additionally, an enlarged dot plot between specific chromosomes is available by clicking each small box for each chromosome in the genome-wide plot. Besides the dot plot, users can see the list of gene pairs in colinear blocks by clicking on the segment in the enlarged plot. With the list containing both name and inferred functions of the genes, users can compare colinear blocks with single-gene resolution.
There have been many gene-level studies such as comparing a few genes included in colinear blocks (28–38). Locus search is a function to find a colinear block containing a specific locus (Figure 2C) and to show fine structure of the colinear block. By typing the locus name in textbox and clicking the ‘submit’ button, a user can search colinear blocks containing the locus. Locus-search results can be divided into two parts: an alignment image and a list of genes in the image. In the alignment image, PGDD shows genes in colinear blocks, so users can easily determine gene-level changes such as insertion and deletion of genes. The list of genes below the each alignment image shows not only the inferred function of each gene but also the Ks and Ka values of the gene pair. Thus, the user easily determines evolutionary distance and possible changes in function between genes.
In many cases, a researcher seeks information about a locus just with a nucleotide or a protein sequence, without additional information such as locus name. A typical BLAST search returns a list of likely homologues in the target genomes but lack the global view of how the hits are distributed. To support such cases, PGDD provides Map View function. In the corresponding web page, users can search for a locus in PGDD by similarity with a nucleotide or protein sequence (Figure 2D). The page contains a text box to type or paste a sequence, buttons to choose a BLAST program depending on the sequence type and a text box to set e-value cut-off. The search result can be divided in two parts: list of locus names that are similar with user input sequence and image to show the positions of the locus in chromosomes. In the image, each grey vertical bar represents each chromosome, and each green arrow shows the positions of loci, which are similar to the sequence. The user can see detailed information for the locus and colinear blocks alignment image by clicking a blue locus name in the list of locus names above the image. In the detailed information page, the user can get protein and nucleotide sequences of genes in the locus, as well as descriptions of the genes.
The user can download a file containing colinear block information between two plants by choosing the two plants in the combo box and clicking the ‘download’ button. To decrease file size, the file is compressed by gzip, a popular file format that can easily be decompressed by many widely used programs. The file is written in comma-separated values (CSV) format and can be read and handled by most spreadsheet programs such as Microsoft Excel and Calc in LibreOffice (http://www.libreoffice.org/). The file contains not only gene pairs in colinear blocks but also additional data such as Ka and Ks values of the pairs.
To facilitate investigations into the functional and evolutionary consequences of gene and GD, we have determined and provided colinear blocks in plants from a single resource based on uniform standards. Many programs have been developed to determine colinear blocks, with different sensitivities and specificities in colinear block prediction (26,27). Among them, the current version of PGDD used MCScan, which shows a consistent, high accuracy prediction (27). PGDD has provided data used in much research (28–43,52–54) for past 5 years, and for the past 1 year alone, PGDD has been used by researchers from 111 countries with a total of 713 254 accession logs.
While continually adding new genome data to PGDD, we are also preparing new functions to enhance PGDD. For example, at present, users can access data in PGDD just by connecting to the web site or by downloading colinear block data files. To make it possible that other web services or programs can access PGDD data via the internet, we plan to add OpenAPI functions that enable web sites to interact with each other and build RESTful web services that make the data easily accessed over HTTP by clients. Besides developing the OpenAPI and RESTful web services, we plan to develop interfaces to link multiple data sources such as the VISTA (55) suite of programs and databases for multi-way analysis of genomic sequences, and CoGe (22), web application to display the homologous regions across multiple genomes. Hence, new functions and integration of multiple data sources are intended to further enhance the PGDD database as a platform to study many evolutionary questions.
A.H.P. appreciates funding from the National Science Foundation [NSF: DBI 0849896, MCB 0821096, MCB 1021718]; Resources and technical expertise from the University of Georgia (in part); Georgia Advanced Computing Resource Center, a partnership between the Office of the Vice President for Research and the Office of the Chief Information Officer. Funding for open access charge: NSF.
Conflict of interest statement. None declared.