|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: YW XW AHP. Performed the experiments: YW SPF. Analyzed the data: YW XW FAF AHP. Contributed reagents/materials/analysis tools: HT XT. Wrote the paper: YW XW AHP.
Both single gene and whole genome duplications (WGD) have recurred in angiosperm evolution. However, the evolutionary effects of different modes of gene duplication, especially regarding their contributions to genetic novelty or redundancy, have been inadequately explored.
In Arabidopsis thaliana and Oryza sativa (rice), species that deeply sample botanical diversity and for which expression data are available from a wide range of tissues and physiological conditions, we have compared expression divergence between genes duplicated by six different mechanisms (WGD, tandem, proximal, DNA based transposed, retrotransposed and dispersed), and between positional orthologs. Both neo-functionalization and genetic redundancy appear to contribute to retention of duplicate genes. Genes resulting from WGD and tandem duplications diverge slowest in both coding sequences and gene expression, and contribute most to genetic redundancy, while other duplication modes contribute more to evolutionary novelty. WGD duplicates may more frequently be retained due to dosage amplification, while inferred transposon mediated gene duplications tend to reduce gene expression levels. The extent of expression divergence between duplicates is discernibly related to duplication modes, different WGD events, amino acid divergence, and putatively neutral divergence (time), but the contribution of each factor is heterogeneous among duplication modes. Gene loss may retard inter-species expression divergence. Members of different gene families may have non-random patterns of origin that are similar in Arabidopsis and rice, suggesting the action of pan-taxon principles of molecular evolution.
Gene duplication modes differ in contribution to genetic novelty and redundancy, but show some parallels in taxa separated by hundreds of millions of years of evolution.
Whole-genome duplications (WGDs) have occurred in the lineages of plants , animals ,  and fungi , , with possible consequences including evolution of novel or modified gene functions , , , , and/or provision of “buffer capacity” ,  or genetic redundancy that increases genetic robustness , , , , , . Genome duplication may also increase opportunities for nonreciprocal recombination , , , permitting or causing duplicated genes to evolve in concert for a period of time. Rapid DNA loss and restructuring of low-copy DNA , , , , retrotransposon activation , ,  and epigenetic changes , , , , ,  following WGD may further provide materials for evolutionary change.
Genes may be duplicated by several mechanisms in addition to WGDs, which have been collectively referred to as small scale duplications  or single gene duplications , . Tandem duplicates are consecutive in the genome while proximal duplicates are near one another but separated by a few genes. These two gene duplication modes are presumed to arise through unequal crossing over  or localized transposon activities . Dispersed duplicates are neither adjacent to each other in the genome nor within homeologous chromosome segments . Distant single gene transposition may explain the widespread existence of dispersed duplicates within and among genomes . Distant single gene transposition duplication (referred to as distantly transposed duplication) may occur by DNA based or RNA based mechanisms . DNA transposons such as packmules (rice) , helitrons (maize) , and CACTA elements (sorghum)  may relocate duplicated genes or gene segments to new chromosomal positions (referred to as DNA based transposed duplication). RNA based transposed duplication, often referred to as retrotransposition, typically creates a single-exon retrocopy from a multi-exon parental gene, by reverse transcription of a spliced messenger RNA. It is presumed that the retrocopy duplicates only the transcribed sequence of the parental gene, detached from the parental promoter. The new retrogene is often deposited in a novel chromosomal environment with new (i.e. non-ancestral) neighboring genes and, having lost its native promoter, is only likely to survive as a functional gene if a new promoter is acquired , .
Classical population genetic theory suggests that a likely consequence of gene duplication is reversion to single copy (singleton), unless at least one gene copy evolves new function . More recently, the subfunctionalization model, which proposes that duplicated gene copies might both be retained if they partition the functions of the ancestral gene between them, has described an important modification of the classical model , . Some studies also show evidence to support the value of genetic redundancy per se , , , , , , , ,  or dosage balance , , , .
The angiosperms (flowering plants) are an outstanding model in which to elucidate the consequences of gene duplication. All angiosperms are now thought to be paleopolyploids , many of which underwent multiple WGDs , . Traces of past WGDs can often be detected from pairwise syntenic alignments through software such as ColinearScan  and multiple alignments using MCScan . Arabidopsis, selected as the first angiosperm genome to be sequenced due to its small genome size and minimal DNA sequence duplication, has experienced two ‘recent’ WGDs, i.e. since its divergence from other members of the Brassicales clade (α and β), and a more ancient triplication (γ) shared with most if not all eudicots , , . Likewise, rice appears to have experienced at least two WGDs, one shared with most if not all cereals (ρ), and another more ancient event (σ) . Single gene duplications in angiosperms are also widespread , , .
One avenue for systematic investigation of functional divergence between duplicate genes is comparison of their spatiotemporal expression profiles, comparing degrees of divergence with proxies of duplication age such as synonymous substitution rates (Ks) between duplicate genes. In Arabidopsis, the rate of protein sequence evolution is asymmetric in >20% of duplicate pairs and functional diversification of surviving duplicate genes has been proposed to be a major feature of the long-term evolution of polyploids . Arabidopsis genes created by large-scale duplication events are more evolutionarily conserved in gene expression than those created by small-scale duplication or those that do not lie in duplicate segments, and the time since duplication is correlated with functional divergence of genes . Further, there may be also a strong positive correlation between expression divergence and non-synonymous mutation (Ka) in Arabidopsis, and the different modes (segmental, tandem and dispersed) of duplication may affect patterns of expression divergence . Arabidopsis duplicated genes show greater expression diversity than singleton genes across closely related species and allopolyploids . In rice, expression correlation is significantly higher for gene pairs from WGDs or tandem duplications than dispersed duplications, and expression divergence is closely related to divergence time .
Though many studies have investigated the functional divergence and retention of duplicate genes, conclusions are often contradictory, e.g. gene retention has been attributed to either neofunctionalization ,  or genetic redundancy , , , , , , and expression divergence between duplicate genes has been suggested to be either time dependent ,  or selection dependent . The fates of duplicate genes may be influenced by different modes of gene duplication, which have been suggested to retain genes in a biased manner . With much richer expression and annotation data available now than for most prior studies, and improved ability to discern various mechanisms of gene duplication, we find merit in re-examining some existing hypotheses and exploring some new hypotheses regarding the consequences of gene duplication. Here, we related multiple types of genomic data to gene expression divergence in two angiosperm species, Arabidopsis and Oryza (rice), to formally test possible evolutionary patterns (hypotheses). A far richer volume of analyzed microarray data than was available in prior studies improves the robustness of statistical analyses.
A total of 4,566 Affymetrix Arabidopsis Genome ATH1 Arrays and 508 Affymetrix GeneChip Rice Genome Arrays were used to generate the expression profiles of 22,810 Arabidopsis genes and 27,910 rice genes. We classified gene duplications into six modes: WGD, tandem, proximal, DNA based transposed, retrotransposed and dispersed duplication, according to the procedure shown in Figure 1 and described in methods. Note that in this study, a gene may have up to five potential duplication relationships, depending on the number of BLASTP hits. For WGD duplicates, redundant duplication relationships were removed using co-linearity restrictions. If a gene was created by single gene duplications, all possible duplication relationships were considered. However, redundant duplication relationships in single gene duplications did not enlarge the gene set created by each duplication mode. In a distantly transposed duplication, one duplicate gene is the parental (ancestral) copy while the other is the transposed (derived) copy, at a novel locus. Dispersed duplications, which we cannot attribute to specific mechanisms, are regarded as a control group. The number of pairs of duplicate genes and number of unique genes (i.e. number of created genes) in each mode of duplication is summarized in Table 1. A total of 2,981 α, 1,161 β and 417 γ WGD duplicate pairs in Arabidopsis; and 1,712 ρ and 568 γ WGD duplicate pairs in rice, have expression profiles. In this study, the degree of similarity between the expression profiles of a pair of genes across all experiments is measured by the Pearson's correlation coefficient (r). To express in positive values the evolution of gene expression between duplicates or orthologs, we use the term “expression divergence”, measured by , .
Expression divergence between duplicate genes was compared across modes of duplication (Figure 2). The trends of expression divergence between duplicates in Arabidopsis and rice are very similar: DNA based transposed duplication≈retrotransposed duplication > dispersed duplication > proximal duplication > WGD≈tandem duplication (both ANOVA model involving all duplication modes and Tukey's HSD test between adjacent duplication modes are significant at α=0.05). Although retrotransposed duplications have a little higher average expression divergence than DNA based transposed duplications, the difference is not significant (P-value>0.05). WGDs result in a little higher expression divergence than tandem duplications in Arabidopsis but the difference is not significant in rice.
Despite the relatively fast evolution of gene expression shown by distantly transposed duplications, a tendency toward co-expression between genes duplicated by all modes can be observed by comparison with 10,000 randomly selected gene pairs (Figure 2). Furthermore, we used r<0.371 and r<0.621 (95% quantile of the r values obtained from random gene pairs) as criteria for determining that two duplicate genes have diverged in expression in Arabidopsis and rice respectively , . The proportions of divergent expression between genes duplicated by different modes are shown in Table 2. All these data suggest that the extent of expression divergence of retained duplicates is affected by the duplication mechanism: WGD and tandem duplicates are more likely to maintain their original expression patterns, proximal duplications show intermediate divergence, and distantly transposed duplications tend to have the biggest changes of gene expression profiles.
Computationally, genetic redundancy may be inferred from simultaneous conservation in protein sequences that determine molecular functions, and expression patterns which determine biological processes , . WGD and tandem duplicates tend to be simultaneously conserved in protein sequences (using 25% quartile of Ka of all duplicate pairs, i.e. <0.329 in Arabidopsis and <0.383 in rice, as criteria) and in gene expression (using in Arabidopsis and in rice as criteria), while distantly transposed and dispersed duplicates have a random association (assuming that conservation in protein sequences and gene expression were independent in the pooled duplicate genes) between these parameters, and proximal duplicates fall in between (Table 3).
Expression levels differ between the genes created by different duplication modes (Figure 3). WGD and dispersed duplicates have higher gene expression levels than tandem, proximal and distantly transposed duplications (2-sample t-tests are significant at α=0.05). The higher expression of WGD duplicates is consistent with their retention due to dosage amplification, a theory which has been proven in yeast , , . Potentially transposon mediated gene duplications including tandem, proximal and distantly transposed duplications tend to be associated with lower gene expression levels than other duplication modes (Figure 3). Dispersed duplication, with unclear genetic mechanisms so far, is associated with gene expression levels comparable to WGD.
Since its divergence from other Brassicales, Arabidopsis experienced two WGDs (α and β), while sharing a more ancient genome triplication (γ) with all rosids and perhaps all eudicots , , . Rice has experienced two WGDs: the ρ event shared with all Poaceae, and the more ancient σ event . Although expression divergence has been compared between WGD and single gene duplications , , , the combinational effects of different WGD events on expression divergence have not been addressed. We propose that WGD events themselves, together with the subsequent ‘adaptation’ of the resulting genome to the newly-duplicated state, may accelerate evolution, contributing to variation in expression divergence sometimes attributed to time (usually measured by Ks) alone , .
To further investigate the combinational effects of multiple WGD events, we compared the expression divergence of duplicates from different WGD events (Figure 4). Not surprisingly, expression divergence between the WGD duplicates of more ancient events tends to be larger: γ duplicates > β duplicates > α duplicates in Arabidopsis, and σ duplicates > ρ duplicates in rice (both ANOVA model involving all WGD events and Tukey's HSD test between adjacent WGD events are significant at α=0.05). Next, we fitted a curve between expression divergence and Ks for each WGD event using a smooth spline with 10 degrees of freedom available in R packages (Figure 4). We found no significant correlation between expression divergence and Ks within the more ancient Arabidopsis β duplicates (r=0.036, P-value=0.241) or γ duplicates (r=−0.008, P-value=0.883), or rice σ duplicates (r=0.045, P-value=0.307) but correlations are significant within the most recent Arabidopsis α duplicates (r=0.126, P-value=) and rice ρ duplicates (r=0.105, P-value=). Further, we conducted a power analysis for these correlations. We found that at α=0.05, the non-significant correlations (β, γ and σ duplicates) did not have higher power than conventionally desired (>0.8) while significant correlations (α and ρ duplicates) had power greater than 0.98, confirming that the relationship between expression divergence and Ks differs among different WGD events.
WGD events themselves influence gene expression divergence, with more ancient WGD duplicated genes likely to have greater expression divergence than more recent duplications, even if both have similar Ks (Figure 5). To support this hypothesis statistically, we coded the α, β and γ events by 1, 2 and 3 in Arabidopsis and the ρ and σ events by 1 and 2 in rice. Then different linear regression models of expression divergence on Ks and/or WGD codes were fit in Arabidopsis and rice respectively. All regression models and their coefficients were statistically significant. For both Arabidopsis and rice, the model which counts both Ks and the number of WGD events that duplicate genes underwent results in the highest adjusted R2 and lowest Akaike information criterion (AIC) (Table 4) with significant nonzero slopes of all coefficients, supporting the hypothesis that WGD events themselves, in addition to Ks, can lead to increased expression divergence between duplicates.
Selection after WGD events may constrain expression divergence of some duplicates. To examine this question, we studied the 25% of WGD duplicate pairs with most conserved expression at each WGD event. At a P-value threshold of 0.05 by Fisher's exact test (corrected for multiple tests), specific GO terms/Pfam domains were associated with conserved expression at each WGD event, and some recurred across different WGD events, e.g. transcription factor activity (GO:0003700) and ribosome (GO:0005840) for Arabidopsis α and γ and rice ρ events; protein biosynthesis (GO:0006412) for Arabidopsis α and β and rice ρ events (Table S1). In contrast, WGD duplicates with divergent expression (25% of pairs with highest d values at each event) showed little or no enrichment of specific GO terms/Pfam domains and functional terms did not recur between different WGD events.
In that most angiosperms share most genes, changes in expression may be fundamental to angiosperm biodiversity. Previous studies have associated duplicated genes with greater expression diversity than singletons in closely related species of both animals  and plants . However, it has been difficult to extend such comparisons to more distant species such as Arabidopsis, a eudicot, and rice, a monocot, due to greater difficulty discerning orthology or paralogy. To facilitate the comparison of gene expression data generated by different microarray platforms, we adopted a conceptual framework of comparing co-expression patterns across species  (see Methods). Further, we restricted our study to 2,012 gene pairs suggested both by DNA sequence similarity and by synteny/collinearity to be orthologs between Arabidopsis and rice, downloaded from the PGDD database , . The comparison of expression divergence between different types of orthologs shows the following trend: duplicate-duplicate>singleton-duplicate>singleton-singleton (Figure 6), with P-values of 0.049 between duplicate-duplicate and singleton-duplicate and 0.010 between singleton-duplicate and singleton-singleton using two-sample t-tests. This finding supports that singletons are more conserved in expression than duplicated genes, consistent with the hypothesis that one consequence of gene duplication is increased expression diversity.
Divergence in coding sequences can be denoted by Ks, which indicates putatively-neutral mutations that are synonymous at the amino acid level, or by Ka, which indicates altered amino acids suggestive of the action of selection on gene function. The correlations between expression divergence and coding sequence divergence in angiosperms have been widely discussed , ,  but conclusions were inconsistent: Casneuf et al. and Li et al. suggested that Ks is closely correlated with gene expression divergence, while Ganko et al. found little correlation. Since microarray data contain a high level of noise and previous studies often relied on small sets of microarray data or only one species, our analysis of “all arrays” and two highly-divergent species may have broader inference space.
The distributions of Ka or Ks differ markedly for different gene duplication modes, but are relatively consistent in Arabidopsis and rice (Figure 7). Tandem/proximal and WGD duplicates have qualitatively lower Ks (putatively reflecting younger age) than distantly transposed (DNA and RNA) or dispersed duplicates, the distinction being much clearer in the small genome of Arabidopsis (Figure 7A) than the 3× larger and more repeat-rich genome of rice (Figure 7B). Within these qualitative distinctions, quantitative differences among the categories are also evident and largely consistent, with relative Ks (putatively age) of duplications following the trend of: dispersed > distantly transposed > WGD > proximal > tandem (both ANOVA model involving all duplication modes and Tukey's HSD test between adjacent duplication modes are significant at α=0.05). Retrotransposed duplicates differ slightly in the two taxa, being similar to DNA based transposed duplicates in Arabidopsis, and to dispersed duplicates in rice. The trend of Ka shows the same qualitative distinction as that of Ks (Figure 7C and 7D), but differing in the quantitative trend with amino-acid altering mutation frequencies being retrotransposed > dispersed > DNA based transposed > proximal≈WGD≈tandem (both ANOVA model involving all duplication modes and Tukey's HSD test between adjacent duplication modes are significant at α=0.05). WGD duplicates are more functionally constrained, with higher Ks but equal or lower Ka than proximal duplicates. These data do not show the conventional L-shaped distribution for dispersed and distantly transposed duplicates, because the filters employed in gene selection focus this analysis only on genes that have survived a long time, implying that the genes serve important functions.
Relationships between coding sequence divergence and expression divergence are heterogeneous, and differ among gene duplication modes. For WGD duplicates, expression divergence is significantly correlated with both Ka and Ks in both Arabidopsis and rice, although the strength of the correlations is progressively weaker for more ancient duplications and in some cases reaches non-significance (Table 5). Expression divergence is also significantly correlated with both Ka and Ks among proximal duplicates. Tandem duplicates differ in the two taxa, with those of rice resembling WGD genes with expression divergence significantly correlated with both Ka and Ks, and those of Arabidopsis resembling distantly transposed duplications with marginal and sometimes non-significant correlation.
While age and functional divergence are more closely related to expression divergence in WGD genes than those resulting from other duplication modes, this does not reflect a lack of expression divergence among other gene duplicates. Indeed, proximal duplication is associated with higher expression divergence than WGD, despite its smaller average Ks. Likewise, DNA based transposed duplication is associated with higher expression divergence than dispersed duplication, despite smaller Ks (Table 6).
In partial summary, expression divergence between duplicate genes may be affected by duplication modes, as well as by the ‘age’ (Ks) of the duplicated genes, i.e. gene expression divergence may differ among duplication modes at the same Ks or Ka levels. To further validate this claim, we fit a smooth spline curve between expression divergence and Ks or Ka for each duplication mode (Figure 8). While these curves fluctuate markedly, at fixed Ks or Ka levels distantly transposed duplications (for example) are generally associated with higher expression divergence between duplicates than WGD or tandem duplications.
Epigenetic mechanisms such as DNA methylation have been suggested to potentially differentiate newly arisen duplicate genes ,  as well as orthologous genes across closely related species . Transcriptional silencing has often been associated with DNA methylation in promoter regions , . Using data on genome-wide DNA methylation status for both Arabidopsis and rice , we examined whether DNA methylation status in promoter regions is related to expression divergence between duplicates or between orthologs. This comparison carries an inherent assumption that methylation patterns are relatively static and generally apply to all of the microarray studies. A gene promoter region was considered to be methylated if two or more adjacent probes are methylated within the region . Proportions of pairs of duplicates that differ in DNA methylation status in promoter regions, separated by gene duplication modes, are summarized in Table 7. Distantly transposed duplications appear somewhat more likely to differ in DNA methylation status than other duplication modes. However, the duplicate genes that differ in DNA methylation status in promoter regions do not have more divergent expression than those that have the same DNA methylation status, within any duplication mode (negative data are not shown). Likewise, different methylation status among orthologs also showed no significant relationship to expression divergence, although we confirmed that singletons are a little more likely to be methylated in promoter regions than duplicates (Table 8), as proposed by others . These analyses suggest that the mechanisms by which DNA methylation status affects expression divergence between homologous genes may be complicated, and direct association may not be informative for unraveling such mechanisms.
The diversity of gene duplication mechanisms and patterns of gene expression divergence raise questions about how gene families expand and how their members have been retained in the history of evolution. WGD duplicates are differentially retained across different gene functional classifications , , , . However, we suggest that gene families may be more informative units than functional terms for investigating patterns of gene origin, as duplication relationships in gene families are clearer. Based on our findings above, both functional divergence and redundancy may contribute to retention of duplicate genes. Furthermore, because the degrees of functional diversification are not equal across gene families and gene duplication modes add additional heterogeneity to patterns of functional divergence, it is possible that gene family members may have non-random patterns of origin, e.g. the gene families with high functional diversification may be enriched with distantly transposed duplications while those families contributing to genetic redundancy are likely to be enriched with WGD duplications.
To examine these questions, we investigated the gene duplication modes of 126 Arabidopsis and 24 rice published gene families of 10 or more genes, available at TAIR (http://www.arabidopsis.org/) and Michigan State University (http://rice.plantbiology.msu.edu/) respectively. By using Bonferroni-corrected Fisher's exact test, we found that 64 (50.8%) Arabidopsis gene families and 19 (79.2%) rice gene families are enriched for at least one gene duplication mode at α=0.05 (Table S2). For example, DNA based transposed duplications are enriched in disease resistance gene homologs and the cytochrome P450 gene family (Figure 9 A–C). Disease resistance gene homologs, most of which have nucleotide binding site-leucine rich repeat (NBS-LRR) domains, express at different levels and tissue specificities, and function in diverse biological processes in Arabidopsis . P450s also express in many tissues in a tissue specific manner and are involved in diverse metabolic processes , . The cytochrome P450 family also shows enrichment for DNA based transposed duplications in rice. Thus, these two gene families may have achieved functional and expression diversity through some combination of transposition activity and retention of distantly transposed duplicates. Interestingly, these two families are also enriched with proximal duplications, again often associated with greater expression divergence than WGD despite generally similar coding sequence divergence.
WGD duplicates are enriched in other gene families, such as the cytoplasmic ribosomal protein gene family, and C2H2 zinc finger proteins (Figure 9 D–F). In Arabidopsis, a large number of ribosomal genes are co-regulated . C2H2 zinc finger proteins have been shown to be involved in some basic biological processes such as transcriptional regulation, RNA metabolism and chromatin-remodeling . Furthermore, C2H2 zinc finger proteins are enriched with retained WGD duplicates in both Arabidopsis and rice. Our analyses suggest that gene family members may have common non-random patterns of origin, that recur independently in different evolutionary lineages (such as monocots, and dicots, studied here), and that such patterns may result from specific biological functions and evolutionary needs.
In two species that sample a wide range of tissues and physiological conditions in major angiosperm lineages diverged by about 140–170 million years  and affected by at least 5 different genome duplication events, we have compared expression divergence between positional orthologs and between genes duplicated by several additional mechanisms. Both neo-functionalization and genetic redundancy can result in retention of duplicate genes. WGD duplicates generally are more frequently associated with genetic redundancy than genes resulting from other duplication modes, partly due to dosage amplification. Tandem duplications also contribute to genetic redundancy, while other duplication modes are more frequently associated with evolutionary novelty. Potentially transposon mediated gene duplications tend to reduce gene expression levels. Expression divergence between duplicates is discernibly related to duplication modes, WGD events, Ka, Ks, and possibly the DNA methylation status of their promoter regions. However, the contribution of each factor is heterogeneous among duplication modes, and new factors as well as combinatorial effects of different factors are worth further investigation. Gene loss may retard inter-species expression divergence, as singletons are generally more conserved in gene expression than duplicates. Members of different gene families have non-random patterns of origin, and such patterns may be similar between Arabidopsis and rice.
The use of large volumes of data and inclusion of as many genes as possible may help to mitigate factors specific to particular developmental states, noise associated with microarray data, and bias reflecting features specific to particular gene families. For example, we have found that the correlations between expression divergence and Ks are not consistent within gene duplication modes (Figure 5 and and8).8). For WGD duplicates, significant correlations only exist in those generated by recent WGD events - if only relatively ‘young’ WGD duplicates are studied, the correlations may be overestimated. Moreover, such correlations are not uniformly distributed among Ks levels - at low Ks levels (<1), all duplication modes may show correlations.
We find evidence for duplicate gene retention by both neo-functionalization and genetic redundancy, seemingly at opposite ends of the spectrum of possible fates of duplicated gene pairs. Genetic redundancy has clear biological significance, i.e. provision of buffering capacity ,  and/or dosage balance , , , , and seems most closely related to WGD or tandem duplicates. The origins of genetic novelty, of clear biological significance in occupation of new niches or adaptation to new environments, may lie more with the greater expression divergence and more independent evolution of distantly transposed and dispersed duplications. Proximal duplication is more balanced in its contributions to genetic novelty and redundancy than other gene duplication modes.
Detailed delineation of gene duplication modes reveals some new trends. Prior studies classified genes into as few as two types (anchors generated by polyploidy, and non-anchors generated by single gene duplication ), or as many as three types (segmental, tandem and dispersed: ). In this study, we have attempted to distinguish DNA/RNA based transposed from dispersed duplication, and proximal from tandem duplication. DNA based transposed duplications tend to evolve faster in expression while having smaller Ks than dispersed duplicates. Tandem duplicates diverge slower in gene expression than proximal duplicates. Proximal duplicates tend to diverge faster in expression than WGD duplicates, though concerted evolution  may homogenize their coding sequences.
Our analyses suggest that it may be inappropriate to make generalizations about levels and patterns of expression divergence across gene duplication modes. Ks, putatively a proxy for age, seems to be related to expression divergence only within a subset of duplication modes and largely only among younger duplicates. Ka, putatively a proxy for functional change, also shows statistically significant and heterogeneous relationships to expression divergence. The level of these correlations is very low, even in recent WGD duplicates.
Although expression divergence between duplicates is often significantly correlated with coding sequence divergence, it is well known that gene expression is also regulated by other genomic regions such as promoters, 5′UTRs, and 3′UTRs. The correlations between expression divergence and nucleotide substitution rates (μ) of different genomic regions for pairs of duplicates are summarized in Table S3. WGD duplicates show significant correlations between expression divergence and nucleotide substitution rates in all three regions. These correlations become marginal and often non-significant among tandem duplicates. Expression divergence of proximal duplicates is more closely associated with divergence in promoters, 5′UTRs and 3′UTRs than coding sequences. Expression divergence of DNA based transposed duplicates seem to be most related to Ka and μ of 3′UTRs. Expression divergence of dispersed duplicates is very slightly correlated with Ka but not with other substitution rates. Retrotransposed duplication is least related to any type of sequence divergence, consistent with its general separation of a gene from its native regulatory elements.
In partial summary, expression divergence between duplicate genes may be affected by different and multiple genetic factors depending on the causal duplication mechanism. For pairs of orthologs between Arabidopsis and rice, expression divergence seems only correlated with Ka (Table 5 and Table S3). Single gene duplications including translocated and tandem/proximal duplications have been suggested to be much more prone to promoter disruption than WGD . We examined this hypothesis using >45% sequence identity as criterion for determining duplicated (non-disrupted) promoter regions, finding proximal duplicates to have higher proportions of duplicated promoter regions than WGD duplicates (Table 9). This finding seems to contradict the greater expression divergence of proximal duplicates than WGD duplicates. Thus, we note that each of the investigated genetic/epi-genetic factors may only explain a small portion of the variation of expression divergence between duplicate genes, and perhaps only for certain duplication modes. New factors that may affect expression divergence and how different factors work together are worth investigation.
WGD is often associated with speciation in plants , . If ancestral polyploidy was attendant with speciation, new species would have likely initially faced very small Ne (i.e. effective population size), weak selection, high drift and high mutational load. This could put a premium on buffering, but allow little chance for beneficial mutations. On the other hand, small-scale duplications may have been only infrequently associated with speciation, if at all. Thus they might be more likely to arise in established populations with larger Ne and more efficient selection, all putting a greater premium on evolutionary novelty to attain fixation. A hypothesis worthy of further investigation is that non-random associations between duplication mode and population size have shaped which specific genes and functional variations are retained.
Genome annotations were obtained from TAIR (http://www.arabidopsis.org) for Arabidopsis, and from the Rice Genome Annotation Project data (http://rice.plantbiology.msu.edu) for rice. Gene structures were retrieved using ENSEMBL Biomart (http://plants.ensembl.org/biomart/martview).
To reliably assess the expression divergence between duplicates or between orthologs, we used as many publicly available microarray datasets as possible, all of which were obtained from NCBI's GEO (http://www.ncbi.nlm.nih.gov/geo/). At the time of retrieval, 6,009 samples existed for the Affymetrix Arabidopsis ATH1 Genome Array (GEO platform GPL198), of which 800 were not available and a total of 5,209 CEL files were downloaded. 550 CEL files for the Affymetrix GeneChip Rice Genome Array (GEO platform GPL2020) were downloaded, of which 13 were removed due to incorrect array types. For both Arabidopsis and rice raw expression data, RMA normalization was performed using the RMAExpress software (http://rmaexpress.bmbolstad.com) across the entire dataset. Outliers were detected using the arrayQualityMetrics  Bioconductor package, which implements three different statistical tests to identify outliers. A total of 443 and 29 samples were detected as outliers and removed in Arabidopsis and rice respectively. Thus, 4,566 and 508 samples remained for Arabidopsis and rice, respectively. The annotation files (Release 30) of these two arrays were downloaded from the Affymetrix website (http://www.affymetrix.com), containing 22,810 Arabidopsis genes and 27,910 rice genes. For a gene, there may be multiple probe sets or multiple types of probe sets available on the array. However, a general rule for selection of a probe set that best represents the gene's expression profile has not been resolved yet , . In this study, inclusion or exclusion of “sub-optimal” probe sets with suffix “_s_at” or “_x_at” that are suspected of potential cross-hybridization (may be not sub-optimal in practice according to ref. , ) had only trivial effects. Thus, to survey as many genes as possible, all types of probe sets were considered, and for a gene with multiple probe sets, we used the first probe set according to alphabetic sorting to represent its expression profile.
Similarity between the expression profiles of two duplicate genes within species was initially measured by either Pearson's (denoted by PCC or r) or Spearman's correlation coefficient. Note that all replicate chips were retained and correlations were computed across all individual chips. These two measures generated highly consistent results, and thus we only showed the statistics measured by Pearson's correlation coefficient. The expression divergence between two duplicate genes or orthologs was measured by , .
Orthologous gene pairs compared between Arabidopsis and rice were restricted to 2,012 pairs of orthologs located at corresponding loci in paired syntenic blocks between Arabidopsis and rice as identified by MCScan , and having expression profiles on the arrays. To assess the expression conservation (EC) for a pair of Arabidopsis-rice orthologs, we adopted a conceptual framework of comparing co-expression patterns across species  implemented in several other studies similar to ours , , , , . In this study, the framework can be described as:
The populations of potential gene duplications in Arabidopsis or rice were identified using BLASTP. Only the top five non-self protein matches that met a threshold of were considered. Genes without BLASTP hits that met a threshold of were deemed singletons. Pairs of WGD duplicates were downloaded from the PGDD database , . Pairs of α, β, γ duplicates in Arabidopsis and pairs of ρ, σ duplicates in rice were obtained from published lists , . Single gene duplications were derived by excluding pairs of WGD duplicates from the population of gene duplications. Tandem duplications were defined as being adjacent to each other on the same chromosome. Proximal duplications were defined as non-tandem genes within 20 annotated genes of each other on the same chromosome .
The remaining single gene duplications (after deducting tandem and proximal duplications) were searched for distant single gene-transposed duplications. To accomplish this aim, genes at ancestral chromosomal positions need to be discerned by aligning syntenic blocks within and between species , . Angiosperm syntenic blocks were downloaded from the Plant Genome Duplication Database (PGDD), available at http://chibba.agtec.uga.edu/duplication. At the time of retrieval, PGDD provided syntenic blocks within and between 10 species including Arabidopsis thaliana, Carica papaya, Prunus persica, Populus trichocarpa, Medicago truncatula, Glycine max, Vitis vinifera, Brachypodium distachyon, Oryza sativa, Sorghum bicolor, Zea mays , . An Arabidopsis or rice gene locus was regarded as ancestral if the resident gene along with any of its homologous genes (paralogs/orthologs) occur at corresponding loci within any pair of syntenic blocks in PGDD. Using this criterion, the population of Arabidopsis/rice genes was divided into two subsets: genes at ancestral loci and genes that were transposed. For a pair of distantly transposed duplicate genes, we required that one copy was at its ancestral locus and the other was at a non-ancestral locus, named the parental copy and transposed copy respectively. If the parental copy has more than two exons and the transposed copy is intronless, we inferred that this pair of duplicate genes occurred by retrotransposition (RNA based transposition). If both copies have a single exon, the pair of duplicates was unclassified. For other cases of a pair of distantly transposed duplicate genes, we inferred that the duplication occurred by DNA based transposition. The remaining single gene duplications in the population, i.e. after deducting WGD, tandem, proximal, DNA based transposed and retrotransposed duplications from the BLASTP output, were classified as dispersed duplications. After pairs of duplicate genes in each duplication mode were identified, we assigned a unique origin to each duplicated gene, according to the following order of priority: WGD>tandem>proximal>retrotransposed>DNA based transposed>dispersed.
GO/Pfam enrichment analysis was performed using Fisher's exact test. The P-value was calculated for the null hypothesis that there is no association between a subset of genes and a particular functional/domain category and was corrected with the total number of terms to account for multiple comparisons.
Coding sequence divergence between a pair of genes was denoted by either non-synonymous (Ka) or synonymous (Ks) substitution rates. Protein sequences were aligned using Clustalw  with default parameters. The protein alignment was then converted to DNA alignment using the “Bio::Align::Utilities” module of the BioPerl package (http://www.bioperl.org/). Ka and Ks were estimated by Nei-Gojobori statistics , available through the “Bio::Align::DNAStatistics” module of the BioPerl package. Note that the “Bio::Align::DNAStatistics” module may generate invalid Ka or Ks for some duplicate gene pairs due to mis-alignments, which were ruled out from related analysis. All levels of valid Ka or Ks values were considered in related statistical analyses. Because distributions of Ka or Ks were centered at low levels (~1.0), in related figures, to improve their clarity, we only displayed Ka or Ks values between 0 and 2.0.
The promoter region of a gene was restricted to a maximum of 1,000 bp upstream of the transcription start site (TSS) or less if the nearest adjacent upstream gene is closer than 1,000 bp. For a pair of genes, the divergence of promoter sequences was indicated by their Jukes-Cantor nucleotide substitution rate (μ) , which is available through the “Bio::Align::DNAStatistics” module of the BioPerl package. The divergence in 5′UTR and 3′UTR is also measured by nucleotide substitution rates (μ). Note that the “Bio::Align::DNAStatistics” module may not output μ if the distance between two input nucleotide sequences is too near or too far. Duplicate gene pairs lacking estimation of μ in the promoter region, 5′UTR or 3′UTR were removed from related analysis.
Arabidopsis and rice genome-wide DNA methylation data were obtained from GEO (accession number: GSE21152) . We chose this study, which provided DNA methylation for both Arabidopsis and rice, because the systematic errors between species should be smaller than in data from separate studies. A gene methylated in the promoter region is defined by the presence of two or more adjacent methylated probes within the promoter DNA sequence , .
Lists of published gene families were obtained from TAIR (http://www.arabidopsis.org/browse/genefamily/index.jsp) for Arabidopsis, and from the Rice Genome Annotation Project data (http://rice.plantbiology.msu.edu/annotation_community_families.shtml) for rice. Only families with more than nine genes were considered. Arabidopsis disease resistance gene homologs were downloaded from the NIBLRRS Project website (http://niblrrs.ucdavis.edu/). The Rice Cytochrome P450 gene family was downloaded from the Cytochrome P450 homepage  .
Enriched GO terms and Pfam domains associated with the duplicates of conserved or divergent expression at each WGD event.
List of investigated gene families and their enrichments with modes of gene duplication.
Correlations between expression divergence and different types of sequence divergence.
We thank Prof. Michael R. Freeling for helpful advice and Barry Marler for IT support. We thank Xinyu Liu for statistical advice.
Competing Interests: The authors have declared that no competing interests exist.
Funding: AHP appreciates funding from the National Science Foundation (NSF: DBI 0849896, MCB 0821096, MCB 1021718). This study was supported in part by resources and technical expertise from the University of Georgia Research Computing Center, a partnership between the Office of the Vice President for Research and the Office of the Chief Information Officer. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.