Expansion of the PG family in Arabidopsis and rice
To investigate the relationships among PGs and the extent of lineage-specific expansion in rice and Arabidopsis
, we identified PGs from the GenBank polypeptide records and the genomes of Arabidopsis
and rice (Oryza sativa
). All PGs identified contain GH28 domains that are approximately 340 amino acids long and encompass approximately 75% of the average PG coding sequence (for lists of genes used in this analysis, see Figure and Additional data files 1,2 and 8). According to the phylogenetic relationships of bacterial, fungal, metazoan, and plant PGs (Additional data file 3), we found that the 66 Arabidopsis
and 59 rice PGs fall into three distinct groups (Figure , groups A, B, and C). Sixteen of the rice PGs contain more than one glycosyl hydrolase 28 (GH28) domain and were regarded as mis-annotated tandem repeats. It should be noted that the rice PGs were derived from the shotgun sequencing of the O. indica
genome that was estimated to be 95% complete [23
]. We identified the nodes that lead to Arabidopsis
-specific and rice-specific clades and predict that these represent the divergence point between these two species. We have designated the clades defined by such nodes as AO (Arabidopsis
) orthologous groups. For example, in the A3 clade there exists one Arabidopsis
subclade and one rice subclade, and we predict that only one ancestral A3 sequence was present before the divergence between Arabidopsis
and rice. However, gene losses could have occurred and therefore some PGs may be present in the Arabidopsis
-rice common ancestor but later lost in either Arabidopsis
or rice (Figure , arrowheads). Therefore, Arabidopsis
(A, indicating loss(es) in rice) and rice (O, indicating loss(es) in Arabidopsis
) clades were also identified based on their sister group relationships to the AO clades. Since the clades that we defined are most likely orthologous groups (Figure , red circles), the number of clades reflects that there were at least 21 ancestral PGs before the Arabidopsis
-rice split. Further expansion of this gene family occurred after the split as suggested by the duplication events in the lineage-specific branches that reside within each clade. It should be noted that some clades such as the A1 clade were not defined based on the AO clade-based criteria because the nodes within had relatively low bootstrap supports (<50%). If we assumed these less well-supported nodes are correct, there are 27 ancestral PGs.
Figure 1 The phylogeny of Arabidopsis and rice PGs. The amino acid sequences for the glycosyl hydrolase 28 family motif were aligned. The phylogeny was generated using neighbor-joining algorithm with 1,000 bootstrap replicates. Sequences are color-coded according (more ...)
Duplication mechanisms accounting for the PG family expansion
Examination of the distribution of the Arabidopsis PGs on all five chromosomes indicates a non-random distribution of many PGs (Figure ). More than one third of the Arabidopsis PGs (24 of 66) have at least one related sequence within ten predicted genes, and these 24 genes fall into nine clusters that range from two to four genes per cluster (Figure , column cluster). In most cases, these physically associated PGs are from the same clades; however, there are five exceptions including genes in clusters 1d, 2b and 3a (Figure ). In these cases, some members within the cluster are not closest relatives. Besides these 24 tandem repeated sequences, all remaining PGs are at least 100 genes apart. This bimodal distribution of PG physical distances and relationships between closely linked genes suggests that the 24 closely linked PGs are derived from tandem duplications.
Figure 2 Mechanisms of Arabidopsis PG family expansion. The locations of Arabidopsis PGs are indicated on the Arabidopsis chromosomes. The tandem clusters are also indicated. They are color-coded based on the following scheme: PGs found in both duplicated regions (more ...)
In addition to tandem duplications, it has been shown that the Arabidopsis
genome is the product of several rounds of polyploidization or whole-genome duplications [17
]. To determine the contribution of these large-scale duplications, we mapped Arabidopsis
PGs to the duplicated blocks established in two independent studies. The first dataset from the Arabidopsis Genome Initiative [17
] contains 31 blocks (AGI blocks), and forty Arabidopsis
PGs fall in 16 of the AGI blocks (Figure , indicated in red and green). Blocks from the second dataset from Blanc et al
] are designated as BHW (after Blanc, Hokamp, Wolfe) blocks, and 19 PGs were found in 10 BHW blocks (Figure , shaded). The AGI and BHW blocks were identified using different approaches and their combined use increases the coverage of duplicated regions. As a result, nearly 90% (59 out of 66) of Arabidopsis
PGs are covered in the 26 AGI and BHW blocks.
Within these 26 duplicated blocks, 29 PGs are found in both duplicated regions of ten block pairs. To investigate the origin of PGs in these ten block pairs, we conducted similarity searches between regions of each pair to determine if PGs mapped to the corresponding duplicated regions, and if their neighboring genes were arranged collinearly (Figure ; see also (Additional data file 4) for all comparisons). Sixteen PGs in five of these block pairs are clearly located in such collinear regions, indicating that they were derived from large-scale duplication of their associated blocks. For example, AGI block 23a contains nine PGs in six corresponding duplicated regions that show extensive collinearity (Figure ). In Figure , At2g41850 and At3g57510 are flanked by paralogous genes that are arranged collinearly, indicating that they were products of a block duplication. This is also true for a tandem cluster of four PGs and a PG singleton shown in Figure . Interestingly, At3g57790 corresponds to At2g43210, a potential pseudogene lacking the signal peptide and the bulk of the PG catalytic domain (Figure ). We also observed that there are 23 duplicated block pairs with asymmetrical distribution (Additional data file 4). Among them, 16 block pairs have PGs on only one of the blocks (Figure and (Additional data file 4)): ten for AGI and six for BHW blocks. For the remaining seven block pairs, the PGs are found on both blocks but are not arranged in a collinear fashion. Taken together, these findings clearly indicate that many members of the PG family are derived from large-scale duplication events. However, quite a few of them were not retained.
Figure 3 Collinearity of PGs in AGI block 23a. After locating areas with similarities in the block 23a (see also Additional data file 4), six distinct PG-containing regions were defined. (a) At2g40310 does not have PG in the collinear region. (b) At2g41850 and (more ...)
PG expression in Arabidopsis tissues
The size of the plant PG family and the patterns of PG duplication in Arabidopsis
indicate that the PG family expanded in both Arabidopsis
and rice after their divergence. The continuous expansion of this gene family raises an intriguing question on the mechanisms of duplicate retention and their functions in plants. Since retention may be due to functional divergence between duplicate copies, it is possible that PG functional divergence can be, in part, attributed to expression divergence. To evaluate the degree of expression divergence between PG duplicates, we analyzed the expression of all 66 Arabidopsis
PGs in five tissue types (flowers, siliques, inflorescence stems, rosette and cauline leaves, and roots) with RT-PCR (Figure and Additional data file 5). PCR reactions were repeated at least three times for each gene in each tissue type, and all primers were tested using genomic DNA as a positive control (see Figure ). In addition, PCR products of 40 of the 43 PGs were sequenced to verify their identity. We found that 23 PGs did not have detectable RT-PCR products in any of the five tissue types tested. We further tested the expression of these 23 PGs in a T87 suspension culture cell line that had been previously shown to have >60% genes expressed [24
]. Only one PG (At2g43860) was detected. To rule out the possibility of faulty primer designs, a second primer set was designed for each of these 23 PGs, but none led to detectable products.
Figure 4 The phylogeny and expression patterns of Arabidopsis PGs. The phylogeny was generated using all Arabidopsis PGs with Erwinia peh1 as the outgroup. The clade classification, cluster and block designation are also shown. The levels of transcripts are classified (more ...)
Figure 5 RT-PCR of PGs in five tissue types. The competitive RT-PCR, using both cDNA and gDNA templates, is demonstrated. The expression pattern of PGs in the clade A14 is variable except At1g23470, which has no detectable expression in all five tissue types. (more ...)
To complement the RT-PCR approach, we also examined the expression tags that were publicly available including full-length cDNAs, expressed sequence tags (ESTs), and massive parallel signature sequencing (MPSS) tags (Additional data file 6). The presence of RT-PCR products or other expression tags is shown in Figure (far right-hand panel). Among these four different expression measures, the RT-PCR approach detects the highest number of PGs. In the 43 PGs with RT-PCR products, other expression tags support only 30 of them. In addition, only three PGs have cDNA, ESTs, and/or MPSS but not RT-PCR products. These findings indicate that RT-PCR is the most sensitive approach with a relatively low false-negative rate. For further analyses, we consider a PG expressed if two out of three of the RT-PCR reactions had detectable products (42) or if its expression is supported by the presence of either cDNA or EST (three). Based on these criteria, 45 PGs had detectable expression (Figure ). Approximately 50% of these expressed PGs are found in all five tissues and 20% have relatively higher level of expression in more than one tissue. In addition, more than 50% of expressed PGs have high level of expression in floral tissues, 40% in root tissue, 16% in stem and 12% in silique. Only nine PGs (approximately 20%) are found in only one tissue type (Figure ). These findings indicate that most PGs have rather wide expression patterns and the expression level seems to be generally higher in floral tissues. The complexity of expression patterns represented in Figure emphasizes the need for additional interpretation, and is the basis for the statistical analyses described below for the expression data.
Effects of duplication mechanisms on gene expression
While it was anticipated that more closely related genes would tend to have similar expression patterns, we did not find significant correlation between the synonymous substitution rate (Ks) and the expression profile (Figure ). In addition, to evaluate the relationships between Ks and expression correlation using all PG pairs, we also reached the same conclusion after partitioning the data as within clade (r = -0.119, p = 0.39), between clade (r = 0.002, p = 0.58), or reciprocal best matches (r = -0.4389, p = 0.12). This finding indicates that expression patterns have diverged quickly after PG duplications. In particular, significantly fewer PGs in tandem clusters were expressed when compared with those not in clusters (Table ; Fisher's exact test; p = 0.0326). In several cases, the tandem duplicated regions have one relatively highly expressed gene while the rest have either low expression levels or no RT-PCR products. For example, in the 1b tandem cluster of clade A14, At1g23460 is highly expressed while At1g23470 does not have any detectable expression. Curiously, we found that related PGs found in duplicated blocks tend to have similar expression patterns at the tissue level. For example, in block 11d clade A14, At1g23460 and At1g70500 have nearly identical expression profiles (Figure ). We selected 18 PG pairs that were derived from tandem or large-scale block duplication to compare their expression divergence. Among nine pairs in large-scale duplicated blocks, the expression pattern is significantly different in only one pair (Table ). Among the nine pairs derived from tandem duplications, the t-test could only be conducted for four pairs because several of the tandem duplicates had no detectable expression. In addition to two pairs with significant differences (p < 0.05), three pairs with only one of the tandem duplicates expressed are also classified as pairs showing expression divergence. Therefore, excluding two pairs with no expression for both duplicates, five out of seven tandem pairs have divergent expression. Significantly fewer PG pairs derived from tandem duplications have similar expression patterns compared with those derived from large-scale duplications (Fisher's exact test; p < 0.01). Therefore, tandemly duplicated PGs have higher levels of expression divergence compared with PGs derived from large-scale duplications. These findings suggest that duplication mechanisms contribute to divergence of expression patterns differently.
Figure 6 Expression of PGs shared among tissues and the correlation between expression patterns and the Ks. (a) Overlapping expression of PGs - the majority of expressed PGs are found in all five tissues tested. (b) Pairwise comparisons of tissues with PGs - the (more ...)
Distribution and expression of Arabidopsis PG genes in duplicated regions
Expression (RT-PCR) of Arabidopsis PG genes in different clades
Developmentally regulated expression divergence among PGs expressed in abscission zone
So far, our expression analyses were performed in five widely different tissues. To further expand our understanding of PG expression, we took a close look at 43 of the expressed PGs in the abscission zones of flowers and developing siliques at five developmental stages during floral organ abscission (Figure ). During the abscission process there are discrete stages when cell wall loosening and cell wall dissolution occurs, thus providing an excellent biological system to look at more subtle changes in the regulation of cell separation. And indeed, this analysis allowed us to discern differences in expression between PGs that had been initially regarded as similar due to limitations in resolution (Figure ). For example, at the tissue level, At1g23460 and At1g70500, from block 11d clade A14 were regarded as having nearly identical expression profiles. However, when we examined five stages of abscission, these genes have distinct profiles (Figure and , Additional data file 7).
Figure 7 RT-PCR on floral organ abscission zones representing five unique stages of development. Expression of 43 PGs is examined in the abscission zones at five different stages of floral organ abscission as determined by position on the inflorescence, where (more ...)
We determined that there are nine unique patterns of expression for the PGs during the five stages of abscission that are shown in Figure and Additional data file 7. Eight PGs display high levels of expression at anthesis, low levels during the events of cell separation, and high levels post abscission as depicted in Figure . These genes are all from independent clades except two sets: At1g19170 and At3g42950 (B8), and At2g23900 and At3g48950 (B6). In Figure , 7 PGs show initial high expression at anthesis that decreases steadily during abscission, while in Figure , PG expression (At1g02460, At1g56710, and At3g61490) initially decreases right before abscission and then increases after the loss of floral organs or during what is described as post abscission repair. In Figure , two PGs (At1g23460 and At1g10640) have very low or undetectable expression during anthesis that goes up continually during abscission. Other patterns include ten PGs with constitutive expression (Figure ), and six PGs with no expression (Figure ). Last, we observed three patterns of expression that correlated with unique changes during the process of abscission (Figure ). In Figure , high levels of gene expression correlate with cell wall loosening or the earliest steps of abscission, while in Figure highest levels of gene expression correlate with cell separation or loss of floral organs. In Figure , it is only at around positions 10 and 11 that we observe detectable gene expression, and this correlates with predicted stages of cell repair [25
Taken together, expression divergence between PGs that show no difference at the tissue level were revealed when we examined PG expression at different developmental stages of abscission, thus indicating duplication mechanisms contribute to divergence of expression differently. Our findings also provide candidate PGs important for different abscission stages. More importantly, the expression divergence between duplicate genes in general appears to be under-estimated in expression studies due to the limitations in resolution.