|Home | About | Journals | Submit | Contact Us | Français|
Users may view, print, copy, download and text and data- mine the content in such documents, for the purposes of academic research, subject always to the full Conditions of use: http://www.nature.com/authors/editorial_policies/license.html#terms
We present the 207 Mb genome sequence of the outcrosser Arabidopsis lyrata, which diverged from the self-fertilizing species A. thaliana about 10 million years ago. It is generally assumed that the much smaller A. thaliana genome, which is only 125 Mb, constitutes the derived state for the family. Apparent genome reduction in this genus can be partially attributed to the loss of DNA from large-scale rearrangements, but the main cause lies in the hundreds of thousands of small deletions found throughout the genome. These occurred primarily in non-coding DNA and transposons, but protein-coding multi-gene families are smaller in A. thaliana as well. Analysis of deletions and insertions still segregating in A. thaliana indicates that the process of DNA loss is ongoing, suggesting pervasive selection for a smaller genome.
Genome sizes in angiosperms range from 64 Mb in Genlisea1 to an enormous 149 Gb in Paris2-4. Two major processes increase genome size: polyploidization and transposable elements (TE) proliferation. Processes that counteract genome expansion include the loss of entire chromosomes, as well as deletion-biased mutations due to unequal homologous recombination and illegitimate recombination5-9. Recent work comparing two cereals, rice and sorghum, has begun to shed light on some of these processes10. However, these species are separated by 60 to 70 million years, making it difficult to disentangle the different evolutionary forces at work.
An exciting opportunity to understand what drives differences in genome size over shorter time scales is offered by the genus Arabidopsis in the Brassicaceae. The genome of the self-incompatible perennial A. lyrata is larger than 200 Mb, near the family average11,12, while the self-compatible annual A. thaliana has one of the smallest angiosperm genomes, at about 125 Mb, even though the two species diverged only about 10 million years ago13-15. Compared to the difference between the two species, variation within A. thaliana is much less11.
A high-quality genome sequence for the partially inbred A. lyrata strain MN47 was assembled from approximately 8.3x coverage of dideoxy sequencing reads, making use of information from genetic maps and chromosome painting16-19 (Online Methods). The final assembly included 206.7 Mb of sequence, 90% of which are included in eight large scaffolds covering the majority of each of the eight chromosomes, and another large scaffold of 1.9 Mb representing one of the centromeres. Based on cytological observations20, the centromeric gaps were estimated to span 17.2 Mb. A combination of de novo predictions, homology to A. thaliana features, and RNA sequencing was used to annotate the genome. In A. lyrata, we predicted 32,670 protein-coding genes, compared to 27,025 genes in A. thaliana21.
Since overall sequence identity between A. lyrata and A. thaliana is greater than 80% (Supplementary Fig. 1), the two genomes could be easily aligned (Fig. 1a). Genetic mapping16,18,19 has revealed 10 major rearrangements, including two reciprocal translocations and three chromosomal fusions, that led to the A. thaliana karyotype of five chromosomes, compared to the ancestral state of eight, as found in A. lyrata and other Brassicaceae. Although centromeric regions are difficult to assemble, we could identify the syntenic region in A. thaliana that corresponds to the chromosome 4 centromere of A. lyrata. The entire centromere has been lost, with only two remnants of satellite repeats in the 1.4 kb intergenic region between the genes At2g26570 and At2g26580 (Supplementary Fig. 2).
Apart from chromosomal-scale changes, approximately 90% of the two genomes have remained syntenic, with the great majority in highly conserved collinear arrangements (Fig. 1b and Supplementary Fig. 1d). The run length distribution of collinear gene pairs is bimodal, with a first peak of fragments of five or fewer collinear gene pairs (Fig. 1c), reflecting an abundance of small-scale rearrangements (<10 kb), including single gene transpositions. Windows containing a breakpoint in collinearity are enriched for TEs and other repeats (Supplementary Table 1), in agreement with repetitive elements often being associated with chromosomal rearrangements and transposed genes22-27, although they might not necessarily be causal28. Two thirds of the 154 inversions identified between the two species are flanked by inverted repeats (Supplementary Table 2).
Despite this overall similarity in gene arrangement, the two genomes are strikingly different in size. A whole-genome alignment reveals that more than 50% (<114 Mb) of the A. lyrata genome appears to be missing from the A. thaliana reference genome. In contrast, only about 25% (<30 Mb) of the A. thaliana genome is absent from A. lyrata (Fig. 1d; Supplementary Fig. 1e; Supplementary Table 3). Nevertheless, the distribution across different sequence classes is similar: half of the unalignable sequences are in TEs, and a quarter in intergenic regions. The net effect of these changes is that the A. thaliana genome is ~80 Mb smaller than the A. lyrata genome, with a much higher fraction of genic sequences, 42% instead of 29%, even though the total gene count is smaller (Fig. 1e). The apparent shrinkage of the A. thaliana genome is not simply due to a few chromosome-scale changes: only 10% of the size difference is attributable to the three missing centromeres; the rest is due to hundreds of thousands of smaller insertions and deletions, spanning all classes of sites. Strikingly, while large differences much more often correspond to sequences only found in A. lyrata, this is not true for very small insertions and deletions (Fig. 2). This is in stark contrast to other genomes from other closely related species, but with similarly sized genomes, such as chimpanzee and human29.
Although rearrangements are correlated with genome shrinkage (rearranged regions are on average shorter in A. thaliana than are collinear regions; Fig. 3 and Fig. 4a), unalignable sequences are found throughout the genome. An analysis of collinear gene pairs confirmed that in most cases, intergenic regions in A. lyrata are longer than their counterparts in A. thaliana (Fig. 4b). Introns behave similarly, although the difference is smaller13.
The gene content of A. thaliana is ~17% lower than that of A. lyrata, but without major differences in Gene Ontology (GO) distribution. Similarly, divergence patterns for different gene families between the two species mirror those of within-Arabidopsis thaliana polymorphism levels30,31. The combined gene sets of A. lyrata and A. thaliana result in 12,951 MCL32 clusters, with fewer singletons in A. thaliana (Fig. 4c). Among the 8,794 shared multi-gene MCL clusters (Fig. 4d), clusters that are smaller in A. thaliana outnumber those that are smaller in A. lyrata (1,797 to 612). F-box and NB-LRR genes are examples of gene families with particularly high birth and death rates in plants30,31,33-35. Arabidopsis lyrata has 596 F-box and 187 NB-LRR genes, compared to 502 and 159, respectively, in A. thaliana. The trend of fewer genes in A. thaliana is supported by a broader comparison of the Arabidopsis gene set with those of two other dicots36-38. Arabidopsis lyrata has 114 ortholog clusters39 shared with poplar and grapevine but not A. thaliana, while A. thaliana has only 45 clusters found in poplar and grapevine but not A. lyrata. Similarly, A. lyrata has 875 clusters not detected in any of the other three species, while A. thaliana has only 156 species-specific clusters (Supplementary Table 4 and Supplementary Fig. 3).
As in other taxa, TEs make an important contribution to the change in genome size (Fig. 1d), and TEs comprise a larger fraction of the A. lyrata genome (Fig. 1e). Without an outgroup, one cannot infer directly how much such patterns are shaped by different TE activity levels or the differential purging of ancestral TEs since speciation. To obtain an estimate of relative activity levels, one can exploit the molecular clock to estimate the average age of long terminal repeat (LTR) retrotransposons40 (Fig. 1e). Using the experimentally determined mutation rate in A. thaliana14, we calculated the mean and median age in A. thaliana to be 3.1 and 2.1 million years, respectively, compared to 1.1 and 0.6 million years in A. lyrata (Fig. 5a). In agreement with previous estimates41, this suggests that LTR retrotransposons have been recently more active in A. lyrata. A phylogenetic analysis also supports a greater expansion of specific LTR retrotransposon clades in A. lyrata (Fig. 5b). Coupled with higher activity levels of TEs in A. lyrata, we find that TEs are differently distributed in the two species, with A. lyrata having a higher proportion of genes with a TE nearby than A. thaliana (Fig. 5c), and this distance is skewed towards larger values in A. thaliana (Supplementary Table 5 and Supplementary Fig. 4). Together, these observations are consistent with a model under which selection purges TEs with deleterious effects on adjacent genes, such that TEs more distant from genes preferentially survive42, with TE elimination having been more efficient in A. thaliana. In addition, there is the possibility that TEs in A. lyrata have experienced less natural selection because they are on average younger.
The evidence presented so far points to A. thaliana having suffered a large number of deletions throughout its genome. We can use within-species polymorphisms to shed light on the process by which this has happened. If the A. thaliana genome continues to shrink, we would expect fewer segregating insertions than deletions. Using the A. lyrata genome as a proxy in determining the derived state among a set of insertion and deletion polymorphisms found throughout the genome of 95 A. thaliana individuals43, we find a clear excess of deletions over insertions, with 2,685 fixed and 852 segregating deletions, compared to 1,941 fixed and 106 segregating insertions. Furthermore, among the fixed differences, deletions are on average longer than insertions (Fig. 6a). If selection were not involved, and if this pattern were only due to mutational bias favoring deletions44,45, deletion and insertion polymorphisms should have similar allele frequencies in the A. thaliana population. However, segregating insertions are on average found in fewer individuals than are deletions or single-nucleotide polymorphisms. Deletions are often found in the majority of individuals, and many are approaching fixation in A. thaliana (Fig. 6b). This pattern suggests that deletions are favored over insertions because of selection, rather than simple mutational bias, thus leading to a smaller genome.
The pattern of divergence between the two genomes supports this hypothesis. While more deletions have occurred on the A. thaliana than the A. lyrata lineage, the bias towards deletions becomes stronger the longer the missing sequence, and it is absent for sequences shorter than 5 bp or so (Fig. 2). This is consistent with a model where long deletions are selectively favored in A. thaliana, whereas short deletions are not. We acknowledge that without an outgroup to reconstruct the ancestral state shared by the ancestor of both A. lyrata and A. thaliana, one cannot accurately determine whether all changes are derived in A. thaliana.
In summary, we have presented a high-quality reference genome sequence for A. lyrata, which will be a valuable resource for functional, evolutionary and ecological studies in the genus Arabidopsis. Several processes contribute to the remarkable difference in genome size between the predominantly selfing A. thaliana and the outcrossing A. lyrata. In just a few million generations, numerous chromosomal rearrangements have occurred, consistent with theoretical predictions of rearrangements that reduce fitness in heterozygotes being fixed much more easily in strongly selfing species46. Though A. thaliana has 17% fewer genes than A. lyrata, much of the genome size difference seems to be due to reduced TE activity and/or more efficient TE elimination in A. thaliana, especially near genes, as well as shortening of non-TE intergenic sequences and introns in A. thaliana. Specifically, by making the reasonable assumption that the A. lyrata allele presents in the majority of cases the ancestral state, we find that segregating deletions at non-coding sites in A. thaliana are skewed towards higher allele frequencies, and that both fixed and polymorphic deletions are more common than insertions. Together, this suggests pervasive selection for a smaller genome in A. thaliana. Apart from apparent advantages for species with smaller genomes that have been inferred from meta analyses47, the transition to selfing might be an important factor in this process46. In addition, a shorter life span may allow a reduction of the genetic repertoire and thus contribute to the smaller genome of A. thaliana as well.
What role, if any, genome expansion might play in A. lyrata can be addressed once detailed A. lyrata polymorphism information as well as closely related outgroup genomes become available, such as the one from Capsella rubella, which is currently being assembled. A complete understanding of the processes behind genome contraction and expansion over short time scales will also require better knowledge of mutational events, and a deeper understanding of the distribution of, and selection on, non-coding regulatory sequences42. For both, high-quality whole genome sequences of additional Arabidopsis relatives will be an important tool.
Arabidopsis lyrata strain MN47 was derived by forced selfing from material collected in Michigan, USA, by Dr. Charles Langley (UC Davis). It was inbred six times before extracting DNA for sequencing. Libraries with various insert sizes including fosmids and BACs were dideoxy sequenced on ABI 3730XL capillary sequencers. Reads were assembled with Arachne48, and collinearity information was integrated with marker information from genetic maps16,18,19 to reconstruct the eight linkage groups. Additional details and specifics are presented in the Supplementary Note.
MCL (mcl-06-058 package; http://micans.org/mcl/src/) was used with default parameters (−I 2, −S 6) based on clustering of hits with E-value ≤ 10−5. MCL uses a Markov cluster algorithm that attempts to overcome many of the difficulties with protein sequence clustering, such as the presence of multi-domain proteins, peptide fragments and proteins with very common domains. The method has been used for a variety of animal genomes49-51.
Orthologous gene clusters were computed from OrthoMCL comparisons39 of four dicotyledonous species with finished genomes: A. thaliana and A. lyrata, Populus trichocarpa36 and Vitis vinifera37,38. A search for potentially missed genes in both Arabidopsis genomes resulted in minor adjustments of the OrthoMCL clusters. Instead of 10,573, 10,878 clusters now contained at least one gene of each the four species, and instead of 5,699, 5,800 clusters were Arabidopsis-specific. To determine deleted or newly generated orthologs (by OrthoMCL definition) between the two species, we focused on clusters specific for either A. lyrata or A. thaliana. For both species, there are two cluster types, those that are supported by members in P. trichocarpa and/or V. vinifera (supported specific cluster, SSC), and clusters exclusively found in one of the Arabidopsis species (exclusive specific cluster, ESC). We did not consider 2,939 and 6,103 unclustered genes (singletons) in A. thaliana and A. lyrata, respectively.
In our initial analysis, we detected 354 SSCs and 161 ESCs for A. thaliana, and 168 SSCs and 833 ESCs for A. lyrata. Whole genome projects, however, may contain false positive as well as missed or incomplete/partial gene calls that impose difficulties for OrthoMCL to detect orthologous relationships. To ensure that genes from the previously detected SSCs were indeed specific for one of the Arabidopsis species, we re-evaluated absence or presence of specific gene calls in the two genome sequences. Previously missed genes detected by GenomeThreader were added to each of the gene sets and the OrthoMCL analysis was repeated.
Using F-box PF00646.hmm as HMM profile with hmmsearch (E-value ≤ 10−5), 394 hits were found from in A. thaliana and 461 hits in A. lyrata. Alignment of these sequences was optimized with the PF00646 seed using ClustalX 2.052. The final alignment was produced by aligning with hmmalign against PF00646.hmm, to construct an Arabidopsis specific HMM F-box profile. With this HMM profile, 502 hits were found in A. thaliana, and 596 hits in A. lyrata. hmmalign was used to align all of these against PF00646.hmm.
A blastp search (E-value ≤ 10−10) performed with the NB domain (based on HMMEMIT, from http://niblrrs.ucdavis.edu/At_RGenes/). The NB domains of the retrieved proteins, 142 in A. thaliana and 162 in A. lyrata, were aligned using ClustalX52. This alignment was used to develop an Arabidopsis-specific HMM profile, which was used to search the complete set of proteins encoded by both the two genomes (cut off E ≤ 10−5).
PAUP* version 4.0b1053 was used to reconstruct phylogenetic trees with neighbor-joining method.
To develop de novo repeat libraries for both species, we used RepeatModeler (version Beta 1.0.3, http://www.repeatmasker.org/RepeatModeler.html). To reduce false positives, unclassified repeats were compared to annotated genes, eliminating all that had at least 80% identity to annotated genes over at least 80 bp (GenBank: Green plant GB all [protein]; blastx with E-value ≤ 10−10). The remaining RepeatModeler predictions were classified with the 80-80-80 rule54, grouping repeats if they shared at least 80% identity over at least 80% of the aligned sequence, which had to be at least 80 bp long. The identified repeats were appended to RepBase (Arabidopsis library - RM database version 20080611), resulting in a final library with 1,152 repeat units. The final libraries were used to annotate TEs using RepeatMasker version 3.2.5.
Intact LTR retrotransposons were identified de novo using LTR_STRUC55 with default parameters. Based on the sequence divergence between the two LTRs of the same element, insertion times were estimated. All LTR pairs were aligned using MUSCLE56, and the distance K between them calculated with the Kimura two-parameter model using the distmat program implemented in the EMBOSS package (http://emboss.sourceforge.net/). The insertion time T was calculated as T = K/(2) (r), with r as the rate of nucleotide substitution. The molecular clock was set based on the observed mutation rate of 7 × 10−9 per site per generation (assumed to equal one year)14.
LTR retrotransposons can be classified into Ty1/copia-like and Ty3/gypsy-like elements57. We classified repeats using RepBase (version 13.08, http://www.girinst.org/server/RepBase/) and blastn (E-value ≤ 10−10), and by direct comparison against the JCVI/TIGR plant repeat database (http://blast.jcvi.org/euk-blast/index.cgi?project=plant). All intact LTR retrotransposons were compared with blastx (E-value ≤ 10−10) against a conserved 156 amino acid segment corresponding to the reverse transcriptase domain58 of Ty1/copia-like and Ty3/gypsy-like sequences, and this segment was then used for phylogenetic reconstruction, with PAUP* version 4.0b1053 and neighbor-joining method. As outgroup sequence, we used yeast the reverse transcriptase domain from yeast Ty1 and Ty3 elements, respectively58.
Genome wide collinearity was detected by running i-ADHoRe59 on the core-orthologous genes, allowing the identification of breakpoints including inversions and nested inversions. For each inversion, 10 kb up- and downstream of the delimiting breakpoints were compared to each other using blastn (word size 4), tblastx (word size 1) and SSEARCH60-62. Tblastx outperforms blastn for coding regions. In non-coding regions, SSEARCH is more sensitive than blastn, but computationally less efficient, and hence most useful for comparison of shorter sequences. Only one hit per strand was reported. Therefore, for each pair of inversion flanking regions, all combinations of repeats and protein coding genes were evaluated. Default settings were used for gap penalties. An E-value of ≤ 0.01 was considered as indicating similarity between the up- and downstream regions.
To investigate nucleotide divergence of intergenic regions around coding genes (Supplementary Figure 1b), we extracted for each syntenic gene pair the 2 kb sequences 5′ of the start codon and 3′ from the stop codon. If the neighboring gene was closer than 2 kb, the extracted sequence was accordingly trimmed. Coding sequences of syntenic genes were also analyzed. Global alignments of syntenic sequences were generated using the Needleman-Wunsch algorithm as implemented in the EMBOSS package 5.0 (default parameters). Sequence identity of coding regions was measured over the full-length alignment. To investigate whether divergence of intergenic sequence is affected by relative orientation to neighboring genes, upstream sequences were split into head-to-tail and head-to-head groups, and downstream sequences into tail-to-head and tail-to-tail groups.
To identify fixed insertions and deletions among 1,238 fragments that had been amplified by PCR and sequenced in 95 A. thaliana individuals43, two representative sequences for each fragment were first constructed to represent the insertion and deletion states among all segregating indels. The representative sequence consisting of insertions was then queried against the A. lyrata genome with both BLAT63 (−maxGap=100 −extendThroughN −minIdentity=80) and BLAST64 (−e .00001 −F F −G −5 −E −1). Based on the longest hit from the union of hits obtained by both methods, the representative sequences for each alignment were profile-aligned with the A. lyrata allele with MAFFT65. Fixed insertions and deletions were identified in the resulting alignment.
A similar procedure to that described above was used to identify the A. lyrata allele (presumed ancestral state) for each polymorphic indel in A. thaliana. Instead of querying the entire fragment, we queried each insertion allele along with 25 bps flanking each side, against the A. lyrata genome using BLAT. For each polymorphic indel, we filtered for the best hit that spanned both sides of the indel site (by at least 3 bps) and reported each indel as either a derived insertion (if the A. lyrata allele was a deletion in the resulting profile alignment) or a derived deletion (if the A. lyrata allele was not a deletion).
The assembly and annotation (Entrez Genome Project ID 41137) are available from GenBank (accession number ADBK00000000) and from JGI's PHYTOZOME portal (http://www.phytozome.net/alyrata.php). Seeds of the MN47 strain have been deposited with the Arabidopsis Biological Resource Center under accession number CS22696.
The U.S. Department of Energy Joint Genome Institute (JGI) provided sequencing and analyses under the Community Sequencing Program supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. We are particularly grateful to Dan Rokhsar and Kerrie Barry for providing leadership for the project at JGI. We thank Justin Borevitz, Anne Hall, Charles Langley, June Nasrallah, Barbara Neuffer, Outi Savolainen and Stephen Wright for contributing to the initial sequencing proposal submitted to the Community Sequencing Program at JGI, Christa Lanz and Kenneth Lett for technical assistance, and Peter Andolfatto and Rod Wing for comments on the manuscript. This work was supported by NSF DEB-0723860 (B.S.G.), NSF DEB-0723935 (M.N.), NSF MCB-0618433 (J.C.C.), NSF IOS-0744579 (M.E.N.), NIH GM057994 (J.B.), grant GABI-DUPLO 0315055 of the German Federal Ministry of Education and Research (K.F.X.M.), ERA-PG grant ARelatives from the Deutsche Forschungsgemeinschaft (D.W.) and Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT) and the Inter-University Network for Fundamental Research (P6/25, BioMaGNet) (Y.V.d.P.), a Gottfried Wilhelm Leibniz Award of the DFG (D.W.), the Austria Academy of Sciences (M.N.), and the Max Planck Society (D.W. and Y.-L.G.).
Methods and any associated reference are available in the online version of the paper at http:/www.nature.com/naturegenetics/.
Note: Supplementary information is available on the Nature Genetics website.
J.B., J.C.C., B.S.G., I.V.G., Y.-L.G., K.F.X.M., M.N., Y.V.d.P. and D.W. conceived the study; M.E.N. provided the biological material; J.C., J.-F.C., R.M.C., N.F., J.G. and Y.-L.G. performed the experiments; E.G.B., J.A.F., N.F., H.G., Y.-L.G., G.H., J.D.H., T.T.H., R.P.O., S.O., P.P., A.A.S., J.S., K.S., M.S., X.W., and L.Y. analyzed the data; and Y.-L.G., T.T.H., M.N. and D.W. wrote the paper with contributions from all authors.
COMPETING INTEREST STATEMENT
The authors declare no competing financial interests.