|Home | About | Journals | Submit | Contact Us | Français|
Sequence analysis of organelle genomes has revealed important aspects of plant cell evolution. The scope of this study was to develop an approach for de novo assembly of the carrot mitochondrial genome using next generation sequence data from total genomic DNA.
Sequencing data from a carrot 454 whole genome library were used to develop a de novo assembly of the mitochondrial genome. Development of a new bioinformatic tool allowed visualizing contig connections and elucidation of the de novo assembly. Southern hybridization demonstrated recombination across two large repeats. Genome annotation allowed identification of 44 protein coding genes, three rRNA and 17 tRNA. Identification of the plastid genome sequence allowed organelle genome comparison. Mitochondrial intergenic sequence analysis allowed detection of a fragment of DNA specific to the carrot plastid genome. PCR amplification and sequence analysis across different Apiaceae species revealed consistent conservation of this fragment in the mitochondrial genomes and an insertion in Daucus plastid genomes, giving evidence of a mitochondrial to plastid transfer of DNA. Sequence similarity with a retrotransposon element suggests a possibility that a transposon-like event transferred this sequence into the plastid genome.
This study confirmed that whole genome sequencing is a practical approach for de novo assembly of higher plant mitochondrial genomes. In addition, a new aspect of intercompartmental genome interaction was reported providing the first evidence for DNA transfer into an angiosperm plastid genome. The approach used here could be used more broadly to sequence and assemble mitochondrial genomes of diverse species. This information will allow us to better understand intercompartmental interactions and cell evolution.
To date, 23 mitochondrial genomes in seed plants have been fully sequenced and analyzed http://www.ncbi.nlm.nih.gov/Genomes/. These mitochondrial genomes are extremely variable in size, ranging from 221kb (Brassica napus) to 2,740kb (Cucumis melo). Sequence analysis revealed that the most abundant portion of the mitochondrial genomes is non-coding , which includes “promiscuous” DNA of plastid and nuclear origin , as well as sequences of horizontal origin from foreign genomes [3-6]. Structural analysis, through use of Southern hybridization or paired-end data, revealed a high frequency of intra- and intermolecular recombination due to accumulation of repetitive sequences. This process has generated a structurally dynamic assemblage of genome configurations within a species [7-9] and a scrambling of gene order within closely related species . This dynamic organization of the plant mitochondrial genome provides a powerful model for the study of genome structure and evolution. In addition, the increasing availability of plant organelle and nuclear genome sequence data provides an understanding of the mechanisms driving plant genome evolution. Indeed, there is a strong structural and functional interaction among plastid, mitochondrial, and nuclear genomes [11,12]. Transfer of DNA among these three compartments in higher plants has been reported, with exception of transfer into the plastid genome [13,14].
Despite its importance, technical obstacles of DNA isolation and sequence assembly limit the sequencing of mitochondrial genomes. Conventional approaches to mitochondrial genome sequencing involve extraction and enrichment of mitochondrial DNA, cloning, and sequencing. Large repeats and the dynamic mitochondrial genome organization complicate sequence assembly.
The development of next generation sequencing technologies (NGS), such as the Roche and Illumina platforms, provides a new opportunity for rapid characterization of mitochondrial genomes. Non-enriched whole genome DNA libraries, both shotgun and paired-end, include plastid and mitochondrial DNA that is sequenced along with the nuclear DNA during the sequencing run. NGS technologies have already been used for sequencing the small mitochondrial genome of nematodes [15,16], human  and fish  with no library enrichment. Recently, sequencing data from non-enriched libraries has been successfully used to assemble plastid genomes of wild and domesticated rice, mung bean, date palm, and milkweed [19-22]. The major limitations for use of this approach on de-novo assembly of mitochondrial genomes are the ability to overcome assembly problems related to large repeat regions, presence of promiscuous DNA, and sequence ambiguity due to sequencing technologies.
The aim of this study was to demonstrate how next generation sequence data from total genomic DNA can be used to de-novo assemble the mitochondrial genome of carrot (Daucus carota). In addition, intergenic sequence analysis provides evidence of a rare transfer of DNA into the plastid genome. This is the first report of mitochondrial genome transfer into an angiosperm plastid genome. The strategy used in this study has broad application to explore more mitochondrial genomes, to further investigate intra-cellular genome interaction and genome evolution.
Five different sets of 813,770, 814,668, 771,864, 704,918 and 692,688 454 shotgun reads (Table1) with an average reads length ranging 354 to 419 nt, corresponding to an estimated nuclear genome coverage of 0.6×/set were used for initial assembly and 570,590 3kb paired-end reads were used for connection verification. In addition, 50,598,879 Illumina reads of length 100 nt were used to correct homopolymer ambiguity. Each sequences set was independently assembled, each producing from 36,602 to 25,841 contigs (Table1). Plastid assemblies resulted in five master circles with length ranging from 155,771 to 155,849 nt (Table1). These five assemblies were then aligned with Kalign , generating a consensus sequence of 155,834 nt (Table1). Alignment of the de novo assembly with the published carrot plastid genome  showed full-length coverage, with 99% identity relative to the published sequence including 49 nt of difference in cumulative sequence SNPs, and 433 nt of cumulative indels, with a maximum indel length of 20 nt.
Mitochondrial assemblies were based on contig connections. In order to visualize these contig connections we developed and used bb.454contignet [http://www.vcru.wisc.edu/simonlab/sdata/software/]. The tool allows visualization of connections between gsAssembler contigs, along with contig size and average read coverage. This visualization allowed us to establish single and repeat contigs.
With 15 single copy and 7 repeated contigs from sequence set 4 the mitochondrial genome could be arranged in two possible master circles of 281,042 nt (Table1, Figure1, Additional file 1: Figure S1). Sequence analysis identified four large repeat regions, five single copy regions, and 12 possible connections between repeat sequences and flanking regions in the master circles. Sequence sets 1, 2, 3, and 5 had some small gaps in their assemblies, due to 3, 1, 1, and 3 missing connections, respectively (Additional file 2: Figure S2). Locations of these gaps were never shared between assemblies. Alignment of the five assemblies gave a complete consensus sequence of 281,079 nt (Table1, Additional file 2: Figure S2). In order to confirm the 12 possible connections between repeat to repeat and repeat to single copy regions, we performed PCR and sequenced all possible amplicons spanning those regions. Sequence of these amplicons confirmed all 12 expected connections (Additional file 1: Figure S1). As a second verification, we mapped a set of 570,590 3kb whole genome paired-end reads onto the mitochondrial assembly, with both ends ≥50 nt aligned. Mapped reads with ≥95% similarity and ≥85% of the length matching the assembled sequence, and reads that aligned at least once within a range of 1,000 to 5,000 nt of each other, were considered in agreement with the master circle assembly and 9,134 reads mapped at least once within this range covering the entire assembled genome (Figure2, green lines). By contrast, reads appearing to be in disagreement (Figure2, red lines) were alternative mappings near repeat region borders (Figure2, examples 1, 2, white numbers) or with regions with plastid similarity that are within the expected range in the plastid genome and outside of the range in the mitochondrial genome (Figure2, examples 3, 4, 5). These results confirmed repeat connections as well as contig connections.
In the case of sequences of ambiguous source, common to both mitochondrial and plastid genomes, we determined the true mitochondrial-version of these common sequences. We amplified and sequenced three regions with size of 1,298, 1,545 and 2,257 nt using the Sanger method. PCR results confirmed the size of the expected amplicons and these sequences were then incorporated into the mitochondrial assembly.
As 454 pyrosequencing is known to include insertion/deletion errors due to homopolymeric repeats , we corrected homopolymer regions of≥5nt by mapping 51 million Illumina reads (corresponding to a 10x nuclear genome coverage) with quality ≥28, excluding the three regions common to plastid and mitochondrion, independently sequenced by the Sanger method (see above). In total 1,593 homopolymer sequences were detected, 23 in coding and 1,570 in non-coding regions (Table2). After mapping, 40 errors were corrected, 6 in coding and 34 in non-coding regions, for a total length of 102 nt and an error rate of 1.14%.
In addition to homopolymers, we also looked for ambiguity due to SNPs or single base insertions/deletions. Overall, no ambiguities were detected resulting in a final assembly of 281,132 nt.
Assemblies of mitochondrial and plastid genomes in this study were carried out using 454 sequence data from whole genome libraries (see Materials and Methods). In order to evaluate plastid and mitochondrial genome coverage obtained from all sets of shotgun 454 and Illumina reads used in this study, we mapped sequences against the assembled plastid and mitochondrial genomes using GNUMAP .
Mapping of five sets of 454 shotgun reads (813,770, 814,668, 771,864, 704,918 and 692,688 reads) gave a coverage ranging from 123× to 173× for plastid, and from 16× and 23× for mitochondrial genomes (Additional file 3, Table S1). Overall, 249,302 and 50,802 reads mapped in the plastid and mitochondrial genome suggesting that about 6.5% and 1.3% of the sequenced DNA was of plastid and mitochondrial origin, respectively.
For Illumina sequences, out of 51 million reads (corresponding to a 10× nuclear genome coverage), 6,949,870 sequences were successfully mapped to the plastid genome and 1,365,825 sequences were mapped to the mitochondria giving a 4,385x and 452x coverage, respectively (Additional file 3: Table S1). This suggests that about 13.7% and 2.7% of the sequenced DNA was of plastid and mitochondrial origin, respectively.
Analysis of GC content across the genomes indicate that in regions where the GC content is high the Illumina read coverage is reduced (Additional file 4: Figure S3 A1, A2). By contrast, high GC content did not affect 454 read coverage (Additional file 4: Figure S3 B1, B2).
These results confirmed that unenriched whole genome sequencing is a practical approach for de novo assembly of higher plant plastid and mitochondrial genomes.
To evaluate the consistency of the assembly and possible recombination across repeats, a Southern blot experiment was carried out. Total genomic DNA was digested with two enzymes, XhoI and Eam1105I and hybridized with probes designed within repeats 1–3 and 4. Consistent with the assembled genome sequence, an expected hybridization pattern for both XhoI and Eam1105I was observed (Figure3, A). In addition, the digestion pattern of repeat 4 with both XhoI and Eam1105I was consistent with the expected size fragments (8,653 nt and 13,726 nt for XhoI and 8,980 nt and 24,224 nt for Eam1105 I) of the recombinant sub-circles 1 and 2 (Figure3B, Additional file 5: Figure S4).
The 281,132 nt mitochondrial genome presented in Figure4 is one of the two possible master circle conformations. After Brassica napus (221,853 nt) and Silene latifolia (253,413 nt), the carrot mitochondrial genome is one of the smallest mitochondrial genomes sequenced to date among the angiosperms. The overall GC content of carrot (45.4%) is comparable to other angiosperms [27,28] (Table3). Considering alignments with minimum length of 50 nt, the portion of the genome exhibiting 80% or more similarity with Nicotiana, Vitis and Carica constitutes 47.6%, 45.0% and 44.5%, respectively. Intergenic spacer regions represent the largest part of the genome with 224,526 nt (79.9%). Coding-exons represent 16.2% (46,063 nt) of the genome, and RNA-coding genes constitute 3.7% (10,547 nt), with rRNA accounting for 3.0% (8,565 nt) and tRNA accounting for 0.7% (1,982 nt) of the genome (Table3).
A BLAST search against a local mitochondrial protein and nucleotide database identified 44 putatively functional protein-coding genes and 3 genes coding for ribosomal RNAs (Additional file 6: Table S2). For all of these genes, open reading frames (ORFs) were detected, confirming the quality of the consensus sequence. Distribution of genes is illustrated in Figure4. As in other angiosperm mitochondrial genomes, most of the known genes encode for proteins of the electron transport chain, with 9 subunits of complex I (nad12344L567 and 9), one subunit of complex III (cob), three subunits of complex IV (cox12 and 3), five subunits of complex V (atp1468 and 9) and four genes involved in the biogenesis of cytochrome c (ccmBccmCccmFc and ccmFn). Truncated copies of atp1 and atp9 were detected, confirming observations previously reported by [29,30]. In addition, two conserved ORFs, coding for a maturase (matR) mapped within the nad1 intron and another ORF named mttB coding for a transporter protein, were detected.
The number of ribosomal and succinate dehydrogenase proteins is usually variable among different species (Additional file 6: Table S2). Annotation of the carrot mitochondrial genome allowed identification of 9 putatively functional genes coding for ribosomal proteins (rpl5, rpl10, rpl16, rps1, rps3, rps4, rps7, rps12 and rps13) of mitochondrial origin and one, rpl2 of plastid origin. The genome appears to lack functional copies of the succinate dehydrogenase genes sdh3 and sdh4.
Because they are in repeat regions, one gene (atp8) is duplicated, and six genes occur in triplicate (ccmB, cox3, nad3, rpl5, rps1 and rps12). Detection of ORFs allowed identification of 19 group II introns, 7 of which are trans-spliced.
BLAST searches against a local nucleotide mitochondrial database and detection with tRNA scan-SE allowed identification of genes coding for 18 tRNAs (Table4). One is a pseudogene, coding for a plastid-derived tRNA-Ile (UAU) with a deletion in the coding sequence. Out of the 17 tRNAs, four are of plastid origin and 13 of mitochondrial origin. These tRNA genes recognize 15 amino acids (Phe, Met, Ser, Pro, Tyr, His, Gln, Asn, Lys, Asp, Glu, Cys, Trp, Ser and Gly). Thus, tRNA genes for six amino acids are missing in the carrot mitochondrial genome.
In order to investigate the presence of the seven missing mitochondrial genes (sdh3, sdh4, rpl2, rps2, rps10, rps14 and rps19) in the nuclear genome we aligned the corresponding sequences from the mitochondrial genome of the species most closely related to carrot (see Materials and Methods) against a local database containing all the assembled sequences and raw reads used in this study. For three genes, sdh3, rps10 and rps14, a corresponding contig with conserved ORFs was detected in the nuclear genome (data not shown). None or insignificant matches were identified for the remaining four genes, sdh4, rpl2, rps2 and rps19 suggesting that these genes were not present in this dataset.
After annotation of conserved genes, we searched for other ORFs and plastid regions in the intergenic sequences and ORFs longer than 300 nt were annotated (Figure4, orf26–60). In total, 35 ORFs ranging from 300 to 1,122 nt in length were detected (Figure4). Out of these, 3 ORFs were conserved among mitochondrial genomes, with the best match of orf58 to orf147 of Nicotiana tabacum, of orf59 to orf115b of Brassica napus and of orf60 to orf122 of Beta vulgaris. Out of the remaining ORFs, 28 (orf26–53) had unknown function, and 4 (orf 54–57) were similar to truncated copies of hypothetical proteins from Vitis vinifera and Arabidopsis thaliana nuclear genes.
BLAST analysis of intergenic sequences against a local database containing all published plant plastid genomes allowed detection of 16 fragments with similarity to plastid sequences, with length ranging from 45 to 1,298 nt, accounting for 5,701 nt (2.0% of the genome) (Figure4, green box).
With a total of 74 repeats ranging from 37 to 14,749 nt, the carrot mitochondrial genome has the lowest number of repeats among the sequenced plant mitochondrial genomes (Table5, Figure4, red and blue line). All but one are dispersed repeats. Most of the repeats (about 90%) are between 20 and 202 nt in length accounting for just 2.0% of the total genome coverage. Nine large repeats ranging from 4,220 to 14,749 nt account for 44.0% of the genome. Analysis of repeat orientation allowed the detection of 52 direct repeats (Figure4, blue line) and 22 inverted repeats (Figure4, red line). The insertion of the large repeat 1, between repeat 2 and 3, forms a 35kb super-repeat. After wild cabbage , this is the largest repeat region described in eudicot mitochondrial genomes to date.
Recently Goremykin and colleagues , while analyzing the Vitis vinifera mitochondrial genome, detected two sequences of 74 (Figure5, region 2) and 126 nt (Figure5, the left terminal part of region 3) which were similar to the carrot plastid genome (positions 99,364-99435 and 99,437-99561 of GenBank: NC 008325). Both sequences are contained in a 1,452 nt carrot plastid sequence (position 99,297-100,748) that is absent in other published plastid genomes. BLAST analysis revealed that the 74 nt sequence is similar to cox1 gene, suggesting a transfer of DNA from the mitochondrial to the plastid genome. Given this unexpected finding and no experimental confirmation, the authors concluded that an evaluation of this region of the carrot plastid assembly would be needed. In order to refer to this sequence we coded it as Daucus carota Mitochondrial Plastid sequence (DcMP).
Here we confirm the presence of this region in the carrot plastid genome in the inbred carrot line B493B analyzed in this study (Figure5ADcMP region 1, 2, 3). In addition, analysis of intergenic sequences of the carrot mitochondrial genome revealed that this entire sequence is present in the mitochondrial genome as three separate pieces. Comparison of D. carota to Petroselium crispum [GenBank: NC 015821] revealed that the carrot plastid genome lacks a 339 nt sequence (Figure5A, position 101,297-101,636 of P. crispum plastid genome) present in the P. crispum and other more taxonomically distant species because of the replacement with the DcMP region.
In order to clarify the origin of this region, we first designed primers specific of flanking regions 1, 2 and 3 of the plastid genome or region 3 of the mitochondrial genome. We then amplified and sequenced these regions in different Apiaceae species. PCR results revealed a consistent conservation of the region 3 in the mitochondrial genomes; in fact an expected amplicon of 1,935 nt was amplified across all species (Figure5B, left). By contrast, a polymorphic pattern has been obtained when the plastid flanking sequences were used to anchor the primers (Figure5B, right). A short fragment of 500 nt was amplified in all non Daucus species including parsley (Petroselium crispum), fennel (Foeniculum vulgare), celery (Apium graveolens), plains eryngo (Eryngium planum) and 400 nt in coriander (Coriandrum sativum). The only exception was for cumin (Cuminum cyminum) which has the same pattern as carrot. With the exception of D. aureus, D. crinitus and D. muricatus, where amplicons of about 2,000, 2,000 and 1,100 nt were obtained, respectively, other Daucus species and cumin produced the amplicon of expected size, 1,618 nt.
Sanger sequence of above described PCR products confirmed the presence of the large region 3 in the mitochondrial genome of all Daucus and non Daucus species. It also confirmed, the absence of the insertion of regions 1, 2 and 3 in the plastid genome of non Daucus species (Figure5C). These sequences and its taxonomic distribution in Daucus and close relative cumin [64; see Discussion] provide strong evidence of a transfer of DNA into the plastome. In addition a few variants were observed. An insertion observed in the plastid DcMP region of wild species D. aureus, D. crinitus and D. glochidiatus is similar to a mitochondrial sequence contiguous to the DcMP region 3 (Figure5 A and C, region 4). This suggests that this region now identified as region 4 was part of the original insertion in the plastid genome of the ancestral progenitor of carrot. Sequence analyses of the plastid D. murucatus DcMP region revealed that the shorter amplicon is due to a deletion of regions 1 and 2.
To investigate the sequence characteristics that might provide a possible explanation of this transfer in the carrot plastid genome, we aligned the plastid and mitochondrial sequences of regions 1, 2 and 3 against NCBI nucleotide and protein databases. The alignment demonstrated that region 1 was an intergenic sequence with no similarity to any other nucleotide or protein sequence present in databases. As previously reported by Goremykin and colleagues , region 2 is part of the cox1 gene. It is more interesting that blastx alignment of region 3 revealed multiple matches with nuclear retrotransposable elements from different species including Vitis vinifera [GenBank: CAN77047 and CAN79321] and Populus trichocarpa [GenBank: XP 002329638 and XP 002334157] (Additional file 7: Figure S5).
Next generation sequencing technologies have revolutionized molecular biology by making whole genome sequencing projects possible for any species. Low coverage whole genome shotgun sequencing has proven a valuable approach for rapid and easy acquisition of a large amount of sequence data at relatively low cost. Low coverage data is useful for marker acquisition as well as the assembly of plastid genomes [20,21,31-33]. Despite its power, few examples of the application of NGS on de novo assembly of plant mitochondrial genomes have been reported. Recently Straub and colleagues  attempted to assemble the milkweed mitochondrial genome using Illumina data from unenriched whole genome DNA. The assembly resulted in a partial mitochondrial genome assembly of 115 contigs. Here we demonstrate how 454 data from whole genome DNA library can be used for a complete de novo assembly of the mitochondrial genome of Daucus carota. Adequate coverage, sufficient read length, and application of a bioinformatics analysis using a tool such as bb.454contignet were crucial for the final assembly.
Our sequencing library was prepared from whole genome DNA. The fraction of reads of plastid and mitochondrial genome origin obtained from such a library can vary depending on the age of tissue, DNA extraction method, relative size of the organelle and nuclear genomes, and species . Adequate coverage is always necessary for a complete plastid and mitochondrial genome assembly, but this coverage likely can be obtained, even with a low nuclear genome coverage. The fact that we were near the lower limit can be seen from a comparison of the five assembly replicates, which suggests that if the coverage is too low, the complete assembly of the mitochondrial genome can begin to fail apparently due to isolated gaps in the sequence due to random chance.
In our study, leaf tissue was collected from plants without any pretreatment (such as darkness) that could reduce the amount of plastid or mitochondrial DNA. The same DNA extraction was sequenced by both platforms. We observed approximately twice the organellar fraction from Illumina sequencing, relative to 454. The interpretation of this observation is difficult as many factors could be involved. Here we observed that Illumina is more sensitive to regions with variable GC content than is 454. In a range of GC content of 20–50% other authors reported a positive correlation between GC contents and sequence coverage [35,36]. Since organellar DNA is typically higher in GC content than nuclear DNA, we suggest this as one of possible cause of the higher representation of organellar fraction from Illumina sequencing.
Development of new bioinformatic tools is often a crucial point for interpreting DNA/RNA sequence information into biological information . Repeat regions are the primary difficulty in genome assembly, and the length of reads relative to the length of repeats is a key factor in the ability to assemble a genome . Reads from repeats longer than the read length could assemble to multiple possible sequences, some of which may not exist in vivo. These cannot be collapsed into a single unambiguous sequence, thereby causing breaks in contigs, where one end of a contig can then lead to two or more adjacent contigs which continue the sequence. Some assemblers such as gsAssembler make this connection information available. We developed and used bb.454contignet http://www.vcru.wisc.edu/simonlab/sdata/software/ to parse, filter, and then graphically visualize these contig connections. This tool was very useful for our small-scale assembly project, and it could be suitable for other applications, such as assembly of Bacterial Artificial Chromosome sequences, or specific regions in a whole genome assembly. As confirmation of the validity of our procedure, another group has successfully assembled another genome using this tool . A similar tool, ABySS-Explorer has been developed for assemblies generated by ABySS . However, our results demonstrate the utility of shotgun 454 sequences with a size of 300–500 nt, making the experiment cost efficient, and we confirmed the assembly by PCR, Southern blot and mapping 3kb paired end data.
Full de novo assembly of the mitochondrial genome can be very difficult due to repeat regions as well as its dynamic organization due to frequent recombination events. Recent advances in Illumina and Roche 454 sequencing technologies have allowed increasing read length up to 150 and 1000 nt, respectively [http://www.illumina.com, http://my454.com]. In addition, both NGS technologies officially support paired sequence data of up to 5kb for Illumina platform and up 20kb for Roche 454 platform. Comparison of the distribution of the number and the length of the repeat sequences among sequenced mitochondrial genomes and the whole range of longest sequencing data achievable by both technologies demonstrates the power of NGS to overcome most of the assembly ambiguities due to repeats across these genomes. Moreover, use of long shotgun sequences alone, such as 454 reads, would resolve assembly ambiguity across a large portion of repeat regions detected among these mitochondrial genomes (Additional file 8: Figure S6). By contrast, use of short shotgun reads such as Illumina could still result in a high number of contigs and connections that could make the assembly unreliable. Paired sequence data can support and confirm assembly, and supply information about structural variants due to recombination across repeats. The ongoing development of third generation sequencing technologies such as PacBio [http://www.pacificbiosciences.com/], able to produce longer reads, will facilitate sequencing of new organelle genomes.
The usefulness of new organelle genome sequences will be reduced if their quality is reduced by sequence errors. Next generation sequencing technology is known to generate a sequence error rate higher than the traditional Sanger method . Most of these errors can be corrected with high sequence coverage. However, errors due to inaccurate determination of homopolymer length, particularly when ≥5 nt, are characteristic of pyrosequencing (Roche 454), and are not resolvable by higher coverage. These errors can result in an inaccurate sequence annotation if the error occurs in a reading frame. In our study, homopolymer ambiguity was detected in both coding and non coding sequences. Availability of Illumina sequences allowed us to correct these homopolymer errors as described above. After corrections, ORF detection confirmed contiguous reading frames of all of the genes predicted by BLAST alignment, an additional verification of the correctness of the finished sequences. Combining these two technologies for this type of correction has already been described . The high number (1,593) of homopolymer regions observed in this study would make verification of every homopolymer region by Sanger sequencing uneconomical. By contrast, the low number of ambiguities in coding regions  in such a small scale sequencing project would allow the use of a Sanger sequencing approach as an alternative cost-efficient way to correct ambiguity in those regions, and insure correct annotation of genes.
Carrot is the first mitochondrial genome sequenced among the large euasterid II clade, and only the second, along with N. tabacum, among asterids. The size of the genome (281,132 nt) is similar to a previous estimation (255,000 nt) made by Robinson and Wolyn  based on restriction digestion mapping. Sequence analysis revealed that the closest mitochondrial genome in terms of similarity was tobacco, confirming their phylogenetic relationship in the euasterid II clade. The gene content of the carrot mitochondrial genome confirms the previous report of Adams et al.  where the authors used Southern hybridization to survey mitochondrial gene presence or loss across 280 angiosperms. The major factor accounting for gene loss in the mitochondrial genomes is transfer to the nuclear genome. Presence or absence of mitochondrial genes in the nuclear genome have been reported for several species [45-48], including the transfer of rps10 and a lack of sdh3 and sdh4 in the nuclear genome of carrot. In contrast with these data, our results revealed the presence of a nuclear copy of sdh4 in the carrot genome. In addition, we confirmed the transfer of rps10 from the mitochondria into the nucleus and detected an additional nuclear copy of rps14. We also found a replacement of the original mitochondrial rpl2 with a copy of the plastid rpl2 in the mitochondria. The high level of similarity between the rpl2 nucleotide sequences in these two organellar genomes (99%) suggests a recent translocation. The lack of any sequence similar to the mitochondrial rpl2 gene in the nuclear genome suggests that the original mitochondrial version has been lost. According to Kubo and Arimura , the replacement of a mitochondrial gene by a duplicated plastid counterpart could be due to the complete loss of the gene in either mitochondrial or nuclear genomes. However, detection of the ORFs for nuclear sdh3rps10rps14 and plastid rpl2 suggests the possibility of functionality. Further work will be needed to determine whether these nuclear copies are expressed.
The structure of angiosperm mitochondrial genomes is frequently characterized by repeat sequences . The number and the size of these repeats is important, as they influence the size of the genome, and they are the sites of intragenomic recombination, underlining evolutionary changes in mitochondrial genome organization and structural dynamism in vivo[52,53]. A high proportion of smaller repeat sequences explained (<1kb) the size of the largest mitochondrial genome of Cucumis melo. In contrast, in maize , variable genome size among different genotypes is mainly caused by large (>1kb) sequence duplication. The structure of the carrot mitochondrial genome presented here revealed that nearly half (46%) of the genome is composed of large repeated sequences. The fact that carrot has the least number of repeat sequences among sequenced mitochondrial genomes and the presence of four large repeats (1–14kb), we deduced that the increase of the carrot mitochondrial genome size mainly occurred by large sequence duplication.
Intragenomic recombination is an active phenomenon in the mitochondrial genome of angiosperms [53,54]. Recombination frequency seems to depend on the size of the repeats. Large size direct repeats (>1kb) are associated with high frequency of homologous recombination producing sub genomic molecules observed in other mitochondrial genomes [53,54]. Our Southern blot results provide evidence of recombination across a large repeat 4 (Figure3), and the in vivo co-existence of sub-genomic circles. Smaller repeats, however, suggest very low recombination frequency, producing aberrant asymmetric recombinant molecules that are generally present at very low copy number and hard to detect [55,56]. In this study, we tested a possible recombination across short repeats using next generation sequencing data from Roche 454 and Illumina platforms and our results did not support any recombination events (data not shown). It was shown recently that mitochondrial intergenomic recombination required stretches of similarity longer than 500 nt and remained under control of a nuclear gene msh1. Inactivation of msh1 lead to increased frequency of recombination and allowed recombination between repeats only 50 nt-long .
Intercompartmental (plastid, mitochondrion and nucleus) DNA migration is a known phenomenon in plant cell evolution. DNA transfer among these compartments resulted in a functional relocation of organellar genes in the early phase of organelle evolution [13,56-60]. Theoretically, six intercompartmental DNA migrations are possible, and at least four have been observed in angiosperms (Additional file 9: Figure S7). Transfers of DNA from mitochondrion to nucleus and from plastid to mitochondrion are discussed above (see Introduction and Results section). DNA migration from plastid to nucleus has been described in Arabidopsis, soybean and other species [61,62]. Nuclear DNA has been transferred to the mitochondrial genome of Cucumis melo and other species [10,63]. In contrast, no evidence of transfer of DNA from nucleus or mitochondria to plastid has been reported in angiosperms . Smith  investigated the presence of mitochondrial DNA in the plastid genome of 42 species including 11 angiosperm, and found a complete absence of mtDNA like sequences in any of the plastid genome of these species. Recently Goremykin and colleagues  reported a fragment of mitochondrial DNA in Vitis with similarity to a sequence of 1452 nt present only in the carrot plastid genome.
In this study, we confirmed the presence of this sequence in the plastid genome of carrot, and discovered that it was also present as three non-contiguous sequences in the mitochondrial genome. PCR results and sequence analysis clearly document conservation of a large portion of this sequence (DcMP region 3, Figure5B) across all mitochondrial genomes of the diverse Apiaceae species examined, and a “gain pattern” only in Daucus and Cuminum plastid genomes. This evidence strongly suggests a transfer of DNA into the plastid genome. In addition, the observation of the insertion of the DcMP region 4 in the plastid genome of wild Daucus species D. aureusD. crinitus and D. glochidiatus led us to hypothesize that in the ancestor of carrot the DcMP regions 1, 2, 3 and 4 were contiguous when the transfer event occurred. Assuming a single transfer event, the loss of the sequence contiguity in the mitochondrial genome could be due to a recombination of that genome, as is often observed in plant mitochondrial genomes. According to Downie et al. , Cuminum is the closest relative to carrot of the taxa examined here. The PCR patterns obtained across species for the plastid copy of DcMP concur with that classification and suggest that the presence of the DcMP region can be a distinctive feature of species from the Daucus clade.
The fact that the DcMP3 sequence has no matches in sequence data bases makes interpretation of its origin difficult. Limited similarity with a retrotransposon element (in the range of 50–60%; Additional file 7: Figure S5) of Vitis vinifera and other species suggests that a retrotransposable element might have been moved from the nuclear genome into the mitochondrial and then into the plastid genome, or directly and independently from the nuclear genome to plastid and mitochondrial genomes. The fragment of 125 nt at the 5′ end of DcMP3 is very conserved across several even more diverse angiosperm mitochondrial genomes, which suggests a mitochondrial origin, and also suggests that the transfer occurred from mitochondrial to plastid genome. However, considering the exceptional DNA up take system of plant mitochondrial genomes [4,65], other sources of DNA donors via horizontal transfer or transfer from the nuclear genome cannot be excluded. Rare insertions of retrotransposon elements into the plastid genome have been observed in the alga Chlamydomonas reinhardtii, but based on our knowledge this is the first evidence of transfer of DNA from mitochondrial genome to plastid genome, and the potential presence of a possible retrotransposon in the plastid genome of a flowering plant.
Our results confirmed that unenriched whole genome sequencing is a practical approach for de novo assembly of higher plant mitochondrial genomes. Sequence analysis confirmed the dynamic organization of the mitochondrial genomes in carrot as in other mitochondrial genomes. This first report of a DNA insertion from the mitochondria in the angiosperm plastid genome suggests that new and unknown mechanisms in intercompartimental genome interactions may exist. The new approach developed here for assembly of the mitochondrial genome used here could be used more broadly to sequence and assemble mitochondrial genomes of diverse plant species. This would supplement the currently limited number of sequenced mitochondrial genomes, and may allow us to better understand intra- and inter-genomic DNA transfers and recombination, and this new instance of transfer of DNA to plastid genomes could possibly have occurred in other angiosperms.
A single male fertile plant of USDA carrot inbred line B493B  was used in this study (Additional file 10: Table S3). Leaf tissue was lyophilized, and extracted with DNeasy Plant Maxi Kit (Qiagen). The extracted DNA was analyzed for potential degradation by gel electrophoresis, and DNA concentration was quantified using Pico Green (Invitrogen, Pisley, UK).
DNA samples used for PCR verification of DNA transfer (Coriandrum sativumPetroselium crispum cv “Hamburg”, Cuminum cyminumApium graveolens ssp. repaceum cv ‘Brilliant’, Anethum graveolensFoeniculum vulgareDaucus carota ssp. sativusDaucus muricatusDaucus glochidiatusDaucus crinitusDaucus capillifoliusDaucus aureusDaucus sahariensisDaucus pusillus, and Eryngium planum) (Additional file 10) were extracted as described by Murray and Thompson  and quantified using a NanoDrop spectrophotometer (NanoDrop Technologies).
454 sequencing was performed with a GS-FLX platform (Roche, CT, USA), and Illumina sequencing was performed with the HiSeq 2000 platform (Illumina, San Diego, CA). In both cases, the sequencing was done at the University of Wisconsin, Biotechnology Center (University of Wisconsin, Madison, USA). Shotgun libraries were prepared and sequenced according to the manufacturers’ instructions. For 454 sequences sets of 813,770, 814,668, 771,864, 704,918 and 692,688 shotgun reads corresponding to an estimated nuclear genome coverage of 0.6×/set were used for initial assembly and 570,590 3kb paired-end reads were used for connections verification. In addition, 50,598,879 Illumina reads were used to correct homopolymer ambiguity.
Each set of 454 sequences was independently assembled using gsAssembler v.2.6 (Newbler) (454 Life Sciences Corp, CT, USA). Parameter settings are listed in additional file 11: Table S4. In order to identify contigs of plastid and mitochondrial origin, assembled sequences were aligned against the carrot plastid genome [GenBank: DQ898156] and tobacco mitochondrial genome [GenBank: NC006581] using MUMmer 3.22 ; http://mummer.sourceforge.net/. Contigs with similarity to atpA for plastid, or atp1 for mitochondrion, were used as starting points for a de novo assembly of contigs. In order to visualize the contig connection information in the 454ContigGraph.txt file produced by gsAssembler, we created the program bb.454contignet (http://www.vcru.wisc.edu/simonlab/sdata/software/). Parameters are listed in additional file 11: Table S4. In order to generate a consensus sequence the five replicate assemblies, sequences were aligned using kalign  and a consensus generated using a custom Perl program.
To verify contig and repeat connections, sequences spanning those connections were amplified and sequenced using primer pair 1–12 listed in Additional file 11: Table S5. Fragments were amplified in a 20μl PCR reaction: 13μl water, 2μl 10x DNA polymerase buffer, 0.8μl dNTPs (2.5mM each), 1μl 5μM of each primer, 0.2μl Taq polymerase (MBI, Fermentas, USA) and 2μl of genomic DNA (~50ng). Amplification conditions were: initial denaturation at 94°C for 2min., followed by 35 cycles of 94°C for 45sec., Tm (°C) for 1.0min., 72°C for 1.0min. and 20sec., and a final step at 72°C for 10.0min. Electrophoresis was carried out for 2–3 hours at 100V on 2% agarose TAE gels supplemented with 0.2μg/ml of ethidium bromide. PCR products were then sequenced in a 5 μl reaction including, 1.75μl of water, 1μl of 5μM primer, 0.75μl 5× BigDye®3.1 sequencing buffer, 0.5μl of BigDye®3.1 ready reaction mix and 1μl of PCR product, previously diluted 1:10 with water. Amplification conditions were: 25 cycles of 96°C for 10sec., and 58°C for 2min., and a final step at 72°C for 5.0min. The sequences were analyzed on an ABI 3730xl DNA Analyzer and analyzed using Sequencher software version 4.8 (GeneCodes Corporation, Ann Arbor, MI).
Sequences similar to plastid and mitochondrial genomes longer than 358 nt (which corresponds to the average read length) were amplified using primer pairs 13–18 reported in additional file 12: Table S5. PCR assay and sequence analysis were carried out as described above.
In addition, the consensus sequence was compared using blastn to a local database containing a set of 3kb 454 paired-end reads and results were filtered and visualized with Circos (http://circos.ca/).
To correct for homopolymer length errors and evaluate genome coverage, both Illumina and 454 shotgun reads were mapped to the mitochondrial consensus sequence using GNUMAP v.3.0.0 , http://dna.cs.byu.edu/gnumap/. Homopolymer regions of ≥5nt were corrected by a custom Perl program when a majority of Illumina reads indicated a different length, in most cases of correction this became 1 nt longer.
The DNA was digested in separate tubes with the Fermentas enzymes XhoI, Eam1105I (AhdI) and KspAI (HpaI). Each digestion was performed overnight in 37°C in 40μl containing approx. 1μg of DNA. After restriction the samples were supplemented with 8μl of the 6X Loading Dye Solution (Fermentas) and run in a 0.7% agarose gel in TAE buffer, during the first hour at 0.9V/cm and then for the next 15h at 0.5V/cm. After electrophoresis the gel was subjected to depurination, denaturation, neutralization and capillary blotting using procedures described in Roche DIG Application Manual. The probes were synthesized using long PCR carried out in 25μl containing: 2.5μl 10x Long PCR Buffer with MgCl2, 2.5μl dNTPs (2.5mM each) , 1μl 1μM of each primer, 0.2μl of Long PCR Enzyme Mix (MBI, Fermentas, USA), and 2μl (70μM) of alkali-labile DIG-11-dUTP (Roche, Applied Science) and 2μl of genomic DNA (~4ng). Probes were synthesized using primer pairs 20–22 listed in additional file 12: Table S5 with the following PCR conditions: initial denaturation at 94°C for 2min.; 10 cycles of 94°C for 20sec., 57°C for 30sec., 68°C for 15min.; 25 cycles of 94°C for 20sec., 57°C for 30sec., 68°C for 15min. with an automatic 10sec. extension in each consecutive cycle and finally 10min. in 68°C. Three μl of the labeling reaction were examined in the standard agarose gel against the unlabelled control (the reaction without DIG-11-dUTP). The remaining portion of the labeling reaction was added to 16ml of the standard hybridization buffer, the resulting solution was incubated in a water bath at 95°C for 10min. (probe denaturation) and afterwards immediately applied to the membrane of ca. 400cm2.
A preliminary annotation of proteins, rRNA, and tRNA was carried out using BLAST with a local database of nucleotide and protein sequences of all published mitochondrial genomes. Consistency of ORFs was then tested using Open Reading Frame Finder http://www.ncbi.nlm.nih.gov/gorf/gorf.html. The tRNA genes were detected with tRNAscan-SE , http://lowelab.ucsc.edu/tRNAscan-SE/. Any ORFs longer than 300 nt and not overlapping conserved genes were searched with the NCBI nucleotide and protein databases and annotated on the mitochondrial sequence. Sequence of sdh3sdh4rpl2rps10 and rps19 from the Nicotiana tabacum mitochondrial genome [GenBank: NC_006581] and rps2rps14 from the Vitis vinifera mitochondrial genome [GenBank: NC_012119] were used to investigate the presence of those genes in the nuclear genome of carrot. In order to annotate sequences common to plastid and mitochondrial genomes, known conserved mitochondrial genes were first masked in our mitochondrial consensus sequence, which was then queried against a local database containing sequences of all published plastid genomes. Sequences with minimum length of 40 nt and 90% identity not interrupted by masked sequences were added to our annotation.
Regions repeated within the mitochondrial genome were detected with blastn with the following parameters: minimum length 40 nt; percent identity 90.
The complete sequence of the annotated carrot mitochondrial genome was deposited in the NCBI organelle genome database with accession number JQ248574.
Mitochondrial and plastid DNA insertions were amplified using primer pairs 15, 16 and 19 listed in additional file 12: Table S5. PCR assay and sequence analysis were carried out as described above. Sequences were then searched against the NCBI nucleotide and protein database using default parameters.
The authors declare that they have no competing interests.
MI: designed experiments, carried out PCR validations and Sanger sequencing, annotated the genome, interpreted the data and wrote most sections of the manuscript; DSe: designed experiments, carried out most of the bioinformatic analyses, wrote relevant parts of the methods and revised the paper; MS: carried out Southern experiments, wrote the relative methods and revised the paper; DG: participated in data interpretation and revision of manuscript; DSp: provided taxonomic advice and reviewed the paper; PWS: developed the plant material, participated in the experimental design and initiation of the experiments, data interpretation, writing and revision of several sections of the manuscript. All authors read and approved the final manuscript.
Figure S1. Connections verification. A. Schematic representation of the order of single copy regions (A, B, C, D, E) and repeated regions (R1-R4) into two possible master circles, Mc 1 and Mc 2; B: PCR results of all possible region connections; letter above each lane indicate the location of the primer pair (relative to each region) used for PCR. MW: 1kb DNA molecular weight; C-=negative control.
Figure S2. Comparison of the five mitochondrial genome assemblies from sequence sets 1–5. Different colors identify different contigs. Triangles indicate missing connections between contigs. Letters in the consensus sequence indicate single copy regions (A-E) and repeated regions (R1-R4).
Table S1. Summary of Illumina and 454 read coverage.
Figure S3. Coverage plots displaying the Illumina (A) and 454 (B) read coverage (Red line) and GC content (Green line) across the complete plastid (A1, B1) and mitochondrial (A2, B2) genomes. Windows in the graphs indicate regions with higher GC content.
Figure S4. Original Southern blot pictures for repeat 1-3-4. Unrelated samples should be ignored.
Table S2. Comparison of gene content in mitochondrial genomes.
Figure S5. Results of alignment of the DcMP 3 sequence against the NCBI protein database.
Figure S6. Potentiality of next generation sequencing data for assembly of plant mitochondrial genomes. Number and size of repeat regions among representative sequenced plant mitochondrial genomes compared to the longest read length available for the two most utilized next generation sequencing technologies.
Figure S7. Intercompartmental DNA transfer. Schematic representation of all possible and observed directions of intercompartmental DNA transfer in angiosperm genomes. Black solid arrows indicate previously observed DNA transfers; black dotted arrow indicates unobserved DNA transfer; Red solid arrow indicates the mitochondrial-to-plastid transfer identified in carrot.
Table S3. Plant material used in this study.
Table S4. Parameters used for analysis programs.
Tables S5. Primers used in this study.
This work was supported by the Agricultural Research Service, United States Department of Agriculture, vegetable seed companies and production industry.