|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: MY XZ ISAM JY. Performed the experiments: MY XZ YY QY DZ. Analyzed the data: MY GL JY. Contributed reagents/materials/analysis tools: MY GL KC ISAM. Wrote the paper: MY.
Date palm (Phoenix dactylifera L.), a member of Arecaceae family, is one of the three major economically important woody palms—the two other palms being oil palm and coconut tree—and its fruit is a staple food among Middle East and North African nations, as well as many other tropical and subtropical regions. Here we report a complete sequence of the data palm chloroplast (cp) genome based on pyrosequencing.
After extracting 369,022 cp sequencing reads from our whole-genome-shotgun data, we put together an assembly and validated it with intensive PCR-based verification, coupled with PCR product sequencing. The date palm cp genome is 158,462 bp in length and has a typical quadripartite structure of the large (LSC, 86,198 bp) and small single-copy (SSC, 17,712 bp) regions separated by a pair of inverted repeats (IRs, 27,276 bp). Similar to what has been found among most angiosperms, the date palm cp genome harbors 112 unique genes and 19 duplicated fragments in the IR regions. The junctions between LSC/IRs and SSC/IRs show different features of sequence expansion in evolution. We identified 78 SNPs as major intravarietal polymorphisms within the population of a specific cp genome, most of which were located in genes with vital functions. Based on RNA-sequencing data, we also found 18 polycistronic transcription units and three highly expression-biased genes—atpF, trnA-UGC, and rrn23.
Unlike most monocots, date palm has a typical cp genome similar to that of tobacco—with little rearrangement and gene loss or gain. High-throughput sequencing technology facilitates the identification of intravarietal variations in cp genomes among different cultivars. Moreover, transcriptomic analysis of cp genes provides clues for uncovering regulatory mechanisms of transcription and translation in chloroplasts.
The chloroplast (cp), believed to arise from endosymbiosis between a photosynthetic bacterium and a non-photosynthetic host , is the photosynthetic organelle that provides essential energy for plants and algae. A given cell, such as that of a plant leaf, often contains 400 to 1,600 copies of cp genome . Other biological activities also take place in chloroplast including the production of starch, certain amino acids and lipids, vitamins, certain pigments in flowers and several key pathways of sulfur and nitrogen metabolism . In angiosperm, most cp genomes are circular DNA molecules ranging from 120 to 160 kb in length and have a quadripartite organization consisting of two copies of inverted repeats (IRs) of ~20–28 kb in size, which divides the rest of genome into a large-single-copy region (LSC; 80–90 kb) and a small-single-copy (SSC; 16–27 kb) region. Usually, the gene content of angiosperm cp genome is rather conserved, encoding 4 rRNAs, ~30 tRNAs, and ~80 unique proteins . Since the completion of the first cp genome of tobacco (Nicotiana tabacum) , there have been many discoveries on rearrangements and IR expansions within cp genomes, perhaps the most remarkable one is that of Pelargonium x hortorum, in which numerous rearrangements and large IR expansions were found . Gene losses have been found frequently in angiosperm cp genomes , , .
The date palm (Phoenix dactylifera, L.), an Arecaceae family member, is one of the most economically important woody plant cultivated in Middle East and North Africa . Its fruit provides staple food across Asian (mainly in Arabian Peninsula, Iran, and Pakistan), African, Europe, and American tropics. A recent report suggests that there are more than 340 cultivars in Saudi Arabia and nearly 2,000 cultivars around the world . The total number of date palm trees grown in the world is ~100 million, producing 15 million metric ton fruit each year . The cultivated hybrids of date palm are mostly diploid (2n=36) and propagated from offshoots . Because cp genome is maternally inherited , deeper knowledge about its structure, sequence variation, and evolution provides useful information for developing propagation technologies, such as cytoplasmic breeding and transgenic insertion.
In the past 10 years, we have witnessed a dramatic increase in the number of complete cp genomes. Up to now, 132 complete land plant cp genomes have been deposited in GenBank Organelle Genome Resources, albeit only 22 of them are monocot cp genomes and most of these genomes were sequenced using capillary sequencers . With the emergence of next-generation sequencers, new approaches for genome sequencing have been gradually proposed due to their high-throughput, time-saving, and low-cost advantages , , . Here, we report the complete cp genome sequence of date palm, the first member of the Arecaceae family from an elite cultivar Khalas (Al-Hssa Oasis, Saudi Arabia), using one of the next-generation sequencing method – pyrosequencing (Roche GS FLX). We also describe details in genome assembly, annotation, and comparative analysis, as well as information on sequence variations and trancriptomics for the date palm cp genome.
Using Roche GS FLX system, we carried out five sequencing runs to generate 4,169,506 raw reads (347 bp in average read length) for the project. After screening the reads through alignment with reference cp genomes and extensive contig extension efforts (see Materials and Methods for details), we collected 369,022 cp-genome-related reads (8.8% of total reads and an average of 384 bp in length), reaching 1,081x coverage on average over the cp genome. After validating the homopolymer regions and the junctions between single-copy and IR regions with PCR-based confirmation, we obtained a complete cp genome sequence of 158,462 bp in length.
Typically, two types of errors are characteristic in pyrosequencing data processing; one is associated with contig ends and the other involves heterogeneous insertion/deletions (Indels) arisen from homopolymeric repeats . The first is basically a sequence quality issue and usually overcome by increasing coverage (here we have 369,022 cp reads) and removing low-quality reads. The second is intrinsic to pyrosequencing and can not be improved by increasing coverage; we therefore performed alternative types of experiments to correct the erroneous homopolymer calls found in the assembly. Based on the homopolymer (n>=7) distribution in the preliminary cp genome assembly, we designed 151 pairs of PCR primers to validate all homopolymer runs in the entire assembly by using capillary sequencing (Table S1). The result was very satisfactory where we added 117 base pairs in 108 homopolymers, and all previously uncertain or missed nucleotides, either A or T (except for one homopolymer run that has 13 cytosines) were satisfactorily validated. These recovered nucleotides appeared larger in number as compared to what encountered when sequencing Mungbean cp genome, where only 49-bp sequences were lost in the initial assembly (74.6x raw data) . Since we have 161 homopolymers above 7 bp, and 181 equal to 7 bp, the potential errors are too large to not handle them with care and a simple increase of sequence coverage is not likely to be useful .
The date palm cp genome is a typical circular double-stranded DNA molecule, and it shares a common quadripartite structure with the vast majority of other angiosperms: a pair of IRs (27,276 bp) separated by the LSC (86,198 bp) and SSC regions (17,712 bp) (Figure 1). It encodes 131 predicted functional genes; 112 are unique and 19 are duplicated in the IR regions. Among the 112 unique genes, we identified 79, 29, and 4 protein-coding, transfer RNA, and ribosomal RNA genes, respectively. 50.93%, 1.79%, and 5.71% of the genome sequence encode proteins, tRNAs, and rRNAs, respectively, whereas the remaining 41.57% are non-coding and filled with introns, intergenic spacers, and pseudogenes. Similar to other cp genomes , , the date palm cp genome is also AT-rich (62.77%), and the values vary slightly among defined sequences of non-coding, protein-coding, tRNA, and rRNA, where their A+T contents are 66.60%, 61.03%, 57.94%, and 52.19%, respectively.
Similar to tobacco cp genome, the date palm cp genome has 18 intron-containing genes among the 112 unique genes; almost all are single-intron except for two genes, ycf3, and clpP, whose exons are separated by two introns. The gene rps12 is trans-spliced; one of its exons is in the LSC region (5′) and the two reside in the IR regions separated by an intron. The introns of all protein-coding genes share the same splicing mechanism as Group II introns . In the IR regions, both ycf15 and ycf68 became pseudogenes duo to internal stop codons identified in their coding sequences (CDS). Specifically, the CDS of ycf15 is interrupted by a stop codon at downstream (57 bp away from the start codon), whereas several stop codons appeared in ycf68 gene. Similar mutations have also been known to happen in the cp genomes of other species . Another pseudogene is ycf1 in the boundary of IRb and SSC because of the incomplete duplication of the normal copy of ycf1 at the IRa and SSC boundary (Figure 1).
A total of 22,950 codons represent the coding capacity of all protein-coding genes of date palm cp genome (Table 1). Among these codons, 2001 (8.72%) encode for isoleucine and 271 (1.18%) for cysteine, which were the most and the least amino acids, respectively. These extremes are the same as what in nuclear genomes. The base compositions at each codon position are slightly biased: 54.4%, 61.8%, and 70.6% for the first to the third, respectively.
There are some exceptional cases in start codons. We identified two ACG as start codons in rpl2 and ndhD and one GUG start codon in rps19. The non-canonical starts have been detected in other angiosperms  and even in fern-like plant Alsophila spinulosa, where 20 genes start with ACG . We also found an unconventional start codon in cemA that encodes a heme-binding protein functioning in the inner chloroplast envelope membrane . Multiple alignments of cp genomes show that there is an 8-bp homopolymer (AAAAAAAA) in the downstream of its start codon sequences commonly found among certain monocots, such as rice. However, in date palm (also in oil palm and Yucca), this homopolymer is 10 bp in length, resulting a 2-bp coding frame shift that changes ATG to AAT and GAA (Figure 2). Moreover, a stop codon—TAG—12 bp in the upstream sequence makes the start codon of cemA in date palm rather ambiguous. Our transcriptome analysis demonstrated that its mRNA is in a polycistronic transcription unit, together with psaI, ycf4, and petA (Part 7). However, whether this gene is translatable to protein or not remains unclear due to the obscurity of its start codon. To make further inspection to this gene, we aligned its orthologs from 74 angiosperms and 3 gymnosperms (Ginkgo, Pinus and Welwitschia) and found a start codon among all other species either at the origin or the upstream other than what in date palm, oil palm, Yucca and Lemna (Figure S1). Although ATG is typical the start codon, GUG is also used in 14 species. Most of the 14 species are either basal angiosperms or gymnosperms except two in Acoraceae family (Acorus calamus and Acorus americanus) . These results suggest that the start codon of cemA gene must have endured a strong selective pressure during the early evolution of angiosperms but it is somewhat relaxed among certain plant taxa.
Since the tobacco cp genome is often regarded to be unarranged , we compared our assembly with it and observed a high degree of synteny between the two cp genomes. Two minor exceptions were found—additional copies of rps19 and trnH-GUG around the junction between LSC and IRs— due to IR expansions in the date palm cp genome. The gene content is also nearly identical with that of tobacco; only a single gene, namely infA common among monocots, appeared degenerated into a pseudogene in tobacco. However, structural rearrangements and gene loss-and-gain events are quite often among other monocot cp genomes, mainly found in Poaceae family, where the canonical order is disrupted by three inversions in their LSC regions and rpl23 are translocated from IR to LSC regions . Indels are also frequently found in Poaceae such as intron-loss in rpoC1 and insertion in rpoC2 . Other variations are gene-loss (deletion or becoming pseudogene) of accD, ycf1, and ycf2 in Poaceae cp genomes . We calculated the vestigial lengths of these three pseudogenes and found that their sequence variations can explain in part the cp genome size differences between Poaceae species (137,851 bp in average) and date palm, since the average size of the Poaceae cp genomes is ~14 kb shorter than that of date palm. In addition, gene-loss event often occurs in other monocot families (cp genomes of 22 monocot species are available in public databases and 15 of them are Poaceae species). There is only one Typhaceae cp genome published recently, which has the same gene content and order as date palm . Among the remaining cases, Lemna and Dioscorea, each lost a single gene: infA and rps16, respectively; two Acoraceae members lost accD , ,  and Phalaenopsis and Oncidium lost most of ndh genes , . Rearrangements also occurred in Dioscorea, such as the inversion of SSC. Nevertheless, similar to that of tobacco, date palm cp genome appeared less rearranged and having very limited gene loss-and-gain especially when compared to these monocots.
Around the borders of JLB and JLA, date palm has the same structure as what in Poaceae cpDNAs; specifically, in JLB, rpl22 and its 5′-end adjacent rps19 are completely fell in LSC and IRb, respectively, whereas in JLA, another copy of rps19 in IRa adjoins its 3′-end to psbA in LSC. However, surrounding JSB and JSA, the gene order of the date palm assembly is similar to those of Amborella and certain dicots (i.e., tobacco, Panax, and Arabidopsis), namely JSB locates between ycf1 pseudogene and ndhF, whereas JSA resides in the 3′ region of the normal ycf1 gene.
Among other monocots, various degrees of IR to LSC expansions were identified. In Dioscorea and Acorus, only the 5′-end of rps19 are included in IRb to generate a rps19 pseudogene in IRa , whereas in Lemna, a contraction was detected as IRb shrunk into the 3′-end of rpl2 . Other than Lemna, all other monocots studied thus far possess a copy of trnH-GUG (in 5′-end of rps19 in IRb) that is absent in tobacco and Amborella. Therefore, based on the two-step hypothesis of IR expansion , we deduced that the formation of IR-LSC boundaries among monocots (except Lemna) occurs in the following manner: trnH-GUG is duplicated in IRb as the first step of inclusion of trnH-GUG to IRa; and along with the second expansion step in recruiting rps19 to IRb, another copy of rps19 (or pseudogene) is generated in IRa. However, in Phalaenopsis and Oncidium, the expansion was unusual where IRb extended to the 3′-end of rpl22 gene ,  (Figure 3).
At the two boundaries of SSC, the general structure revealed in Amborella and dicots (i.e., tobacco, Panax and Arabidopsis) is that ycf1 spans JSA and ycf1 pseudogene adjacent to JSB. In monocots, two Acoraceae members share this structure; and similar to Arabidopsis, a small expansion occurred in date palm, which formed an overlap between ndhF and ycf1 pseudogene (55 bp in date palm) in JSB in date palm . In Dioscorea, this structure is inverted—the normal ycf1 in JSB and the corresponding ycf1 pseudogene in JSA—because of the inversion of SSC. In Phalaenopsis, there is a short contraction of IRs where ycf1 is completely included in SSC . A special case in monocots is found in Lemna, its expansion pattern of IRs to SSC is similar to that in Poaceae family if we exclude the complete duplicate of ycf1 . Since we can find the symmetrically degenerated ycf1 in several Poaceae cp genomes (i.e., the position 99,623–100,458 and 114,660–115,495 in IRb and IRa, respectively, in Oryza sativa) in the corresponding position of Lemna, we speculate that in the beginning, the ancestors of Poaceae experienced similar IR expansion into SSC. Therefore, there are two types in IR/SSC junctions in the early evolution of monocots: one is little changed such as the case of date palm and Acoraceae members, which has no obvious expansion to share similar structure with Amborella or tobacco; the other experienced apparent expansion to firstly include the whole duplicates of rps15 and ycf1 in IRb, and thereafter, two possible expansions occur. The first one is the incorporation of the 5′-end of ndhH, resulting in an incomplete copy of ndhH pseudogene in IRb. This expansion happened in Lemna as well as probably among most Poaceaes. The second type is found in Panicoideae subfamily of Poaceae cp genomes and IRb expand into SSC to include the 3′-end of ndhF, accompanying with an incomplete copy of ndhF pseudogene in JSA. After these expansions, ycf1 became nonfunctional and resulted in the present Poaceaes structure (Figure 4).
Gene order in the four junctions has been known to vary among different cp genomes due to the expansion or contraction of the IR regions , , , . The general process of IR changes has also been surveyed for monocots cp genomes , . We proposed a most likely evolution rout of IRs and summarized patterns of expansions between junctions of IR/LSC and IR/SSC in comparison to Amborella and tobacco. The basal monocots in our analysis are members of Acoraceae family, whereas the higher monocots belong to Poaceae. However, in the IR/LSC junctions, Lemna has a more contracted IR structure than Acrous and the basal angiosperm— Amborella , ; and larger expansion occurred in Orchidaceae than in Poaceae. In junctions of IRs/SSC, we observed obvious expansion in Lemna and all Poaceae members, whereas little expansion was noticed in all other monocots, and there is even an IR contraction in Phalaenopsis. Therefore, the expansion rate may not be in accordance with the taxonomic relationship among monocots.
Using REPuter, we identified 11 forward and inverted repeats 30 bp or longer with a sequence identity greater than 90% (Table 2). Three pairs of forward repeats were found in the coding region, while seven pairs were identified either in intergenic or intronic regions. The remaining was mostly in the two trnS genes. Compared to other species, the number of repeats in date palm cp genome is fairly low; for example, in Poaceae families, there are 19–37 forward and inverted repeats with the size ranging from 30 bp to 60 bp ; and in some dicots, such as Gossypium  and Citrus , the number of repeats are 54 and 29, respectively. The lengths of repeats in the date palm assembly are also much shorter, and the longest repeat is only 39 bp in length, whereas a much longer 91-bp repeat was found in Poaceae family .
Small inversions or SIs between IRs are quite interesting upon examination. SIs vary in length from 5 to 50 bp, and are flanked by a pair of IRs ranging from 11 to 24 bp in size. SIs can generally be determined through pair-wise comparison between the sequences from closely related taxa. A comprehensive result of 16 SIs has been reported previously for plant chloroplasts . Additional two SIs have been found recently in the intergenic region of psbA-trnH and psbC-trnS , . Here, we also identified a new SI in date palm cp genome after performing alignments with its orthologs in other monocots (Table 3). This SI is located in psaB coding region, 63 bp in length, and has the stem and loop being 13 and 37 bp, respectively, in most analyzed cases. The loop regions in the aligned sequences have the same orientation across different taxa and have at least 90% sequence identity to that of data palm. Sequence variations in IR region mostly occurred as a G-to-A single nucleotide polymorphism (SNP), which decrease the stability of the secondary structures. In Lemna and Acrous, there were three SNPs found in the stem region, making the stem-loop structure rather unstable as their free energy values are −1.49 and −1.1, respectively; in tobacco and Panax, numerous variations disrupt the secondary structure seriously. Different loop configurations of the SIs are generated when folding to secondary structure due to their large size and variations in the loop sequences. The alteration is conspicuous in Phalaenopsis and Agrostis, whose loops are even longer (41 and 47 bp, respectively) than most of others (37 bp) due to the variation near the two ends of the loop in the stem region (Figure 5). However, this did not affect the free energy of their secondary structures. Phalaenopsis has almost identical free energy (−11.39) with date palm (−11.12), whereas the free energy of Agrostis (−7.31) is a little lower than that of the Pooideae subfamily (−9.14).
In addition, we detected 29 SIs when searching for IRs; nine SIs were confirmed and the rest were all putative because we could not find homologous sequences among other monocots (Table S2). They were mostly located in intergenic or intron regions and stem-loop forming. Excluding the previously-mentioned SI in psaB, we found three putative SIs in the coding region; one was known in ccsA . If these putative SIs are proven common in the genome, they may provide phylogenetic information or may even play functional roles in stabilizing their corresponding mRNAs , .
A plant cell often contains multiple clones or copies of cp genomes, and chloroplast can be regarded as a population with high genetic heterogeneity. Therefore, when thousands of high-quality cp sequence reads are aligned, polymorphic sites can be detected readily with software tools and confirmed experimentally. Similar phenomenon is also observed in mitochondrial genomes. The sequence variations within a variety (or cultivar and subspecies), often discovered by high-coverage sequencing, can be separated into major and minor intravarietal genotypes within a chloroplast or mitochondrial genome assembly based on sequence counts. Variations can be further defined based on experimentation over large population sampling from different varieties and subspecies of the same species as intervarietal or intersubspecific genotypes when one of the alleles become unique to certain faction of the samples , .
High throughput next-generation sequence technologies provide us a great opportunity to collect a huge number of raw reads for surveying intravarietal polymorphism within cp genomes. Using the 369,022 raw 454 cp reads, we surveyed each locus in date palm cp genome for possible intravarietal SNPs. A total of 113 SNPs were originally detected, and after an exclusion of homopolymer runs, the remaining 78 SNPs were considered to be stable SNPs. Among them, 16 fell into intergenic or intronic regions (Table S3) and the remaining were all located in protein-coding regions spreading over 23 genes (Table 4). There are 26 transitions and 52 transversions and more transversions are seen in both non-coding (93.8%) and coding (59.7%) regions. In the CDS regions, 29 and 31 mutations were synonymous and nonsynonymous substitutions, respectively. Major or minor genotypes occurred simultaneously in two SNPs of rpoC2 at positions 21,213 and 21,215 in genome. In addition to these intravarietal SNPs, we also detected a 4-bp intravarietal Indel in intergenic region of accD and psaI from the position 61,482 to 61,485. This Indel, characterized as “TAGA”, however, is defined as a minor genotype, because in the total of 255 original reads (quality value greater than 20) in these four sites, only 75 displayed as “TAGA” insertion.
In general, SNPs in cp genome occurred at a rate of 1 in 1,400 bases. However, the 78 SNPs were not randomly distributed across the whole genome—a majority of these SNPs were clustered in LSC region. In IRs, we only found one SNP in ycf2, harboring a T–G mutation at position 92,696. In SSC region, we detected only two SNP sites: one in an intergenic region at 122,758 (T–G mutation) and the other in ndhA is similar to one locus in rpoC2 (genome position 21163) with T–G mutation causing leucine (TTA) to a stop codon (TGA). The percentage of both minor genotypes in these two genes was about 11% with around 100 original supporting reads. These mutations occurred at 5′-end of rpoC2 and 3′-end of ndhA could terminate the transcription prematurely and therefore make these two genes nonfunctional.
The intraSNPs in LSC region could be classified primarily into four functional gene categories: transcription (ropB, rpoC1, and rpoC2), photosynthesis (psa and psb proteins), energy metabolism (atp, pet and ndh proteins), and translation (rps proteins). Since chloroplasts can be considered a population within the cell, major-allelic variations that change amino acid sequences and translation efficiencies may affect or regulate gene functions. Furthermore, it has been proven in rice that the intravarietal major genotypes in one subspecies are often the minor genotypes in other subspecies, or vice versa . Since subspecies among date palm cultivars are not yet well-characterized, we used the 81 available CDSs in oil palm cp genome for a brief comparative analysis . We have adopted a procedure to avoid read contamination from nuclear or mitochondrial genomes and sequence artifacts (Materials and Methods); therefore, these intravarietal variations are unquestionably characteristic of the Khalas cultivar. Although date palm and oil palm belong to two different genera, we indeed found 19 sites, whose cp minor genotypes in date palm became major genotypes in oil palm, or vice versa (Table 4). This result confirms that some of these genotypes (major in one and minor in another) between the two species are actually real and these intravarietal cp genome variations may also be proven as real intervarietal variations among date palm cultivars if a large survey is to be carried out in the future.
Our study is the first one to quantify intravarietal polymorphisms using the next-generation sequencing technology. A recent analysis has reported a similar discovery of 40 SNP sites in Loloum perenne . The intravarietal polymorphisms are indicators for the heterogeneous nature of chloroplast population in a given species and we have demonstrated in rice that the minor genotypes are also chloroplast in origin . Since the effort of classifying date palm cultivars is early in its way, these intravarietal polymorphisms provide us useful markers for the evaluation of different subspecies. We are definitely able to validate some of these variations in the future genetic studies of date palm.
We generated transcriptomic data, including 1,076,222, 833,875, and 465,456 raw reads for leaf, root, and bud, respectively (date not shown), for the cp transcriptome analysis. The genes identified from leaf were most abundant: 19,052 and 306,154 for protein-coding and RNA genes, respectively. Most of the genes (99) transcribed in at least two tissues, although we failed to detect transcripts for four genes (trnQ-UUG, rpl32, ndhG, and ndhE) in any tissues due to possible sampling biases (see discussion below). More than half of the 112 unique genes were rather scarcely detected in buds. The transcription-translation apparatus of chloroplasts has a number of prokaryote-like features; for instance, cp genes are often co-transcribed . In our assembly, we found 18 polycistronic transcripts including genes for 63 proteins, 5 tRNAs, and all rRNAs (Table 5).
Since the cDNA identification for chloroplast genome is not as quantitative as it should be due to the lack of polyA tail that provide a handle for mRNA purification, the interpretation of cp gene transcription is rather qualitative. What we observed here is the transcripts whose RNAs are co-purified with mRNA preparation, similar to rRNAs, as they both are so abundant. It is also the unbiased co-purification that provides us useful information about chloroplast transcripts. We indeed observed interesting expressions of the date palm cp transcripts. First, one protein-coding gene—atpF—have 15,121, 3,030, and 91 corresponding cDNA reads, accounting for 79.4%, 66.8%, and 21.9% of the total cp related reads in leaf, root, and bud, respectively. Although the fractions may reflect the real proportions of the gene abundance in three different tissues, its unusual abundance is obvious by examining the assembly—its sequence has a polyA-like track that is 22 bp long and situated at the 57-bp upstream of atpF gene. Since our reverse transcription was primed by using a polyT adaptor, atpF transcripts are thus enriched in the cDNA libraries. However, due to the prokaryote-like features of chloroplast genome, there is no obvious polyA structure at the downstream of other genes. Furthermore, we found that polycistronic transcript rps2-atpI-atpH-atpF-atpA-trnR-trnG had only 50 transcripts containing atpF and atpA and it is obvious a result of enrichment related to the polyA-like track (Figure S2). In fact, other than atpF, the fraction of total transcripts with adaptor sequences account for about 72% of all CDS reads, and it indicated that our transcripts for CDS are indeed non-specifically co-purified with polyA mRNAs rather than enriched due to polyT affinity purification. For example, in the downstream of atpA, there is an 11-bp (10,718-10,728 in genome) rather degenerated polyA unit, whereas only 23 reads with adaptors for atpA were detected. Besides, the extremely high number of atpF transcripts suggested that this gene may not be simply co-transcribed with other genes as a polycistronic transcript but may also be transcribed independently. Second, there are two rRNA genes—rrn23 and rrn16—also display high abundance. In leaf, the number for rrn23 and rrn16 are 280,106 and 28,721, respectively (Table 5). There is no polyA structure downstream of these RNA genes, and therefore their transcripts may be just co-purified with mRNA but such enrichment may reflect their real abundance in leaf. Third, the number of trnA-UGC transcript is also special, accounting for 90.9% of all tRNA transcripts. It is curious why this gene had so high a copy number in the cell, since alanine only accounts for 5.32% amino acids in date palm cp genome. The reason remains to be illuminated.
In summary, we present here the first complete cp genome from Arecaceae family obtained based on pyrosequencing. The genome is A+T rich and we had to design primers to verify the most of the homopolymer runs. As a rather regular cp genome, its gene content and structure are similar to those of tobacco, and its IRs expansion occurred in LSC but not SSC regions. Its number of repeated sequences appeared lower than that of Poaceae cp genomes and certain dicots. The high coverage of raw cp reads demonstrated that the intravarietal SNPs exist not only in intergenic region, but also in coding genes, where these variations may have been functionally selected. Some of the major intervarietal SNPs may serve as genetic markers for classifying subspecies when appropriately developed as simple genotyping assays. Most protein-coding genes of cp genome are transcribed as polycistronic transcripts and transcriptome analysis is to be designed in particular to avoid cross-contamination between the nuclear poyA-containing and organellar polyA-lacking transcripts. Our data are useful for the study of date palm biology and pave a way for further molecular investigations.
We collected fresh green leaves from an adult plant of Khalas (a common cultivar of date palm) grown in Al-Hass Oasis, Kingdom of Saudi Arabia, and extracted genomic DNA from 50 g leaves according to a CTAB-based protocol . We used 5 µg purified DNA for constructing libraries according to the manufacturer's manual for GS FLX Titanium.
Since our original sequence reads are a mixture of DNA from nucleus and organelles, we separated the cp reads from the total reads based on the known cp genome sequences. We assembled the filtered sequence reads into non-redundant contigs using Newbler, a de novo sequence assembly software, and aligned the contigs to the references, including those of tobacco , rice , orchid , and the coding sequences from oil palm . We obtained 42 cp genome contigs with a collective length of 115 kb. Using a perl script, we made an iterative elongation for both ends of each contig against the total raw reads until the 42 contigs emerged as preliminary cp sequence scaffolds.
To avoid assembly errors from homopolymer runs (characteristics of the pyrosequencing technology) and to acquire a high-quality complete cp genome sequence, we designed 151 pairs of primers covering the preliminary cp genome assembly. Based on a published procedure , we sequenced PCR products using BigDyeV3.1 Terminator Kit for 3730XL (Life Technologies) and assembled the high-quality sequences into the complete cp genome using the Phred-Phrap-Consed package , . The four junctions between the single-copy segments and inverted repeats were also confirmed based on PCR product sequencing. The final date palm chloroplast sequence has been deposited to GenBank (accession number GU811709).
The genome was annotated by using DOGMA , coupled with manual corrections for start and stop codons. Intron positions were determined based on those of the tobacco cp genome . The transfer RNA genes were identified by using DOGMA and tRNAscan-SE (version 1.23) . Certain intron-containing genes (i.e., trnK-UUU, petB, and petD), in which exons are too short to be detected with software tools, were identified based on comparisons to corresponding exons in tobacco and other cp genomes. The functional classification of cp genes was referred to CpBase (http://chloroplast.ocean.washington.edu/). The cemA sequences of various species were downloaded from Genbank and aligned by using MUSCLE 
We used REPuter  to assess both direct (forward) and inverted (palindrome) repeats within the date palm cp assembly. The identity and the size of the repeats were limited to no less than 90% (hamming distance equal to 3) and 30 bp in unit length, respectively. Verification of the identified repeats was performed manually based on intragenomic comparisons.
To obtain possible small inversions, we first searched IRs in the length from 11 to 24 bp with REPtuter, and then collected those repeats whose distance is shorter than 50 bp as candidate SIs. Next, we evaluated the likely secondary structure of these SIs using MFOLD (version3.2) , and discarded those do not form obvious step-loop structure. Finally, we run Blast program using the remaining putative SIs against all monocot cp genomes, and only collected those qualified hits into the final SI list.
We used the total 369,022 cp reads for intravarietal SNP discovery. Based on the Consed graphical interface, we were able to identify genome loci that have more than one nucleotide type as candidates for intravarietal SNPs and to assign major and minor alleles. We only considered those loci with nucleotide sequencing quality value greater than 20.
To make our results more convincing and to eliminate false positives caused by contamination from nuclear and mitochondrial sequences, we adopted the following measures: (1) the number of aligned reads for each selected locus must be above 50; (2) the percentage of the maximal minor genotype (if there are two or more minor genotypes) must be above 10%; (3) if there were gaps in the reads of one aligned locus, we did not count (it may be a one-basepair Indel), and after removing these reads, the number of the remaining reads must be more than 50; and (4) we excluded the sites overlapping with homopolymer runs. These errors often occurred at two ends of a homopolymer run due to high coverage of the original reads; for example, a sequence “AAAAAAAAAC” often displays a minor genotype for C in position 10, but this C is often an artifact due to sequence errors intrinsic to pyrosequencing.
We purified total RNAs from leaf, root, and flower bud of cultivar Khalas and sequenced their cDNAs using Roche GS FLX system according to the manufacture's protocol. The total RNA was reverse-transcribed by using an oligo-dT adaptor. The adaptor sequences are:
After reverse-transcription, the cDNAs were sheared into small fragment to construct sequencing libraries followed by emulsion PCR and sequencing. We mapped all raw reads to the date palm cp genome assembly using both Newbler and Blast programs. We acquired non-redundant contigs and singlets by assembling the mapped reads, and collected only those with over 80% coverage and 99% identity of a gene. If different genes or a single gene are assembled into one contig or singlet, we assign the genes as polycistronic or monocistronic transcription unit, respectively. The poly- or monocistronic transcripts were estimated using the assembling results from leaf where chloroplasts are most abundant. These transcription units are carefully inspected manually using Consed graphical interface to avoid assembling mistakes and interrupting contigs by intron-containing genes.
A whole list of all pairs of primers used in date palm chloroplast
(1.08 MB PDF)
The location and sequences of all putative small inversions in date palm chloroplast genome
(0.01 MB PDF)
The intravarietal SNPs found in non-coding regions
(0.01 MB PDF)
Multiple alignments of cemA genes from 77 different cp genomes.
(0.02 MB PDF)
Partial 454 transcription reads alignments with date palm cp genome at downstream of atpF and upstream of atpA. Comments are marked with thick blue line and blue color fonts.
(0.25 MB TIF)
We thank Dr. Yongjun Fang as well as Mr. Xiaoguang Yu and Guangyu Zhang for their useful suggestions and excellent technical supports in sequencing experiments.
Competing Interests: The authors have declared that no competing interests exist.
Funding: The project is supported by a project grant from KACST 428-29, King Abdulaziz city for science and technology, Riyadh, Saudi Arabia. The funder had no role in study design and analysis.