|Home | About | Journals | Submit | Contact Us | Français|
Despite the extensive phenotypic variation that characterizes the Gesneriaceae family, there is a lack of genomic resources to investigate the molecular basis of their diversity. We developed and compared the transcriptomes for two species of the Neotropical lineage of the Gesneriaceae.
Illumina sequencing and de novo assembly of floral and leaf samples were used to generate multigene sequence data for Sinningia eumorpha and S. magnifica, two species endemic to the Brazilian Atlantic Forest. A total of 300 million reads were used to assemble the transcriptomes, with an average of 92,038 transcripts and 43,506 genes per species. The transcriptomes showed good quality metrics, with the presence of all eukaryotic core genes, and an equal representation of clusters of orthologous groups (COG) classifications between species. The orthologous search produced 8602 groups, with 15–20% of them annotated using BLAST tools.
This study provides the first step toward a comprehensive multispecies transcriptome characterization of the Gesneriaceae family. These resources are the basis for comparative analyses in this species-rich Neotropical plant group; they will also allow the investigation of the evolutionary importance of multiple metabolic pathways and phenotypic diversity, as well as developmental programs in these nonmodel species.
The understanding of the genetic basis of phenotypic and ecological divergences between species is a central question in evolutionary biology (Coyne and Orr, 2004). The investigation of these questions in nonmodel plant species is motivated by their patterns of ecological and phenotypic differentiation during lineage diversification (Elmer and Meyer, 2011), but represents a challenge due to the lack of genomic resources. This challenge can now be overcome with the advent of next-generation sequencing (NGS) methods that allow a better integration of the patterns of molecular, chromosomal, and epigenetic evolution into plant speciation and diversification (Rieseberg and Wendel, 2004; Lexer and Widmer, 2008). Recent studies highlighted the value of genomic resources for the phylogenomics of highly diverse lineages (Yang et al., 2015) and the advances and future research of NGS into broad-scale patterns of diversification (Lexer et al., 2013).
One of the strategies that has revolutionized NGS projects is whole-transcriptome sequencing, or RNA-Seq (Wang et al., 2009). This technique directly accesses the expressed protein-coding genes in a sample, allowing the investigation of differences in gene expression and sequence between conditions, populations, or species, without any previous knowledge of the biological system. The increased adoption of this technique to address ecological and evolutionary questions relies partially on its applicability to nonmodel organisms and its growing toolkit and bioinformatics support (Orsini et al., 2013; Wolf, 2013).
The Ligeriinae subtribe, previously named Sinningieae (Weber et al., 2013), is a clade of Gesneriaceae endemic to the Brazilian Atlantic Forest, which includes three genera (Sinningia Nees, Vanhouttea Lem., and Paliavana Vand.) and 91 species. Ligeriinae shows a large variation in habit, inflorescence form, and corolla shape that evolved during the ca. 30 Myr history of this clade (Perret et al., 2003, 2013). Previous studies on the mechanisms of speciation in Ligeriinae suggested that geographical isolation and repeated adaptations to different functional groups of pollinators and environments have promoted their diversification and geographic expansion (Perret et al., 2006, 2007). Although these results and the recent advances in the knowledge of the evolutionary history in the Gesneriaceae (Serrano-Serrano et al., 2015; Roalson and Roberts, 2016) provided new insights on the diversification of tropical plants, the scarce genomic resources for this group still limit our understanding of the molecular mechanisms underlying their evolution. Recent efforts for generating genomic or transcriptomic resources in the Gesneriaceae have been focused on the Old World members of the family. The leaf transcriptome and the genome of Boea hygrometrica (Bunge) R. Br. have provided evidence for the regulation of physiological and cellular functions during plant dehydration (Zhu et al., 2015; Xiao et al., 2015). Similarly, transcriptomic resources from seedlings of Streptocarpus rexii (Bowie ex Hook.) Lindl. have contributed to the available genomic data for this family (Chiara et al., 2013). In contrast, data for New World members are restricted to two still unpublished National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) projects (i.e., Achimenes Pers. species [accession no. PRJNA340450] and Sinningia speciosa (Lodd.) Hiern [accession no. PRJNA282582]) that will soon complement the efforts for the exploration of the genomics of Neotropical Gesneriaceae.
Here, we contribute to these resources by developing novel NGS resources in the subtribe Ligeriinae. This tribe is particularly suited to address questions about the mechanisms generating morphological and ecological differentiation, such as changes in growth habits or the evolution of floral traits associated with pollinator types (Perret et al., 2003, 2007). This article describes the generation of de novo reference transcriptomes for two perennial tuber species within the Dircaea clade: Sinningia magnifica (Otto & A. Dietr.) Wiehler and S. eumorpha H. E. Moore, with around 9 Myr divergence (Serrano-Serrano et al., 2017b). The data produced include nucleotide and translated protein sequences for the assembled genes, annotation tables that will pave the way for future investigations including population genomic studies, development of new phylogenetic markers, and comparative transcriptomics in these species and their related lineages. We discuss the main differences in the transcriptomic data between the two species, providing information on functional categories that may be the groundwork for the study of the genomic regions associated with the morphological evolution of this clade of Gesneriaceae.
All plants were cultivated in the greenhouses of the Botanical Garden of Geneva in Switzerland (Conservatoire et Jardin botaniques de la Ville de Genève [CJB]; Fig. 1). Tissue material was collected from young leaves and flowers; flowers were sampled at three developmental time points based on the percentage of the total flower size measured from the receptacle to the end of the corolla tube: bud1 = 0–33%, bud2 = 33–75%, and flower = 75–100%. We collected each species at every floral stage and leaf material from two biological replicates represented by different accessions (S. eumorpha accession no.: 200807305 and 20090391J0; S. magnifica accession no.: AC3615 and AC23105). Each sample contained multiple floral tissues, such as sepals, the limb portion of the flower, flower tube, anthers including the filament, and the stigma including the style, which were all combined. All samples were collected from May 2013 to November 2014, and were immediately frozen in liquid nitrogen and stored at −80°C. RNA was extracted with the QIAGEN RNeasy Plant Kit (catalog no. 74904; QIAGEN, Hilden, Germany) and treated with DNase I (QIAGEN) according to the manufacturer’s instructions. Illumina TruSeq (San Diego, California, USA) stranded paired-end mRNA libraries were performed using 2 μg of total RNA, following the library prep kit instructions (protocol version 15031047, Revision D, September 2012) for 300-bp fragments. Library concentration, integrity, and size were determined with the Agilent Fragment Bioanalyzer 2100 (Agilent Technologies, Santa Clara, California, USA) and Qubit fluorometric quantitation (Thermo Fisher Scientific, Waltham, Massachusetts, USA). Eight libraries were constructed for each species (three floral and one leaf, in two replicates each). Illumina sequencing was performed with 100 cycles of paired-end reads in a HiSeq 2500 at the Lausanne Genomic Technologies Facility (Lausanne, Switzerland).
We generated Illumina raw reads from each tissue and replicate that were pooled. Sequencing was even between the two species, with 316 and 332 million reads for S. eumorpha and S. magnifica, respectively. Raw reads were trimmed and filtered with a minimum length of 80 nucleotides and an average quality score higher than 20, using the FASTX-Toolkit (version 0.0.13.2, http://hannonlab.cshl.edu/fastx_toolkit/). Reads have been deposited in the NCBI SRA under BioProject ID PRJNA349129 (Biosamples SAMN05930205: Illumina reads for S. eumorpha [TaxId: 189015], and SAMN05930295: Illumina reads for S. magnifica [TaxId: 189031]). De novo transcriptomes for each species were assembled using the Trinity pipeline (version 2.0.3; Grabherr et al., 2011) using a minimum contig length of 200. The large amount of raw reads from all libraries per species was concatenated and in silico normalized to a maximum coverage of 50 with the Trinity settings –normalize_max_read_cov. We filtered the lowest 5% of the distribution of transcript lengths and a minimum of 1 FPKM (fragments per kilobase of transcript per million reads mapped) to obtain the final transcript set per species (Table 1). The transcriptome sequences are available from the Dryad Digital Repository as a FASTA file (http://dx.doi.org/10.5061/dryad.4r5p1; Serrano-Serrano et al., 2017a). Open reading frames (ORFs) were predicted with TransDecoder (Trinity plug-ins version 2.0.3; Haas et al., 2013). All genes and ORFs (predicted proteins) were annotated using BLASTX and BLASTP against the SwissProt database, producing matches with high-quality annotations. Sequence contaminants were screened and removed using the BLAST results that matched bacterial, fungal, or any other nonplant genetic material. BLAST annotations were filtered to avoid spurious hits using a threshold for the E-value and identity higher than 1 × 10−6 and 55%, respectively. Trinotate (version 2014.07.07; http://trinotate.github.io) was used to integrate the functional annotation, selecting one unique top BLAST hit and gene ontology (GO) annotation. Gene ontologies were plotted and compared between species using the WEb Gene Ontology annotation plot (WEGO) webtool (http://wego.genomics.org.cn/cgi-bin/wego/index.pl; Ye et al., 2006). The assembly statistics are presented in Table 1.
The number of Trinity transcript clusters (hereafter called “genes”), transcripts, and ORFs, and the distribution of mean and total sizes of transcripts were similar between species, although these values were overall higher for S. eumorpha when compared with S. magnifica (Table 1; Fig. 2A, ,2B).2B). The comparison of libraries showed that a large proportion of genes are shared by all stages and tissues (18,636 and 18,386 for S. eumorpha and S. magnifica, respectively; see Fig. 3). Genes found uniquely in leaves, flower buds, and anthetic flowers ranged between 1418 to 6810 and 1234 to 4345 for S. magnifica and S. eumorpha, respectively (Fig. 3). We performed GO enrichment (http://geneontology.org/) for these unique genes (Table 2 shows the results with P < 0.05), which represent the specific GO functional categories enriched at a specific developmental time (flowers vs. early and late buds) and in particular tissues (leaves vs. all flower stages). For instance, the genes uniquely expressed in early stages of flower development (B1) included categories related to reproductive developmental processes and regulation of transcription for both species. For the middle flower stage (B2), a unique GO category (xylan biosynthetic pathway) was enriched in both species. This pathway produces xylan, which is an important polymer found in plant cell walls that is likely related to growth processes. The GO categories at the anthetic flower stage showed strong differences between species. The genes uniquely expressed in flowers of S. eumorpha were enriched for categories such as translation, as well as metabolism of heterocycle macromolecules. In contrast, S. magnifica showed fewer GO categories enriched (response to hormones and inorganic anion transport). The GO categories found between the leaf libraries showed similarities in functions related to hormone responses.
The annotation of the two transcriptomes showed that only a low proportion of the genes have significant BLAST matches with existing curated protein and nucleotide sequences (maximum of 13% and 20% for BLASTX and BLASTP, respectively; see the BLAST annotation file in the Dryad Digital Repository [http://dx.doi.org/10.5061/dryad.4r5p1; Serrano-Serrano et al., 2017a]). This pattern is common for nonmodel species, and BLAST tools are known to fail for annotation of around 75% of the genes for de novo assemblies (Chiara et al., 2013; DeBiasse and Kelly, 2016). Gene ontologies indicated that the proportion of assembled genes associated with each functional category is similar between species, and that the de novo assemblies are comparable (Fig. 4). High and low abundance terms are mostly shared between species, except for three categories at the cellular component level (envelope, membrane-enclosed lumen, and symplast), one category at the molecular function level (electron carrier), and six categories at the biological process level (biological regulation, immune system process, locomotion, multiorganism process, pigmentation, and rhythmic process) that have significantly different gene counts (P < 0.001). These categories most likely contain pathways and genes with a different pattern of expression between the two species.
The two assembled transcriptomes were examined for quality and completeness using two measures: the ortholog hit ratio (OHR; O’Neil et al., 2010) and the Core Eukaryotic Genes Mapping Approach (CEGMA) analysis (http://korflab.ucdavis.edu/datasets/cegma/; Parra et al., 2007). The OHR is computed as the percentage of a gene in the transcriptome that matches a putative ortholog in a related species (Solanum lycopersicum L. [tomato] with ca. 79 Myr divergence; http://www.timetree.org/). It is calculated by dividing the length of the putative coding region by the total length of the orthologous gene. We performed a BLASTX of our genes against the set of transcripts from tomato (International Tomato Annotation Group [ITAG] version 2.4 predicted protein database: 34,725 sequences [December 2014]) and considered the best hit with an E-value < 1 × 10−6 to be the putative orthologs. For the CEGMA analysis, we used clusters of orthologous groups (COGs) for eukaryotes to search the 458 highly conserved core proteins that matched our predicted ORFs. The evaluation of the transcriptome completeness performed using the CEGMA analysis showed that 99.6% of the core eukaryotic genes mapped to the generated genes on each species. Additionally, the evaluation of the OHR indicated that the assembled genes covered a large proportion of the putative reference orthologs in tomato (Fig. 2C, ,2D).2D). For S. eumorpha and S. magnifica, 51.4% and 51.3% of genes have an overlap with the reference orthologous genes higher than 0.8.
The analysis with OrthoMCL (version 2.0.9; Li et al., 2003) identified 8602 orthologous groups between the two species, with 2274 putative single-copy genes (data available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.4r5p1; Serrano-Serrano et al., 2017a). This number is comparable with other surveys between closely related species (Zhang et al., 2013). These orthologous groups provide the information for future molecular evaluations and gene-specific analyses. Simple sequence repeat (SSR) markers were identified using the MicroSAtellite Identification Tool (MISA; Thiel et al., 2003), with a minimum number of repeats of six, five, and five for dinucleotides, trinucleotides, and tetranucleotides, respectively. A total of 3428 and 2966 SSRs were identified for S. eumorpha and S. magnifica, respectively. These SSRs are located in 1968 and 2236 genes for S. eumorpha and S. magnifica, respectively, which indicates that approximately 2% of the genes contain at least one SSR marker (1.98% and 2.61% for S. eumorpha and S. magnifica, respectively). Most of these genes contain a single SSR, and the most common SSRs are trinucleotides for both species (Table 1). Although we are unable to currently test these markers at the population level because of the scarcity of samples, we provide the SSRs as a valuable resource for marker development (data available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.4r5p1; Serrano-Serrano et al., 2017a).
Our results provide large-scale sequence data for two related species (only six nuclear genes were available for the Ligeriinae previous to this work), and expression and population-level genomic analyses can now be developed. The resources we developed will facilitate the investigation of ecological and evolutionary questions within the Sinningia genus and the Gesneriaceae family. The transcriptomes described in this study generated at least 40,000 Trinity “genes” per species, and 8600 potentially orthologous groups were identified between them. The quality and quantity of data are comparable between the two species and to other plant studies (Zhang et al., 2013; Chapman, 2015), providing valuable transcriptomic data for the Gesneriaceae family.
The number of genes uniquely present in one of the developmental stages differed between species. We found a larger number of genes at the anthetic flower stage compared with other stages (6810 genes in Fig. 3A) in S. eumorpha. In contrast, a higher number of unique genes in S. magnifica was found in middle buds (4345 in Fig. 3B). These differences may indicate that most of the stage-specific expression occurs later in the floral development of S. eumorpha compared with S. magnifica. We should be careful, however, because the tissue identity and growth phases have not been characterized in detail for the species in question. The functional categories associated with the stage-specific genes also differed between species mostly at the anthetic flower stage (Table 2), pointing again to the fact that the late stages of flower development could carry relevant genes for the morphological differentiation between species. Figure 4 showed additional categories where the whole assembled transcriptomes differed between species, such as pigmentation, driving our attention toward flower color, one of the most distinctive traits between the species (Fig. 1C, ,1E1E).
The sequences reported here constitute the raw material for designing probes and primers for phylogenetic and population studies. Additional steps are essential to complement any genomic or experimental survey for this plant family. For instance, the identification of differences in gene expression between taxa and the evaluation of evolutionary constraints or divergent use of genes will help our understanding of the evolution of reproductive isolation and habitat adaptation in plants (Lexer and Widmer, 2008). Furthermore, the identification of candidate genes for flower evolution and the investigation of floral developmental genetics will contribute to our understanding of flower diversity (Chanderbali et al., 2016). The two Gesneriaceae species that we sampled will enable these genetic and comparative analyses, and will provide unprecedented research opportunities for the study of Neotropical plant diversity.