454 sequencing, assembly, and sequence analysis
A cDNA sample was prepared from coral larvae as illustrated in Figure , and sequenced using the 454 GS-Flx platform. This single sequencing run produced 628,649 reads, with an average sequence length of 232 bases (SD = 55, range = 30–474). An overview of the sequencing and assembly is given in Table . After removal of adaptor sequences, 623,267 reads remained, with an average length of 216 ± 58 bases. This comparison revealed that 99% of the reads produced contained useful sequence data. A total of 146 Mb of raw sequence data were generated, of which 135 Mb remained after adaptor trimming (92% of sequenced bases). Size-selection to remove outlier reads (unusually long and unusually short) reduced this to 134 Mb (599,248 reads; 95% of the original number of reads). The size distribution for these trimmed, size-selected reads is shown in Fig , revealing that 92% of trimmed reads fell between 100 and 300 bp in length.
Figure 1 Diagram of cDNA synthesis and 454 library preparation procedures. Three fragment types are produced by sonication: 5', internal, and 3'. Ligation with the partially double-stranded adaptors A and B produces, for each fragment type, certain adaptor configurations (more ...)
Summary of sequencing, assembly, and analysis
Figure 2 Overview of A. millepora transcriptome sequencing and assembly. (A) Size distribution of 454 sequencing read sizes after removal of adaptor sequences and outliers. (B) Size distribution of assembled sequences after assembly and contig joining. Note the (more ...)
Assembly of the trimmed, size selected reads along with the publicly available EST sequences (NCBI's dbEST) for A. millepora produced 44,444 contigs, with 62,657 reads remaining as singletons. Contigs ranged from 86 to 7,830 bp in size, with an average of 440 bp and an N50 of 693 bp (i.e., 50% of the assembled bases were incorporated into contigs 693 bp or longer). The size distribution for these contigs is shown in Fig . The assembly produced a substantial number of large contigs: 11,813 contigs were >500 bp in length, and 4,654 were >1 kb. Sequencing coverage ranged from 1 to 65, with an average coverage of 5. As expected for a randomly fragmented transcriptome, there was a positive relationship between the length of a given contig and the number of reads assembled into that contig (Fig ). The majority of assembled reads were incorporated into the top 20,000 largest contigs, which included 74% of the bases sequenced and accounted for 47% of the total assembled length (Fig ).
Singleton sequences (i.e., reads that were not assembled into contigs) were retained on the basis that although some singletons might represent artifacts or contaminants, others are likely fragments of transcripts present at low levels in the original sample. This interpretation is supported both by analysis of sequence similarity and by PCR validation of singletons. 5,330 of the 62,657 singletons had significant blast matches in the Swiss-Prot database, and nearly half of these top hits (2,404) were not found among the top hits from contigs in this same database. This finding suggests that singletons contain sequence information not found among contigs, consistent with the possibility that they represent unique genes expressed at levels low enough to hinder adequate sampling. To confirm that these sequences were present in our experimental material (i.e., were not simply sequencing artifacts), we randomly selected 10 singletons for primer design. PCR analysis confirmed, for 8 of the 10 tested, that these sequences are present in the original cDNA (additional file 1
). These 8 primer pairs each produced a single band of the expected molecular weight, and the others gave either no product or multiple bands. A large number of cycles (35–40) were required to produce detectable products, consistent with the possibility that these sequences represent genes expressed at low levels. Sanger sequencing of the eight successful PCR products confirmed the identity of five, with the other three matching poorly to the target singletons (Table ).
Validation of singleton sequences and the contig joining procedure by PCR amplification and Sanger sequencing.
Of the 107,101 assembled sequences (contigs plus singletons), 24,850 had significant matches in public protein sequence databases (nr). These blast matches correspond to 17,707 different unique accession numbers, of which 2,331 were each matched by multiple queries without overlap. These 2,331 subject sequences correspond to 5,327 different query sequences (~2.3 queries matched each subject, on average). For each of these protein sequences, all queries matching that subject were joined together in the appropriate order and strand (sense/antisense) to produce a scaffold, a construct intended to represent the collection of fragments originating from a single transcript. Overall, the contig joining process condensed the number of sequences from 107,101 to 104,005. To validate the contig joining procedure, ten sets of sequences that had been joined in this fashion (i.e., 10 scaffolds) were randomly selected for validation. Note that because the gaps in scaffolds are of unknown length, the scaffold sequences provide minimum expectations for PCR product size. All ten scaffolds were successfully amplified by PCR (additional file 1
). Sanger sequencing of these PCR products confirmed the identity of eight scaffolds at both ends, and the other two at only one end (Table ). These findings confirm that the scaffolds produced from our contig joining procedure successfully recapitulate cDNA sequences in most cases. This strategy successfully identified different parts of the same transcript for >2,000 transcripts in our study, and should be widely applicable for other transcriptome sequencing efforts in emerging model organisms.
The preceding steps, including assembly of raw reads and contig joining, were expected to reduce redundancy among sequences (i.e., more than one sequence per gene). To characterize any redundancy that remained, assembled sequences were clustered using two parallel procedures based on (1) nucleotide sequence similarity among the assembled sequences themselves, and (2) protein sequence similarity among the query sequences' best blast matches. The final set of 93,466 merged clusters was obtained from the union of these protein and nucleotide sequence clusters. Excluding the large clusters matching transposable elements, sequence clusters ranged from 1 to 64 sequences in size, with most containing only 1–2 sequences and a small number of clusters containing more than 2. The largest clusters could indicate paralogs, highly-divergent alleles, well-conserved gene families, alternative splice variants, or simply high levels of expression with correspondingly high sequencing error due to increased sampling. Regardless of the biological interpretation of sequence clusters, the clustering of 104,005 assembled sequences into 93,466 clusters indicates that relatively little redundancy remained after assembly and joining. Annotation of sequences clusters allows the rapid identification of all sequences with similarity to a particular gene of interest, facilitating studies of sequence polymorphisms and closely related genes.
Functional annotation of the transcriptome
Because the significance of sequence similarity depends in part on the length of the query sequence, short sequencing reads obtained from next-generation sequencing (e.g., the ~216 bp average trimmed read length obtained in this study) frequently cannot be matched to known genes [21
]. Most of the 104,005 assembled sequences analyzed here are short (61% ≤ 250 bp), and correspondingly few have significant blast matches in NCBI's nr database (23.9%). The proportion of sequences with matches in public databases is greater among the longer assembled sequences, as demonstrated by applying a series of increasing minimum size cutoffs and recording the number of queries with significant matches (additional file 2
). Among sequences ≥ 300 bp (i.e., longer than a single read), significant matches were found in nr for 62%, and among those longer than 1 kb, the proportion increased to 88.8% (Table ).
Summary of annotation of the A. millepora larval transcriptome.
Assembled sequences were assigned gene names based on the gene product and gene name annotation of the best blast match for that sequence. This procedure successfully assigned gene names for 15,860 sequences among the entire dataset; for 9,464 of the sequences ≥ 300 bp in length; and for 3,889 of the sequences ≥ 1 kb in length (Table ). Among the 15,860 annotated best hits, 11,633 different gene names were assigned, providing a rough estimate of the number of different genes expressed in these libraries. This is probably an underestimate because many sequences lacked matches in public sequence databases (Table ) and were therefore not assigned gene names.
Analysis of protein domains revealed that 12,785 of the assembled sequences matched profiles in NCBI's conserved domains database, corresponding to 4,475 different domains. Most domains were found in only 1–2 sequences, with a small number appearing more frequently. The top 20 most frequently detected domains include conserved domains associated with transcription factors, growth factors, and signaling pathways. The utility of domain annotation is that it allows researchers to quickly select all genes sharing a common domain; for example, selecting the 60 different sequences matching the domain COG5576 (homeodomain) retrieves putative homologs of 51 different genes with known roles in development and morphogenesis (e.g., SIX4, PAX3, dll, and engrailed).
Gene Ontology terms were assigned to 17,902 assembled sequences based on sequence similarity with known proteins annotated with GO terms (UniProt-TrEMBL database). For each sequence, the specific annotated terms were mapped to higher-level (i.e., more general) parent terms to provide a broad overview of the groups of genes cataloged in this transcriptome for each of the three ontology vocabularies. The hierarchical structure of these vocabularies allows the selection of sets of genes involved in a specific process at the desired level of detail. For example, GO annotation revealed 320 sequences implicated in general stress responses, including heat shock factor binding proteins (Table ). Within this set of general stress response genes, subsets of genes were associated with specific stressors (i.e., higher level GO terms), including heat stress (e.g., heat shock protein 70), oxidative stress (e.g., catalase), and response to wounding (e.g., phospholipase A2-activating protein). These searches were based on processes that are the focus of ongoing research in coral biology, and identified a number of genes that had not previously been investigated in the context of coral stress responses (Table ). These GO annotations provide a valuable new resource for investigation of specific processes, functions, or cellular structures involved in coral stress responses.
Candidate genes identified based on GO annotation of A. millepora larval transcriptome.
Pathways and complexes
To evaluate the completeness of our transcriptome library and the effectiveness of our annotation procedure, we searched the annotated sequences for the genes involved in a set of metabolic pathways and protein complexes shared among animal phyla. These simple text searches were based on standard gene names or synonyms. This confirmed that our data include annotated sequences for all genes in the five major pathways considered here (Table ). In a similar set of searches for components of essential protein complexes, almost all the genes were found (91–100%; Table ). The presence of these essential cellular process genes suggests that these annotated sequences accounts for nearly the complete coral larval transcriptome.
Genes from essential metabolic pathways and macromolecular complexes annotated in larval transcriptome.
To further evaluate the depth of coverage in this library, we searched the annotated sequences for sets of regulatory genes known to be present in anemone Nematostella vectensis, the most closely related organism for which an assembled draft of the genome sequence is available. The rationale for these searches was that the presence of these genes in anemone suggests their presence in coral, and their successfully identification in the coral larval transcriptome would support the completeness of this sequencing effort. For a set of 24 genes associated with eight major intercellular signaling pathways, 23 were successfully identified in the transcriptome sequences (Table ). Activin-like kinase was notably absent from this list, but this finding does not imply the absence of this gene in A. millepora; it could simply indicate that the gene is not expressed, or expressed at a very low level, in the developmental stages sampled in this study.
Intercellular signaling pathway genes annotated in larval transcriptome.
A similar search was conducted for a set of transcription factors known to be present in N. vectensis, based on conserved domains characteristic of each transcription factor family. Of the 20 families investigated, these searches identified annotated sequences corresponding to 18 (Table ). These text searches were based on domain identifiers, domain descriptions, and gene names. No matches were found for two families of transcription factors: the cold-shock DNA binding domain (pfam00313), and the paired amphipathic helix repeat (pfam02671). The lack of detection of these genes is not evidence that they are absent in the coral larvae; it might indicate their low expression, incomplete sequencing coverage, or inadequate annotation. Nevertheless, the overall success in identifying the known sets of genes associated with these pathways and complexes indicates that the collection of sequences established in this study provides a reasonably complete description of the coral larval transcriptome.
Major transcription factor families identified by conserved domain annotation of larval transcriptome
Comparisons with Nematostella vectensis
Sequence comparisons revealed broad similarity between the coral sequences described here and the protein sequences predicted from the anemone genome. Of the 104,005 assembled coral sequences, 20,619 have significant matches among the anemone predicted proteins (blastx with adjusted e-value threshold of 10-4). These correspond to 11,572 different predicted proteins. Reciprocal blast searches, in which the anemone sequences were queried against the coral sequences, revealed that 20,976 of the 27,273 N. vectensis proteins had significant similarity with coral sequences (tblastn), corresponding to 9,802 different coral sequences. Comparison of the best blast matches from these searches identified 8,515 unambiguous orthologs between coral and anemone, based on significant reciprocal best matches. Because the coral sequences represent partial transcripts, this is likely an underestimate of the true number of anemone orthologs expressed in this developmental stage.
To investigate whether the coral larval transcriptome contains "coral-specific" genes not found in the anemone, the assembled coral sequences were first compared with public databases. 22,158 coral sequences had significant matches in nr with an e-value ≤ 10-4
; these represent sequences with reasonably strong matches to previously identified proteins in other organisms. A list of all coral sequences that showed even weak similarity to anemone proteins was compiled by searching the predicted proteins (blastx) and the genome assembly (tblastx), using a permissive threshold of e ≤ 0.01 (n = 28,190 sequences). Comparison of these results revealed that 748 coral sequences matched known genes in other organisms but had no significant similarity to anemone proteins from the draft genome. To reduce false positives stemming from the fact that many assembled sequences were not complete cDNAs, the top blast match for each of these (i.e., a complete protein sequence for that gene, from a different organism) was queried against the anemone proteins with a permissive threshold of e ≤ 0.01. After eliminating the sequences with matches in this search, 207 sequences remained, corresponding to 95 different annotated coral genes. These analyses suggest that these coral genes lack orthologs in the anemone. Because the draft assembly of the anemone genome is not complete [24
], further experimental work would be required to verify the absence of these candidate coral-specific genes in the anemone.
SNP detection and validation
Using the QualitySNP program, we identified 33,433 high quality SNPs and 6,820 indels within 14,613 CAP3-assembled contigs (Fig ). These predicted SNPs included 22,312 transitions and 10,823 transversions. The overall frequency of all SNP types in the transcriptome, including indels, was 1 per 207 bp. These predictions provide an extensive set of genetic markers for reef-building corals that will enable future studies of genetic connectivity and genetic mapping at a previously unprecedented level of detail.
Classification of single nucleotide polymorphisms (SNPs) identified from 454 sequences. Overall frequency of these SNP types in the larval transcriptome is one per 207 bp.
Twenty of these predicted SNPs (corresponding to 18 different genes) were selected for validation using PCR and Sanger sequencing, and 14 of these tests (70%) were successful (additional file 3
). Because the genes selected for validation include several previously described stress responses genes [14
], the SNPs confirmed in these analyses may be valuable for investigating the genetic determinants of stress-gene expression. Overall, the successful experimental validation of the majority of computationally predicted SNPs confirms the utility of mining 454 transcriptome sequences for genetic markers.