Spawning and sample collection
Our N. vectensis culture was founded with adults from Mark Martindale’s laboratory (University of Hawaii). Animals were kept in 12 parts per thousand seawater (Nematostella Medium: NM) at 16°C, fed newly hatched Artemia twice weekly, and cleaned once a week. Females were kept in female-only bowls.
A total of two replicates time course were collected in this study. For each time course, a single pool of embryos, spawned from the same parents at the same time, was sampled over the course of 10 days. Spawning was induced by feeding female-only bowls and mixed bowls (with males and females) oyster, followed by a water change and placement on a light box attached to a timer [30
]. They were exposed to 8 hours of light, with bowl water temperatures increasing to above 24°C. After light exposure, animals were moved to a dimly lit room, and any eggs spawned overnight were removed. Bowl water was changed to room temperature 0.2 μm filtered NM.
Females began to spawn approximately two hours after light exposure ended. Every 30 minutes newly spawned eggs were moved to small mesh bottom cups in NM from mixed male/female bowls (which contained sperm), kept at 18-23°C. Time of fertilization was considered to be at the time cups were placed in mixed water. Eggs were allowed to sit in water from mixed bowls for one hour. NM from the mixed male/female bowls was changed once over the course of collection. In N. vectensis, eggs are secreted in a gelatinous matrix, which must be removed before embryos can be sorted. Embryos were de-jellied by rocking them for 15–30 minutes in 40 ml NM, with 1.6 g cystine and 12 drops 5M NaOH. Embryos were rinsed 3 times with 0.2 μm 18°C filtered NM, divided and placed in 0.2% gelatin coated dishes. A total of six dishes, one for each time point, were prepared. Each dish had approximately 500–1000 embryos. These dishes were kept at 18°C for the duration of development.
We sampled six time points during each replicated time series. The target sampling points were zygote (2 HPF), early blastula (7 HPF), mid-blastula (12 HPF), early gastrula (24 HPF), planula (5 DPF), and young polyp (10 DPF). The exact sampling times had minor variation, and were 2.50, 7.23, 12.23, 23.60, 125.42, and 240.07 HPF for the first replicate spawning, and 2.55, 7.30, 12.48, 23.63, 125.50, 240.13 HPF for the second replicate. Prior to sampling for gene expression, any anomalous or un-cleaved embryos were removed (after 2 HPF), and the remaining embryos were rinsed with 0.2 μm 18°C NM. The fertilization rate was higher than 90% for both replicates. For each time point, approximately 500–1000 embryos were placed in RNAse-free non-stick tubes, excess liquid was drawn off, and they were snap frozen on liquid nitrogen.
mRNA extraction & HiSeq preparation
Messenger RNA was extracted directly from tissue with Dynabeads from the mRNA Direct Kit (Invitrogen) with only one round of bead hybridization. mRNA was treated with Turbo DNase (Ambion) and concentrated by ethanol precipitation. Re-suspended mRNA samples were quantified with Qubit. Samples were prepared for multiplex sequencing using Illumina TruSeq RNA Sample Prep Kits A and B (part # FC1221002 (kit B), Lot: 5781467) according to manufacturer's instructions, with an RNA fragmentation time of 8 minutes at 94°C.
All twelve samples were sequenced in a single lane on the Illumina HiSeq 2000 at the Brown Genomics Core Facility. Reads were single-end 50bp, with a separate read to sequence the sample index. Reads were de-multiplexed according to their index sequences with Casava version 1.8 (Illumina). Reads that did not pass the Illumina chastity filter were discarded.
Reference and mapping
Filtered transcript predictions from the Joint Genome institute (JGI) N. vectensis
genome project (http://ftp.jgi-psf.org/pub/JGI_data/Nematostella_vectensis/v1.0/annotation/transcripts.Nemve1FilteredModels1.fasta.gz
) were used as reference sequences [1
]. The original JGI transcriptome file has 27,273 predicted transcripts, with a mean contig length of 1,092 nucleotides. Some of these transcripts contain fragments of ribosomal RNA sequences, which, due to the high expression of ribosomal RNA, could complicate analyses of differential gene expression. We therefore removed transcripts that matched any of the following sequences according to a blastn search with an e-value less than 1e-40: 28S (extracted from the N. vectensis
genome), 18S (GenBank: AF254382.1) and the mitochondrial genome (including 16S and 12S; GenBank: DQ643835.1). 762 transcripts matched these sequences and were removed, though some gene predictions that include rRNA fragments that did not match with these stringent criteria are still present. The 18S, 28S, and mitochondrial sequences itemized above were then added to the reference, so that the number of reads that mapped to these high-abundance sequences could be quantified. This produced a reference of 26,514 transcripts. The modified reference, including single copies of the ribosomal sequences and the mitochondrial genome, is provided as Additional file 1
Our sequence reads were mapped to the reference using bowtie 2.2.0 beta3, with the --very-sensitive-local and -a flags. The --very-sensitive-local increases sensitivity at the cost of computational resources, while -a returns all possible mappings for a single read, rather than just the top mapping. A list of full commands used can be found in (Additional file 7
). Counts were generated from the bowtie2 map file using an in-house script (Additional file 8
). This script does not count any read that maps to more than one reference sequences, and multiple mappings to the same reference sequence are only counted once. This reduces the impact of misassembled transcripts on the derivation of read counts.
Statistical analyses of expression
Statistical analyses were performed with R version 2.15.2. The R code for the analyses are in Additional files 9
. The matrix with raw read counts, normalized counts, statistical analyses of changes in expression, top blast hit, and other annotations is in Additional file 2
. This file includes all reads present in the filtered transcript predictions from the N. vectensis
genome project, except redundant reads matching our query ribosomal sequences, which were removed (as discussed in “Reference and mapping”). One copy each of 18S, 28S and the mitochondrial genome were manually added after statistical analysis. This matrix can be used to examine genes and time points not presented in the main text.
We tested the significance of differential expression between each pair of adjacent sampling time points, using the R library edgeR version 3.0.4 [31
]. Since there were six sampling time points, there were five intervals that were tested. Transcripts without read counts or with very low read counts were filtered out before performing the test. This filtering strategy aimed at keeping transcripts with an average read count of at least 1 count per million (CPM) for replicates of a particular time point (keep <− rowSums(cpm(d) > 1) >= 2). TMM normalization was applied to account for compositional difference between libraries using the function calcNormFactors [32
]. Dispersion was estimated using the functions estimateCommonDisp and estimateTagewiseDisp. Testing for differential expression was done using the function exactTest. P-values were corrected for multiple testing using p.adjust and Bonferroni correction. We considered genes with an adjusted p-values below 0.05 as differentially expressed.
We refer to an increase or decrease in read counts between time points for a gene as an increase or decrease in expression of the gene. We acknowledge that this is a simplification of terminology, as numerous biological factors can influence mRNA abundance in cells, including mRNA degradation rate, so that read counts are not a strict measure of expression.
Temporal patterns of expression for individual transcripts were categorized and visualized with Short Time-series Expression Miner (STEM) version v1.3.8 [33
]. These analyses were performed on normalized count data, averaged between replicates. Normalized data were generated by multiplying each count by a normalization multiplier (generated by dividing 1 million by the multiple of the library size and a normalization factor generated using the EdgeR package (using the calcNormFactors function)). Data were input with the Normalized Data option. The data were fit to 50 possible expression profiles, with the maximum unit change between two time points set to two. The STEM output is in Additional file 6
, and the profiles it produced are shown in Additional file 4
STEM profiles only measure abundance, and do not assess significance. To identify genes with a significant peak of expression at a particular time point, we first used STEM profiles to identify all genes with a pattern of interest, then evaluated the significance of the change in abundance for each transcript, keeping only those transcripts that met our significance criteria.
The reference was compared with blastx to the metazoan sequences in the NCBI nr protein database with an e-value cutoff of 1e-5. Specific GO annotations were then produced with the blast2go command line interface [34
] and a local instance of the blast2go database (version b2g_may10). The blast2GO output was modified to fit the gene2GO format used by the R package topGO version 2.10.0 (Additional file 11
]. We used functionality of topGO to build acyclic graphs for the three domains, biological process (BP), molecular function (MF) and cellular compartment (CC), based on the blast2go annotations. After building the graphs the complete set of annotations were exported using the function genesInTerm. The resulting file (Additional file 12
) in combination with the reference matrix (Additional file 2
) allows for the retrieval and subset analysis of transcripts that received a particular GO term.
A gene set enrichment analysis was performed using the R package GOseq version 1.10.1 [36
]. Adjusted p-values from the edgeR analysis and a cutoff of 0.05 were used to construct a numeric vector of differentially expressed genes. Annotations in Additional file 12
were loaded as category mappings. Weightings for each gene, depending on its length, were obtained using the probability weighting function nullp and over and underrepresented GO categories were calculated using the Wallenius approximation. P-values were adjusted using p.adjust and the Benjamini and Hochberg method. We considered categories with an adjusted p-value below 0.05 as enriched. The complete list of enriched GO terms for each GOseq analysis are in Additional file 5
. The R code for this analysis can be found in Additional file 10