Our normalized S. crassipaplis library was first subjected to a one-quarter plate titration run on the 454 GS-FLX machine that yielded 1,512,233 total bases from 5,727 sequence reads that assembled into 554 contigs and 3,182 singletons. Of these contigs and singletons 1,465 were identifiable at an E ≤ 0.001 using BLASTX against the NCBI non-redundant protein database (NR).
We followed this one-quarter plate preliminary run with a full-plate production run that yielded 72,816,811 total bases from 281,537 sequence reads and combined the results of both runs into a single dataset which was assembled de novo
. Files containing our results are available from the National Center for Biotechnology Information Short Read Archive (accession numbers SRR005065 and SRR006884). After filtering adaptors and low-quality sequences [22
] we were left with 74,329,044 total bases from 207,110 sequence reads with an average length of 241 bases. These data were assembled into 20,995 contigs with a mean length of 332 bp and a range of 30 to 2,958 bp as well as an additional 31,056 singletons for a total of 52,051 high-quality sequences. A file of all assembled contigs and singletons is available from the authors upon request.
Subjecting these sequences to a BLASTX search against the NCBI-NR protein database yielded 19,609 well-identified sequences with at least one hit E ≤ 0.0001 (37%), 15,241 poorly-identified sequences with hits between E = 0.0001 and E = 10, and 32,433 with no useful hits (E>10). Subjecting these sequences to a BLASTN search against the NCBI-NT sequence database yielded 6,321 well-identified sequences with a hit E ≤ 0.0001 (12%), 44,268 poorly-identified sequences with hits between E = 0.0001 and E = 10, and 1,453 with no useful hits (E>10). Well-identified sequences (E<0.001) from both searches were combined and compared with each other using BLASTN, reducing the number of well-identified sequences into 11,757 unique gene elements or unigenes. To estimate how many of these unigenes represented unique transcripts as opposed to different non-contiguous fragments of the same transcript we produced a file of the top 50 hits for each unigene and used these lists to compare accession numbers that overlapped among different unigenes with an E<0.001 using a custom Excel macro followed by manual editing. This procedure identified 9,317 sequences that are likely independent transcripts (Table ).
Summary of Sarcophaga crassipalpis EST data.
While the ~9,000 identified transcripts are certainly less than the entire transcriptome, we acquired many sequences of interest. Because our groups, among others, are particularly interested in diapause and stress biology, we compiled a brief list of genes we expect play a role in diapause (Additional file 1
). Clearly, even this abbreviated list will provide fodder for years of functional genomics work on these candidate genes for our group and others.
Our major goal was to generate a substantial representative sample of the transcriptome. To assess whether our EST project was broadly representative of our expectations for the Sarcophaga transcriptome, we used the annotations for each of our contigs and singletons to assign it to one of 14 major GO categories for Biological Process, producing assignments for 9,468 out of a total 52,051 transcripts (Fig. ). We also compared the distribution of our sequences among these 14 GO functional groups within Biological Process to the distribution of the 15,183 predicted genes for Drosophila melanogaster, 5,053 of which are placed into those same functional groups (Fig. , FlyBase, version 2008_07). Although there are noticeable differences between the Drosophila and Sarcophaga data in the percentage of transcripts within a few sub-categories, concordance in the overall distributions suggests that our library sampled widely across sub-categories and provides a good representation of the S. crassipalpis transcriptome. As sequencing technologies improve, further exploration of the S. crassipalpis transcriptome will likely include sequencing of new libraries made from key tissues during focal life stages that may not have high representation in total body extracts, such as brains from diapausing pupae or ovaries from pre and post-reproductive females, as well as deeper sequencing to both reveal rare transcripts and provide the additional information necessary to identify more of the unannotated transcript fragments in our current whole-body library.
Sarcophaga crassipalpis sequences were classified into one of 14 major sub-categories within the Biological Processes GO category.
Figure 2 A comparison of the distribution across 14 major Biological Process GO sub-classes in our Sarcophaga crassipalpis library versus predicted ESTs from Drosophila melanogaster. The sub-categories are CS = cell communication (signaling), RP = Regulation of (more ...)
Because our EST database appears broadly representative of the S. crassipalpis
transcriptome, we compared it to the available transcriptomes of D. melanogaster
and A. gambiae
to assess whether coding sequences were evolving at equivalent rates across GO categories. Non-synonymous rates of substitution (dN) were roughly gamma-distributed across genes within each GO category for all three pairwise comparisons. Median values across all GO categories were generally lowest in comparisons of S. crassipalpis
to D. melanogaster
(Fig. ), reflecting their close relationship within Cyclorrapha relative to the relatively deep phylogenetic split between Cyclorrapha and Culicomorpha. The 95% confidence intervals of the median (notches in Fig. ) did not differ substantially across categories within any of the three pairwise comparisons (similarly shaded bars in Fig. ), suggesting largely homogeneous rates of amino acid substitution. However, the median dN for Response to stimuli had the highest (D. melanogaster
to A. gambiae
; S. crassipalpis
to A. gambiae
) or second highest (S. crassipalpis
to D. melanogaster
) value in the rank ordering among GO categories across all three pairwise comparisons. This consistently high dN across multiple comparisons of species diverged on the order of 200 million years ago [23
] suggests that genes regulating responses to stimuli may evolve relatively rapidly across dipteran taxa. This divergence meshes well with obvious differences in sensory strategies between these three species that must identify and locate very different nutritional resources (rotting flesh, live animals for blood meals, and rotting fruit). In addition, responses to pheromonal and behavioural cues associated with reproduction vary widely across taxa and are likely under strong diversifying selection.
Figure 3 Box plots of the distributions of non-synonymous substitution rate (dN) values in each GO category. The pairwise comparisons are: S. crassipalpis to D. melanogaster (a), S. crassipalpis to A. gambiae (b), and A. gambiae to D. melanogaster (c). The upper (more ...)
In addition to the focal animal, transcriptomics projects can also provide insights into parasites, pathogens, and other associated microorganisms. Because we used whole organisms grown in non-sterile culture conditions we expected a portion of our library to include microorganismal sequences, some which are directly associated with S. crassipalpis and some that are inadvertent environmental contaminants. We identified only 160 non-metazoan sequences that appeared to be independent transcripts with an E<0.0001 from our BLASTX and BLASTN searches. Although we expect that flesh flies have a rich microorgansimal community, our selection of poly-A RNA as the starting material for our library likely eliminated many potential microbial sequences. Few hits were ascribed to any single microorganismal taxa. Roughly one-third of the sequences had high sequence similarity to various eukaryotic microorganisms including yeasts (e.g., Candida, Saccharomyces), filamentous fungi (e.g., Aspergillus, Neurospora), and protozoans (e.g., Plasmodium, Theileria). Sequences having apparent bacterial origin included many genera with species that have been associated with insects either as pathogens or as normal flora in the gut (e.g., Burkholderia and Escherichia), but these sequences could just as easily represent environmental contamination. Perhaps most interesting of the non-metazoan sequences were those for transposable elements, particularly sequences in the Mariner-Tc1 family and several retroviral elements. With future development such sequences may provide insight into new functional genomics tools for Sarcophaga, potentially including transgenic modification.
Another interesting feature of our 454 sequence data is that a number of sequences appear to contain repeat motifs (i.e., microsatellites) or single nucleotide polymorphisms (SNPs). We searched for microsatellite motifs within our sequences using the program MSATCOMMANDER [24
] to identify sequences with di-, tri-, tetra-, penta, and hexanucleotide repeats. Our search for repeat motifs revealed 521 sequences containing some such motif from the total set of 52,053 sequences (using a minimum repeat length of seven, Additional file 2
). The high levels of variation typically found for microsatellite markers have led to their widespread use for a broad range of studies such as genome mapping, parentage analysis, and analyses of gene flow and genetic structure. Because our library was constructed from a long-standing laboratory colony expected to harbor little standing variation compared to field populations, we did not screen any of these loci for allelic variability. Similarly, to identify SNPs in our contigs we used the program PolyBayesShort, a version of the PolyBayes program [25
] that has been optimized for data from short read platforms such as 454, under the default parameters. We found at total of 12,090 SNPs that were distributed among 3436 of our 20,995 contigs, yielding an average of 7.23 SNPs per 10 kb. However, because our samples were derived from a long-standing, inbred laboratory colony our SNP estimates are likely a small fraction of the variation existing in S. crassipalpis
populations in the field. We hope that identification of candidate microsatellite and SNP loci will be useful for others interested in field-based population studies of S. crassipalpis
and closely related Sarcophaga
species, some of which are important for forensic studies.