For E. propertius, the large sequences produced by the 454 FLX Titanium allowed for the formation of a 14.6 Mbp assembly from 176 Mbp of EST sequence, with average contig coverage of 10× and average contig length of 753 bp. Similar results were obtained for P. zelicaon.
Comparisons to Bombyx mori suggest that our final assemblies are of high quality. Because βt was generally low for highly covered contigs, and nearly all variant regions had fewer variants than the number of genotypes sequenced, we see little evidence for over-assembly and paralog collapse. Further, the fact that amongst the assemblies tested we have not seen a point of diminishing returns in terms of average ortholog hit ratio (Table ) suggests that even more aggressive assemblers may produce more accurate assemblies for such diverse datasets.
Clustering results and comparison to the P. xuthus mitochondrial genome indicate the presence of ribosomal RNA in at least the P. zelicaon dataset. Although mitochondrial genes (e.g. ND5 and ATP6) are polyadenylated and appropriately found in our datasets, ribosomal RNAs are not, and hence should be considered contamination. While such unigenes can easily be filtered after assembly, the fact that many of these were clustered via hits to a protein predicted dataset (for H. erato) highlights the need for well annotated and curated reference datasets.
Clustering results also reveal that greater than 90% of unigenes had no similarity with other unigenes, indicating thorough assemblies. We searched for five single copy genes present in B. mori
]. For those E. propertius
homologues we found, assembly of unigenes was fairly complete (only one contig associated with each gene) and clustering was accurate (only one cluster contained an extra singleton). For P. zelicaon
, assembly was less complete (multiple contigs per gene), although in only one case were contigs split across two clusters. Coverage of found genes was around 50%, with the exception of the low coverage for PGD. For both organisms, no evidence was found for two of the five genes; based on similar analysis for M. cinxia
ESTs, this appears to be the result of low expression in larval samples.
Determining the breadth of coverage of the transcriptomes is difficult, given how little is known about butterfly genomes. Unigenes for both E. propertius
and P. zelicaon
hit ~ 8 K (of ~ 14.5 K total) unique B. mori
predicted proteins. For both species, at least 9 K unigenes hit one of B. mori
, H. erato
, or D. melanogaster
. Excluding the largest clusters, these hits were distributed roughly evenly between high and low coverage contigs and singletons, supporting previous studies suggesting that singletons and low coverage contigs are biologically valuable [14
]. Gene ontology term analysis reveals that 24 high level categories are present for both species in levels similar to that for B. mori
. Thus, although we cannot speculate on how many transcripts exist in these transcriptomes, they appear to be sampled broadly.
As expected in whole larval samples, we identified unigenes representative of plants, bacteria, fungi, and other non-lepidopteran sources [11
]. Interestingly, a single EST from both species hit to Wolbachia
, a symbiotic bacteria known to affect population dynamics and hypothesized to be present in E. propertius
Vera et al. compared unigene length to length of the best hit protein to estimate completeness of transcript discovery [11
]. Unfortunately, this also includes untranslated regions in unigenes, artificially inflating the desired measure. Our alternative, the ortholog hit ratio, provides a more conservative estimate of the effectiveness of gene discovery and speaks to assembly quality. Greater than 5% of unigenes had a ratio > 0.8 and greater than 10% of unigenes had a ratio > 0.5 for both species. We conclude that at least ten percent of our putative B. mori
orthologs capture approximately 50% of their corresponding silkworm genes.
The effects of alternative splicing on the ortholog hit ratio depend on the abilities of the assembler, as well as on whether the reference B. mori
protein set contains orthologs to alternatively spliced transcripts. Since many assemblers split contigs at ambiguous or repetitive regions [55
], alternative splicing will likely result in a low ortholog hit ratio for the alternative version, even if the alternative ortholog exists in the reference dataset. If the alternative ortholog does not exist in the reference dataset, either the alternative segment will match a subregion of the original transcript form (resulting in a low ortholog hit ratio), or there may be no hit at all (resulting in an undefined ortholog hit ratio). Since these issues serve to reduce ortholog hit ratios rather than inflate them, the conservativeness of the ortholog hit ratio approach is preserved.
While estimating genetic diversity accurately (e.g. computing θ
]) is difficult given the essentially unknown number of natural alleles contributing to the population sample for each contig, some relative comparisons between species should be possible. For example, using a SNP calling criterion similar to our strict criterion and a source dataset similar to ours, Vera et al. estimate 12.6 SNPs per 1,000 base pairs in the M. cinxia
], whereas we estimate 5.89 SNPs/1,000 bp for E. propertius
and 9.28 SNPs/1,000 bp for P. zelicaon
While Novaes et al. caution against comparing β
diversity estimates across assemblies--as they depend on sequencing depth, SNP calling criteria, and other factors--the E. propertius
and P. zelicaon
datasets were collected and processed in nearly identical fashion. Average β
statistics were higher in P. zelicaon
than in E. propertius
, despite smaller sample sizes for P. zelicaon
(owing to the relative difficulty of specimen collection). These comparative diversity results, both in terms of raw SNP counts per 1,000 bases and β
, support previous findings that overall genetic polymorphism is higher for P. zelicaon
than E. propertius
As was found for Eucalyptus grandis
, the β
distributions are all right-skewed (Table ), suggesting purifying selection for the majority of genes [12
is slightly negatively correlated with the number of species hit (0,1,2, or 3 of B. mori
, H. erato
, and D. melanogaster
) for E. propertius
, suggesting that lineage and species specific genes are more diverse for E. propertius
= -0.154, p
< 0.0001). A similar, but weaker and non-significant, trend is found for P. zelicaon
= -0.0254, p
As has been noted before [38
], different assembly programs can produce very different results, as seen in Table . While none of the assembly programs currently in widespread use are designed for ecoinformatics, Liang et al. have suggested that CAP3 is the best choice for ESTs [56
]. However, Liang et al. did not consider the Celera Assembler, and our results suggest that new versions of the Celera Assembler may be more appropriate for data containing a diversity of genotypes.
For further comparison, we also assembled the E. propertius
and P. zelicaon
EST sets with the recently released Newbler assembler version 2.3 (Roche 454 Life Sciences), which has options specifically for transcriptome data. For E. propertius
, Newbler produced 19,110 contigs of average length 637 bp and 36,848 singletons with average (uncleaned) length 314 bp. For P. zelicaon
, 25,336 contigs of average length 730 bp and 20,926 singletons of average (uncleaned) length 297 bp were produced. Newbler version 2.3 also produces a set of sequences known as "isotigs," arrangements of contigs meant to represent splice forms (similar to [55
]). For E. propertius
, 11,677 such isotigs with average length 1,238 bp were produced. 17,520 isotigs of average length 1,309 bp were produced for P. zelicaon
Another factor in successful transcriptome assembly is the sequencing technology used. In our application, the 454 Titanium chemistry sequencer produced average read lengths of about 400 bp. In contrast, the older 454 GS-20 platform used by Vera et al. produced reads averaging 110 bp for the M. cinxia
]. To assess the effects of sequencing technology, we obtained M. cinxia
ESTs from the Sequence Read Archive (SRA: SRR000670 and SRR000671) and cleaned and assembled them similarly to our datasets. (The original assembly by Vera et al. used SeqmanPro, a proprietary assembler from DNAStar.) After cleaning, 575,313 ESTs of average length 100 bp remained. Our assembly produced 34,921 contigs (average length 141 bp), and 27,468 singletons (average length 81 bp). The fact that this assembly size is different from that produced by Vera et al. indicates that reanalysis of data may be important as new bioinformatics tools and assemblers become available.
Comparison between the above M. cinxia assembly and that for P. zelicaon or E. propertius is complicated by multiple factors. First, these are different species with different patterns of diversity and expression. Second, even though the number of cleaned ESTs is similar, the shorter read lengths for M. cinxia ESTs provide less total sequence data, affecting the number of contigs obtained. Nevertheless, the similar aspects of these datasets (including that they were all sourced from several individuals) does suggest that longer read lengths can improve assembly quality.