We have used massively parallel pyrosequencing to characterize the transcriptome of Zoarces viviparus
. However, sequencing and assembly of a eukaryotic transcriptome without any genomic data a priori
is a notoriously difficult undertaking [36
]. Based on 400,000 reads we assembled ~50,000 putative transcripts, a strikingly high number compared to amount of genes found in the five sequenced fish genomes (20,000–25,000). One obvious reason for the high number of transcripts is the short length of the reads, which may result in several assembled contigs and singlets for each gene. Another important factor is the process of alternative splicing, where one single gene can result in several transcripts by combining and removing different parts (usually exons). This mechanism is common in higher eukaryotes [37
], including in fish [38
], and the number of expressed transcripts is therefore substantially higher than the number of genes in the genome.
Another issue is the limited sequence length exhibited by massively parallel pyrosequencing. Even though the length is substantially longer than other high-throughput sequencing techniques, such as SOLiD and Illumina sequencing [29
], the difference compared to Sanger-based sequencing is notable. In this study, we used a FLX Gene Sequencer, which generates 250 bases long reads, which is only a fourth of the 1000 bases long sequences which can be produced by Sanger sequencing [36
]. The short sequence reads makes the assembly more difficult and thus the resulting contigs shorter. However, the average contig length (~400 bases) from our assembly is slightly longer compared to previous studies [27
] and it is likely that this difference is due to the single tissue targeted in this study, but also due to the advance processing pipeline where genomic uninteresting regions, such as repeats, are removed.
For transcriptome sequencing, massively parallel pyrosequencing exhibits a high sensitivity compared to traditional Sanger-based approaches [33
]. The large number of generated reads allows detection of low-abundant transcripts and the cDNA cloning step where mRNA molecules are inserted into bacterial vectors is no longer required [30
]. Transcripts that previously have been hard to sequence can therefore be detected and previous studies have reported a considerable amount of reads aligning in genomic regions previously not annotated [32
]. Among our assembled transcripts, ~19,000 could be aligned to at least three out of five available fish genomes and among these 3% matched regions without any prior annotation. The expression levels for these unknown transcripts were at least as high as those aligning in annotated regions and they aligned in regions with evolutionary conservation between distant species. Hence, these transcripts are likely to represent novel transcribed elements such as not yet described fish-specific genes.
Massively parallel pyrosequencing can currently only process DNA stretches of limited length. Thus, a fragmentation step is applied and as a consequence the reading direction, traditionally indicated by poly-A tail primers, is lost. Hence, the assembly pipeline therefore needs to be flexible and build contigs based on reads from both directions. Nevertheless, the direction of transcription will still be unknown for the resulting contigs. For BLAST-based sequence alignments, the knowledge of the direction of transcription is not necessary since the comparison can easily be done for both strands. However, when constructing gene expression microarrays, the direction of transcription is crucial for proper probe design, both for placing the probe on the correct strand and for evaluating probe specificity. Unfortunately, estimating the direction of transcription is intricate for short stretches of mRNA, especially for sequences without a satisfactory annotation. Even though we complemented our BLAST-based annotations with estimates from FrameFinder, an application used to identify open reading frames, a substantial part of our transcripts lacks reliable estimates of their direction of transcriptions. These transcripts may therefore be represented by probes in both directions in future gene expression assays, and thereby circumvent the problem at the cost of space on the microarray. It is finally worth to mention that the gene expression levels on our evaluation microarray are not good predictors of the direction of transcription. Only ~60% of the well-annotated transcripts have probes that had a higher expression levels in the correct direction compared to the erroneous direction (Additional file 3
). This result is, however, far from surprising since the probes for the two directions will have different thermodynamic properties and specificity and therefore different intensity levels [39
The correlation between the developed eelpout microarray and the number of reads from the massively parallel pyrosequencing was estimated to 62%. This number is strikingly high considering both the normalization of the cDNA library prior sequencing and the non-linearity typically exhibited by microarrays. Previous studies comparing hybridization- and high-throughput sequencing-based gene expression measurement techniques have reported a correlation between 46%–72% (46–62%, [40
], 72–75%, [41
]). In contrast to these studies, which use established previously evaluated microarrays from well-sequenced species, the eelpout microarray is designed from scratch based on less than half of the transcriptome. Thus, the correlation shows that that there is a good concordance between the developed eelpout microarray and the transcriptome sequence data, which validates both the assembled transcripts and the corresponding oligonucleotide probes.
The present study demonstrates the relative ease by which de novo
transcriptome sequencing can be done today. The performance of the modern high-throughput sequencers is high enough that one single run is enough to cover a substantial part of the trancsriptome of a higher eukaryote. As a consequence, the assembled contigs becomes long enough to design reliable microarray probes for large-scale gene expression analysis. This development will undoubtedly spur genome-based research in many biological disciplines, especially in fields where a wide range of wild-life species are studied [42
]. One prime example is ecology, where tools such as gene expression assays are needed for efficiently study the low-level mechanistic of ecosystems while other related techniques such as single nucleotide polymorphism (SNP) analysis can be used to unravel parts of the complex interplay between genetic variation and fitness in wild populations [43
The Zoarces viviparus microarray developed in this study will constitute a valuable resource for marine environmental research in the Northern Europe. The platform will allow correlation of molecular-level responses, caused by exposure to various toxicants, to the unique morphological and physiological characters of Z. viviparus. In particular, the reproductive output in the form of size, number and quality of the embryos, can be related to gene expression changes at an individual basis. The stationary behavior of this fish species in combination with the sensitivity of gene expression biomarkers will form a unique instrument for accurate and robust monitoring of the aquatic ecosystems. The generated sequence data for the Z. viviparus transcriptome will also form a base to study molecular and physiological adaptations to viviparity in fish.