Search tips
Search criteria 


Logo of narLink to Publisher's site
Nucleic Acids Res. 2010 September; 38(17): e171.
Published online 2010 August 3. doi:  10.1093/nar/gkq667
PMCID: PMC2943614

A practical, bioinformatic workflow system for large data sets generated by next-generation sequencing


Transcriptomics (at the level of single cells, tissues and/or whole organisms) underpins many fields of biomedical science, from understanding the basic cellular function in model organisms, to the elucidation of the biological events that govern the development and progression of human diseases, and the exploration of the mechanisms of survival, drug-resistance and virulence of pathogens. Next-generation sequencing (NGS) technologies are contributing to a massive expansion of transcriptomics in all fields and are reducing the cost, time and performance barriers presented by conventional approaches. However, bioinformatic tools for the analysis of the sequence data sets produced by these technologies can be daunting to researchers with limited or no expertise in bioinformatics. Here, we constructed a semi-automated, bioinformatic workflow system, and critically evaluated it for the analysis and annotation of large-scale sequence data sets generated by NGS. We demonstrated its utility for the exploration of differences in the transcriptomes among various stages and both sexes of an economically important parasitic worm (Oesophagostomum dentatum) as well as the prediction and prioritization of essential molecules (including GTPases, protein kinases and phosphatases) as novel drug target candidates. This workflow system provides a practical tool for the assembly, annotation and analysis of NGS data sets, also to researchers with a limited bioinformatic expertise. The custom-written Perl, Python and Unix shell computer scripts used can be readily modified or adapted to suit many different applications. This system is now utilized routinely for the analysis of data sets from pathogens of major socio-economic importance and can, in principle, be applied to transcriptomics data sets from any organism.


Transcriptomics is the molecular science of examining, simultaneously, the transcription of all genes at the level of the cell, tissue and/or whole organism, allowing inferences regarding cellular functions and mechanisms. The ability to measure the transcription of thousands of genes simultaneously has led to major advances in all biomedical fields, from understanding the basic function in model organisms, such as the free-living nematode Caenorhabditis elegans (1–3) or the vinegar fly, Drosophila melanogaster (4–6), to studying molecular events associated with the development and progression of human diseases, including cancer (7–9) and neurodegenerative disorders (10–12), to the exploration of the mechanisms of survival, drug-resistance and virulence/pathogenicity of bacteria (13,14) and other socioeconomically important pathogens, such as parasites (15–20). For more than a decade, transcriptomes have been determined by sequencing expressed sequence tags (ESTs) using the conventional Sanger method (21,22), whereas levels of transcription have been established quantitatively or semi-quantitatively by real-time polymerase chain reaction (PCR) (23) and/or cDNA microarrays (24). The use of these technologies has been accompanied by an increasing demand for analytical tools for the efficient annotation of nucleotide sequence data sets, particularly within the framework of large-scale EST projects (25). With a substantial expansion of EST sequencing has come the development of algorithms for sequence assembly, analysis and annotation, in the form of individual programs (26–28) and integrated pipelines (29,30), some of which have been made available on the worldwide web (29,31,32). However, the cost and time associated with large-scale sequencing using a conventional (Sanger) method and/or the design of customized analytical tools (e.g. cDNA microarray) have driven the search for alternative methods for transcriptomic studies (33).

In the last few years, there has been a massive expansion in the demand for and access to low cost, high-throughput sequencing, attributable mainly to the development of next-generation sequencing (NGS) technologies, which allow massively parallelized sequencing of millions of nucleic acids (33,34). These sequencing platforms, such as 454/Roche (35; and Illumina/Solexa (36;, have transformed transcriptomics by decreasing the cost, time and performance limitations presented by previous approaches. This situation has resulted in an explosion of the number of EST sequences deposited in databases worldwide, the majority of which is still awaiting detailed functional annotation. However, the high-throughput analysis of such large data sets has necessitated significant advances in computing capacity and performance, and in the availability of bioinformatic tools to distil biologically meaningful information from raw sequence data.

Sequences generated by NGS are significantly shorter (454/Roche: ~400 bases; Illumina/ABI-SOLiD: ~60 bases) than those determined by Sanger sequencing (0.8–1 kb), which poses a challenge for assembly. In addition, the data files generated by these technologies are often gigabytes to terabytes (1 × 109 to 1 × 1012 bytes) in size, substantially increasing the demands placed on data transfer and storage, such that many web-based interfaces are not suited for large-scale analyses. The bioinformatic processing of large data sets usually requires access to powerful computers and support from bioinformaticians with significant expertise in a range of programming languages (e.g. Perl and Python). This situation has limited the accessibility of high-throughput sequencing technologies to some (smaller) research groups, and has thus restricted somewhat the ‘democratization’ of large-scale genomic and/or transcriptomic sequencing. Clearly, user-friendly and flexible bioinformatic pipelines are needed to assist researchers from different disciplines and backgrounds in accessing and taking full advantage of the advances heralded by NGS. Increasing the accessibility to high-throughput sequencing will have major benefits in a range of areas, including the investigation of pathogens. The exploration of the transcriptomes of pathogens has major implications in improving our understanding of their development and reproduction, survival in and interactions with the host, virulence, pathogenicity, the diseases that they cause and drug resistance (17–20,37–39), and has the potential to pave the way to novel approaches for treatment, diagnosis and control. In the present study, we (i) constructed a semi-automated, bioinformatic workflow system for the analysis and annotation of large-scale sequence data sets generated by NGS, (ii) demonstrated its utility by profiling differences in the transcriptome of an economically important parasite, Oesophagostomum dentatum (Strongylida), throughout its development, and (iii) indicated the broader applicability of this system to different types of transcriptomic data sets.


Sequence data sets

For this study, original cDNA sequence data sets representing four distinct developmental stages of O. dentatum [i.e. third-stage (L3) and fourth-stage (L4) larvae as well as adult female and male worms] were produced and stored as described previously (40). Total RNA (10 µg) from each stage and/or sex was used to construct a normalised cDNA library; each library was sequenced using a Genome Sequencer™ (GS) Titanium FLX (Roche Diagnostics) as described previously (18). FASTA- and associated files, with short-read sequence quality scores of each data set, were extracted from each SFF-file; sequence adaptors were clipped using the ‘sff_extract’ software (

Bioinformatic components for the construction of the workflow system

Five components (1–5), documented in a series of peer-reviewed, international publications, were selected based on the parameters of general applicability, ease of use, versatility and efficiency. Once constructed, the workflow system was applied to the analysis of the O. dentatum data sets.


The Contig Assembly Program (CAP3 v.3; 31) was used to cluster sequences (with quality scores) into contigs and singletons from individual or combined (i.e. pooled) data sets, employing a minimum sequence overlap of 40 nucleotides and an identity threshold of 90%. This program was selected to enable the assembly of relatively long sequences and to remove redundant short-reads (41).

Similarity searching

BLASTn and BLASTx algorithms (42) were used to compare contigs and singletons with sequences available in public databases [i.e. NCBI ( and EMBL-EBI Parasite Genome Blast Server (; April 2010], to identify putative homologues in range of other organisms (cut-off: <1E-05). For nematodes, WormBase (release WS200; was interrogated extensively for relevant information on C. elegans orthologues/homologues, including transcriptomic, proteomic, RNA interference (RNAi) phenotypes and interactomic data.

Prediction and annotation of peptides

The program ESTScan (32) was used to conceptually translate peptides from assembled contigs and singletons. InterProScan (available at; 27) and gene ontology (GO; 43) were used to classify peptides (based on their putative function/s). Biological pathways were inferred from C. elegans for each peptide using the KEGG Orthology-Based Annotation System software (KOBAS; 44) and displayed using the iPath tool (; 45).

In silico subtraction

A BLASTn algorithm, employing a stringent cut-off (cut-off: <1E-15; 17), was used to examine differential transcription between data sets by subtraction in silico. Peptides corresponding to transcripts that were unique to a particular data set were assigned parental (i.e. level 1) InterPro terms and compared, using a BLASTp algorithm (cut-off: <1E-15), with peptides inferred from the assembly of sequences from combined data sets. The subtraction approach allows qualitative (not quantitative) differences between or among samples to be established.

Probabilistic functional networking of protein-encoding genes, and drug target prediction

Interaction networks among C. elegans orthologues of differentially transcribed molecules were inferred using an established approach (46). The druggability of C. elegans homologues of molecules unique to a particular O. dentatum data set or common to all data sets was inferred using a published method (18). Briefly, the InterPro domains of predicted proteins were compared with those linked to known, small molecular drugs, which follow the ‘Lipinsky rule of 5′ regarding bioavailability (47,48). GO terms were mapped to Enzyme Commission (EC) numbers, and a list of enzyme-targeting drugs was compiled based on data available in the BRENDA database (; 49,50). The C. elegans orthologues/homologues included in this list were ranked according to the ‘severity’ of non-wild-type RNAi phenotypes (including lethality or sterility of different developmental stages; see; release WS200).


A semi-automated bioinformatic workflow system (Figure 1), incorporating five key bioinformatic components, was constructed and linked using customized Perl, Python and Unix shell computer scripts (listed in Supplementary File S1 and accessible via This system was then assessed for the assembly, analysis and functional annotation of each or all of the four sequence data sets for O. dentatum. The specificity of the in silico subtraction step was verified using independent experimental evidence.

Figure 1.
Bioinformatic analyses of the Oesophagostomum dentatum data sets. Stars indicate analyses performed using custom-written Perl, Python and/or Unix shell computer scripts, accessible via [1] Individual ...

Assembly and detailed annotation and analyses of the O. dentatum data sets

A total of 1 826 367 sequences (244 ± 32 bases; i.e. mean length ± standard deviation) were determined for L3, L4 as well as adult female and male of O. dentatum. Following the clipping of adapter sequences, only sequences of >100 bases (n = 1 800 874; 98.6%) were included in further analyses. The numbers of contigs assembled for each of the four data sets are listed in Table 1. The assembly of the sequences of all four data sets yielded 36 233 contigs (516 ± 316 bases in length) and 452 528 singletons (Table 1); sequences (n = 115) with similarity (cut-off: <1E-15) to potential host molecules were excluded. The L3 data set had the largest number of sequence clusters with orthologues/homologues in C. elegans (n = 32 904; Table 1) and in organisms other than nematodes (n = 14 731; Table 1), whereas the L4 data set included the largest number of clusters with orthologues/homologues in other parasitic nematodes (n = 38 634; Table 1).

Table 1.
Summary of the nucleotide sequence data for the adult female, adult male, and third (L3) and fourth (L4) larval stages of Oesophagostomum dentatum prior to and following in silico subtraction as well as detailed bioinformatic annotation and analyses

Of the four assembled data sets, the L3 set included the largest number of sequence clusters with predicted open reading frames (ORFs; n = 57 818; Table 1), of which 27 297 (47.2%) could be annotated functionally using InterPro terms and 12 763 (22.1%) could be assigned GO terms, including 19 705 ‘biological process’, 10 926 ‘cellular component’ and 34 904 ‘molecular function’. The numbers of peptides inferred from sequence clusters in the adult female, adult male and/or L4 data sets, which could be assigned InterPro and/or GO terms, are given in Table 1. In total, 85 395 peptides were predicted for all sequences from all four data sets, representing 17.5% of clusters (Table 1); 56 940 (66.7%) of them could be mapped to known proteins defined by 31 982 different domains, the most represented being ‘SCP-like extracellular’ (IPR014044; 1.2% of the peptides mapping to a conserved protein motif), ‘NAD(P)-binding’ (IPR016040; 1.1%) and ‘proteinase inhibitor I2, Kunitz metazoa’ (IPR002223; 1%) (Table 2). GO annotation allowed 56 940 (66.7%) inferred proteins to be assigned to 19 346 ‘biological process’, 11 007 ‘cellular component’ and 35 182 ‘molecular function’ terms (Table 1). The predominant terms were ‘metabolic process’ (GO:0008152; 10.9%), ‘proteolysis’ (GO:0006508; 7%) and ‘translation’ (GO:0006412; 5.4%) for ‘biological process’; ‘intracellular’ (GO:0005622; 17.5%), ‘membrane’ (GO:0016020; 15.6%) and ‘nucleus’ (GO:0005634; 11.6%) for ‘cellular component’ and ‘ATP binding’ (GO:0005524; 7.5%); ‘catalytic activity’ (GO:0003824; 7%) and ‘binding’ (GO:0005488; 4.6%) for ‘molecular function’ (Table 3). Proteins inferred from the combined assembly were predicted to be involved in 262 different biological pathways, defined by 64 unique KEGG terms, of which ‘peptidases’ (12%), ‘other enzymes’ (8%) and ‘antigen processing and presentation’ (5.5%) were predominant (see Supplementary File S2). A display of biological pathways, defined by KEGG terms, inferred from predicted peptides and mapped to the complement of known pathways in C. elegans, is shown in Supplementary Figure S1.

Table 2.
The 20 most represented (InterPro) protein domains inferred from peptides conceptually translated from individual contigs for Oesophagostomum dentatum [combined assembly of data for adult female, adult male, and the third (L3) and fourth (L4) larval stages] ...
Table 3.
Functions predicted for proteins encoded in the transcriptome of Oesophagostomum dentatum (combined assembly), based on gene ontology (GO)

Using BLASTn algorithms, subsets of 3451, 10 344, 14 380 and 7520 nucleotide sequences were identified as being uniquely transcribed in adult female, adult male, L3 and L4, respectively (Table 1). The accuracy of the in silico subtraction process was verified using independent evidence from a previous analysis of differential transcription between adult females and males of O. dentatum using a microarray-based approach (51). This verification showed that all 220 female- and 171 male-enriched molecules characterized previously (51; GenBank accession numbers AM157797-AM158083) were contained exclusively within the female and male data sets, respectively, following in silico subtraction (data available upon request). Based on these findings, the specificity of the subtraction process, calculated using the Wilson score (52) at a confidence interval of 95%, ranged from 98% to 100%. Of the 139 parental functional domains assigned to predicted peptides unique to the adult female data set, ‘chitin-binding protein, peritrophin-A’ (IPR002557; 8.6%) and ‘basic-leucine zipper (bZIP) transcription factor’ (IPR004827; 4.8%) were highly represented. Of the 243 protein motifs identified amongst the predicted peptides that were unique to the adult male data set, ‘PapD-like’ (IPR008962; 4%) and ‘major-sperm protein’ (IPR000535; 3.7%) were most represented. For the L3 data set, 220 unique protein motifs were identified, of which ‘RmlC-like jelly roll fold’ (IPR014710; 4.5%) and ‘six-bladed beta-propeller’ (IPR011042; 2.7%) had the highest representation. In contrast, of the 249 protein motifs unique to L4 data set, ‘peptidase M24, methionine aminopeptidase’ (IPR0011714; 2.2%) and ‘FAD-binding’ (IPR016166; 1.3%) were the predominant domains (Table 2). The number of ‘biological process’, ‘cellular component’ and ‘molecular function’ terms assigned to peptides unique to each of the individually assembled data sets is given in Table 1. The KOBAS analysis assigned 7, 16, 18 and 23 KEGG terms to inferred peptides exclusive to the adult female, adult male, L3 and L4 data sets, respectively; of the 23 KEGG terms assigned to L4, 20 could be mapped to known pathways in C. elegans (Supplementary Figure S2).

Probabilistic genetic interaction networking predicted 215 C. elegans orthologues, representing sequence clusters unique to the adult female of O. dentatum, to interact directly with a total of 1729 other genes (range: 1–277), including some (e.g. lin-12, mom-5, glp-1, ppk-1, tbx-2 and rnr-1; Supplementary Figure S3, and Supplementary File S3) that are essential to embryogenesis and reproduction (see The 373 C. elegans orthologues of sequence clusters unique to the adult male of O. dentatum were predicted to interact directly with a total of 1710 other genes (range: 1–117; Supplementary File S3). Amongst them were genes involved in sperm development (i.e. ima-3) and motility (i.e. act-2) (Supplementary Figure S3, and Supplementary File S3; A total number of 387 and 323 C. elegans orthologues of L3- and L4-unique molecules, respectively, were predicted to interact with 790 (range: 1–122; Supplementary File S3) and 1058 (range: 1–59; Supplementary File S3) other genes, respectively, including some involved in embryonic and/or larval viability (i.e. scc-1, tba-4, cct-3, pfd-3 and mcm-4) and larval development (i.e. let-711) (Supplementary Figure S3 and Supplementary File S3;

The 2397 predicted peptides unique to the adult female of O. dentatum had significant homology (cut-off: >1E-05) to 261 C. elegans orthologues/homologues (data not shown), of which 151 were associated with EC numbers linked to ‘druggable’ enzymes and/or InterPro domains (Table 4); of these, 92 were associated with non-wild-type RNAi phenotypes, including adult lethality (n = 3), embryonic and/or larval lethality (n = 44) and/or adult sterility (n = 65). Of the 541 C. elegans homologues of the 7117 predicted peptides unique to the adult male of O. dentatum, 375 were associated with EC numbers linked to ‘druggable’ enzymes and/or InterPro domains (Table 4). Of these, 205 were associated with the RNAi phenotypes ‘embryonic and/or larval lethality’ and 196 to ‘sterility’ (Table 4). Of the 565 unique C. elegans homologues of predicted peptides unique to the L3 of O. dentatum, 344 were associated with EC numbers linked to ‘druggable’ enzymes and/or InterPro domains (Table 4); 121 of these were linked to RNAi phenotypes ‘embryonic and/or larval lethality’ and 165 to ‘sterility’ (Table 4). Amongst the 416 C. elegans homologues of predicted peptides unique to the L4 stage of O. dentatum, 283 could be associated with EC numbers linked to ‘druggable’ enzymes and/or InterPro domains (Table 4). Sixty-three of these homologues were associated with RNAi phenotypes ‘embryonic and/or larval lethality’ and 72 to ‘sterility’ (Table 4). Examples of ‘druggable’ molecules unique to each of the data sets, together with examples of effective BRENDA compounds, are given in Table 4 and Supplementary Figure S4; the complete lists, together with the list of ‘druggable’ molecules common between two or among more data sets, are available from the primary author upon request.

Table 4.
Examples of C. elegans orthologues of contigs unique to each Oesophagostomum dentatum adult female, adult male and the third (L3) and fourth (L4) larval stages, following in silico subtraction, ranked according to the ‘severity’ of the ...


Technical considerations

We demonstrated the utility of an integrated bioinformatic workflow system for the analysis and annotation of large sequence data sets produced by NGS. This system is considered useful for researchers with basic expertise in computer programming but without the means for developing bioinformatic pipelines or purchasing expensive soft- or hardware packages. The system constructed here was appraised according to: (i) computational time required to perform the analyses, (ii) ease of use, (iii) compatibility with different computer operating systems, (iv) ability to focus the analyses on answering relevant biological questions and (v) general applicability.

The majority of the software incorporated in the bioinformatic workflow was derived from existing application tools (e.g. CAP3 = maximum length of 50 kb) available as web-based interfaces, and originally designed for the analysis and annotation of a relatively small number of sequences. These applications were adapted here to face the challenges presented by the need to analyse large sequence data sets in a time-efficient manner. Indeed, the original sequence data sets described herein, which included a total of ~2 million sequences (244 ± 32 bases), could be analysed and annotated using a 2 CPU Linux computer with 8 processor cores, within ~2000 computing hours corresponding to ~240 man-hours (one computing hour = 1 hour of computing time on one processor core). Based on our experience, the same analyses, conducted using web-based interfaces, require several months to complete. However, an advantage of web-based software tools with extensive graphical interfaces is that no knowledge of computing and/or programming is required (29). The process of developing, trouble-shooting, maintaining and updating scripts can be involved and challenging, laborious and time-consuming. On the other hand, the use of a command line (which consists of a series of standardized commands) to execute pre-existing scripts, such as the Perl, Python and Unix shell, which have been written and made available here, overcomes this limitation. Furthermore, although these scripts have been written and optimized using the Linux operational system, the output files (generated in the form of text or tab delimited files) can be readily viewed, analysed and modified in a range of different operating systems, such as Microsoft Windows and Mac OS, thus being broadly applicable.

A key goal for scientists focusing on the analyses of large NGS data sets is to distil, from large amounts of raw data, biologically meaningful information about the organism under investigation. For example, some pathogens, such as parasitic worms, have complex life cycles and thus represent a challenging group of organisms for genomic and transcriptomic studies, because different life stages can express various sets of genes which are involved in development, reproduction, host–parasite interactions and/or disease (17,37–39). Understanding these aspects should have important implications for finding new ways of disrupting biological processes and pathways, and thus could facilitate the prediction and prioritization of new drug and/or vaccine targets. In addition, compared with the free-living nematode C. elegans, there is a paucity of knowledge on the fundamental molecular biology of parasitic worms (17,39,53). However, extensive information is available on the functions of C. elegans genes through the use of gene silencing and/or transgenesis (see This knowledge, together with the results of comparative analyses of genetic data sets, revealed that parasitic nematodes usually share ~50–70% of genes with C. elegans (54,55), indicating the utility of this free-living nematode as a model to explore molecular aspects of development, survival and reproduction in some parasitic nematodes (18,38,51,56,57).

Biological interpretations from the annotated data set

The bioinformatic workflow system constructed here was utilized to explore differential transcription in O. dentatum. Several reports indicate that this nematode provides a unique model system for studying fundamental aspects of the molecular biology of gastrointestinal strongylid nematodes (58). The in silico subtraction approach identified 139 and 243 protein motifs specific to the adult female and male of O. dentatum, respectively. Most of these molecules could be linked, using KOBAS analyses and genetic interaction networking, to pathways associated with reproductive processes. For instance, a large number of female-specific molecules encoded proteins containing a ‘chitin-binding protein, peritrophin A’ domain (i.e. n = 18; Table 2). This domain was also found to be highly represented amongst the molecules enriched in the female of the pig roundworm, Ascaris suum (59). These proteins are hypothesized to have crucial roles in pathways linked to developmental and reproductive processes, based on the knowledge that the corresponding C. elegans homologues (containing one or more peritrophin-A domains) CPG-1/CEJ-1 and CPG-2 are essential for the synthesis of the eggshell as well as for early embryonic development (60). The production and maturation of oocytes has also been shown, in C. elegans, to be regulated by nematode-specific bipartite signalling molecules, the major-sperm proteins (MSPs) (61,62). Numerous sequences unique to the adult male of O. dentatum represented MSPs (n = 15; c.f. Table 2), in accordance with previous studies of male-enriched data sets of other species of strongylid nematodes, including Trichostrongylus vitrinus (63), Haemonchus contortus (38), as well as the filarioid Brugia malayi (64–66), and A. suum (59). Based on the observation that MSPs from various nematodes, including C. elegans, are characterized by a significant amino acid sequence conservation (i.e. ~64%) (67), a similar role has been proposed for these proteins in processes linked to the maturation of oocytes in the uterus of female nematodes (61,62).

In addition to molecules unique to adult female and male of O. dentatum, the predicted proteins exclusive to the larval stages of this parasite could be linked, using InterPro and/or GO classification and/or probabilistic genetic interaction networking, to biological pathways associated with larval development and/or interactions with the vertebrate host (see Table 2). For example, a large number of molecules unique to the L4 stage (n = 10) were inferred to represent proteases. In parasitic nematodes, proteases have been proposed to facilitate the survival of the parasite by mediating, for instance, tissue penetration, feeding and/or immune evasion (68–70). Indeed, O. dentatum L4s are known to evoke immunological reactions that result in the encapsulation of the larvae in nodules with aggregations of neutrophils and eosinophils (58,71). In addition, somatic extracts of and supernatants from in vitro maintenance cultures of O. dentatum L4s have been shown to induce the proliferation of porcine mononuclear cells in vitro (72). These observations suggest an active role for L4-specific proteases in the modulation of the host’s immune response, which (as proposed for other biological systems) could consist of: (i) the direct digestion of antibodies (68); (ii) cleavage of cell-surface receptors for cytokines (73) and/or (iii) direct lysis of immune cells (74). In parasitic nematodes, other molecules have been proposed to play immuno-modulatory roles during the invasion of the host, the migration through tissues as well as feeding. Amongst them, proteins containing a ‘sperm-coating protein (SCP)-like extracellular domain’ (InterPro: IPR014044), also called SCP/Tpx-1/Ag5/PR-1/Sc7 (SCP/TAPS; Pfam accession number no. PF00188), were highly represented in the transcriptome of O. dentatum (see Table 2). Members of the SCP/TAPS protein family have been identified in various eukaryotes, including plants, arthropods, snakes, mammals as well as free-living and parasitic helminths (75). These molecules have been studied mainly in the hookworms Ancylostoma caninum and Necator americanus, and are commonly referred as to Ancylostoma secreted proteins (i.e. ASPs; 75). Due to their abundance in the excretory/secretory (ES) products from serum-activated L3s (=aL3s) of A. caninum and to the high levels of mRNAs encoding ASPs in aL3s compared with non-activated, ensheathed L3s (L3s), these molecules have been hypothesized to play a major role in the transition from the free-living to the parasitic stage of this species (39,76). Other ASP homologues have been characterized for the adult stage of hookworms, and suggested to play a role in the initiation, establishment and/or maintenance of the host-parasite relationship (39,77,78). Although a male-biased transcription of ASP homologues had been reported for O. dentatum (51), results from the present study show that the transcription of SCP/TAPS molecules occurs in all developmental stages studied herein. As the sequences analysed were generated from normalized cDNA libraries, the differences in levels of transcription of genes encoding SCP/TAPS throughout the life cycle of O. dentatum could not be inferred. Future work could involve, for instance, the application of the present bioinformatic workflow tool to the analysis of data generated (e.g. by Illumina sequencing) from non-normalized cDNA libraries of O. dentatum, which would allow quantitative rather than qualitative differences in transcription to be determined for genes encoding SCP/TAPS, to assist in the study of the biological function(s) of these molecules (75). The O. dentatum-pig model could also provide a useful means of exploring the biological role/s of these molecules in the development and reproduction of this nematode as well as its interactions with the host. Several features of O. dentatum, including its short life-cycle, its ability to survive and grow in culture in vitro for weeks through several moults, and the possibility of rectally transplanting worms (e.g. from in vitro culture) into the host without the need for surgical intervention (58,79), offer an opportunity to experimentally test hypotheses formulated based on the interpretation of results from bioinformatic analyses. Bioinformatically guided interpretations of NGS data sets are also increasingly playing an important role in the identification of putative drug targets (80), due to the possibility of using predictive algorithms to prioritize and select sets of molecules for experimental studies both in vitro and in vivo (81–83), potentially leading to a significant reduction in the cost associated with drug discovery and development (84). For instance, in the present study, subsets of molecules without known host (pig) homologues were identified and predicted to represent targets for intervention. Amongst them, protein kinases and phosphatases were the most abundantly represented (Table 4). Previously, in O. dentatum, a catalytic subunit of a serine/threonine protein phosphatase (PP1) was characterized (Od-mpp1); gene silencing by RNAi of the corresponding C. elegans homologue resulted in a significant reduction (30–40%) in the numbers of F2-progeny produced (56). Based on these findings, it is tempting to speculate that some pathways, involving phosphatases/kinases, represent key targets for nematocidal drugs.

Concluding remarks

Here, we demonstrated, using a large test data set derived from different stages/sexes of a parasitic worm (O. dentatum), that our bioinformatic workflow system provides a practical tool for the assembly, annotation and analysis of NGS data. The custom-written Perl, Python and Unix shell computer scripts, accessible via the web, can be readily adapted to suit the requirements of researchers conducting transcriptomic studies in their particular discipline. This workflow system is now routinely used by our research group for the analysis of data sets from a range of pathogens of major socio-economic importance and has been applied more broadly to data sets representing other organisms, including mammals. Thus, this integrated system should be a user-friendly and efficient tool for biologists involved in transcriptomic studies in any field on any organism.


Supplementary Data are available at NAR Online.


The Australian Research Council; Australian Academy of Science; the Australian-American Fulbright Commission (to R.B.G.); National Human Genome Research Institute and National Institutes of Health (to M.M.).

Conflict of interest statement. None declared.

Supplementary Material

Supplementary Data:


Staff at WormBase are gratefully acknowledged. The Austrian Ministry for Science and Research approved the animal experimentation (BMWF-68.205/0103-II/10b/2008) and is also acknowledged. C.C. is in receipt of an International Postgraduate Research Scholarship from the Australian Government and a fee-remission scholarship from The University of Melbourne as well as the Clunies Ross (2008) and Sue Newton (2009) awards from the School of Veterinary Science of the same university.


1. McKay SJ, Johnsen R, Khattra J, Asano J, Baillie DL, Chan S, Dube N, Fang L, Goszczynski B, Ha E, et al. Gene expression profiling of cells, tissues, and developmental stages of the nematode C. elegans. Cold Spring Harb. Symp. Quant. Biol. 2003;68:159–169. [PubMed]
2. Portman DS. Profiling C. elegans gene expression with DNA microarrays. WormBook. 2006;20:1–11. [PubMed]
3. Golden TR, Melov S. Gene expression changes associated with aging in C. elegans. WormBook. 2007;12:1–12. [PubMed]
4. Stathopoulos A, Levine M. Whole-genome expression profiles identify gene batteries in Drosophila. Dev. Cell. 2002;3:464–465. [PubMed]
5. Gupta V, Oliver B. Drosophila microarray platforms. Brief. Funct. Genomic Proteomic. 2003;2:97–105. [PubMed]
6. Vibranovski MD, Lopes HF, Karr TL, Long M. Stage-specific expression profiling of Drosophila spermatogenesis suggests that meiotic sex chromosome inactivation drives genomic relocation of testis-expressed genes. PLoS Genet. 2009;5:e1000731. [PMC free article] [PubMed]
7. Mizuarai S, Irie H, Schmatz DM, Kotani H. Integrated genomic and pharmacological approaches to identify synthetic lethal genes as cancer therapeutic targets. Curr. Mol. Med. 2008;8:774–783. [PubMed]
8. Ren S, Liu S, Howell P, Jr, Xi Y, Enkemann SA, Ju J, Riker AI. The impact of genomics in understanding human melanoma progression and metastasis. Cancer Control. 2008;15:202–215. [PubMed]
9. Santos ES, Blaya M, Raez LE. Gene expression profiling and non-small-cell lung cancer: where are we now? Clin. Lung Cancer. 2009;10:168–173. [PubMed]
10. Greene JG. Gene expression profiles of brain dopamine neurons and relevance to neuropsychiatric disease. J. Physiol. 2006;575:411–416. [PubMed]
11. Mufson EJ, Counts SE, Che S, Ginsberg SD. Neuronal gene expression profiling: uncovering the molecular biology of neurodegenerative disease. Prog. Brain Res. 2006;158:197–222. [PubMed]
12. Tanaka F, Niwa J, Ishigaki S, Katsuno M, Waza M, Yamamoto M, Doyu M, Sobue G. Gene expression profiling toward understanding of ALS pathogenesis. Ann. NY Acad. Sci. 2006;1086:1–10. [PubMed]
13. Chan VL. Bacterial genomes and infectious diseases. Pediatr. Res. 2003;54:1–7. [PubMed]
14. Jackson RW, Giddens SR. Development and application of in vivo expression technology (IVET) for analysing microbial gene expression in complex environments. Infect. Disord. Drug Targets. 2006;6:207–240. [PubMed]
15. Li BW, Rush AC, Mitreva M, Yin Y, Spiro D, Ghedin E, Weil GJ. Transcriptomes and pathways associated with infectivity, survival and immunogenicity in Brugia malayi L3. BMC Genomics. 2009;10:267. [PMC free article] [PubMed]
16. Ranganathan S, Menon R, Gasser RB. Advanced in silico analysis of expressed sequence tag (EST) data for parasitic nematodes of major socio-economic importance–fundamental insights toward biotechnological outcomes. Biotechnol. Adv. 2009;27:439–448. [PubMed]
17. Cantacessi C, Campbell BE, Young ND, Jex AR, Hall RS, Presidente PJA, Zawadzki JL, Zhong W, Aleman-Meza B, Loukas A, et al. Differences in transcription between free-living and CO2-activated third-stage larvae of Haemonchus contortus. BMC Genomics. 2010;11:266. [PMC free article] [PubMed]
18. Cantacessi C, Mitreva M, Jex AR, Young ND, Campbell BE, Hall RS, Doyle MA, Ralph SA, Rabelo EM, Ranganathan S, et al. Massively parallel sequencing and analysis of the Necator americanus transcriptome. PLoS Negl. Trop. Dis. 2010;4:e684. [PMC free article] [PubMed]
19. Young ND, Hall RS, Jex AR, Cantacessi C, Gasser RB. Elucidating the transcriptome of Fasciola hepatica - a key to fundamental and biotechnological discoveries for a neglected parasite. Biotechnol. Adv. 2010;28:222–231. [PubMed]
20. Young ND, Campbell BE, Hall RS, Jex AR, Cantacessi C, Laha T, Sohn WM, Sripa B, Loukas A, Brindley PJ, et al. Unlocking the transcriptomes of the carcinogens Clonorchis sinensis and Opisthorchis viverrini. PLoS Negl. Trop. Dis. 2010;4:e719. [PMC free article] [PubMed]
21. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc. Natl Acad. Sci. USA. 1977;74:5463–5467. [PubMed]
22. Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes CA, Hutchison CA, Slocombe PM, Smith M. Nucleotide sequence of bacteriophage phi X174 DNA. Nature. 1977;265:687–695. [PubMed]
23. Wang AM, Doyle MV, Mark DF. Quantitation of mRNA by the polymerase chain reaction. Proc. Natl Acad. Sci. USA. 1989;86:9717–9721. [PubMed]
24. DeRisi J, Penland L, Brown PO, Bittner ML, Meltzer PS, Ray M, Chen Y, Su YA, Trent JM. Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nat. Genet. 1996;14:457–460. [PubMed]
25. Clifton SW, Mitreva M. Strategies for undertaking expressed sequence tag (EST) projects. Methods Mol. Biol. 2009;533:13–32. [PubMed]
26. Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 2005;21:3674–3676. [PubMed]
27. Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Buillard V, Cerutti L, Copley R, et al. InterPro: the integrative protein signature database. Nucleic Acids Res. 2009;37:D211–D215. [PMC free article] [PubMed]
28. Flicek P, Birney E. Sense from sequence reads: methods for alignment and assembly. Nat. Methods. 2009;6:S6–S12. [PubMed]
29. Nagaraj SH, Deshpande N, Gasser RB, Ranganathan S. ESTExplorer: an expressed sequence tag (EST) assembly and annotation platform. Nucleic Acids Res. 2007;35:W143–W147. [PMC free article] [PubMed]
30. Nagaraj SH, Gasser RB, Nisbet AJ, Ranganathan S. In silico analysis of expressed sequence tags from Trichostrongylus vitrinus (Nematoda): comparison of the automated ESTExplorer workflow platform with conventional database searches. BMC Bioinf. 2008;9:S10. [PMC free article] [PubMed]
31. Huang X, Madan A. CAP3: A DNA sequence assembly program. Genome Res. 1999;9:868–877. [PubMed]
32. Iseli C, Jongeneel CV, Bucher P. ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 1999;1:138–148. [PubMed]
33. Morozova O, Marra MA. Applications of next-generation sequencing technologies in functional genomics. Genomics. 2008;92:255–264. [PubMed]
34. Metzker ML. Sequencing technologies - the next generation. Nat. Rev. Genet. 2010;11:31–46. [PubMed]
35. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. [PMC free article] [PubMed]
36. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. [PMC free article] [PubMed]
37. Moser JM, Freitas T, Arasu P, Gibson G. Gene expression profiles associated with the transition to parasitism in Ancylostoma caninum larvae. Mol. Biochem. Parasitol. 2005;143:39–48. [PubMed]
38. Campbell BE, Nagaraj SH, Hu M, Zhong W, Sternberg PW, Ong EK, Loukas A, Ranganathan S, Beveridge I, McInnes RL, et al. Gender-enriched transcripts in Haemonchus contortus–predicted functions and genetic interactions based on comparative analyses with Caenorhabditis elegans. Int. J. Parasitol. 2008;38:65–83. [PubMed]
39. Datu BJ, Gasser RB, Nagaraj SH, Ong EK, O'Donoghue P, McInnes R, Ranganathan S, Loukas A. Transcriptional changes in the hookworm, Ancylostoma caninum, during the transition from a free-living to a parasitic larva. PLoS Negl. Trop. Dis. 2008;2:e130. [PMC free article] [PubMed]
40. Joachim A, Ruttkowski B. Cytosolic glutathione S-transferases of Oesophagostomum dentatum. Parasitology. 2008;135:1215–1223. [PubMed]
41. Soderlund C, Johnson E, Bomhoff M, Descour A. PAVE: program for assembling and viewing ESTs. BMC Genomics. 2009;10:400. [PMC free article] [PubMed]
42. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
43. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000;25:25–29. [PMC free article] [PubMed]
44. Wu J, Mao X, Cai T, Luo J, Wei L. KOBAS server: a web-based platform for automated annotation and pathway identification. Nucleic Acids Res. 2006;34:W720–W724. [PMC free article] [PubMed]
45. Letunic I, Yamada T, Kanehisa M, Bork P. iPath: interactive exploration of biochemical pathways and networks. Trends Biochem. Sci. 2008;33:101–103. [PubMed]
46. Zhong W, Sternberg PW. Genome-wide prediction of C. elegans genetic interactions. Science. 2006;311:1481–1484. [PubMed]
47. Lipinski C, Lombardo F, Dominy B, Feeney P. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 1997;23:3–25. [PubMed]
48. Hopkins AL, Groom CR. The druggable genome. Nat. Rev. Drug Discov. 2002;1:727–730. [PubMed]
49. Robertson JG. Mechanistic basis of enzyme-targeted drugs. Biochemistry. 2005;44:5561–5571. [PubMed]
50. Chang A, Scheer M, Grote A, Schomburg I, Schomburg D. BRENDA, AMENDA and FRENDA the enzyme information system: new content and tools in 2009. Nucleic Acids Res. 2009;37:D588–D592. [PMC free article] [PubMed]
51. Cottee PA, Nisbet AJ, Abs El-Osta YG, Webster TL, Gasser RB. Construction of gender-enriched cDNA archives for adult Oesophagostomum dentatum by suppressive-subtractive hybridization and a microarray analysis of expressed sequence tags. Parasitology. 2006;132:691–708. [PubMed]
52. Wilson EB. Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 1927;22:209–212.
53. Nikolaou S, Gasser RB. Prospects for exploring molecular developmental processes in Haemonchus contortus. Int. J. Parasitol. 2006;36:859–868. [PubMed]
54. Blaxter ML, De Ley P, Garey JR, Liu LX, Scheldeman P, Vierstraete A, Vanfleteren JR, Mackey LY, Dorris M, Frisse LM, et al. A molecular evolutionary framework for the phylum Nematoda. Nature. 1998;392:71–75. [PubMed]
55. Parkinson J, Mitreva M, Whitton C, Thomson M, Daub J, Martin J, Schmid R, Hall N, Barrell B, Waterston RH, et al. A transcriptomic analysis of the phylum Nematoda. Nat. Genet. 2004;36:1259–1267. [PubMed]
56. Boag PR, Ren P, Newton SE, Gasser RB. Molecular characterisation of a male-specific serine/threonine phosphatase from Oesophagostomum dentatum (Nematoda: Strongylida), and functional analysis of homologues in Caenorhabditis elegans. Int. J. Parasitol. 2003;33:313–325. [PubMed]
57. Hu M, Zhong W, Campbell BE, Sternberg PW, Pellegrino MW, Gasser RB. Elucidating ANTs in worms using genomic and bioinformatic tools–biotechnological prospects? Biotechnol. Adv. 2010;28:49–60. [PubMed]
58. Gasser RB, Cottee P, Nisbet AJ, Ruttkowski B, Ranganathan S, Joachim A. Oesophagostomum dentatum: potential as a model for genomic studies of strongylid nematodes, with biotechnological prospects. Biotechnol. Adv. 2007;25:281–293. [PubMed]
59. Cantacessi C, Zou FC, Hall RS, Zhong W, Jex AR, Campbell BE, Ranganathan S, Sternberg PW, Zhu XQ, Gasser RB. Bioinformatic analysis of abundant, gender-enriched transcripts of adult Ascaris suum (Nematoda) using a semi-automated workflow platform. Mol. Cell. Probes. 2009;23:205–217. [PubMed]
60. Olson SK, Bishop JR, Yates JR, Oegema K, Esko JD. Identification of novel chondroitin proteoglycans in Caenorhabditis elegans: embryonic cell division depends on CPG-1 and CPG-2. J. Cell. Biol. 2006;173:985–994. [PMC free article] [PubMed]
61. Miller MA, Nguyen VQ, Lee MH, Kosinski M, Schedl T, Caprioli RM, Greenstein D. A sperm cytoskeletal protein that signals oocyte meiotic maturation and ovulation. Science. 2001;291:2144–2147. [PubMed]
62. Miller MA, Ruest PJ, Kosinski M, Hanks SK, Greenstein D. An Eph receptor sperm-sensing control mechanism for oocyte meiotic maturation in Caenorhabditis elegans. Genes Dev. 2003;17:187–200. [PubMed]
63. Nisbet AJ, Gasser RB. Profiling of gender-specific gene expression for Trichostrongylus vitrinus (Nematoda: Strongylida) by microarray analysis of expressed sequence tag libraries constructed by suppressive-subtractive hybridisation. Int. J. Parasitol. 2004;34:633–643. [PubMed]
64. Li BW, Rush AC, Tan J, Weil GJ. Quantitative analysis of gender-regulated transcripts in the filarial nematode Brugia malayi by real-time RT-PCR. Mol. Biochem. Parasitol. 2004;137:329–337. [PubMed]
65. Li BW, Rush AC, Crosby SD, Warren WC, Williams SA, Mitreva M, Weil GJ. Profiling of gender-regulated gene transcripts in the filarial nematode Brugia malayi by cDNA oligonucleotide array analysis. Mol. Biochem. Parasitol. 2005;143:49–57. [PubMed]
66. Moreno Y, Geary TG. Stage- and gender-specific proteomic analysis of Brugia malayi excretory-secretory products. PLoS Negl. Trop. Dis. 2008;2:e326. [PMC free article] [PubMed]
67. Cottee PA, Nisbet AJ, Boag PR, Larsen M, Gasser RB. Characterization of major sperm protein genes and their expression in Oesophagostomum dentatum (Nematoda: Strongylida) Parasitology. 2004;129:479–490. [PubMed]
68. Hotez PJ, Prichard DI. Hookworm infection. Sci. Am. 1995;6:42–48.
69. Williamson AL, Brindley PJ, Knox DP, Hotez PJ, Loukas A. Digestive proteases of blood-feeding nematodes. Trends Parasitol. 2003;19:417–423. [PubMed]
70. Bethony JM, Loukas A, Hotez PJ, Knox DP. Vaccines against blood-feeding nematodes of humans and livestock. Parasitology. 2006;133:S63–S79. [PubMed]
71. Stockdale PH. Necrotic enteritis of pigs caused by infection with Oesophagostomum spp. Br. Vet. J. 1970;126:526–530. [PubMed]
72. Freigofas R, Leibold W, Daugschies A, Joachim A, Schuberth HJ. Products of fourth-stage larvae of Oesophagostomum dentatum induce proliferation in naïve porcine mononuclear cells. J. Vet. Med. B Infect. Dis. Vet. Public Health. 2001;48:603–611. [PubMed]
73. Björnberg F, Lantz M, Gullberg U. Metalloproteases and serineproteases are involved in the cleavage of the two tumour necrosis factor (TNF) receptors to soluble forms in the myeloid cell lines U-937 and THP-1. Scand. J. Immunol. 1995;42:418–424. [PubMed]
74. Robinson BW, Venaille TJ, Mendis AH, McAleer R. Allergens as proteases: an Aspergillus fumigatus proteinase directly induces human epithelial cell detachment. J. Allergy Clin. Immunol. 1990;86:726–731. [PubMed]
75. Cantacessi C, Campbell BE, Visser A, Geldhof P, Nolan MJ, Nisbet AJ, Matthews JB, Loukas A, Hofmann A, Otranto D, et al. A portrait of the “SCP/TAPS” proteins of eukaryotes – developing a framework for fundamental research and biotechnological outcomes. Biotech. Adv. 2009;27:376–388. [PubMed]
76. Hawdon JM, Jones BF, Hoffman DR, Hotez PJ. Cloning and characterization of Ancylostoma-secreted protein. A novel protein associated with the transition to parasitism by infective hookworm larvae. J. Biol. Chem. 1996;271:6672–6678. [PubMed]
77. Zhan B, Liu Y, Badamchian M, Williamson A, Feng J, Loukas A, Hawdon JM, Hotez PJ. Molecular characterisation of the Ancylostoma-secreted protein family from the adult stage of Ancylostoma caninum. Int. J. Parasitol. 2003;33:897–907. [PubMed]
78. Mulvenna J, Hamilton B, Nagaraj S, Smyth D, Loukas A, Gorman J. Proteomic analysis of the excretory/secretory component of the blood-feeding stage of the hookworm, Ancylostoma caninum. Mol. Cell Proteomics. 2009;8:109–121. [PubMed]
79. Joachim A, Ruttkowski B, Daugschies A. Comparative studies on the development of Oesophagostomum dentatum in vitro and in vivo. Parasitol. Res. 2001;87:37–42. [PubMed]
80. Krasky A, Rohwer A, Schroeder J, Selzer PM. A combined bioinformatics and chemoinformatics approach for the development of new antiparasitic drugs. Genomics. 2007;89:36–43. [PubMed]
81. Caffrey CR, Rohwer A, Oellien F, Marhöfer RJ, Braschi S, Oliveira G, McKerrow JH, Selzer PM. A comparative chemogenomics strategy to predict potential drug targets in the metazoan pathogen, Schistosoma mansoni. PLoS One. 2009;4:e4413. [PMC free article] [PubMed]
82. Keil M, Marhofer RJ, Rohwer A, Selzer PM, Brickmann J, Korb O, Exner TE. Molecular visualization in the rational drug design process. Front. Biosci. 2009;14:2559–2583. [PubMed]
83. Doyle MA, Gasser RB, Woodcroft BJ, Hall RS, Ralph SA. Drug target prediction and prioritization: using orthology to predict essentiality in parasite genomes. BMC Genomics. 2010;11:222. [PMC free article] [PubMed]
84. Pong SW, Shiang R. Biopharmaceutical Drug Design and Development. 2010. The use of bioinformatics and chemogenomics in drug discovery. 2nd edn., Humana Press.

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press