|Home | About | Journals | Submit | Contact Us | Français|
Plant parasitic nematodes are major pathogens of most crops. Molecular characterization of these species as well as the development of new techniques for control can benefit from genomic approaches. As an entrée to characterizing plant parasitic nematode genomes, we analyzed 5,700 expressed sequence tags (ESTs) from second-stage larvae (L2) of the root-knot nematode Meloidogyne incognita.
From these, 1,625 EST clusters were formed and classified by function using the Gene Ontology (GO) hierarchy and the Kyoto KEGG database. L2 larvae, which represent the infective stage of the life cycle before plant invasion, express a diverse array of ligand-binding proteins and abundant cytoskeletal proteins. L2 are structurally similar to Caenorhabditis elegans dauer larva and the presence of transcripts encoding glyoxylate pathway enzymes in the M. incognita clusters suggests that root-knot nematode larvae metabolize lipid stores while in search of a host. Homology to other species was observed in 79% of translated cluster sequences, with the C. elegans genome providing more information than any other source. In addition to identifying putative nematode-specific and Tylenchida-specific genes, sequencing revealed previously uncharacterized horizontal gene transfer candidates in Meloidogyne with high identity to rhizobacterial genes including homologs of nodL acetyltransferase and novel cellulases.
With sequencing from plant parasitic nematodes accelerating, the approaches to transcript characterization described here can be applied to more extensive datasets and also provide a foundation for more complex genome analyses.
Root-knot nematode species, including Meloidogyne incognita, are the most important of the plant parasitic nematodes, infecting almost all cultivated plants, and are responsible for billions of dollars in crop losses annually [1,2]. They are obligatory sedentary endoparasites with a 1- to 2-month life cycle. Embryos develop in a proteinaceous matrix extruded by the adult female, and hatch as second-stage larvae (L2) that move through the soil and invade the plant root. Within the root, the worm establishes a feeding site and undergoes three additional molts to become an adult. M. incognita is a mitotic parthenogenetic species. Males develop but appear to play no role in reproduction . Females swell to a pear shape and are incapable of moving once committing to a root feeding site.
The Meloidogyne L2 larvae, the infective stage where the worm is away from the host plant (also referred to as second-stage juvenile in the literature), is more accessible than the rest of the life cycle, and is an interesting stage biologically with the worm completing multiple steps required for survival. On hatching from the eggshell, L2 worms are able to locate and migrate towards a potential host plant, penetrate the root behind its tip in the zone of elongation, and migrate intercellularly through the vascular cylinder by separating cells at the middle lamella . The migration is enabled by a combination of stylet protrusion (mechanical force) and secretion of cell-wall-degrading enzymes from specialized glands [5-8]. Upon completion of migration, secretions from the nematode's glands, and potentially other cues, induce root cells to alter their development and gene expression, undergoing abnormal growth and repeated endomitotic rounds of replication to form a feeding site made up of giant cells [9,10]. The L2 feeds from the giant cells for 10-12 days, then ceases feeding and molts three times over the next two days to form the adult. L2 undergo significant change following establishment of the feeding site, including swelling of the body and a switch in gland activity from subventral to dorsal dominance .
Until recent years, molecular characterization of Meloidogyne genes has been limited [12,13], particularly because the species' obligate parasitic life cycle makes studies difficult. Both basic understanding of root-knot nematode biology and applied research toward new means of nematode control are now beginning to benefit from the rapid identification of transcribed genes in the species. The generation of expressed sequence tags (ESTs) by single-pass random sequencing of cDNA libraries is a powerful tool for rapid gene transcript identification in metazoans [14-17] including parasitic nematodes of humans and animals [18-23]. High-throughput projects on two dozen nematode species have now brought the total number of publicly available roundworm ESTs to nearly 400,000, with half the sequences coming from parasites [24-27]. As a part of these efforts, EST sequencing from plant parasitic nematodes is in progress  and pilot EST datasets from the root-knot nematode M. incognita and the cyst nematodes Globodera rostochiensis and G. pallida second-stage larvae have recently been analyzed [29,30].
Important to the characterization and understanding of these sequences is the creation and implementation of bioinformatics approaches (such as clustering, functional classification, similarity analysis) that can be applied uniformly across the ever-increasing multiple nematode datasets. We present here an analysis of 5,713 ESTs from M. incognita L2 including creation of NemaGene clusters to reduce sequence redundancy, identification of abundant transcripts, and functional classification of gene products based on assignments to InterPro domains, the Gene Ontology hierarchy, and KEGG biochemical pathways. Building on the availability of the complete genome sequence, gene homologs of the free-living nematode Caenorhabditis elegans  were identified for M. incognita clusters and correlated with known RNA interference (RNAi) phenotypes. Genes specific to plant parasitic nematodes (Tylenchida species) as well as prokaryotic-like horizontal gene transfer candidates were also examined.
As part of a larger effort to examine expressed gene sequences from parasitic nematodes, we have generated and submitted to GenBank's EST database 5,713 ESTs from a M. incognita L2 library. Sequences, which include both 5' and 3' reads, averaged 481 nucleotides, resulting in 2.82 million submitted nucleotides. Here we present a first analysis applying semi-automated bioinformatics tools to genome data from a plant parasitic nematode, thereby laying the groundwork for more extensive analyses.
To reduce data redundancy, improve base accuracy and transcript length, and determine gene representation within the library, ESTs from the M. incognita L2 library were grouped by sequence identity into contigs and clusters by a method using Phrap and BLAST. 'Contig' member ESTs appear to derive from identical transcripts while 'cluster' members may derive from the same gene yet represent different transcript splice isoforms (that is, ESTs form contigs, contigs form clusters). Beginning with 5,713 traces, automated screens and manual inspection of misassembled contigs resulted in the elimination of 52 ESTs as potential chimeric sequences. The remaining 5,661 ESTs formed 1,798 contigs and 1,625 clusters. Clusters varied in size from a single EST (723 cases) to 77 ESTs (1 case) (Figure (Figure1).1). By eliminating data redundancy during contig building, the total number of nucleotides used for further analysis was reduced from 2.82 million to 1.99 million. To a first approximation, this project generated sequence from as many as 1,625 genes, for a new gene discovery rate of 29%, with only 13% of ESTs being singletons. This number may, however, overestimate gene discovery as a single gene could be represented by multiple non-overlapping clusters. While library redundancy reduces the number of new genes discovered, 65% of clusters still have 10 or fewer EST members. Such redundancy is desirable to increase base accuracy and transcript length within contigs. Additionally, 122 clusters have multiple contig members, revealing potential splice isoforms. Contig building was successful in significantly increasing the length of assembled transcript sequences from 481 ± 108 nucleotides for submitted ESTs alone to 611 ± 174 nucleotides for multi-member contigs. The longest sequence also increased from 780 to 2,353 nucleotides. Sampling of another 5,661 ESTs from the same source is estimated to result in the discovery of only 329 new clusters, a new gene discovery rate of only 6% (ESTFreq, W. Gish, personal communication). Further sampling will therefore await library normalization. This same clustering methodology is being applied to ESTs from other nematode species .
The 25 most abundant EST clusters accounted for 18% of all ESTs generated. A high level of representation in a cDNA library generally correlates with high transcript abundance in the original biological sample , although artifacts of library construction can result in selection for or against representation of some transcripts. Transcripts abundantly represented in the library include genes encoding cytoskeleton proteins (such as myosin, actin, UNC-87, troponin T) and proteins that carry out core eukaryotic energetic and metabolic processes (for example ADP/ATP translocase, lactate dehydrogenase) (Table (Table1).1). Sixty-four ESTs had significant homology to the putative fatty-acid-binding protein Sec-2, confirming the abundant expression of this gene reported in L2 cDNA libraries from M. incognita  and the cyst nematodes G. rostochiensis and G. pallida cDNA . Sec-2 is secreted by plant-parasitic nematodes at relatively high levels . Several abundantly expressed genes are also horizontal gene transfer candidates (see below).
To categorize transcripts by putative function, we have utilized the Gene Ontology (GO) classification scheme [35,36]. GO provides a dynamic controlled vocabulary and hierarchy that unifies descriptions of biological, cellular and molecular functions across genomes. InterProScan was used to match Meloidogyne clusters to characterized protein domains (5,875 entries) in the InterPro database . Existing mappings of InterPro domains allowed placement of Meloidogyne clusters into the GO hierarchy, viewed locally with the AmiGO browser. Of 1,625 clusters, 1,280 (79%) have homologies beyond M. incognita, 693 (43%) align to InterPro domains, and 475 (29%) map to the GO hierarchy. These 475 clusters represent generally conserved genes containing domains with characterized biochemical and physiological function in other species. The actual mappings are more complicated than one-to-one: the 693 clusters with InterPro alignments match to 379 InterPro domains, and the 475 clusters with GO assignments have 764 mappings to 127 GO categories.
Gene Ontology representation of M. incognita clusters is shown for each organizing principle of GO: biological process (Table (Table2a,2a, Figure Figure2a),2a), cellular component (Table (Table2b,2b, Figure Figure2b),2b), and molecular function (Table (Table2c,2c, Figure Figure2c).2c). Table Table22 and Figure Figure22 provide a breakdown of representation by major GO categories. A complete listing of GO mappings is available as additional data with the online version of this article. While hatched L2 before plant invasion are a long-lived non-feeding dispersal stage , GO categories reveal numerous transcripts encoding metabolic enzymes, including those involved in biosynthetic pathways. Distributions of clusters by GO categories can be compared to findings from other species using the TIGR gene index [38,39] which includes information for three nematodes - the free-living C. elegans and the human filarial parasites Brugia malayi and Onchocerca volvulus. Table Table33 compares observed GO representation among nematode species. The most striking initial differences in M. incognita GO representation from the other three species were for molecular function, where 52% of Meloidogyne clusters had ligand-binding/carrier mappings versus 24-28% for the other species, and cellular component, where 15% of M. incognita clusters had extracellular mappings versus 0-2% for the other species. Meloidogyne extracellular mappings (15 clusters) were all within the category of SCP/Tpx-1/Ag5/PR-1/Sc7 extracellular proteins (InterPro domain IPR001283) and showed homology to the genes vap-1 from H. glycines and Mi-msp-1 from M. incognita [40,41], both venom allergen antigen 5 family members with homologs in numerous nematodes including hookworms and C. elegans . Categories that particularly contributed to the abundance of ligand-binding/carrier mappings for Meloidogyne included EF-hand calcium binding (22 clusters), RNA recognition motif (18 clusters), and a variety of ATP-binding domains (20 clusters). Differences in the distribution of GO mappings may be attributable to the more extensive stage representation available for the other species. Comparisons of relative expression levels for genes among different M. incognita stages will begin to be possible as EST collections from other life-cycle stages are generated and analyzed.
As an alternative method of categorizing clusters by biochemical function, clusters were assigned to metabolic pathways using the Kyoto Encyclopedia of Genes and Genomes database (KEGG ) using enzyme commission (EC) numbers as the basis for assignment. EC numbers were assigned to 258 clusters (16% of total), of which 176 (11%) had mappings to KEGG biochemical pathways (361 total and 212 unique mappings). Out of 82 possible metabolic pathways 56 were represented (Table (Table4).4). For a complete listing of KEGG mappings see Additional data files. Pathways well represented by the M. incognita clusters include: glycolysis/gluconeogenesis (10 enzymes represented), citrate cycle (7), fatty-acid metabolism and biosynthesis (11), pyrimidine metabolism (7), lysine degradation (8), arginine and proline metabolism (8) and tryptophan metabolism (8). Lysine, arginine and tryptophan are essential amino acids in C. elegans whereas proline is not . Pathways not represented in Meloidogyne include alkaloid biosynthesis II and riboflavin (vitamin 82) metabolism. C. briggsae is incapable of synthesizing riboflavin  but C. elegans does appear to have a homolog of a riboflavin kinase (R10H10.6) and M. incognita may have at least one enzyme involved in riboflavin processing (see below).
Nematodes are believed to be unique among animals in utilizing the glyoxylate cycle to generate carbohydrates from the beta-oxidation of fatty acids (reviewed in ). The glyoxylate pathway, generally found in plants and microorganisms, is similar to the citrate cycle, but relies on two critical enzymes, malate synthase and isocitrate lyase, to bypass two decarboxylation steps. Nematodes appear to use this pathway for energy production from stored lipids during starvation or non-feeding stages [47,48] such as Meloidogyne pre-infective L2. Eight M. incognita L2 clusters map to five glyoxylate pathway enzymes. These include homologs of malate synthase (MI00879.c1, EC 18.104.22.168, BLASTX probability of 2e-31), several enzymes not shared with the citrate cycle (for example, formate tetrahydrofolate ligase, EC 22.214.171.124, 5e-38), as well as two shared with the citrate cycle (for example, malate dehydrogenase, EC 126.96.36.199,4e-29). Isocitrate lyase (EC 188.8.131.52) was not observed in this EST collection, but the first putative Tylenchida homologs of this gene have subsequently appeared from our further EST sequencing (M. hapla BM883225, and M. javanica BI324412). In C. elegans, two genes each encode unusual bifunctional enzymes containing both isocitrate lyase and malate synthase domains . Since the isocitrate lyase domain lies within the amino-terminal half of the C. elegans bifunctional enzyme and none of the Meloidogyne EST reads stretches across both domains, further sequencing of the 3' end of cDNA clones from the M. hapla or M. javanica isocitrate lyase ESTs will be necessary to determine whether the Meloidogyne genus contains a bifunctional glyoxylate enzyme homolog similar to that of C. elegans. The presence of glyoxylate pathway enzymes in Meloidogyne L2 provides experimental support for the model describing this larval stage as the functional equivalent of the C. elegans dauer larva . These ESTs and their corresponding cDNA clones will be useful reagents for the further study of the glyoxylate pathway in different stages of the Meloidogyne life cycle. For instance, energy metabolism would be expected to change markedly upon plant invasion and intracellular migration toward the feeding site, and might include a decrease in expression of transcripts specific to the glyoxylate pathway.
Figure Figure33 is a Venn diagram combining the results of BLAST searches versus three databases for the 79% (1,280/1,625) of M. incognita clusters which had matches to sequences from other species. Strikingly, in the majority of cases where homologies were found (740/1,280), matches were found in all three of the databases surveyed - C. elegans proteins, other nematode sequences, and non-nematode sequences. Gene products in this category are generally widely conserved across metazoans and many are involved in core biological processes. This category should continue to expand as additional complete genomes become available [50,51].
The 20% of contigs (353) that had no homology may contain novel or diverged amino-acid coding sequences that are specific to Meloidogyne species or even to M. incognita only. Alternatively, clusters which containing mostly 3' or 5' untranslated regions (UTRs) would lack BLASTX homology because they are non-coding or contain too short a coding sequence to result in significant homology. To examine this latter possibility contig consensus sequences with and without BLASTX homology were examined to determine their longest open reading frame (ORF). The distribution of ORF sizes indicates that clusters without homology contain two populations; one population of novel protein-coding sequences with a similar distribution of ORF sizes to that found in sequences with homology, and a second population of UTR sequences containing random or generally short ORFs (Figure (Figure4).4). The combined distribution is bimodal (relatively high left shoulder) with a mean ORF size of 140 amino acids versus a mean ORF size of 172 amino acids for sequences with homology. A further characterization of novel M. incognita genes could begin by examining those with longer ORFs as these are most likely to be real coding regions.
In contrast to these findings for M. incognita where most clusters had homology, BLAST searches with EST clusters from the filarial nematode B. malayi showed far fewer database matches with the same e-value cut-off of 10-5  - 57% versus 79%. Part of this difference is due to the use of more extensive databases in the M. incognita search. For instance, the Meloidogyne search included all dbEST sequences in the 'other nematode' set, resulting in matches for 61% of all clusters, whereas the Brugia search used only protein sequences in GenBank and saw matches in only around 12% of cases. However, even matches in C. elegans were fewer for B. malayi (50% versus 67%), where nearly identical databases were used. Brugia, Meloidogyne and Caenorhabditis represent three separate major nematode clades (III, IV and V, respectively) . Possible explanations for the discrepancy in matches are that the Brugia clusters contain a large fraction of non-coding sequences (that is, 5' and 3' UTR, unspliced introns) or have undergone more rapid molecular evolution and diversification. Alternatively, since the Brugia ESTs derive from 12 different libraries they may represent rarer transcripts than are contained in the M. incognita collection. A correlation between stage of expression and molecular conservation has been observed in C. elegans .
As expected, the C. elegans genome  was the best source of information for interpreting M. incognita sequences with 85% of all clusters with matches showing homology to a C. elegans gene product (Figure (Figure3).3). Table Table55 presents the 15 gene products with the highest level of conservation (e-240 to e-115) between M. incognita and C. elegans; these include gene products involved in cell structure (for example, actin, myosin), protein biosynthesis (for example, ribosomal proteins) and glycolysis (for example, lactate dehydrogenase, enolase). Representation of these clusters in the M. incognita L2 EST collection varied from common (77 ESTs) to rare (1 EST). None of these most conserved gene products was nematode specific. Out of all clusters 281 (17%) had homology only to nematodes, either C. elegans (80), other nematodes (53), or both (148). The most conserved of these nematode-specific proteins had a probability value of e-77. Included among the most conserved nematode-specific proteins were previously characterized nematode-specific domains including the transthyretin-like domain IPR001534  (MI00092.c1), as well as uncharacterized C. elegans hypothetical proteins (for example, MI01590.cl = TrEMBL Q19251; MI00719.cl = TrEMBL P90889).
Thirteen M. incognita clusters lacked homology to any C. elegans protein in Wormpep (v.54) yet had significant homology to regions of the C. elegans genome by TBLASTX. Such matches might reveal unpredicted protein-coding regions within the genome. Most of the clusters, including MI00112.cl, MI0000518.cl, MI01572.cl (matching to C. elegans LG V:10343341..10344858), MI01502.cl (LG X:16624802..16624921), MI00768.cl (LG III:2421909..2421700) matched regions of the genome where genes were predicted in later versions of Wormpep (WP 88, WP 73 and WP 65, respectively) indicating the usefulness of ESTs from other nematodes in predicting C. elegans coding regions. In fact, ESTs from our parasitic nematode sequencing project are being continually mapped to the C. elegans genome  and used by Wormpep curators for this purpose. We are further investigating other regions of homology such as MI00899.cl (LG 11:7443833..7443537) to determine whether modifications to current C. elegans gene-structure predictions are necessary.
Nematodes process many mRNAs by trans-splicing to SL1 and other splice leader sequences [57,58] and in C. elegans use of different splice leaders is tied to genome organization in operons . SL1 is the predominant nematode splice leader and is highly conserved across many species. Use of SL1 by transcripts is estimated at 70% in C. elegans , more than 80% in Ascaris lumbricoides , and approximately 60% in G. rostochiensis (Ling Qin, personal communication). SL1 has previously been observed in M. incognita , although genes with non trans-spliced 5' ends have also been cloned [5,6]. Only 33 of our M. incognita contigs have an SL1 sequence at their 5' end. This limited detection of SL1 is not surprising as both the poor processivity of reverse transcriptase and the positioning of the vector sequence primer near the beginning of the insert result in low representation of the initial 5' nucleotides of a transcript among EST collections. As an alternative method of determining which M. incognita genes may have an SL1 splice leader, contigs were compared by BLASTN to our recently sequenced ESTs from a M. arenaria egg library produced by PCR with an SL1 primer sequence. Of the M. incognita contigs 188 had high-level nucleotide identity (better than 1e-30) to this collection of SL1-containing Meloidogyne genes. With ESTs now available in our collection from four Meloidogyne and numerous other SL1-PCR cDNA libraries , it should be possible to address whether or not SL1-splicing of individual genes is conserved across nematode species.
The technique of RNAi, whereby the introduction of a sequence-specific double-stranded RNA leads to degradation of matching mRNAs , has allowed the systematic surveying of thousands of C. elegans genes for phenotypes following transient gene knockout [63-65]. Such information is potentially transferable to understanding which genes have crucial roles in parasitic nematodes where high-throughput RNAi is not yet possible. A list of 7,212 C. elegans RNAi experiments surveying 4,786 genes was compared to the list of all M. incognita clusters with significant homology to C. elegans proteins. Using the criterion that the C. elegans gene was the best match available for one of the Meloidogyne clusters and RNAi experimental information was available, 539 genes were revealed. A specific phenotype by RNAi was apparent for 221 (41%) of these genes, whereas 318 (59%) remained wild type (see Additional data files for the complete list of C. elegans RNAi phenotypes for genes with M. incognita homologs). By comparison, RNAi surveys of all predicted genes on a C. elegans chromosome have found a smaller percentage of genes with phenotypes: 14% for chromosome I  and 13% for chromosome III . Surveys of expressed genes reveal an intermediate level of 27% with phenotypes . Further, selecting for C. elegans genes with expressed Meloidogyne homologs led to enrichment for genes with severe phenotypes by RNAi such as embryonic lethality or sterility as compared to the overall dataset (Figure (Figure5)5) (For a complete tally of all observed phenotypes see Additional data files). A correlation between sequence conservation and severe phenotype by RNAi had previously been shown by comparison of C. elegans to genomes from the distant phyla Saccharomyces, Drosophila and human [63,64]. Here we show a similar trend following detection of homology to expressed genes in other nematode species. Applying RNAi techniques directly to parasitic nematodes is challenging owing to the organisms' generally longer and more complex life cycles, including the requirement for passage through a host organism. Progress has been made recently in assaying RNAi effects in both plant  and animal  parasitic nematodes. Further success may allow for a more high-throughput examination of phenotypes resulting from transient gene knockout in parasites.
Fifty-three M. incognita clusters showed homology to sequences from other nematode species yet lacked either C. elegans or non-nematode homologs. Twenty of these clusters showed conservation only to gene products from other Tylenchida species. MI00244.cl, for example, had homology to 47 ESTs in our collection from other Tylenchida species including root-knot nematodes M. javanica, M. hapla and M. arenaria, cyst nematodes H. glycines and G. rostochiensis, and the lesion nematode Pratylenchus penetrans with E-values from 7e-78 to 3e-05. The best homology to any C. elegans protein was an extremely weak match (E-value = 0.017) to hypothetical protein M01H9.3b. Genes in this collection may be rapidly evolving so that homologs are only detected in closely related species. Alternatively, genes may be special adaptations to plant parasitism. No annotation is available for any of these genes, but alignments with sequences from related species can define domains for further characterization.
In 1998, it was discovered that plant parasitic nematodes possess genes encoding beta-1,4-endoglucanase enzymes (cellulases) and that by far the strongest non-Tylenchida homologs for these enzymes were prokaryotic cellulases from Pseudomonas, Clostridium and other microbes. Following isolation from G. rostochiensis and H. glycines , cellulases have been identified in M. incognita , G. tabacum , H. schachtii , and P. penetrans . Additional prokaryotic-like sequences identified in plant parasitic nematodes include other cell-wall-degrading enzymes such as xylanase , pectate lyase [8,71] and polygalacturonase , and evidence is accumulating that these sequences have been acquired by horizontal gene transfer . The known Meloidogyne cellulase (MI00483.cl), potentially novel cellulases (MI00537.cl, MI01196.cl, MI01381.cl, MI01842.cl), and pectate lyase (MI00592.cl, MI00520.cl) were represented in the M. incognita EST clusters.
MI01045.cl, the seventh largest Meloidogyne EST cluster, is a new horizontal gene transfer candidate with homology to nodL acetyltransferase from Rhizobium leguminosarum (1e-53). Nod factor is responsible for the induction of nodules in nitrogen-fixing plants and nodL has an essential role in Nod factor biosynthesis . Experimental demonstration of a trans-spliced leader on the Meloidogyne nodL mRNA and the presence of introns in the gene confirm that it is not a bacterial contaminant and more extensive characterization is underway (E.H. Scholl, J.L. Thorne, J.P.M. and D.M.B., unpublished work). It is possible that root-knot nematodes have adapted a portion of Nod factor biology to the induction of feeding sites, rather than nodules, in plants.
To identify further horizontal gene transfer candidates from the M. incognita EST clusters, the subset of clusters with homology to sequences in other Tylenchida and in non-nematodes but not in non-Tylenchida nematodes were examined. In addition to those sequences already characterized, four additional clusters of interest were identified. MI00109.cl shows homology to a group of hypothetical proteins from alpha-proteobacteria: Sinorhizobium meliloti NP_386252 (3e-44); Novosphingobium aromaticivorans ZP_00095448 (3e-38); Mesorhizobium loti NP_107072 (5e-37). The finding of multiple Tylenchida genes with close homologs in rhizobacteria suggests the possibility of horizontal transfer of cassettes of genes or multiple transfer events between nitrogen-fixing soil bacteria and plant parasitic nematodes. MI01406.cl and MI00267.cl show homology to two hypothetical proteins from the Actinomycetales - Amycolatopsis mediterranei CAC42207 (5e-29) and Streptomyces lavendulae AAD32751 (2e-24). Providing some clue to function, the clusters as well as the hypothetical proteins are more distant homologs (1e-05 to 1e-08) of a putative riboflavin aldehyde-forming enzyme from Agaricus bisporus, CAB85691 (D.C. Eastwood, GenBank direct submission, 2000), an annotation based on homology (5e-05) to the characterized enzyme from Schizophyllum commune . A weak but common motif between all of the proteins is discernible.
As recently as February 2000 only 22 ESTs from plant parasitic nematodes had been deposited in dbEST. As of October 2002, that number has risen to 46,876, including 42,210 from Washington University and collaborators. Included are 32,735 sequences from Meloidogyne species (M. incognita 12,752, M. hapla 11,049, M. javanica 5,600, M. arenaria 3,334), as well as ESTs from cyst nematode species (G. rostochiensis 5,934, H. glycines 4,327, G. pallida 1,832), and the lesion nematode (P. penetrans 2,048). The majority of these sequences have been isolated from L2 and egg libraries, but sequencing from more diverse stages is now underway.
The only previous analysis of root knot nematodes ESTs  used 914 ESTs from M. incognita L2 without clustering and with non-automated assignment of genes to categories. The two datasets share some overlap, with 35% (316/914) of the previously analyzed ESTs finding matches in 16% (261/1,625) of the NemaGene clusters analyzed in this paper, many with strong homology (< 1e-40). This overlap was less than expected given the redundancy of the cDNA library analyzed here, at nearly 6,000 ESTs, and suggest that: first, libraries made by different methods are likely to result in different representation from an mRNA pool (either different genes or other portions of the same genes as a result of different 5' processivity); and second, that M. incognita L2 are likely to have a substantial number of unsampled messages awaiting generation of new libraries or library normalization. The semi-automated clustering, sequence homology searching and scripted assignment of sequences to functional categories presented here is a scalable approach to analysis that can be applied to larger datasets.
In addition to applying the approaches presented here to larger and more diverse datasets, further topics in Meloidogyne genome analysis have yet to be explored. The availability of ESTs representing different developmental stages of Meloidogyne will allow an examination of changes in gene representation between stages, and in turn an understanding of the relative importance of various metabolic processes at different stages of development. EST sequences and their corresponding clones can be further used to study relative expression level between stages and conditions using microarrays  and amplified fragment length polymorphism (AFLP) approaches . Contig sequences within clusters can also be compared directly for evidence of alternative splicing, another feature which might correlate with developmental stage. Other topics where bioinformatics analysis of available ESTs can improve current knowledge of Meloidogyne molecular biology include the identification of secreted and transmembrane proteins through secretion signal sequence detection , the creation of a more accurate codon usage/bias and amino-acid usage tables , the identification of conserved genes and pathways used in dauer/infective stages across nematode species , the definition and study of nematode-specific domains , and improved phylogenies based on sampling from multiple genes .
While ESTs do not provide information on genome organization in Meloidogyne (no genome sequence or physical map is yet available), they can shed light on the organization of the C. elegans genome. For instance, C. elegans autosomes are organized into central regions dense with predicted genes, highly expressed genes and known mutants, whereas the chromosome arms contain more repetitive sequences and have a higher meiotic recombination rate [31,80]. By using the expanding collection of ESTs from nematodes at various evolutionary distances from C. elegans, the hypothesis that genes on the autosome arms are more rapidly evolving can be tested more systematically. Mapping of ESTs from other nematode species can also detect genes contained in the C. elegans genome yet not previously recognized, and therefore missing from Wormpep, as well as recognized genes where not all exons have been correctly predicted.
In conclusion, the 5,713 ESTs analyzed here in 1,625 clusters probably represent 6-10% of the genes in the M. incognita genome. This initial study, which will be expanded as further sequences are generated, demonstrates that EST generation is an effective method for the discovery of the new genes in plant parasitic nematodes. Further, functional categorization and comparison to known sequences allows the identification of important biological processes at specific developmental stages as well as unusual sequences, such as horizontal gene transfer candidates.
To obtain M. incognita L2 larvae, a population of nematodes maintained on Rutgers tomato were harvested, eggs were isolated and hatched by standard protocols . Briefly, galled roots were removed from sandy soil, rinsed, and shaken in 15% bleach for 3 min to break roots and free egg masses. Contents were filtered with a large excess of water through a No. 200 sieve to remove root and soil fragments, and a No. 500 sieve to retain nematode eggs. Decanted eggs in small volume were applied above a 40% sucrose solution in a 50 ml conical tube and spun at 2,000 rpm for 10 min. Eggs banded at the sucrose/water interface and were removed by pipette. Following rinsing, sucrose banding was repeated. Harvested eggs were hatched over 4 days on top of a moist filter paper barrier (3 Crown Shopmaster heavy-duty wipes). Hatched larvae migrated through the paper and were collected in a water-filled petri dish below. By microscopic examination, collected worms were predominantly live moving L2, but rare dead L2 and eggs could be found.
Total RNA was isolated from collected L2 by the Trizol method (Life Technologies, Gaithersburg, MD) with a yield of 380 μg from around 1 ml of packed L2 worms. Poly(A)+ RNA was isolated from total RNA using the Promega Isolation System II (Promega, Madison, WI) following the manufacturer's instructions with a yield of 4.04 μg. The cDNA library (named Bird_Rao_Meloidogyne_incognita_J2) was constructed using the Zap Express cDNA Synthesis Kit and Gigapack III Gold Cloning Kit, 200403 (Stratagene, Cedar Creek, TX). Inserts were directionally cloned between an EcoRI site (5') and a XhoI site (3'); however, sequencing indicates that ~22% of clones are in reverse orientation. The non-directionality of the library does not interfere with either clustering or homology detection as both orientations are examined. The titer of the non-amplified phage library was 70,000 recombinants. In preparation for high-throughput sequencing the pBK-CMV phagemid was excised in bulk from the Zap Express phage using the ExAssist Interference-Resistant Helper Phage protocol 211203 (Stratagene). Resulting plasmids were replicated in the helper phage-resistant host cell XLOLR with kanamycin selection. It is expected that the majority of messages in this whole-animal library derive from the tissues that make up most of the mass of the L2 animal including hypodermis/cuticle, intestine, muscle, esophageal and rectal gland, and esophagus/pharnyx .
Clone processing and sequencing was performed as in Hiller et al.  with some modifications. Single bacterial colonies from the plasmid library were picked from agar trays into 384-well plates containing media, kanamycin, and 7% glycerol using a Q-bot robotic colony picker (Genetix, Christchurch, UK). Plates were incubated overnight at 37°C and stored at -80°C. To prepare template plasmid DNA from each sample, bacterial inoculates were transferred from 384-well storage to 96-well growth blocks containing 1 ml medium per sample and grown overnight. All subsequent sample and reagent transfers were done using a stationary 96-channel Hydra (Robbins Scientific, Sunnyvale, CA). DNA isolation was performed using a fast and inexpensive microwave-based protocol . Sequencing reactions using the T3 (5') primer employed BigDye terminator chemistry (Applied Biosystems, Foster City, CA) and the cycle sequencing reactions were performed with 96 × 4-block thermocyclers (MJ Research, Waltham, MA). Samples were loaded on ABI377 (96-lane slab gel) sequencers (Applied Biosystems).
Following gel image analysis and DNA sequence extraction, sequence data were processed in an automated pipeline to: assess EST quality; trim flanking vector sequences; mask repetitive elements; remove contaminated ESTs; identify similarities by BLAST; identify cloning artifacts; and determine which portion of the EST to submit . The resulting sequences were annotated with similarity information and sequence quality information and submitted to dbEST. Clones are named for their 96-well plate identity and position during processing (for instance ra40e04.y1). Names are mapped to stored clone location in 384-well plate format. Clones can be ordered at . From 7,818 attempted reads, 5,854 sequences (75%) passed quality and contamination filters and were submitted to dbEST . Most submissions (5,713) were made between March and June of 2000. An additional 141 ESTs originally failed as bacterial contaminants (by an overly inclusive filter) have since been submitted (September 2001), but are not included in this analysis. EST sequences are available from GenBank, EMBL and DDJB under the accession numbers AW440989-AW441125, AW570643-AW571393, AW588598-AW588988, AW589050-AW589115, AW735503-AW735730, AW782981-AW783662, AW827629-AW830045, AW870657-AW871697, and BI773381-Bl773521. Submissions total approximately 2.8 million nucleotides.
A failure rate of 25% is typical for high-throughput sequencing and resulted from poor overall trace quality (~21% of all reads), missing insert (~0.3%), small insert size (~0.06%), and E. coli contamination (~0.1%). To further exclude bacterial contamination we have closely examined cases where strong amino-acid homology to prokaryotic genes is observed (see Horizontal gene transfer candidates). Many of these genes have already been confirmed as of M. incognita origin by cloning from genomic DNA, in situ localization and the finding of homologs in other Tylenchida nematodes. In all of these cases, the high level of identity observed at the amino-acid level does not extend to nucleotide level, and GC content and codon usage is typical of other M. incognita transcripts (E.H. Scholl, J.L. Thorne, J.P.M. and D.M.B., unpublished work).
To estimate the number of 5' versus 3' reads, we examined the 4,198 ESTs with detectable homology on either sense or antisense strands at time of submission (BLASTX search versus the SWIR non-redundant protein database, Sanger Centre). Most ESTs (78%) showed translated amino-acid homology consistent with sequencing from the 5' end of the transcript, while 22% showed homology consistent with 3' end sequencing. The mean submitted read length was 481 nucleotides with a standard deviation of 108. Longest and shorted submitted reads were 49 and 780, respectively. Since our submission filter includes a quality cut-off at the distal end of the read (Phred Score < 12 [87,88]), additional sequence can sometimes be obtained by direct examination of the sequencing trace available at .
Clustering was performed by first building 'contigs' of ESTs with identical or nearly identical overlapping sequence and second, by bringing together related contigs to form 'clusters'. Contig member ESTs should all derive from identical transcripts whereas cluster members might derive from the same gene yet represent different transcript splice isoforms or transcripts from multigene families with extremely high sequence identity. The raw traces for submitted ESTs were base-called using Phred  and assembled to form contigs using Phrap (P. Green, personal communication). Although Phrap is a program intended for genome assembly, it has been applied previously to ESTs with modifications . To determine initial assembly quality, the largest contigs were inspected using the assembly viewer Consed . Misassemblies bringing unrelated ESTs together into giant contigs usually resulted from the alignment of long poly(A) tails. To eliminate these assemblies of otherwise dissimilar ESTs, Phrap parameters (forcelevel 1, minmatch 20 and minscore 100) were adjusted and Phrap was rerun.
Once acceptable assembly parameters were obtained, Phrap was run to generate a first-draft assembly. Contigs with only one member EST (singletons) were removed from consideration until the trimming and cluster building stage. All contigs with more than three member ESTs was screened for misassemblies using Consed tools and newly written scripts. Misassemblies were recognized by: regions of high quality unaligned sequence; multiple runs of poly(A) and/or poly(T) (at least 15 nucleotides with no more than a one non-A/T base); internal poly(A) and/or poly(T) runs (> 50 nucleotides from either end of a contig and ≥ 15 or more nucleotides long with no more than one non-A/T base; internal stretches of low consensus quality (> 30 nucleotides from either end of a contig and ≥ 50 nucleotides where 90% of the nucleotides had a consensus quality below Phred 20). Contigs flagged for possible misassembly were manually edited in Consed and potentially chimeric ESTs and other suspect ESTs were identified and removed from the pool of traces. Chimerism can result from multiple-insert cloning or mistracking of sequence gel lanes. The project was reassembled with Phrap and screened again as above. All contigs with more than three members were examined again in Consed to eliminate additional misassemblies not resolved by the initial screens. In total, around 450 contigs were examined manually and around 200 were edited. For each contig, a consensus sequence of all EST members was generated. Contigs (now including singleton EST contigs) were then trimmed to high quality and any internal consensus position with a calculated quality value below 12 was changed to an N (unknown base).
Following the creation of contigs by Phrap, the contig consensus sequences were compared using WU-BLASTN (G = 2 E = 1 v = 100 F = F) [92,93] and grouped on the basis of similarity to form clusters of related contigs. Contigs with overlaps of 100 bases or more with nucleotide-nucleotide identities of 93% or more were clustered together. For further analysis, new assemblies based on clusters were not formed; rather, each cluster retained all the consensus sequences of its contig members. NemaGene Meloidogyne incognita v 2.0 represents our second complete attempt at generating clusters for this species and is used as the basis for all subsequent analysis in this manuscript. Scripts have been written to allow the addition of new data while retaining the original contig and cluster naming scheme. Additional NemaGene versions of M. incognita will be built as additional ESTs become available for the species. A comparison of the NemaGene clustering approach to other EST clustering methods will be considered in a separate manuscript. NemaGene Meloidogyne incognita v 2.0 is available for searching at  and FTP at .
Following clustering, comparative analyses were performed using WU-BLASTX and WU-TBLASTX [92,93] with 1,798 contig consensus sequences (themselves grouped into 1,625 cluster groups) as queries versus multiple databases including SWIR v.21 (5/19/2000) non-redundant protein database and Wormpep v.54 C. elegans protein database (Wellcome Trust Sanger Institute, unpublished work), C. elegans mitochondrial protein sequences, and six internally constructed databases using intersections of data from the GenBank nucleotide database and dbEST . These include: nemnoele (all nucleotide data from the phylum Nematoda with C. elegans removed); nemnoelenomi (nemnoele with M. incognita removed); nemnoelenomel (nemnoele with all Meloidogyne species removed); nemnoelenotyl (nemnoele with all Tylenchida species removed); yestylnomel (all Tylenchida species except Meloidogyne); mj (only M. javanica sequences). An additional database, nrnonem, is an amino-acid database of all non-nematode proteins derived from SWIR v.21. WU-BLASTX (translated nucleotide query versus protein database) parameters were S = 100 M = PAM120 V = 0 W = 4 T = 17. WU-TBLASTX (translated nucleotide query versus translated nucleotide database, each in all six reading frames) parameters were Q = 10 R = 2 gapw = 10. Homologies were reported for e-value scores of 1e-5 and better. By creating intersections of various database search results, contigs/clusters could be organized by their distribution of homologies (for example, clusters which have M. javanica matches but not C. elegans matches). Data analysis was performed in a Unix environment using Perl and Bourne shell scripts. The program ESTFreq (W. Gish, personal communication) was used to estimate novel sequences expected from a second sampling and the program Translate (S. Eddy, personal communication) was used to translate nucleotide consensus sequences for ORF analysis.
To assign putative functions to clusters, the integrated protein domain recognition program InterProScan [97,98] was run locally to search translated contig consensus sequences versus all InterPro protein domains (as of 2 April 2002) . The Prosite, Prints, Pfam, ProDom, and Smart search components of InterProScan were used with default parameters. The GO categorization scheme (go_200205-assocdb.sql) of classification by biological process, cellular component, and molecular function was used to classify clusters based on the existing mappings of InterPro domains to the GO hierarchy . Mappings were stored in a local MySQL database and displayed using the AmiGO browser (16 May 2002)  (M. incognita mappings at ).
As an alternative means of assigning function to clusters, clusters were also assigned to metabolic pathways using KEGG [102,103]. Assignments were made by requiring that the highest-scoring BLAST match in SWIR v.21 have an assigned enzyme commission (EC) number . EC number mappings to KEGG pathways were then used to putatively assign clusters into biochemical pathways. Nonspecific pathway mappings (for example, kinases, EC 2.7.1.-) were eliminated, as were misleading pathway assignments (for example, plant carbon fixation, KEGG 2.3, where the assigned protein had only a peripheral 'feed-in' role in the pathway). Assignments were not made to KEGG regulatory pathways as proteins in these pathways lack EC numbers.
To identify cases where M. incognita and C. elegans share homologous genes which have been surveyed in C. elegans for knockout phenotype using RNAi, a list of all 7,212 available C. elegans RNAi experiments (5 May 2002) from WormBase  was compared to the list of all M. incognita clusters with significant homology matches to the C. elegans Wormpep v.54 protein database. Redundant RNAi experiments were removed to consolidate the WormBase list to 6,107 and experiments performed on the same gene with different phenotypic outcomes were consolidated later. For any given M. incognita cluster, only the best C. elegans matches, ranked by BLAST score, were considered.
To insure that sequences generated originate from M. incognita and are not contaminants, multiple steps purifying material and cross-checking sequence origin have been incorporated into the project: the starting material is purified and freed of plant material; poly (A) selection during library production is highly selective for eukaryotic transcripts, though it is possible for AT-rich prokaryotic transcripts to be cloned; analyzed sequences have been filtered for prokaryotic homology resulting in the removal of eight E. coli contaminants (0.14%), a typical background for cDNA cloning; 96% of the clusters with detectable homology have nematode homologs (1,227/1,280), 17% have only nematode homology, and in the vast majority of cases, higher conservation is seen to a nematode sequence than any non-nematode sequence; additional confirmation of nematode origin comes from the presence of an SL1 trans-spliced leader sequence on some genes; all sequences with strong amino-acid homology to prokaryotic genes were closely examined and in no cases were the high levels of identity maintained at the nucleotide level (as would be the case with a contaminating sequence). While it can be stated with confidence that the vast majority of the sequence analyzed originates from M. incognita, we cannot rule out the possibility that the collection may include a very small number of contaminating sequences.
The following files are available as additional data: a complete listing of gene ontology mappings for M. incognita clusters organized into (a) biological process, (b) cellular component, (c) molecular function (Additional data file 1); complete KEGG biochemical pathway mappings for M. incognita clusters (Additional data file 2); a complete list of C. elegans RNAi phenotypes for genes with M. incognita homologs (Additional data file 3); classification by RNAi phenotype of C. elegans genes with M. incognita homologs (Additional data file 4).
A complete listing of gene ontology mappings for M. incognita clusters
Complete KEGG biochemical pathway mappings for M. incognita
A complete list of C. elegans RNAi phenotypes for genes with M. incognita homologs
A classification by RNAi phenotype of C. elegans genes with M. incognita homologs
Meloidogyne EST sequencing at Washington University was supported by NIH-NIAID research grant AI 46593 to R.W. and NSF Plant Genome award 0077503 to D.B. and S.C. J.M. was supported by a Helen Hay Whitney/Merck Postdoctoral Fellowship. We thank John Spieth, Jeremy Williams and Ian Korf for helpful databases and scripts, Jennifer Schaff for technical support and The Institute for Genomic Research for use of TIGR gene index information. J.M., A.K. and B.C. are employees and equity holders of Divergence Inc. This research was not funded by Divergence Inc.