|Home | About | Journals | Submit | Contact Us | Français|
An accumulation of expressed sequence tag (EST) data in the public domain and the availability of bioinformatic programs have made EST gene expression profiling a common practice. However, the utility and validity of using EST databases (e.g., dbEST) has been criticized, particularly for quantitative assessment of gene expression. Problems with EST sequencing errors, library construction, EST annotation, and multiple paralogs make generation of specific and sensitive qualitative and quantitative expression profiles a concern. In addition, most EST-derived expression data exists in previously assembled databases. The Virtual Northern Blot (VNB) (http://tlab.bu.edu/vnb.html) allows generation, evaluation, and optimization of expression profiles in real time, which is especially important for alternatively spliced, novel, or poorly characterized genes. Representative gene families with variable nucleotide sequence identity, tissue specificity, and levels of expression (bcl-xl, aldoA, and cyp2d9) are used to assess the quality of VNB’s output. The profiles generated by VNB are more sensitive and specific than those constructed with ESTs listed in preindexed databases at UCSI and NCBI. Moreover, quantitative expression profiles produced by VNB are comparable to quantization obtained from Northern blots and qPCR. The VNB pipeline generates real-time gene expression profiles for single-gene queries that are both qualitatively and quantitatively reliable.
Gene expression analysis, pathway profiling, gene regulatory networks, and modeling of biological processes are key for “post-genome project” studies. Various high-throughput methods of expression profiling are commonly employed, such as microarrays, serial analysis of gene expression (SAGE), and quantitative reverse transcription PCR (qPCR); some being more costly and labor intensive than other methods (4). Newer expression profiling technologies include genome-scale in situ hybridization databases (38) (e.g., www.eurexpress.org) and fully sequenced EST libraries using massively parallel DNA sequencing technologies (63). Moreover, there exists a vast array of primary experimental data in the public domain in the form of microarray data, SAGE, and the expressed sequence tags database (dbEST), which can be freely used by investigators for gene expression profiling. Many public microarray databases now provide tools to survey individual gene expression among normal and disease tissues. These include, but are not limited to, Stanford Microarray Database (SMD), Gene Expression Omnibus (GEO), Oncomine, Genesapiens, and Gene Expression Atlas (8,16,32,53,61). A new tool, called the Virtual Northern Blot (VNB), is described herein that maximizes the usefulness of dbEST as a resource in a unique fashion for effective gene expression profiling in real time, something not available in any of these other tools and databases.
ESTs are single-pass sequenced cDNAs representing expressed genes from a specific cell population or tissue (2). They are on average 200–700 nucleotides (nt) derived from partial sequencing of randomly primed or oligo-dT primed cDNA clones from libraries of different tissues. Some libraries have been manipulated (sometimes called normalization) such that rare transcripts might be more highly represented, while other libraries have not been manipulated and thus the proportion of particular cDNA clones should accurately represent the same proportion in the mRNA population in that tissue.
The dbEST is a public domain archival database of cDNA sequence files (10). Since its inception in 1994, dbEST has grown exponentially and this growth is expected to continue. Although a powerful resource for sequence analysis, and especially for identification of novel genes (42), the utility and validity of dbEST for quantitative expression profiling have been criticized. Such criticism stemmed from early high error rates in sequence determination (>3%), poor annotation, partial sequence reads, and large-scale contaminations (3,21,39,52). Despite these issues, numerous EST mining algorithms (31,42) have successfully taken advantage of this tremendous resource (>61 million sequences by May 2009). In addition, methods for systematic validation (60) have shown that some of the early concerns are less problematic as older ESTs have been diluted with higher quality data and better annotation. Expression profiling using dbEST is a common method for exploring the transcriptome (11,51), characterizing novel gene expression (7), and identifying novel pathways in tissues (20). The easy availability of these data has fostered continued improvement and innovation (34,43,69) that underscores the value of this resource.
Gleaning reliable expression information from the archival dbEST database begins with proper identification of ESTs derived from the gene(s) of interest, often by sequence alignment. Common sequence alignment tools (MegaBLAST, BLAT, d2, CAP3, PHRAP) (14,18,28,30,65,70) have been used to cluster ESTs and then assign each cluster to a gene, thus building a gene-indexed database from which expression profiles could be gleaned. This processed data is made available through web-based tools, or pipelines. Two of the most frequently accessed EST analysis pipelines used to display gene-associated EST information are the Genome Browser at UCSC (29), which uses BLAT, and UniGene (41) at NCBI, which uses MegaBLAST. These pipelines are easily accessed and UniGene is among the most commonly used sources for retrieving gene expression data. However, while these tools have enabled the widespread use of EST data, the assembly of these databases is prone to errors from significant sequence error rates, alternative splicing, and lack of genome coverage (11,51). All these issues are especially critical for novel genes and those with very high sequence similarity. In addition, compiling expression profiles from a gene cluster may prove quantitatively inaccurate due to various cDNA construction methods employed (27). Furthermore, such processed data is by its nature not current and pipelines for generating gene expression profiles in real time are not readily available. For those investigators wanting precise, sensitive, and up-to-date gene expression data for a single gene or gene family, there are few tools available for accessing dbEST. In addition, the use of any of these tools for quantitative analysis has not been clearly demonstrated. VNB was specifically designed to address these needs.
VNB is an application that can generate accurate quantitative and qualitative expression patterns for any human or mouse gene, which is available via a web interface (http://tlab.bu.edu/vnb.html). The algorithm is analogous to a classical Northern blot; the program is optimized for single-gene queries for difficult genes (e.g., genes with high sequence identity among paralogs or novel and poorly characterized genes). Validation of VNB, using gene families of varying sequence similarity, function, and expression profiles, demonstrates that this tool is more sensitive and specific than commonly employed algorithms. More importantly, quantitative gene expression information derived using VNB is validated by Northern blots and qPCR.
The VNB algorithm is outlined in the flowchart shown in Figure 1. The algorithm is analogous to performing a Northern blot; first gene-specific “probe” sequences are identified for subsequent in silico “hybridization” against EST libraries of mRNA-derived sequences. The most important part of the algorithm, like in a Northern blot, is selecting the gene-specific probe(s). Using a multiple alignment of the gene of interest to its paralogs (generated by one of two methods, see below), a routine called AutoProbe finds sequence intervals, or probes, that are specific to the gene of interest. These probes can be further verified by ProbeChecker. Using gene-specific probes generated from AutoProbe and ProbeChecker, dbEST is searched using BLAST (5) for ESTs that exactly match at least one of the probes. Each EST is mapped to an EST library by using annotation from Entrez (9). The library identifiers are then used to search library construction information in the Cancer Genome Anatomy Project (CGAP) database (35). The construction information is then used to generate a qualitative and a quantitative expression profile for the query gene. A qualitative profile is generated by counting all ESTs that match the gene-specific probes for each tissue. Another profile, called the quantitative profile, is generated by counting only those ESTs that are annotated as “nonnormalized” by CGAP. Moreover, the quantitative level of expression is calculated as a percentage of the total number of ESTs from “nonnormalized” libraries for each tissue. A depiction of the user interface and a sample output are shown in Figure 2.
The accurate annotation of library information at CGAP is critical to the accuracy of expression profiles generated by VNB. The CGAP site allows for the retrieval of tissue or cell origin, cDNA library construction information, normal or cancerous tissue types, and the number of ESTs in each library (http://cgap.nci.nih.gov/Tissues/LibraryFinder). Currently, CGAP only catalogs ESTs libraries from human and mouse. Both qualitative and quantitative profiles from VNB use the tissues annotated on the CGAP website. A more detailed description of the methodology used by the software can be found in Supplementary Materials (available at http://www.bu.edu/aldolase/lab/software.html).
In order to generate a set of gene-specific probes, AutoProbe needs an alignment of the input gene to all of its paralogs. The user has the option of automatically generating an alignment by using BLAST to query the RefSeq database (http://www.ncbi.nlm.nih.gov/RefSeq/) (48) with the gene of interest, or directly uploading a custom alignment generated with ClustalW (http://www.ebi.ac.uk/clustalw/) (62). This second option allows the user to define the set of paralogs or alternatively spliced transcripts. This “custom” ClustalW alignment is useful for profiling novel or poorly characterized genes, which do not have some (or any) of their paralogs in the RefSeq database.
Autoprobe uses the multiple alignment of the input gene to its paralogs, generated by either of the two methods described above, to find gene-specific sequence intervals. AutoProbe generates multiple small and overlapping ‘probes” that span the length of the mRNA. The overlapping probes help circumvent issues with sequencing errors, regions of high similarity in paralogs and/or alternative splicing. By overlapping, while one probe may not “hybridize” 100% to one EST, another probe would hybridize by sliding past the error, include or exclude an alternative spliced region, and/or include a sequence more specific to the gene of interest and not a closely matched paralog. AutoProbe breaks the input sequence up into short regions, called windows, and selects the most gene-specific probe from each window, using a dissimilarity matrix compiled from the multiple sequence alignment. Such a matrix and the calculation are exemplified in Figure 3. The two parameters of probe length and window size can be adjusted by the user to improve the profile for specificity and/or sensitivity. Finally, a routine called ProbeChecker ensures specificity of the profiles by discarding probes that exactly match sequences to any of the paralogs in the alignment.
A program called BackBLAST was used to determine whether a set of ESTs was truly derived from the gene of interest. This program queries each EST against the RefSeq database. As outlined in Figure 4, the ESTs whose best match in RefSeq is the input gene are considered true positives, while the others arc considered false positives. However, because BackBLAST must query each EST against the RefSeq database, its runtime is prohibitively large. Thus, it was only used for validation purposes and is not part of the implementation available online.
The bulk of VNB’s runtime is spent on querying dbEST with the set of gene-specific probes. When the number and length of the probes increase (by using a longer input sequence, smaller window size, or greater probe length), the runtime, which is typically 3–30 min, increases considerably. VNB is completely automated and it interfaces with all of the tools/resources that it uses online (there is nothing installed locally) (see Supplementary Materials at http://www.bu.edu/aldolase/lab/software.html). As a consequence, VNB uses the latest versions of BLAST, RefSeq, and dbEST available at NCBI, as well as the latest Entrez, CGAP annotation, and version of ClustalW (36).
Blots of total RNA from 14 mouse tissues (brain, heart, lung, liver, spleen, kidney, stomach, small intestine, skeletal muscle, skin, thymus, testis, uterus, placenta) were purchased from Seegene, Inc. (#1006-1-1302). Probes for the aldolase A gene (aldoA) (387 bp from 3′-untranslated region) and the glyceralde-hyde-3-phosphate dehydrogenase gene (gapdh) (357 bp from 3′-untranslated region) were generated from PCR amplification of cDNA clones pFL (40) and EST (IMAGE ID 3513620), respectively. Forward (aldA_1221) and reverse primers (aldA_1587) for aldoA were as follows: 5′-CTTGACTTTCTCCTATGGTCG-3′ and 5′-CCCTTAAATAGTTGTTTATTG-3′, respectively. Forward and reverse primers for gapdh were 5′-CTACACTGAGGACATGGTTGTCTCATGTGACTT-3′ and 5′-CAGCGAACTTTATTGATGGTATTCAAGAGAGT-3′, respectively. Radio-labeling was done using a modified version of the procedure of Feinberg and Vogelstein (19). Briefly, a solution of 100–150 ng of probe DNA was denatured by boiling for 5 min. Reagents were added and the solution was incubated at 37°C for 2 h. The final concentrations in 20 μl were 125 mM HEPES, pH 6.6, 125 mM Tris-HCl, pH 8, 20 mM β-mercaptoethanol, 12.5 mM MgCl2, 50 mM each of dCTP, dGTP, and dTTP, 40 μg/ml BSA, 13.5 A/ml random octomers, 1 unit DNA polymerase I (Klenow fragment), and 60 μCi of [α-32P]dATP (1 mM) (3000 mCi/mmol). The reaction was stopped by fivefold dilution in 10 mM Tris-HCl, pH 8, 1 mM EDTA and the labeled DNA was purified from unincorporated dATP by gel filtration with BioGel P-6 DG (58). Specific radioactivity ranged from 107 to 109 cpm/μg. Prehybridization and hybridization reactions were at 42°C in 5× SSC, 10× Denhardts, 20 mM phosphate, pH 7, 7% SDS for 1 and 16 h, respectively. Washing of the blot was performed at 65°C in 0.5× SSC, 0.5% SDS. A phosphor-imager was used for quantitative tabulation of the hybridization results. The mRNA from aldoA and gapdh coelectrophoresed in these blots and the same blots were used for each probe after washing and checking that signals were at background before re-hybridization.
Total RNA from eight mouse tissues (liver, spleen, skeletal muscle, kidney, brain, testis, lung, and salivary gland) was purchased from BD Biosciences-Clontech (Mouse—Multiple Tissue Total RNA Panel, #636644). Fragments from mRNA encoding aldolase A or GAPDH were amplified from these tissues following cDNA synthesis by reverse transcriptase (Superscript II) using oligo-dT priming of RNA (0.14 μg/μl) as described by the manufacturer (Invitrogen). PCR, using the same primers used for Northern blot probes described above, was used to confirm cDNA synthesis and the specificity of the PCR reactions. Real-time PCR was performed in 384-well plates using an ABI 7300 instrument and PCR cocktail containing SYBR green from ABI according to the manufacturer’s procedures. Amplification of four 10-fold serial dilutions of the cDNA reactions was recorded by fluorescence changes during the denaturation cycles of PCR (15 s at 95°C, 30 s at 55°C, 70 s at 72°C for 40 cycles). The cycle threshold was recorded and plotted as a function of the dilution to generate a straight line with a slope that was related to the doubling efficiency (10−1/slope). The efficiency raised to the value of the intercept of the line at no dilution is a measure of the relative amount of cDNA for each gene in the tissue samples.
The experimentally determined expression levels were taken directly from quoted values in the cited reports or measured from relative intensities measured on a densitometer from zymograms or Northern blots displayed in the figures.
For testing the effectiveness of VNB, queries were used that belonged to three conserved superfamilies. The first, aldolase A, which belongs to a superfamily of aldolases (15), has two closely related isozyme genes. The closest isozyme, aldolase C, shares 85% overall sequence identity with the aldoA cDNA (55). Second, the cyp family is represented by cyp2d-9 (steroid 16-α-hydroxylase gene), which shares many regions of very high sequence identity with its paralogs (57). The closest gene family member (cyp2d-13) is 93% identical in cDNA sequence. The other paralogs include cyp2d-3, cyp2b (−9, −10, −13, −19, −21), cyp2c (−29, −38, −39, −44), cyp2g-1, and cyp2j (−3, −5, −6, −9, −11, −13). The cyp2 family is expressed at low levels and in only limited numbers of tissues (17,24). Lastly, BCL-xL, an antiapoptotic protein, belongs to a large family of proteins involved in apoptosis and shares a moderate sequence identity among its members including BCL2-associated athanogene protein (BAG) and BCL2-associated X protein (BAX). The closest family member of the gene family, bcl2, has 46% overall identity to the bcl-xl cDNA (1), but shares as much as 73% identity in their BH domains. The bcl2 family is ubiquitously expressed at moderate to low levels (23). These three genes were carefully selected to represent a cross section of genes in the genome based on their span in degree of sequence similarity, expression levels, and tissue distribution. For sequence similarity, these gene families have overall sequence identity ranging from 46% to 93%. They range from low levels to high levels of expression, and are expressed in a tissue-specific fashion; for example, cyp2d has a limited tissue distribution (17,24), whereas bcl-xl and aldoA are expressed ubiquitously.
As discussed above, small window sizes generate larger numbers of probes and lengthen the run time of the program. Using smaller window sizes, sensitivity would increase and specificity would decrease. On the other hand, larger window sizes limit the number of probes, speeding up the program, and increasing specificity while curtailing sensitivity. To test this, window sizes (8–24 nucleotides) and probe sizes (16–30 nucleotides) were tested using the mouse dbEST database and cDNAs from the three members of each super-family described above.
Sensitivity was defined as TP/(TP + FN), where true positives (TP) were the total ESTs found that were derived from the gene of interest, and false negatives (FN) were the number of ESTs that the algorithm should have found. The sum (TP + FN) comprised the set of “true” entries in dbEST, which was generated by first combining the currently available lists of ESTs for each gene from UniGene, the UCSC Genome Browser, and VNB. This combined list then was separated into true positives (TPs) and false positives (FPs) by use of the BackBLAST algorithm (Fig. 4). This list of true positives was set by definition as the standard for the comparison of specificity and sensitivity. This standard set of TPs was critical for both the optimization of VNB parameters and the comparison of VNB output to those of Genome Browser at UCSC and UniGene (see below).
Specificity was defined as TN/(TN + FP), where true negatives (TN) were the number of ESTs in dbEST that were not derived from the gene of interest; in other words, the bulk of dbEST. False positives (FP) were the number of ESTs found by an algorithm that were not derived from the gene of interest [i.e., all ESTs found that were not in the standard set of TPs (TP + FN) defined using BackBLAST]. The dbEST for mouse included 4,334,000 EST sequences at the time of the study.
In general, the maximum number of ESTs specific for each query (aldoA, cyp2d-9, and bcl-xl) was obtained with the smaller window and probe sizes (Fig. 5A, C, E, respectively). As expected, this simply meant the larger number of smaller probes increased the chance of finding ESTs in the database (high sensitivity). For specificity (Fig. 5B, D, F), the number of false positives increased as the window size decreased [reflected in a smaller specificity value; TN/(TN + FP)], but only for probe sizes ≤20. In general, increasing specificity correlated with increasing probe length, while increasing sensitivity was associated with decreasing probe length. Remarkably, the smallest probe and window size that yielded the highest specificity, while retaining maximal sensitivity, was the same for all three gene families: a probe length of 20 nt and a window size of 8 nt.
The sensitivity and specificity of all “gene-specific” ESTs derived from VNB, UniGene (41) at NCBI, and the Genome Browser at UCSC (29) were compared. For the purpose of this comparison, the TP + FN set was used (defined by BackBLAST, described above). Each of the three EST collections was then evaluated for sensitivity and specificity toward a member of each of the gene families, aldo, cyp, and bcl-2 (Fig. 6). The optimal window size and probe size (8 and 20) were used for generating the ESTs from VNB. In terms of sensitivity, VNB was superior to the other methods for aldoA and bcl-xL, with virtually all ESTs identified by VNB (Fig. 6A). For example, 98.7% of all ESTs were identified for aldoA and bcl-xL using VNB compared to an average of 87.2% of the ESTs for UniGene and UCSC Genome Browser. For cyp2d-9, the sensitivity was relatively low regardless of the method, with UniGene being slightly more sensitive than the other methods. The high degree of similarity among the cyp family members was likely responsible for this low sensitivity, with no method able to identify all the ESTs determined for cyp2d-9. This was interesting and indicated that each program can identify cyp2d-9 ESTs that the others cannot.
The three collections were compared for specificity (Fig. 6B). This parameter, which reflects the false-positive rate, is often more important than sensitivity for many expression-profiling purposes where false positives can be critically misleading. Here VNB was as effective, or more effective, than either UniGene or the UCSC Genome Browser for all three gene families. For example, for cyp2d-9, the difference in specificity between results from VNB and UCSC Genome Browser corresponded to six more FP in the UCSC list; and the difference for bcl-xL was over 50 FP. In summary, VNB was the most specific program for all three genes, and it was more sensitive than UniGene or UCSC Genome Browser for all but cyp2d-9.
Qualitative and quantitative expression profiles were constructed for aldoA, actin, and gapdh using VNB with a window size of 8 and a probe size of 20. The qualitative profiles were compared to literature values. The quantitative profiles were compared to two different experimental assays.
The expression of different isoforms in specific tissues represents a valid test of qualitative expression. For example, α-actin has two isoforms; while only the skeletal form is found in skeletal muscle, both skeletal and smooth-muscle isoforms are found in the heart (25). Moreover, determination of the relative amounts of each isoform in tissues where more than one form is present represents a semiquantitative analysis. This kind of analysis was performed using VNB for mouse skeletal and smooth-muscle α-actin. The numbers of skeletal and smooth-muscle α-actin ESTs were normalized to the total number of α-actin ESTs in each tissue and compared with published experimental values (25) (Fig. 7A). The VNB-determined expression profile for actin easily reflected the experimental expression pattern. There was no smooth-muscle α-actin expression identified in skeletal muscle, while in heart both the skeletal and heart isoforms were found at the same ratios as the experimental pattern.
This analysis was expanded to the family of the aldolase isozymes. Each aldolase isozyme is selectively expressed in different tissues; aldolase A in the muscle, aldolase B in the liver, and aldolase C in the brain (46). To determine whether VNB recapitulated these experimental results, the expression profile of the mouse aldolase isozymes (A, B, and C) was determined for each isozyme in several well-characterized tissues: muscle, heart, adult brain, fetal brain, liver, and kidney. The positive ESTs for each isozyme were tallied and normalized to the total aldolase ESTs in each tissue (Fig. 7B). The VNB pattern reflected the published experimentally determined expression pattern (37) with the exception of muscle where VNB data suggest that a minor amount of aldolases B and C was expressed. Similar results were obtained for actin using data from UniGene except that false positives for smooth-muscle actin were identified in skeletal muscle and slightly higher than experimental values for smooth-muscle were identified in heart. In addition, UniGene failed to detect the experimentally determined low levels of aldolase C in heart, as well as the VNB-determined low levels in skeletal muscle and kidney (data shown in Supplementary Materials at http://www.bu.edu/aldolase/lab/software.html).
The qualitative/semi-quantitative VNB data shown in Figure 7 matched published experimental data and prompted a direct test of whether VNB could quantitatively determine gene expression (e.g., express as a percentage of all transcripts in a cell/tissue). First, the effect of excluding transcript-altered libraries was assessed. Expression profiles exclusively from unmanipulated libraries and exclusively from manipulated libraries were compared. Expression profiles for aldoA from unmanipulated libraries (defined in CGAP as “nonnormalized”) and from manipulated libraries (defined here as all libraries minus nonnormalized libraries) were compared (Fig. 8) using library information from 27 tissues extracted from the CGAP database (35). Absolute expression levels (as opposed to relative expression levels) of mouse aldoA for each tissue were calculated by VNB by dividing the number of aldoA ESTs by the total number of ESTs from libraries of each category. For most tissues, there were clear differences in aldoA expression levels derived from unmanipulated libraries versus those that had been manipulated. The expression level for aldoA was underestimated in most cases when manipulated libraries were used (Fig. 8, gray bars). This likely reflects the removal of many of the redundant aldoA cDNAs in these manipulated libraries. However, there were four “tissues” (liver, embryonic tissues, head & neck, and eye) where expression levels were comparable, and three “tissues” (pancreas, bone marrow, and thyroid) that had significantly higher (<30% difference) levels in the unmanipulated libraries. In both cases, this may be due to ESTs that were from predominantly “nonnormalized” libraries and/or from libraries in which the manipulation was not effective. In summary, the ability of VNB to use the library annotation information at CGAP and distinguish expression profiles derived from nonnormalized or normalized libraries indicated the potential for quantitative gene expression profiles from dbEST.
Most quantitative gene expression assays are not absolute (e.g., five transcripts per cell or ~0.005%), but generally are calculated relative to an assumed invariant endogenous control. Typically, ubiquitous “housekeeping” genes such as those for GAPDH or a ribosomal protein are used for reference. To compare VNB’s quantitative profiles with such experimental approaches, the ratio of aldoA to gapdh expression was calculated using only “quantitative” mouse libraries (defined in CGAP as “nonnormalized”). Experimental determination of the expression levels of mouse aldoA and gapdh were then determined using Northern blots and qPCR methods. The three methods were compared as shown in Figure 9. Of the seven tissues common to all three experimental techniques (liver, kidney, brain, spleen, testis, skeletal muscle, and lung), skeletal muscle and lung tissues could not be compared because the ESTs found were not from quantitative libraries. The ratio of aldoA/gapdh EST hits in quantitative libraries for liver, kidney, brain, spleen, and testis were 10:94, 58:92, 193: 230, 6:9, and 7:3, respectively. Notably, for both spleen and testis, quantitative EST libraries were small, containing few ESTs, so the calculated errors were relatively large. Each of the three techniques, VNB and the two experimental methods, reflected the same relative expression levels within experimental error and validated the value of both VNB (using the annotation from CGAP) and dbEST for quantitative gene expression profiling.
Characterizing transcriptomes using EST libraries is well established (6,12,43,44,54,56). Web-based EST analysis tools (pipelines), such as UniGene, CGAP, TIGR indexes, Bodymap, ECgene, Tissue-Info, ASePCR, and STACK (33–35,47,49,50,59,66), have made EST expression profiling simple and widespread. These pipelines, and others with more specialized purposes [GeneNest (26), Exquest (13), and Gene2EST (22)], depend on EST assembly programs [d2 cluster (49), CLOBB (45), Phrap (18), BLAT (30), CAP3 (28), MegaBLAST (70), etc.]. These assembly programs rely on many pair-wise comparisons that result in clusters of related ESTs. However, the number of pair-wise comparisons needed for a large database precludes regenerating such EST clusters/assemblies whenever new information is added. VNB, on the other hand, does not use clustering programs and thus can use the most recent publicly available data to generate expression profiles for each query in real time. Computing expression profiles in real time is particularly useful for profiling novel or poorly characterized genes. In addition, unlike a preindexed database, which only offers a single profile per gene, the user can generate multiple profiles of varying specificity and sensitivity for the same gene.
In addition, preindexed databases often contain inaccuracies due to the misidentification or failure to distinguish among paralogs or alternative transcripts. As much as 1.5–3% of ESTs are incorrectly incorporated into the wrong cluster (21,65) and this inaccuracy is even greater (30%) for ESTs derived from 5′-UTRs (39), leading to many false positives and false negatives in each cluster. One of the major problems causing inaccurate assembly is the high sequence error rate. Although not as significant as in early ESTs, these cDNAs remain in the archival database and they are still being generated (47). VNB minimizes inaccuracies in the produced profile by requiring exact matches between the selected probes and the ESTs (reducing false positives), while using many overlapping probes along the entire input sequence (reducing false negatives). The effectiveness of overlapping probes is exemplified by loss of sensitivity (increasing number of false negatives) when probe length decreases relative to window size (see Fig. 5).
Another advantage of VNB is the ability to use a custom alignment generated by ClustalW from a manually selected set of paralogs. An alignment generated using the RefSeq database may not include a complete set of known paralogs. Many alternatively spliced, poorly characterized, or novel mRNAs are not found in RefSeq. When many of the paralogs are not present in RefSeq, the known paralogs can be aligned manually with ClustalW. This ClustalW alignment can then be used as input to VNB to generate a better profile. The ability to input a custom alignment was critical to generating VNB-derived gene expression profiles for triose kinase, which was not available in UniGene or RefSeq at the time (20).
VNB has the advantage of balancing sensitivity and specificity by use of optimizable parameters. While the optimal window and probe size parameters were nearly the same for aldo, bel2, and cyp2 gene families, despite differences in overall sequence conservation, this may not be the case for all queries. For example, smaller window and probe sizes might be more effective for an expression profile for very rare transcripts. VNB identified more P450-specific ESTs with smaller probe and window sizes (see Fig. 5) (although the false positives increased as well). Using longer probes reduced false positives, as VNB identified no false positives with probe lengths greater than 25, although sensitivity was affected. The ability to set the window size and probe length parameters separately allows the user to change the sensitivity and specificity of the produced profile. However, the choice of parameters also affects performance. Increasing the number and length of the probes, by decreasing window size and increasing probe length, respectively, will increase the runtime.
The accurate annotation of library information at CGAP is critical to the accuracy of expression profiles generated by VNB. CGAP lists EST library information, which includes its tissue of origin (normal, cancerous, or cell line), its construction information (whether it has been “normalized”), and the number of ESTs that it contains. To generate the quantitative expression profile, VNB only uses ESTs from libraries annotated as “nonnormalized,” whereas all ESTs are used to generate the qualitative profile. CGAP correctly identified the “nonnormalized” libraries, as demonstrated by VNB’s quantitative expression profiles match experimental results (see Fig. 9). In addition, whether or not profiles are from normalized libraries clearly influences the expression profile. Much higher expression levels were seen in profiles derived from “nonnormalized” EST libraries, perhaps because many ESTs were discarded during normalization (see Fig. 8).
This ability to attain quantitative expression values brings into question how many ESTs are required to obtain reliable expression data. Profiles from VNB show accurate expression levels for tissues that have as few as 10 EST “hits.” For example, Figure 9 shows that reliable expression was obtained for both aldoA and gapdh in liver, kidney, and brain. Moreover, accurate profiles are generated even for small libraries or those with low expression levels (e.g., as few as 3–9 EST hits from spleen and testis), even though these expression levels are less statistically significant.
Other algorithms have been developed for gene expression profiling using dbEST. DigiNorthern uses BLAST and appears to perform a validation test similar to the BackBLAST routine described here, which certainly improved specificity (64). GEPIS simply uses BLAST to search the entire input sequence against dbEST (68). They validate their results by comparison with qPCR data, and only use nonmanipulated EST libraries in generating the profile. Clearly, this would have lower specificity and sensitivity than VNB. Tissuelnfo uses MegaBLAST and is focused on finding expression patterns in given tissues, rather than given genes (59). It acknowledges the problems with annotation and attempts to resolve this, but in the process much of the poorly annotated data is lost. In addition, the use of default 100 bp queries with 95% identity is the likely reason for the lower success rate this algorithm has with genes of high identity with different tissue specificities (59). Another powerful program, ExQuest does pair-wise comparison among ESTs using MegaBLAST (13). However, the queries are restricted to sequences in RefSeq and it defaults to the use of an 88% identity as a cutoff when comparing two ESTs, although this can be modified when needed. The claim that profiles from normalized libraries are no different from those from unmanipulated libraries is in contrast to the report here (see Fig 8). The GBA server is another tool that does in silico gene expression profiling (67). It identifies coexpressed genes by utilizing UniGene EST clusters to generate and statistically compare expression profiles.
All of the above-mentioned tools generate gene expression profiles using dbEST. However, most do not compare the specificity or sensitivity of the algorithm to any standards, nor do they compare the output to direct experimental measurements, although some do compare results to literature values. Moreover, most of these tools do not address issues of construction of EST libraries in generating their profiles and none have shown that quantitative profiles can result from the data in dbEST. VNB addressed most of these issues.
The algorithm (VNB) introduced here is an attractive alternative for generating gene expression profiles from dbEST. It uses a completely different approach from EST clustering used by most other pipelines. Compared to clustering programs, VNB has increased sensitivity and specificity and generates profiles in real time using the latest data in dbEST. Moreover, one unique aspect is its quantitative accuracy, which has been validated experimentally. The program has adjustable parameters, thus allowing accurate mining of EST data for both qualitative and quantitative output by optimizing the tradeoff between sensitivity and specificity. In summary, VNB generates quantitative expression profiles in real time from single-gene queries, and may be especially useful for studying novel or poorly characterized genes that may not be available in preconstructed gene indices and/or may require expert scientific input.
The authors thank Sharona Washington and Artiom Grusdev for their technical assistance and Minita Holloway and Dr. David Waxman for helpful discussions on the cyp gene family. The authors also thank Peter Haverty and Martin Frith for early assistance with the algorithm. Support is gratefully acknowledged from the National Institutes of Health (DK43521 to D.R.T., DK065089 to D.R.T., and R25-GM62463 supported K.V. and D.C. as part of a Summer Research Experiences for Undergraduates program); National Science Foundation IGERT (DGE-9870710 supported V.A.F.); BU Technology Transfer Award (to D.R.T. and V.A.F.).