Gene expression analysis, pathway profiling, gene regulatory networks, and modeling of biological processes are key for “post-genome project” studies. Various high-throughput methods of expression profiling are commonly employed, such as microarrays, serial analysis of gene expression (SAGE), and quantitative reverse transcription PCR (qPCR); some being more costly and labor intensive than other methods (
4). Newer expression profiling technologies include genome-scale in situ hybridization databases (
38) (e.g.,
www.eurexpress.org) and fully sequenced EST libraries using massively parallel DNA sequencing technologies (
63). Moreover, there exists a vast array of primary experimental data in the public domain in the form of microarray data, SAGE, and the expressed sequence tags database (dbEST), which can be freely used by investigators for gene expression profiling. Many public microarray databases now provide tools to survey individual gene expression among normal and disease tissues. These include, but are not limited to, Stanford Microarray Database (SMD), Gene Expression Omnibus (GEO), Oncomine, Genesapiens, and Gene Expression Atlas (
8,
16,
32,
53,
61). A new tool, called the Virtual Northern Blot (VNB), is described herein that maximizes the usefulness of dbEST as a resource in a unique fashion for effective gene expression profiling in real time, something not available in any of these other tools and databases.
ESTs are single-pass sequenced cDNAs representing expressed genes from a specific cell population or tissue (
2). They are on average 200–700 nucleotides (nt) derived from partial sequencing of randomly primed or oligo-dT primed cDNA clones from libraries of different tissues. Some libraries have been manipulated (sometimes called normalization) such that rare transcripts might be more highly represented, while other libraries have not been manipulated and thus the proportion of particular cDNA clones should accurately represent the same proportion in the mRNA population in that tissue.
The dbEST is a public domain archival database of cDNA sequence files (
10). Since its inception in 1994, dbEST has grown exponentially and this growth is expected to continue. Although a powerful resource for sequence analysis, and especially for identification of novel genes (
42), the utility and validity of dbEST for quantitative expression profiling have been criticized. Such criticism stemmed from early high error rates in sequence determination (>3%), poor annotation, partial sequence reads, and large-scale contaminations (
3,
21,
39,
52). Despite these issues, numerous EST mining algorithms (
31,
42) have successfully taken advantage of this tremendous resource (>61 million sequences by May 2009). In addition, methods for systematic validation (
60) have shown that some of the early concerns are less problematic as older ESTs have been diluted with higher quality data and better annotation. Expression profiling using dbEST is a common method for exploring the transcriptome (
11,
51), characterizing novel gene expression (
7), and identifying novel pathways in tissues (
20). The easy availability of these data has fostered continued improvement and innovation (
34,
43,
69) that underscores the value of this resource.
Gleaning reliable expression information from the archival dbEST database begins with proper identification of ESTs derived from the gene(s) of interest, often by sequence alignment. Common sequence alignment tools (MegaBLAST, BLAT, d2, CAP3, PHRAP) (
14,
18,
28,
30,
65,
70) have been used to cluster ESTs and then assign each cluster to a gene, thus building a gene-indexed database from which expression profiles could be gleaned. This processed data is made available through web-based tools, or pipelines. Two of the most frequently accessed EST analysis pipelines used to display gene-associated EST information are the Genome Browser at UCSC (
29), which uses BLAT, and UniGene (
41) at NCBI, which uses MegaBLAST. These pipelines are easily accessed and UniGene is among the most commonly used sources for retrieving gene expression data. However, while these tools have enabled the widespread use of EST data, the assembly of these databases is prone to errors from significant sequence error rates, alternative splicing, and lack of genome coverage (
11,
51). All these issues are especially critical for novel genes and those with very high sequence similarity. In addition, compiling expression profiles from a gene cluster may prove quantitatively inaccurate due to various cDNA construction methods employed (
27). Furthermore, such processed data is by its nature not current and pipelines for generating gene expression profiles in real time are not readily available. For those investigators wanting precise, sensitive, and up-to-date gene expression data for a single gene or gene family, there are few tools available for accessing dbEST. In addition, the use of any of these tools for quantitative analysis has not been clearly demonstrated. VNB was specifically designed to address these needs.
VNB is an application that can generate accurate quantitative and qualitative expression patterns for any human or mouse gene, which is available via a web interface (
http://tlab.bu.edu/vnb.html). The algorithm is analogous to a classical Northern blot; the program is optimized for single-gene queries for difficult genes (e.g., genes with high sequence identity among paralogs or novel and poorly characterized genes). Validation of VNB, using gene families of varying sequence similarity, function, and expression profiles, demonstrates that this tool is more sensitive and specific than commonly employed algorithms. More importantly, quantitative gene expression information derived using VNB is validated by Northern blots and qPCR.