DNA microarrays have become an increasingly common tool for massively parallel measurements of gene expression (
1,
2), DNA copy number (
3,
4), protein–DNA interaction (
5,
6), and other genomic investigations (
7,
8). A single experiment may involve dozens of microarrays, each containing tens of thousands of spots for which dozens of measurements are recorded, resulting in millions of pieces of information. The point is quickly reached at which a full-featured database is necessary to efficiently deal with the information produced in microarray experimentation.
The Stanford Microarray Database (SMD;
http://genome-www.stanford.edu/microarray/), a research database for Stanford-affiliated and other registered users, makes freely available the data for over 3500 two-colour, spotted DNA microarrays. The number of publicly accessible arrays is increasing by about 1000 per year. These public data include experiments on twelve distinct organisms, including
Homo sapiens,
Caenorhabditis elegans,
Arabidopsis thaliana,
Saccharomyces cerevisiae,
Drosophila melanogaster and
Escherichia coli. SMD provides online tools for browsing and selecting experimental data, assessing data quality, filtering by individual spot characteristics and by expression pattern and analyzing data via hierarchical clustering or self-organizing maps, as well as extensive help and tutorials on how to use these tools (
http://genome-www.stanford.edu/microarray/helpindex.html). SMD's software is open source and has been installed at a dozen institutions worldwide, and its schema and table definitions are browsable on the web (
http://genome-www.stanford.edu/microarray/doc/db_specifications.html). Here, we discuss some of our newer tools for data access and quality assessment, which, like most of our tools, are available for use with our public data.
Data selection by array: publications
Data in SMD are always made available to the public upon publication in a peer-reviewed forum, or earlier at the discretion of the experimenter, because public access to all published data is crucial for re-analysis and further exploration. SMD currently makes available more public microarray data than any other microarray database or repository in the world.
All data contained within SMD for a published experiment set are available through the publication interface (
http://genome-www.stanford.edu/microarray/publication.html). For each publication, links are provided to the full text journal article (when available), to the NCBI PubMed record, to any supplemental website and to the data within SMD. Data from individual microarrays from the publication are organized into appropriate experimental sets, and the raw data are fully available for either download via FTP or online analysis using all of SMD's tools. This facilitates both re-examination for quality assessment and independent analysis of the data.
Data selection by gene: expression history
Users may survey the behaviour of a gene or clone of interest across all microarrays in which it appears using the Expression History tool (Fig. ). This facilitates a ‘gene-centric’ rather than ‘array-centric’ approach to microarray results. Users from the general research community may use this tool, but the results displayed are restricted to the arrays whose data have been made public. To enter the Expression History tool, a user can first use the ‘Name Details’ search, which allows wild-card searches for gene name, clone ID and other systematic identifiers or descriptions. The resulting list of matches provides links to Expression History, which is an interactive, graphical display showing the frequency distribution of measured ratio values for a single gene across all arrays to which the user has access. Microarrays may be selected across user-determined ranges of ratio values by clicking on the image; those experiments may then be examined in detail, downloaded, or clustered. The full data set used to generate the distribution may also be downloaded.
Data visualization: array color
Microarray data are often difficult to assess for quality. To complement various quality assessment tools included in microarray image analysis suites such as GenePix (Axon Instruments, Union City, California), SMD provides a variety of utilities. The Array Color tool provides a simplified view of the ratio data for a given microarray, allowing the user to quickly examine the microarray for evidence of global effects such as systematic biases corresponding to spatial location or printing plate of origin (Fig. ). This tool can be used to assess the quality of arrays that SMD researchers have made public.
In addition to the graphical display, the Array Color tool provides two simple one-way analysis of variance (ANOVA) calculations (
9), measuring the dependence of per-spot ratio on printing plate of spot origin, and on sector (printing pin) as a proxy for spatial location. ANOVA is a common statistical method for assessing the significance and strength of the relationships of results to categorical variables. Arrayed DNA is typically printed by a rectangular array of pins, which make sequential ‘dips’ into a series of 384-well microtiter plates; each sector, printed by an individual pin, contains spots from all plates. Given a typical, randomized layout of spots, no relationship would ordinarily be expected between ratio values and spatial location or printing plate of origin. Dependence upon printing plate may indicate problems with the PCR or other DNA-generation steps employed. Dependence upon sector usually indicates problems with the hybridization process, such as evaporation of the hybridization solution.