Our initial implementation of SGQT is for time series data, which we present here. We provide an entry screen that defines the data subset selections that are available for the user to search (Fig. ). The specific projects available fitting the search criteria are then presented, and selection of one project leads to a list of all profiles associated with the project. In the example we describe here, a 54 profile, 27 time point muscle regeneration series was selected, with two different muscles profiled at each time point on U74A microarrays containing ~12 000 probe sets (2
). The user is asked to select the profiles to be studied (‘select all’ is the option used here to query all 54 profiles). A web browser-style search query is then evoked, and entry of any text or probe set then queries genome databases for all genes and probe sets matching the query. For example, entry of ‘myosin’ will identify myosin heavy chains, light chains, binding proteins, etc. The user then selects the desired gene from the pull down result menu. Query of ‘myogenin’ returns only a single probe set, which, when selected (‘submit’) then triggers the database query tool. The tool then dynamically extracts data from the .cel files for the myogenin probe set from the 54 profile (12 000 probe sets/profile) data set, including signal (normalized hybridization intensity), and absent/present calls (Affymetrix MAS 5.0 determinations). The tool then aligns all data into a time series, and graphs replicates for each time point (Fig. ), as well as calculating the average of the replicates, graphing the average, and drawing a graph line through the averages for all time points (Fig. ). The tool also calculates the average signal for each time point, and the fold-change relative to time 0 (based upon array-normalized intensities) (Fig. ).
Initial database query for the time series query tool.
Figure 2 Graphic output of the time series query for myogenin in muscle regeneration. Muscle degeneration/regeneration was induced with intramuscular injection of cardiotoxin, and two different muscles profiled at the indicated time points following injection (more ...)
The resulting on-line graph has mouse-overs containing data associated with each data point (time point, signal, present/absent call), and for the arithmetic average (time point, average signal, fold-change relative to time 0) (Fig. ). The mouse-over shown in Figure is for the arithmetic average of replicates, with the pop-up window indicating the fold-change from time 0. Clicking over any data point links to a series of databases (Unigene, GenBank, LocusLink, Affymetrix) containing information on the gene of interest, as well as access to the download for the original data set (.cel, .dat, or .txt files). The tool also writes a dynamically generated spreadsheet containing all the information in the graph and this appears as a link above the graph. This spreadsheet can be downloaded, and analyzed using any desired graphics or statistical package. It should be emphasized that all visualizations and spreadsheets are dynamic queries of the web Oracle database. The dynamic search and output of the 54 profile murine regeneration series shown here is typically completed in approximately 15 s.
The five time series currently implemented for the tool are a murine in vivo
27 time point muscle regeneration series (54 U74A profiles) (2
), an 8 time point murine lung calorie restriction time series (18 U74A profiles) (4
) (D.Massaro and L.B.Clerch, unpublished data), a 5 time point rat spinal cord damage series (18 U34A profiles) (5
), and two 17 time point methylprednisone bolus time series in rat (47 profiles in liver and 51 profiles in muscle) (6
). It is important to note that many experimental variables, such as diurnal variation in gene expression, should be considered when interpreting time series data; for example, in the Massaro and Clerch calorie restriction studies, non-restricted and calorie-restricted mice were killed at the same time. We will continue to add additional time series to the tool, and plan to implement a collection of time series and non-time series data comparisons and visualizations to the PEPR resource.
To our knowledge, the time series query tool described here is the first expression profile data analysis tool that requires no prior knowledge of microarray data format or data interpretation. This tool is useful due to the quality control and replicates available for each time point, and simple visualization, interpretation and download of these. Future implementations of our data warehouse will allow input of externally generated data that conform to minimum experimental design criteria, and our QC/SOP benchmarks (see http://microarray.cnmcresearch.org/pgaoutline-qcofsamples.asp
) via a web interface with automated QC/SOP checks. As PEPR is built upon a standardized platform of Affymetrix-only data adhering to QC/SOP, all internally- and externally-generated data within PEPR should be intrinsically comparable. A new implementation of PEPR including many projects able to be queried by the SGQT tool described here is expected in late 2003. The updated PEPR will also include a choice of probe set algorithm for data display (MAS 5.0, dCHIP, RMA and ProbeProfiler).