Microarray time series gene expression experiments are widely used to study a range of biological processes such as the cell cycle [
1], development [
2], and immune response [
3]. Based on an analysis of the Gene Expression Omnibus [
4], approximately a third of all microarray studies involve time series experiments with three or more time points, and of these time series experiments over 80% contain no more than eight time points (Figure ). In many cases experimental costs prevent data from more time points from being collected. In some studies, particularly clinical studies, the availability of biological material can limit the number of time points collected. Thus, even if the price of microarray experiments were to go down short time series expression experiments would remain prevalent.
In this paper we introduce the Short Time-series Expression Miner (STEM), the first software application designed specifically for the analysis of short time series gene expression datasets (3–8 time points). Data from short time series gene expression experiments poses unique challenges. In these experiments thousands of genes are being profiled simultaneously while the number of time points is few. In such cases many genes will have the same expression pattern just by random chance. Furthermore as with any time series experiment, there are usually few, if any, full time series repeats from which to gain statistical power. STEM uses a method of analysis that takes advantage of the number of genes being large and the number of time points being few to identify statistically significant temporal expression profiles and the genes associated with these profiles [
5]. STEM also supports Gene Ontology (GO) [
6] enrichment analyses for sets of genes having the same temporal expression pattern providing the means for an efficient and statistically rigorous biological interpretation of significant temporal expression patterns. The integration of STEM with GO is bidirectional. STEM can easily determine and visualize the behavior of genes belonging to a given GO category, identifying which temporal expression profiles were enriched for genes in that category. Finally, STEM also supports the ability to compare temporal responses of genes across experimental conditions.
The novel clustering algorithm which STEM implements for short time series expression data is briefly reviewed in the Implementation section. For a detailed discussion of the clustering algorithm including experimental results on simulated data and a comparison with the
k-means clustering algorithm on real biological data using GO we refer the reader to [
5]. The main focus of this paper is on STEM's integration with GO, its support for comparing data sets across experimental conditions, its visualization capabilities, and a comparison with related software.
To date, researchers analyzing short time series expression data relied mainly on two types of software. The first is general gene expression analysis software implementing methods which do not take advantage of the sequential information in time series data. The second is gene expression time series analysis software implementing methods primarily designed for
longer time series. General methods for gene expression analysis that are frequently applied to time series expression data include popular clustering methods such as hierarchical clustering [
7],
k-means clustering [
8], and self-organizing maps [
9]. These standard clustering methods ignore the temporal dependency among successive time points. Specifically, if we were to randomly permute the order of time points, the results of these methods would not change. Two software packages available for clustering time series gene expression that implement methods that take advantage of the temporal dependency of time points are the Graphical Query Language (GQL) [
10] and the Cluster Analysis of Gene Expression Dynamics (CAGED) [
11]. GQL implements a clustering algorithm based on a mixture of hidden markov models. CAGED implements a clustering algorithm based on autoregressive equations. Unlike STEM these methods generally require the estimation of many parameters and are thus less appropriate for short time series data. Also unlike STEM, both standard clustering methods and previously suggested temporal analysis methods do not differentiate between real and random patterns. This is a particular problem for short time series expression data since, as mentioned above, many genes may have the same expression pattern by random chance. A detailed comparison of STEM with the software implementing methods of analysis primarily designed for longer time series appears in the Discussion section of this paper.
STEM is freely available for download at [
12] for non-commercial research purposes. A comprehensive and detailed manual is also available at [
12] and as
Additional file 1 to this paper.