I
n recent years, microarrays have become indispensable tools in molecular biology because of their ability to quantitatively measure the expression of thousands of genes at the same time (Brown and Botstein,
1999). Their prominence and utility has grown with the increased computational power available to researchers and the development of the field of bioinformatics. The analysis of gene expression data from microarray experiments is a dynamic field; diversity in experimental design and in statistical methods has produced myriad different computational algorithms to make sense of the extremely high dimensional data. As microarray technology becomes more common and cost-effective (Bryant et al.,
2004), experiments have gotten larger. This means not only that sampling is done at higher frequencies and for longer periods of time; but also, researchers are including more repeated measurements in their experiments to enhance the reliability of their results. These trends necessitate new innovations from the computational domain so that we can fully exploit the added information from more thorough experiments. To meet this challenge, existing algorithms need to be modified and new approaches must be developed.
The primary goal in the analysis of gene expression data is separating biologically relevant signals from the underlying biological and experimental noise inherent in microarray experiments. Multiple biological replicates are necessary to produce reproducible, statistically significant results in microarray experiments (Churchill,
2002). Although averaging together multiple measurements does greatly improve accuracy relative to using just a single measurement (Lee et al.,
2000), all information about the variance in the replicates is lost in the averaging. With this in mind, many methods have been proposed for analyzing gene expression data, typically by assigning each gene a score and setting a cutoff at an acceptable error rate (Androulakis et al.,
2007; Storey and Tibshirani,
2003; Tusher,
2001). However, these types of methods do not account for the fact that genes do not act as independent features; rather, their behaviors are often highly correlated (Storey et al.,
2007; Wolfe et al.,
2005). Incorporating this knowledge into the analysis of gene expression data may lead to more biologically relevant insights. Thus, clustering methods are commonly applied when studying gene expression data.
Traditional clustering methods, such as k-means or hierarchical clustering (Eisen et al.,
1998), can be used on gene expression data. But for time course data, they are not ideal because they ignore the sequential nature of the data collection (Ernst et al.,
2005). For this reason, there is interest in methods of searching for temporal patterns in the data; this type of analysis is particularly well suited for time course gene expression data because it searches for groups of genes with similar dynamics over time, which are likely biologically relevant.
Analysis on the level of expression patterns rather than individual genes can be accomplished by assigning the genes into a large yet finite number of categories, depending on the gene's trajectory; this process is called discretization, and the result is a symbolic representation of each gene. A symbolic representation is desirable because it allows the discovery of patterns in the data (genes with the same or similar symbolic representations) instead of limiting the analysis to looking only at differential expression in individual genes. By transforming each point in the time series into a discrete symbol, the statistical analysis becomes more straightforward. Other advantages of symbolic representations include noise reduction and computational efficiency.
The discretization of time series data has been thoroughly discussed in the literature, with applications in virtually all fields of science and engineering (Daw et al.,
2003). The procedure generally consists of setting a number of cutoffs and assigning different symbols to values falling in different partitions. The symbols can then be temporally ordered, resulting in a sequence of symbols. Alternatively, one symbol can be used to represent more than one time point, further temporally discretizing the data.
A popular example of a symbolic representation is the Symbolic Aggregate approXimation (SAX) (Lin et al.,
2007). SAX has been applied to gene expression data through SLINGSHOTS (Yang et al.,
2007), which selects informative motifs from gene expression data based on the symbolic representation. However, because of preprocessing steps required before the symbolic transformation, it does not take into account the magnitude of change in gene expression and the variance in multiple replicates. These limitations are important because gene expression data is notoriously noisy, particularly at low expression values.
When studying time course gene expression data, it is natural to desire a symbolic representation that has only three possible symbols, reflecting the most intuitive possible responses of genes to stimuli: upregulation, downregulation, and no regulation. Unlike SLINGSHOTS, Trajectory Clustering (Phang et al.,
2003) transforms time series microarray data into a symbolic representation that takes into account multiple replicates and the magnitude of gene expression changes. However, it can only be applied to short time series data (five time points or less) before the number of clusters explodes exponentially and becomes unmanageable.
Symbolic discretization is similar to the idea behind Short Time-series Expression Miner (STEM) (Ernst and Bar-Joseph,
2006), which groups genes into a predefined set of clusters and selects the informative clusters based on their
p-values. However, STEM was designed to only function on short time series (approximately eight time points or less), whereas all of the datasets considered in this article contain between 11 and 18 time points.
Bayesian approaches have also been proposed toward significance testing in gene expression data (Angelini et al.,
2007). In the context of clustering, Bayesian clustering of curves approaches gene expression data with the goal of searching for patterns in the data rather than searching for individual genes, similar to symbolic methods. In Heard et al. (
2006), this technique is applied to gene expression data to find underlying patterns in the data. However, other than discarding large clusters that do not vary greatly, they do not extend their method to quantitatively determining which clusters are most significant, which is of importance for studying biological data. Furthermore, the discrete nature of symbolic representations is appealing because of the high levels of noise inherent in gene expression data (Androulakis et al.,
2007).
To overcome these issues with existing methods, this article proposes a new symbolic representation that takes advantage of all of the available experimental data, rather than averaging measurements together; this symbolic representation is presented in conjunction with a procedure to identify statistically significant patterns in the data. This new symbolic representation is simple, intuitive, and effective. The method's ability to discover biologically relevant signals is illustrated by running it on three different datasets, all of which are from time course experiments measuring gene expression in the rat liver. Two datasets concerning corticosteroid treatment are considered: one follows the response to an acute dosage (Jin et al.,
2003), whereas the other dataset contains the response to a constant drug infusion (Almon et al.,
2007). The third dataset explores the normal circadian rhythm in rats that are exposed only to regular light/dark cycling (Almon et al.,
2008).