Global gene expression analysis using microarrays has become an essential tool to study genome-wide transcriptional responses. Although this high-throughput technology produces a huge volume of useful data, enabling researchers to study the response of thousands of genes simultaneously, it faces many potential sources of uncertainties (e.g. technical noise, experimental treatments, biological sampling) [
1,
2]. As such, a number of statistical methods have demonstrated that the information contained in replicates is a valuable asset in order to assign proper confidence levels [
3-
6]. Rocke et al. [
7] proposed a model accounting for measurement error to model gene expression profiles which has been used often in conjunction with variance-stabilizing transformation [
8-
11] and model-based clustering [
12,
13]. Consequently, researchers are designing more experiments with repeated measurements per gene per chip even though it is significantly more costly and time consuming. However, properly incorporating the replicate information remains a challenge.
A typical step in analyzing gene microarray data involves filtering for differential expression [
14]. A number of methods have been proposed in this direction demonstrating the extensive insight gained in utilizing the information from replicates for determining the change of gene expression values e.g. t-test [
15-
17], ANOVA [
18,
19], SAM [
20], EDGE [
21]. An equally important part of the analysis is clustering which has been proven to be a powerful tool to rationalize transcriptional responses, identify possible functional relationships among them, and further elucidate important transcription factors as well as relevant biological pathways [
13]. However, most clustering methods do not take into account the variability of gene expression profiles in the form of replicates. Variability is usually lumped into a mean effect and expression profiles are clustered based on average values of independently repeated measurements for each gene, thus missing, potentially, useful information [
12].
Given that replicates can provide important insights into the nature of inherent variability among gene expression profiles [
3], recent approaches have attempted to incorporate repeated measurements. There are two primary ways to handle replicated data: (i) indirectly integrate the error information among replicates into a pairwise similarity metric between two expression profiles to produce a more robust distance metric, and (ii) directly integrate the replicate information into clustering models. The former offers a relative advantage since clustering methods that take the distance metric as input can be utilized without any modification e.g. standard deviation-weighted correlation coefficient [
22], shrinkage correlation coefficient [
23]. Meanwhile, various models have been proposed for (ii) including those whose design centers around a specific statistical model (e.g. Bayesian mixture model [
24,
25], linear mixed model [
12], random-effects model [
13]) and those that are more general (e.g. CORE [
26], trajectory clustering [
27], mass distributed clustering [
28]). Although such approaches produce more promising results, they are limited in that only a small number of computational methods can explore this information while many others requiring expression profiles as the input cannot.
In the present study we address a somewhat different question, namely whether we can integrate the error information into the time-series expression profiles so that we can utilize a variety of computational models [
29-
31] that take the expression profiles as the required input without any modification while taking into account the advantage of using replicated data (especially for clustering methods e.g. mclust [
32], som [
33], micro-clustering [
34], consensus clustering [
35], etc.).
The most straightforward approach to estimate time-series gene expression profiles is by computing the average expression levels over all replicates for each gene at each time-point (or condition in general). Of course, this approach does not properly take into account the variability in repeated measurements [
23,
36]. Therefore, in an attempt to estimate more robust expression profiles that integrate the error information from replicates, so-called 'true' expression profiles, we explore the error model [
22] to estimate the 'true' mean expression value of a gene across all time-points and the concept of 'relative difference' driven by the theory of t-statistic [
16,
20] to compute the difference between the 'true' mean expression value across all time-points and the mean expression value at each time-point. Those relative differences are then used to infer the 'true' expression profile of the gene. Alternatively, we also explore the capability of using spline to find 'smoothing' expression profiles that take into account all repeated measurements [
37].
We next demonstrate the effects of using the 'true' expression profiles in conjunction with clustering algorithms through synthetic and real time-series expression data. Following the convention of previous studies [
23,
36], we generated synthetic microarray data with known structure of classes and used the adjusted Rand index [
36,
38] to evaluate the performance of clustering via three popular clustering methods: hierarchical clustering, partitional clustering (kmeans [
39], pam [
40]) and model-based clustering (mclust [
32]). Finally, we extend our earlier work that proposed a micro-clustering approach designed specifically for clustering time-series expression data [
34,
41]. The approach involves two main steps: (1) a fine grained clustering step to identify an extensive list of putative clusters based on a symbolic transformation, and (2) a selection step aiming at the determination of which clusters are significant as representative of the underlying response. Additionally, we also propose a heuristic to automatically select the parameter values for the clustering method. For the fine grained clustering step, the basic formalism of
Symbolic
Aggregate appro
Ximation of time-series (SAX) has been adopted and modified [
42,
43]. Each 'true' expression profile is transformed to a corresponding sequence of symbols and then hashed to a particular motif value. As a result, all expression profiles with similar expression patterns will have identical symbolic representations and thus will be assigned to the same cluster.
However, the fine grained clustering step produces a large number of putative clusters while many of them are not significant enough to be considered as a representative expression pattern. Therefore, we propose a selection step based on the hypothesis that significant expression patterns will more likely consist of a large number of individuals compared to random data, given a threshold (p-value). As a result, only those clusters with a large-enough (based on the corresponding p-value) sizes are reported for subsequent investigations. Furthermore, due to the symbolic transformation heuristic of SAX the approach may produce several clusters with similar expression patterns and thus we also provide a heuristic to merge such clusters based on a criterion of maximizing the total homogeneity and separation of selected clusters.
Our results on synthetic data demonstrate that the clustering performance using 'true' expression profiles is superior to that when using average expression profiles and also to other methods with integrated error information. The output of this process can be used as input to a variety of other clustering methods without any modification while taking into account the information content in replicated data. Finally, we derive 'true' profiles for three real (rat liver) time-series datasets (acute/chronic corticosteroid administration [
30,
44] and circadian [
45]) and the explore the extended version of micro-clustering to select significant patterns of transcriptional response. Computational results are further validated predicated upon literature evidence.