The preliminary step, i.e., the fine grained clustering operation, divides the temporal expression data into a large number of clusters in which the similarity between the different expression profiles in a cluster is expected to be very high. As such, any clustering algorithm could in principle accomplish this first task. However, we have elected to explore the basic principles of the HOT SAX representation

[22] which transforms the time series in to an appropriate sequence of symbols. Thus every signal is represented as a finite sequence of symbols which is subsequently hashed to a unique scalar identifier. Time-courses that “hash” to the same value are assumed to belong to the same cluster.

In order to emphasize the role of the shape of the responses the data is first z-score normalized such that all of the expression profiles are of the same magnitude:

This converts each of the signals to have a mean of 0 and a standard deviation of 1. Subsequently the expression value at each time point is assigned a symbol based on an appropriate equiprobable partitioning of the normal distribution . A critical property relates to the number of equiprobable domains the Gaussian is divided into

[22]. If necessary, the time series is also piece-wise averaged in order to reduce the size of the signal, in which case, the time horizon t

=

1…T, is divided into w segments. Therefore, the original expression signal undergoes two transformations: first it is z-scored and subsequently transformed in a sequence of symbols:

The set AB defines the so-called “alphabet” which is a set of symbols with cardinality equal to the number of equiprobable partitions of the Gaussian curve.

After a gene expression profile has been converted into a sequence of symbols, the sequence is converted into an integer through the use of an appropriate hashing function. Following the formalism of

[22] we evaluate the hash value based on:

The ord[•] (ordinal) operation maps the partition in which a given time point is found into a integer. The ordinal operator maps the symbol “a” to 1, “b” to 2, etc. Thus each gene expression profile Y

_{i}(t) hashed to unique identified, H

_{i}, and each such identifier corresponds to a cluster. By definition there are a finite number of hash values, which would correspond to the maximum possible number of clusters in the data. From (2) it is evident that the number of possible values for H

_{i}, and therefore clusters, is related to the length of the signal and the number of possible symbols which the gene expression profile has been broken up into. The use of the equiprobable distribution for the discretization step is important because signals will be assigned to the different clusters with equal probability, provided that the signals were randomly generated via an N(0,1) distribution.

Because of the underlying equiprobable distribution associated with HOT SAX, randomly generated expression profiles will be assigned different hash values with equal probability. Because of the equi-probable assignment of hash values with respect to randomly generated data, the population of a given cluster can be modeled via a Poisson process. However, in the case where there exists approximately the same number of possible hash values as genes to be hashed, this Poisson distribution can be modeled as an exponential distribution

[23]. Thus, in the case of a null synthetic dataset comprised of randomly generated expression profiles, the probability that a given cluster has a population of N can be approximated via an exponential distribution.

Evaluation of Significant Perturbations To evaluate whether a significant perturbation exists within the data, the hash-based clustering is run on the experimental data and a distribution of cluster membership is obtained. This is compared to a synthetic null dataset of the same size in terms of the number of genes and the number of time points generated from the random data with the same number of time points and genes as the experimental dataset. A standard permutation analysis is performed for estimating the statistical significance of a result

[24]. We then determine the probability that a random trial will have a lower R

^{2} correlation to the exponential distribution than the real data. From this random trial, the p-value is calculated, as p-value

=

Prob(R

^{2} in random<R

^{2} in real) to establish the confidence that the data is non-random. We expect that if there is a non-significant stimulation within the experiment the p-value should be quite high, and if there is a significant simulation within the system that the p-value should be quite low i.e. statistically significant, suggesting that indeed the system deviates reliably from the hypothetical exponential distribution.

The behavior of HOT SAX to randomly generated data thereby allows us to select the parameter AB in a systematic manner. In a real dataset, it is hypothesized that significant coordination will occur, and therefore, the performance of the hashing operation should show deviations from the theoretical exponential distribution. Thus, the HOT SAX algorithm should be run on a given dataset where AB is varied parametrically, and the AB which corresponds to the lowest correlation to the null response will be chosen as the optimal AB.

Selection of Characteristic Transcriptional Responses The majority of approaches for analyzing time course gene expression data are based on the fundamental premise that over-populated motifs are indicative of significant events and thus searches for them as the main priority

[9]. Our leading hypothesis extends this powerful concept and aims at the selection of a sub-set of characteristic responses, i.e., a sub-set of hash values and the cluster (motifs) they correspond to, based on the premise that hidden within the maze of responses that are recorded is a small(er) family of probe sets that exhibit clear deviations as the process of monitoring the system progresses in time. The assumption is that only the portion of the genome that is affected by the specific perturbation ought to comprise that critical ensemble of intrinsic responses. Thus the key issues now become: (i) how to assess that a non-trivial change has taken place, and (ii) how to identify the critical sub-set of responses that make up that non-trivial change.

In order to address these two issues we introduce a term which we denote as the “transcriptional state”. The transcriptional state is a metric, which quantifies the deviation from a control. The control state can be arbitrarily defined, since we are interested in deviations and not absolute values. We assume that the “control” state corresponds to time t

=

0, i.e, right before the systemic perturbation, if any. The baseline state is defined as the distribution of expression values of a set of genes at the control state. To quantify the deviation from this baseline state, the difference in the distribution at any future time t and the control state is evaluated. To do so, we make use of the Kolmogorov-Smirnov (KS) Statistic

[25]. The KS statistic represents a simple method for quantifying the difference between two distributions and was selected over other methods, such as the Shaprio-Wilks test

[26], due to its ability to handle arbitrary binned distributions. It must be emphasized that the selection of a specific test is not binding and any effective measure for comparing two distributions can be employed. However, what is important is the ability to compare arbitrary distributions because there is no guarantee that the distribution of expression values of a set of genes will conform to a given named distribution.

While it has been shown that if one takes a large enough set of genes the distribution of values is expected to follow the log-normal distribution

[27], . However, the distribution of expression values corresponding to the genes that exhibit the largest sensitivity to the experimental perturbation will not. In order to evaluate the KS metric the Cumulative Distribution Function (CDF) of expression levels at a given time point is determined and is compared with the corresponding CDF of the control state. For a temporal experiment this needs to be repeated over time. Since the metric is defined over binned distribution, a number of bins, B, also needs to be specified. For calculating the CDF, we have assumed that the number of bins B equals the number of genes selected under a given iteration.

The sequence Δ(t) is defined as the transcriptional state of the system as it quantifies the level of transcriptional deviation from a control state. In order to evaluate a time-independent metric of the difference between two distributions any norm can be used on Δ(t). We opt for the simplest evaluation using the L

_{1}-norm. The use of the L

_{1} norm, quantifies the deviation over all time points during the duration of the experimental protocol. Therefore, the scalar quantifying the difference between the two distributions is defined as:

Δ allows one to evaluate how by how much the distribution of expression values deviated, over time, from a control state, which as previously mentioned is considered to be the state at t

=

0.

Having defined a metric quantifying the deviation of the current state from a control, the selection step of the algorithm identifies a subset of motifs composed of genes whose transcriptional state is responsible for the maximum deviation from the control state Two interesting properties of the transcriptional state are worth exploring further. The first relates to the changing character of the transcriptional dynamics, i.e., the deviation from the control, as more and more clusters are added. Based on the hypothesis that the totality of the transcriptome hides the informative components of the response, one should expect (see ) that the deviation from the control should be minimal if the entire data set is used. Therefore, by assessing Δ after a certain subset of clusters has been selected, we can measure whether the incorporation of additional clusters is not adding information through the inclusion of clusters highly correlated with existing ones, or whether the additional clusters are making the deviation from the baseline more evident. In the case of the null dataset, we expect that there would not be significant coordination between the clusters. Thus, the expected result is that the maximum Δ would be at a single cluster, and that Δ would be essentially a decreasing function as more clusters were added. This would mean that there are no significant anti-correlative patterns within the data, and that as more patterns are considered, one is not adding new information. In a dataset exhibiting a coherent reaction in response to a stimulus, the expectation is that Δ would reach a maximum at some intermediate value, as a population of patterns which represent the underlying dynamics has been isolated. The second characteristic is related to the dynamic progression, Δ(t), over time as it represents the temporal deviations and we hypothesize that it acts as a surrogate of the intrinsic dynamics of the system. Thus, Δ(t) summarizes the response of the system, and can be potentially useful for deriving a cellular dynamics response model.

However, the identification of an informative subset of motifs represents a difficult combinatorial problem. Given that the number of possible motifs is defined as AB

^{T}, where T is the number of piecewise averaged time points, the number of combinations that need to be evaluated is computationally intractable. To compensate for the combinatorial nature of the problem, we propose two different methods for carrying out the selection of informative motifs. The first method which we propose is the use of a greedy algorithm

[28] in which motifs are added in the order of their population, and the transcriptional state will be evaluated each time an additional motif is added. The set of motifs which yield the greatest value for the transcriptional state will be selected as the optimal set. The justification of utilizing a greedy selection algorithm is that genes that are part of more highly population motifs have been hypothesized to be more important as compared to genes which are part of more sparely populated motifs

[29].

The advantage of utilizing a greedy selection lies in the fact that the combinatorial nature of the problem is ignored at the cost of finding a sub-optimal though possibly “good enough” solution.

An alternative method for selecting an optimal subset of motifs is to limit the set of motifs that will be considered. Thus, rather than considering a superset of AB^{T} different motifs, we will limit our evaluation to over-populated motifs. Thus, by limiting the superset to only the over-populated motifs, the number of combinations that must be evaluated is decreased to a more tractable number. To perform this reduction, we define an over-populated motif as a motif whose population is greater than would be expected if the HOT SAX hashing algorithm were performed upon a randomly generated dataset which comprises the same number of probe sets and time points as the dataset being evaluated. After the initial set of motifs has been filtered, we generate all possible combination of motifs and evaluate them for the value of their transcriptional state, and like in the greedy selection, the set of motifs which yield the maximum transcriptional state will be identified as the informative subset. The advantage of utilizing this method is that unlike the greedy algorithm we can be sure that the set of motifs is indeed optimal rather.” However, while this filtering step has eliminated a large number of combinations that need to be considered, it still requires the evaluation of a large number of possible combinations and thus is computationally expensive.