The detection of periodically behaving gene expression time series has been an area of enormous interest lately. Since more and more microarray [2
] data is becoming available, including time series, the periodicity detection methods from other branches of science are being modified for use in gene expression studies. Periodicity detection methods can be broadly divided into generic and more specific detection rules. The generic approaches use the available statistical theory to seek strong periodic components at all the available frequencies [3
] and use exact tests to yield significance values with multiple correction. The more specific methods try to find periodic phenomena at specific frequencies, e.g. the assumed cell cycle frequency (see e.g. [8
Some of the most severe problems of processing gene expression time series data include short time series length, the presence of noise of unknown distribution, outliers (i.e. points that are clearly inconsistent with most of the other points in the data), non-uniform sampling used in performing the experiments and other non-linearities involved in the measurement technologies themselves. Outliers can be thought of as low-probability values from a mixture model where with a high probability the noise in the signal is modelled by a (Gaussian) distribution and with a low probability by another distribution whose variance is much higher than that of the first one. In earlier work we presented a robust modification [5
] of Fisher's g
] for finding hidden periodicities in time series data. The method performs well both under the Gaussian noise assumption and when outliers and other non-linearities are present. However, non-uniform sampling, other than the one resulting from missing values, was not considered and the aim of this paper is to evaluate different robust methods for periodicity detection that can handle non-uniform sampling. Non-uniform sampling in periodicity detection has been previously considered in [7
]. The authors use a so-called Lomb-Scargle periodogram to find the spectral estimate for a time series, not limited by non-uniform sampling, and then test whether the maximum value of the spectral estimate is significantly higher than the other values. While the method is mathematically sound and is based on an exact test, it is non-robust, as is the basic Fisher's test. The same issue applies to most of the other previously published methods. Exceptions to this are in [5
] and in [15
], where the authors use Bayesian detection and show that the method can handle data that is corrupted with uniform and Laplacian noise as well (besides Gaussian).
The matter of choosing the sampling scenario in a cost-effective way is discussed in [16
] where the authors present an active learning based online algorithm for choosing the sampling strategy.
] the authors have developed a periodicity detection method in which they fit orthogonal periodic polynomials to non-uniformly sampled data. If the periodicity of interest is not sinusoidal (e.g. narrow pulse signals), the method improves on the performance of the Lomb-Scargle periodogram, but reduces to it in the case of sinusoidal model, which is the case of interest for us. An approach based on similar ideas is presented in [18
] where the authors use least squares fitting of wavelets which is especially suitable if we want to search for periodicity in non-uniformly sampled data with non-sinusoidal cyclic components.
] the authors use the Lomb-Scargle periodogram for periodicity detection and show that it performs better than the combination of interpolation to uniform sampling and ordinary periodogram. They point out that there is a low-pass effect involved in interpolation that is a major problem. In [20
] the authors use a complicated approach of neural networks for periodicity detection in non-uniformly sampled time series but use interpolation to uniform sampling first, which, according to [19
], causes problems in the high frequency end of the spectra. In [21
] the authors actually make use of non-uniform sampling in digital alias-free signal processing applications. However, their approach is based on the idea of being able to choose the sampling intervals, which is not the usual case in biological studies. A model similar to the one in this paper is presented in [22
], where the authors aim to estimate a wide spectral range of frequencies of a non uniformly sampled signal. Their approach is, however, aimed more at real-time applications and longer signals than those usually present in microarray studies. Some of these methods can be thought to be improvements over the standard periodogram but non-robust when it comes to heavy tailed distributions and/or mixture models where the presence of (low probability) outliers can cause large residuals in the estimators and thus bias the results. The Bayesian approach [15
] is a clear distinction to the aforementioned approaches and presents an opportunity to make use of prior knowledge, such as the frequency of the oscillation. It is shown in [15
] that the Bayesian detector performs better than methods that assume a strict frequency of periodicity [9
] in case the frequency is not exactly known a priori. Other Bayesian periodicity detectors are presented in [23
In this paper we follow the general direction of Fisher's g
-test together with multiple testing correction for the detection of periodic time series in multiple time series data. Several modifications are needed to take into account non-uniform sampling and unknown noise characteristics. We use several different robust regression based methods [27
] to find the spectral estimate of a time series instead of using the basic non-robust periodogram. By using regression we can readily take non-uniform sampling into account.
After finding the spectral estimate we propose to replace the periodogram in the g-test with the robust spectral estimate. Since no analytical results for such modifications exist, we resort to permutation tests in finding the p-values. We also note that the test can be modified to yield a test for one specific chosen frequency if an a priori hypothesis is made about the frequency of the periodicity of interest.
To compare the performance of the different regression methods and some of the previously introduced novel methods [5
] in this framework, we use simulations and show the receiver operating characteristic (ROC) figures under several noise and signal configurations and non-uniform sampling. The computational complexity of the different methods is also briefly considered.
As an application we choose one of the best performing methods and apply it to microarray data measured from the mussel Mytilus californianus. Our experiments indicate that there is no statistically significant connection between the circadian rhythm and cell cycle regulated genes that could be expected to show periodic expression.