4.1. Comparing FLLat to single-sample methods
Here, we present some simulations to demonstrate the advantages that a true multi-sample approach has over single-sample approaches. We simulated 3 different data sets, each consisting of S = 20 samples and L = 1000 probes. For the first 2 data sets, samples were generated in a manner similar to that employed by Olshen and others (2004). We used the model yls = μls + ϵls, l = 1,…,L, s = 1,…,S, where μls is the mean and ϵls~N(0,σ2) with σ determined by the signal-to-noise ratio (SNR), as described below. The mean is given by μls = ∑m = 1MscmI{lm ≤ l ≤ lm + km}, where Ms is the number of segments generated for sample s and cm, lm, and km are the height, starting position, and length, respectively, of each segment. For the first data set, the samples were designed to share no segments (regions of CNV). Therefore, separately for each sample, we chose the value of Ms from {1,2,3,4,5}, then chose cm from {±1,±2,±3,±4,±5}, lm from {1,…,L − 100}, and km from {5,10,20,50,100}. For the second data set, the samples were designed to share segments. We set the number of shared segments to 5 and generated starting positions and lengths for each shared segment, as above. In order to determine the number of samples in which each shared segment appeared, we selected a proportion from (0.25,0.75) for each shared segment and randomly selected the corresponding number of samples that would then share the segment. For each sample containing shared segments, heights for each shared segment were chosen as above. Finally, each sample was also populated with unshared segments, chosen as above, with the additional requirement that no sample contained more than 5 total segments. The third data set was generated according to model (2.1) with J = 5 and ϵls~N(0,σ2) with σ again determined by the SNR. The features were generated using the model βlj = ∑m = 1MjcmI{lm ≤ l ≤ lm + km}, l = 1,…,L, j = 1,…,J. The value of Mj was chosen from {1,2,3}, whereas cm, lm, and km were chosen as above. The weights θjs were generated by creating a matrix of N(0,1) variables and normalizing the rows to satisfy (2.6).
We applied FLLat, cghFLasso (
Tibshirani and Wang, 2008), quantsmooth (Eilers and de Menezes, 2005), CBS (
Venkatraman and Olshen, 2007) and PMD (
L1, FL) (Witten
and others, 2009, a multi-sample method) to each of the 3 data sets for SNRs of 0.1, 0.5, and 1. The SNR was defined to be the mean magnitude of the aberrations divided by
σ. Here, an aberration is any probe with nonzero signal, where the signal is given by the
μls for data sets 1 and 2 and
BΘ for data set 3. For each method, we generally used the default settings but also followed any recommendations and attempted to apply any tuning procedures, as outlined in the software documentation. In particular, for FLLat, we used the PVE plots in to choose
J and used criterion (3.3) to select the optimal tuning parameters; for PMD (
L1, FL), we set the number of factors to
J (to be comparable with FLLat) and used the provided cross-validation function when estimating each factor; for quantsmooth, as the provided cross-validation function produced an error on the simulated data, we used the default value for the smoothing parameter. A comparison of the computation times for each method can be found in Section S.2 of the
supplementary material available at
Biostatistics online.
For each method, we generated receiver operating characteristic (ROC) curves, shown in , by comparing the true signal to the estimated signal. For each sample, any probe that had an estimated signal greater in magnitude than a fixed threshold was declared an aberration. The ROC curves were produced by varying the threshold and calculating the true positive and false positive rates for each value of the threshold. The true positive rate was defined to be the proportion of true aberrations that were declared to be aberrations by a given method. Similarly, the false positive rate was defined to be the proportion of true nonaberrations that were declared to be aberrations.
When comparing ROC curves, curves that lie further above the 45о line indicate better performance. For the first data set, where the samples share no segments, although FLLat is not likely to have much advantage over single-sample methods, we see that it still performs comparably to the other methods. For the second data set, where samples now shared segments, FLLat performed very well, and this is especially evident at the lowest SNR. Finally, for the third data set, where the samples are generated from model (2.1), FLLat significantly outperformed all the other methods. Overall, these simulations show that FLLat performs very favorably compared to single-sample methods when samples truly share similarities. Further, FLLat performs particularly well in situations with low SNRs, which is often the case with aCGH data. With regard to the PVE plots in , we see that for the third data set, where we know the true number of features (J = 5), the PVE accurately determines the correct number of features.
4.2. Estimating the FDR
For a given threshold
T, letting

denote the fitted values produced by FLLat, we can declare probe location
l for sample
s to be an aberration if

. Given these declared aberrations, we would like to have a measure of the proportion that are falsely called. Using a similar approach to
Tibshirani and Wang (2008), identifying aberrations can be thought of as a multiple-testing problem, where we are testing the following hypothesis for each probe location within each sample:
The false discovery rate (FDR,
Benjamini and Hochberg, 1995), for threshold
T, is then defined to be
The numerator within the expectation is the number of declared aberrations that are not true aberrations and the denominator is the total number of declared aberrations. We propose a permutation-based method for estimating the FDR.
To estimate (4.1), we need some information regarding the null distribution of the data. If a reference data set were available, we could apply FLLat to this reference data set and use the resulting declared aberrations to estimate the numerator of (4.1). When an appropriate reference data set is not available, which is often the case in real experiments, an alternative is to approximate the null distribution of the data. Under the null hypothesis, the log intensity ratio for probe location
l in sample
s should be distributed as random noise, with no correlation between neighboring locations. One way to approximate this null distribution is to permute the probe locations within each sample. This has the effect of destroying any linear structure along the chromosome. Suppose the data were permuted in this fashion
K times, and let

denote the fitted values produced by applying FLLat to the
kth permuted data set. Then the FDR can be estimated by
where
π0 is the proportion of true null hypotheses.
We calculated both the true FDR and also the permutation-based estimate for each of the 3 data sets described in Section 4.1 at the same 3 SNRs of 0.1, 0.5, and 1. Plots of the true FDR and the average estimated FDR (calculated from 100 realizations of each simulated data set) for varying threshold values are displayed in . The estimated FDR was based on K = 20 permuted data sets and the true value of π0 = ∑s = 1S∑l = 1LI(∑j = 1Jβljθjs = 0)/S×L (i.e. the proportion of total probe locations having zero true signal) was used in (4.2). The true value of π0 is generally unknown and can either be estimated from the data, which can be a difficult task or set to the upper bound of 1, which results in conservative estimates of the FDR. We see from that in each plot, the estimated FDR approximates the true FDR fairly well for smaller values of the FDR, although there are some deviations at the larger values.