As any researcher knows, the analysis of real data sets always poses multiple unexpected challenges. Our goal in this section is simply to exemplify the use of the model we presented, illustrating its potential. In particular, we want to investigate the benefit of taking into account linkage disequilibrium and the advantages of the joint analysis of multiple samples.
We have currently implemented all the algorithms described in sections 3 and 4 in R and most of them in the much faster C++. We are in the process of automating file handling tasks and will be working on the release of a self standing code in the near future.
The performance of any HMM algorithm designed to study copy number variation, crucially depends on the selected values for the parameters. Their estimates are only as good as the training set their are based on. Unfortunately, a large curated data set, including a number of experimentally validated copy number variants, is not available for Illumina data. Both to develop our model and to explore the effectiveness of our algorithm, we have relied on a data set obtained from a genome-wide association study of sporadic amyotrophic lateral sclerosis that comprises 472 patients and 450 matched controls of Dutch origin [20
]. Genotyping was carried out with Illumina Hap300 Bead Chips. We used this large genotype collection to define five data sets to use for the purposes detailed below. We will complete fine tuning of parameter values and algorithm implementation once a larger data set becomes available.
A. Curated Set of Copy Number Variants
We compiled a collection of 255 curated copy number variants. A simple script was run through the ALS genome data to identify putative CNV on the basis of consecutive homozygous genotypes and logR values outside a defined window. The results from 70 samples were visually scored by two raters to confirm putative copy number variations on the bases of LogR and BAF values [1
]. We further matched these putative variations in copy number with copy number polymorphisms documented in the on-line database of genomic variants (http://projects.tcag.ca/variation/
We considered as true copy number variations the subset of those visually scored that we were able to match with an existing variant in the database, either validated with a PCR study, or independently documented by two different high throughput assays. This led us to a total of 10 homozygous deletions, 105 hemizygous deletions and 140 duplications. Ten of these variants were confirmed with experimental methods. We extracted BAF and LogR information for these copy number variable sites, with a context of 1000 SNPs proximal and 1000 SNPs distal to the CNV (with the exception of the CNVs located at the telomeres).
We use this data set to estimate the parameters of the emission probabilities and for model development and comparisons. This is the most ‘interesting’ of our data sets, as it contains both deletions and duplications in their naturally occurring form and distributed across the entire genome.
B. Signal from the Entire Genome of One Individual
To assess the computational burden of our algorithm with a measure that might be relevant for researchers in the area, we analyze the data for the entire genome of one individual.
C. Simulated Set of Deletions Using Chromosome X Data
We reclustered male X chromosome data to female-only samples, to obtain Log R and BAF signals corresponding to autosomal hemizygous deletions. A set of 270 diploid segments, each 1500 snps long, was obtained by random selection of standard female X chromosome data. In each of these segments we simulated deletions, with positions and lengths generated using the described Markov model with deletion rate δ = 0.0035, and η = 7 × 10−6. The BAF and LogR signals for the deleted regions were obtained from the corresponding SNPs in the reclustered male data.
We use this data set to provide a minimal illustration of how the parameters estimated on the basis of A can be successfully used to analyze an independent data set and to benchmark the performance of our algorithm to other software.
D. Simulated Deletion of a Tumor Suppressor Gene
To generate a challenging data set that mimics the setting of a tumor suppressor gene deletion, we used again true dizygous and hemizygous signals obtained from X chromosome data. Haploid segments comprising the same 1000 SNPs were obtained from 45 different females. A tumor suppressor gene is simulated in the central position with a deletion rate of 0.4, a background deletion rate δ = 0.002, and η = 7 × 10−6. The signal for the deleted SNPs is a weighed average of the respective male and female signals. We are using three different weighted averages of hemizyzous and haploid signals to mimic the noise levels that can characterize real data sets. In the case of tumor data, in particular, since it is often impossible to perfectly separate cancer and normal cells, this mixture model captures the true inhomogeneity of the sample.
We use this simulation to investigate the potential advantage of joint multisample analysis over single sample reconstruction.
E. Common Deletion
We collected signal from 3000 SNPs on chromosome 8 for the entire sample of 922 individuals. This region comprises a known hemizygous deletion of approximately 160 kb, deposited by 11 independent sources in the Database of Genomic Variation (chr8:137585971–138040412).
This data set provides us with the possibility to illustrate our multisample method on real data.
Finally, the entire ALS data set (with the exclusion of the samples in dataset A and B) was used to estimate two-marker haplotype frequencies across the genome, which we use to model LD in our analysis.
Before presenting the results of our data analysis, we need to clarify that, for this purpose, we have actually extended the model described in section 3 in a number of ways. For ease of exposition, we have deliberately omitted these details from the previous sections, but they are important for a successful data analysis. The model we described so far allows only for three possible copy number states. Our training data, however, contains a small number of homozygous deletions. To accommodate for this possibility, we have extended the HMM to include a zero copy number state. Indeed, for the purpose of more general data analysis, one may want to consider other copy number states – this can be done quite easily, as in [2
], without substantial modifications of our algorithm. Another modification of the algorithm we described in the previous sections, has to do with the emission probabilities for y
: while for ease of exposition, we considered a simple Gaussian distribution, here we use a mixture distribution as in [2
] to model outliers.
For the purpose of data analysis, we fixed the parameters of the transition matrix resorting to literature information. We focused on the results reported in [14
] with regard to deletions and duplications scored with Affymetrix genotypes. We obtained the distribution of the lengths of deletions and duplications, as well as of the percentage of SNPs in deleted or duplicated regions for 270 HapMap individuals. These results are presented in figure . The median lengths of deletions and duplications are 48,734 and 131,571 base pairs respectively. The median proportions of SNPs in a deletion or duplication are 0.00027 and 0.00049. Using the method of moments, we fixed δ = 0.001, γ = 0.0005, the marginal probability of the zero copy number state is set to 0.00005, and η = 7 × 10−6
, with distance measured in number of base pairs separating two loci.
Fig. 7 The top row contains histograms of the lengths in basepairs of all duplications and deletions detected in HapMap individuals in . The bottom row presents the histograms of the proportion of all SNPs that appeared to be in a duplicated and deleted (more ...)
A. Curated Set of Copy Number Variants
We use this data set to estimate the emission parameters for our HMM model. We consider two different approaches in the analysis.
(1) To make optimal use of the information available, we perform maximum likelihood on data set A assuming full knowledge of the copy number states. These parameter values will be used in the rest of the paper, unless otherwise specified, and, in particular, in the analysis of data sets B–E. Using these parameter values to analyze A can artificially increase the sensitivity and specificity of our algorithm, in comparison with others trained elsewhere. However, it is perfectly legitimate to compare the performance of different versions of our algorithm on data A, since they are all trained on the basis of the same information. We use this approach, then, to develop our data analysis strategy and to illustrate the advantage of accounting for linkage disequilibrium.
(2) To assess the impact of full knowledge of copy number states in the previous analysis, and to test the algorithms developed in section 3.3, we estimate all parameters using the MM algorithm and ignoring information on copy number status.
(3) Finally to provide the reader with a benchmark of the ‘difficulty’ of data set A, we analyze it with QuantiSNP and PennCNV.
Note that in large scale data analysis, one would want to tune the emission parameters using EM for each sample analyzed, as done in [24
]. We will take this approach in analyzing data set B. We now report the results of these analyses.
(1) Once the parameters of the HMM are estimated with maximum likelihood, we use the Viterbi algorithm to reconstruct copy number states in the training set, with the performance described in table . For comparison purposes, we also analyze the same data set with a version of the HMM that does not account for linkage disequilibrium. This leads to very similar values, but for a substantial increase in the number of false deletions. This is exactly the effect we expected: higher linkage disequilibrium leads to stretches of homozygous markers, which can be mistaken for deletions as soon as the LogR signal is lower than expected. When the HMM is modified to take linkage disequilibrium into account, a part of these spurious reconstructed deletions are eliminated.
Reconstructed copy number states, per SNP, on data set A, both accounting for LD and not
We are reporting the values in table , purely for comparison purposes. One the one hand, because this is our original training set, the sensitivity and specificity we achieve here cannot be necessarily extended to a generic sample. On the other hand, there are a series of obvious modifications to the analysis strategies that increase performance. For example, from table is evident that, even when linkage disequilibrium is taken into account, we are reconstructing some false positive deletions. This excess of deletions is actually clustered in a subset of our samples, which shows larger variability in the values of y, an example of which is given in figure . In the last portion of the analyzed region, the average value of y is lower: when this coincides with a series of homozygous markers, the HMM model can interpret this as a deletion. Reconstructed values can be substantially improved if we estimate individual-specific values for emission parameters, and if we allow the mean of LogR in diploid state to vary smoothly across different genomic regions. This is achieved, for example, by fitting the data with a smooth Lowess curve for each individual and then analyzing the residuals. Table illustrates these results on the training set. A further increase in specificity can be obtained by basing copy number calls not simply on the Viterbi reconstruction, but on the probability of the various states, conditionally on the observed data at each SNP. By selecting to call as deletion or duplications only those location that has conditional probability higher than a given threshold, one can control directly false positive rates. We give these details to illustrate how the performance of a HMM algorithm depends on a number of implementation choices, beside the specific probabilistic model used. This is why it is important to use exactly the same implementation to evaluate the effect of accounting for LD, as we did in table . We have not fully optimized our algorithm, partly because data set A is of rather limited size, partly because this is beyond the scope of the present work. We are in the process of acquiring a much larger training data that we are planning to use for the release of our software.
One sample in the training test with a clear decreasing trend in LogR values, whichresults in increased deletion calls.
Reconstructed copy number states, per SNP, on data set A
(2) We used the MM algorithm described in section 3.3 to estimate the values of all the parameters of the model using data set A, ignoring information on copy number status. Since the data set contains a very small number of homozygous deletions, however, we fixed the emission parameters of this state to their MLE values, obtained in (1). The reconstructed copy number states, using emission parameters so recostructed, are presented in table . Note that one expects this algorithm to perform significantly worse not only of what reported in table , but also than algorithms trained on other data sets with full knowledge of copy number states. Despite this, the results summarized in table are still reasonable, and it is still possible to appreciate an advantage in accounting for linkage disequilibrium.
Reconstructed copy number states, per SNP, on data set A, on the base of parameters estimated ignoring information on CNV
(3) For benchmarking purposes, we also carried out analysis of data set A with QuantiSNP and PennCNV. While possibly at a disadvantage, as trained on different data sets, these algorithms have been thoroughly optimized both in parameter selection, data quality control, and calling procedures. We want to emphasize that it is not our intention to conduct a full fledged comparison between these two algorithms and the one we present here. All of these algorithms are very close to each other and are expected to perform similarly (PennCNV can also handle family data, but this is not applicable in our context). Table reports the results of these analyses. The performances are substantially identical and comparable to those that we report in table , underscoring the similarity of all these algorithms. Differences in achieved sensitivity and specificity are likely to be due to implementation details and calling strategies, rather then to underlying probability assumptions. Since our emphasis is on modifications of the likelihood model, to account for linkage disequilibrium and to allow for joint multiple sample analysis, it is more meaningful to use the same algorithm and explore the effect of these modifications one at the time: this is the spirit of the comparison presented in tables and . The likelihood assumptions underlying QuantiSNP and PennCNV are substantially equivalent to the ones we make in the single sample analysis that does not take into account LD. We hope that our results may convince the curators of these programs to implement some of the modifications we suggest.
Analysis of data set A with QuantiSNP and PennCNV
Reconstructed loss rates for the tumor suppressor gene, as in simulation D
B. Signal from the Entire Genome of One Individual
We analyze the entire genome sequence of one individual, keeping the parameters of the HMM fixed at the values estimated on data set A, with the exception of the emission parameters for the diploid state which are estimating using the entire data for this individual. We require that the local posterior probability of copy number other than 2 to be greater than 0.99 in order to identify a copy number variant. According to this criterion, 20 regions appear to harbor putative copy number variations. Figure gives details on the nature of one of these identified variations. Running time for this analysis was 30 s on a laptop with 2.8 GHz single processor, using our C++ implementation.
Fig. 9 Data relative to the small arm of chromosome 2 for the studied individual. On the x-axis, we report the positions of queried SNPs in base pairs. The plots, from top to bottom, display the posterior probability of a copy number aberration, the copy number (more ...)
C. Simulated Set of Deletions Using Chromosome X Data
To validate the applicability of the parameters estimated on A to other data, we consider simulation C. This is also a reasonable data set where to compare the performance of our algorithm with the one of PennCNV and QuantiSNP. In particular, since we have argued that these should be comparable to our algorithm when we do not account for LD, we present the results of these analysis, to validate our claims. As we have remarked, there are variations in performances due to the calling strategies adopted. For our algorithm we present two results: the Viterbi reconstruction, and a setting where SNPs are called only when the posterior probability of one copy number is larger then 0.99. For QuantiSNP we report calls with Bayesfactor larger then 10, as recommended in the manual. Results of our algorithm, together with those of PennCNV and QuantiSNP are presented in table . Clearly, the performance of our algorithm remains acceptable on data outside the training set and is indeed remarkably similar to that of the published software.
Reconstructed copy number states, per SNP, on data set C for two versions of our algorithm (top row) not accounting for LD, and QuantiSNP and PeennCNV (bottom row)
D. Simulated Deletion of a Tumor Suppressor Gene
Since our multisample analysis method is in part motivated by the study of cancer cells, we explore its effectiveness in this context, relying on the simulation described in D. We consider three contamination levels for the genotype data, corresponding to a proportion of 0.2, 0.5, and 0.8 of truly hemizygous signal used to generate the deletion signal. For each of these noise levels, we generate 13 data sets according to the parameters described, and we analyze them using the single sample analysis and the joint multisample method. For each of the data sets, we then calculate the proportion of individuals identified as carrying a deletion at the tumor suppressor gene. Results are in table : clearly the multisample analysis enables us to correctly reconstruct a larger fraction of the deletions, for all noise levels. False positive rates of the two methods are entirely comparable (data not shown).
E. Common Deletion
To illustrate the applicability of our multisample method on actual data, we consider data set E. Analyzing firstly one sample at the time, we identify 54 individuals carrying the deletion. The averge Viterbi path across all 922 individuals in this region is presented in figure . The polymorphic deletion, which spans 17 SNPs is clearly identified. Subsequently, we analyze this region, estimating a location specific loss and duplication rate for SNPs in the neighborhood. Because multisample analysis of this kind is more sensitive, we want to compare the frequency of individuals with a deletion according to the single sample analysis conducted with Viterbi with the location specific estimate of deletion. In this case, the numbers turn out to be equivalent, suggesting that we had already captured all deletions with our original scan.
Average of Viterbi paths across 922 individuals in 3000 SNPssurrounding a commondeletion, comprising 16–17 SNPs.
Conducting the multisample analysis, however, we realized one strength of the single sample analysis: its greater robustness to smooth fluctuations of the logR signal. We have already mentioned these (fig. ), but it is in the analysis of data set E that their importance becomes more apparent. Consider the situation presented in figure . The LogR signal, in the top panel, clearly fluctuates around a trend, captured here with a spline regression. These variations in LogR have little information on copy number and are particularly challenging for algorithms that ignore the BAF signal. To illustrate this, we have analyzed this sample with the Fused lasso method, as implemented in [19
]: not surprisingly its reconstruction is much noisier than the HMM one (fig. ).
Fig. 11 Smooth trends in LogR signal. The top display provides LogR values for the 3000 analyzed SNPs in one individual. The middle plot presents the result of the cghFlasso algorithm applied on the LogR values. The bottom display reports the reconstructed copy (more ...)
To date, we do not have a clear understanding of the experimental effects behind this trend. Other groups have reported association with the probes CG content or DNA concentations [17
]. In our data set, we identify samples without apparent trend, and two clusters that share two different smooth patterns. Multisample analysis is particularly sensitive to these, as it tends to interpret them as weak signal. In order to avoid spurious results, we first pre-process the raw LogR values to estimate a smooth trend and remove it. We then run single sample Viterbi and multisample analysis on the residuals of this analysis. The single sample Viterbi are not substantially different from the ones obtained on the original raw data, but the multisample results are much less noisy. Figure illustrates these results. In the top panel, we have the frequency of deletion according to the Viterbi paths. In the middle panel, the estimated frequency of deletions using multisample analysis for the entire sample, cases, and controls. In the bottom panel the value of the log-likelihood ratio statistics for comparing the hypothesis of the same frequency of deletion in cases and controls versus different frequency. Clearly, it appears that the overall sample frequency of deletion is well captured by the single sample Viterbi analysis. There are no differences between cases and controls; while the deletion is longer in some of the controls, it is hard to assess the relevance of this finding.
Fig. 12 Multisample analysis of 60 SNPs in the proximity of the studied deletion. The top display contains the proportion of individuals whose Viterbi path was equal to 1 among the 922 analyzed. The middle plot presents the location-specific probability of deletion (more ...)