In this paper, we propose a multilevel model that provides absolute estimates of allele-specific copy number at polymorphic loci and total copy number at nonpolymorphic loci. Our observation that batch effects and copy number changes are often indistinguishable in their effects on the data has led to an estimation procedure that defines batch-specific prediction regions for rare, nonbiallelic genotypes from more commonly observed biallelic genotypes. In particular, biallelic genotype calls of samples in the experimental data set provide naive estimates of allele-specific copy number that are used to derive robust-to-outlier parameters for the background and slope in a linear model fit on the intensity scale. Conditional on the naive estimates of copy number, the log A and log B intensities are correlated due to cross-hybridization of the probes for these alleles. As locus-specific estimates of the covariance are often based on a small number of observations, shrinking these estimates toward typical values provides additional robustness to unusually small or large variance estimates. The copy number estimates obtained from this approach are robust to batch effects and robust to a large proportion of individuals having a copy number alteration. Copy number estimates plotted versus physical position can be used to assess issues such as cell contamination.
Our procedure for locus-level estimation of copy number reduces the impact of batch effects in several ways. First, we quantile normalize the raw intensities for each sample to a target reference distribution. Quantile-normalizing samples to a target distribution ensures that the distribution of intensities is the same across batches and can reduce the occurrence of in silico batch effects incurred by how the samples were grouped by the software. Secondly, we do not rely on external data for training the model or a reference set for estimating ratios. In our experience, external data sets are unlikely to provide useful extrapolations due to (i) batch effects and (ii) latent biological differences between the external samples and the test samples. Third, we fit a multilevel model to the summarized intensities that allows locus-specific parameters for background and signal to depend on batch. Finally, we provide an option to iteratively estimate model parameters and compute copy number such that the bias of the estimated copy number in regions of the genome that are commonly altered is relatively robust to the assumption that the median copy number is 2. This approach does not require prior knowledge of the locations of regions that are commonly altered.
The R package crlmm may not be appropriate for some data sets. For the study of germline traits, our model is most useful when 25 or more samples have been processed together in a batch. Batches with fewer than 25 samples tend to have a large number of SNPs with unobserved biallelic genotypes. For such SNPs, the additional uncertainty from the imputation of the within-genotype moments further reduces the resolution for detecting copy number alterations. For the study of somatic cell diseases such as cancer, copy number alterations in a substantial fraction of the genome are not uncommon. The ability to accurately estimate the absolute copy number in cancer samples will depend to a large degree on an appropriate experimental design, in particular, whether normal controls were processed alongside the cancer samples throughout the experiment. In such a setting, we have demonstrated through simulation that copy number estimates for the test samples can be computed from model parameters that were estimated from only the normal controls with relatively small bias.
Of the models previously proposed in the literature, the models of Wang and others (2008) and Korn and others (2008) are the most similar to ours as both develop bivariate normal prediction regions for altered copy number. The Wang model provides allele-specific estimates of copy number that accounts for the correlation of A and B allele intensities. However, the Wang model is designed for an earlier version of the Affymetrix platform that contained only SNP probes and relies on training data to estimate model parameters. In addition, the Wang model does not address batch effects or explore shrinkage for improving variance estimates. The Korn model is similar to the Wang model with a few important differences. First, Korn and others recommend fitting their software by plate to address batch effects. In our experience, batch effects persist in by plate analyses with Birdsuite. Secondly, Birdsuite does not provide locus-specific estimates of copy number. Rather, Birdsuite houses separate algorithms for calling rare and common copy number variants that each borrow strength from neighboring loci to identify regions of copy number gain or loss. As locus-level estimates are not available, options to explore alternative smoothing algorithms, such as segmentation for samples that are mosaic in copy number, are not available. By contrast, crlmm advocates an approach in which the decision to explore segmentation or HMMs can be evaluated from visualizations of the locus-level summaries. In particular, mixtures of cell populations that give rise to noninteger copy number can be assessed by plotting the locus-level summaries.
Our model can be extended in several ways to improve the prediction regions for biallelic and nonbiallelic genotypes. First, our model assumes that the average intensity increases linearly with allelic dosage. The linearity assumption appears reasonable for many SNPs and can be used to help discriminate between outliers and sparsely populated non-biallelic genotypes. However, the linearity is not apparent for many SNPs and departures from linearity become more pronounced for allelic copy numbers greater than 2. Approaches that allow departures from linearity are a future direction of this work. Secondly, improvements to the univariate prediction regions at nonpolymorphic loci are needed. Again, approaches that relax the assumption of linearity are needed. Thirdly, we currently model batch as a fixed effect. A compromise between a fixed-effect and random-effect model that borrows strength across batch is likely to be very effective, particularly for small batches and SNPs with low minor allele frequencies. Finally, adjusting for sequence characteristics such as GC content and fragment length can be helpful for reducing the variance associated with the probe effect. We will explore methods that adjust for these factors along with batch effects in the future.
Our results provide a strong indication that a model-based approach for estimation of absolute, allele-specific copy number can be effective in large studies with pronounced batch effects, and that borrowing strength across loci can be useful for estimating the variance. Estimates of copy number and the corresponding uncertainty will be useful for downstream assessments of copy number–phenotype association.