Genotyping of common copy number polymorphisms
Previous approaches to copy number analysis7-10
involve searching a single individual’s genome for regions in which evidence of copy number deviation exceeds a genome-wide significance threshold—an approach that does not make use of prior knowledge. Yet the variation at more than 90% of the loci at which any two individuals differ in copy number across a region >10 kb in size seems due to a limited universe of common CNPs6
. At such loci, a copy number variant unambiguously exists and segregates at an appreciable frequency, and the problem can be redefined not as a problem of ab initio
discovery, but rather of accurate measurement (genotyping) of each individual’s integer copy number level5
The first step in our methodology, Canary (copy number analysis routine), determines the copy number of each individual at each predefined CNP locus. A high-resolution map of common CNPs is needed to define these loci; we used the map of McCarroll et al.6
, but improved maps can be substituted as they emerge. Although an individual probe inside a given CNP may not provide enough information to give an accurate integer measurement of copy number (a copy number ‘genotype’)11
(), multiple probes that interrogate the same CNP segment typically show highly correlated and reproducible patterns of intensity6
(). The measurements for the probes in the same CNP (or, in some cases, for a predefined high-performing subset of those probes; ) are combined into a single summarized intensity measurement, resulting in one summarized measurement per sample (). The summarized measurements for a batch of samples are then clustered into discrete copy number classes using a one-dimensional Gaussian mixture model (GMM), where the expected location of copy number clusters is informed by the results of previous experiments (). The resulting clusters are used to assign a CNP genotype to each sample at each CNP, as well as a score reflecting the confidence of each assignment (Supplementary Methods
Figure 2 Schematic of how a CNP is processed through Canary illustrated with data from chromosome 4. (a) Histogram of raw data from a single copy number probe. (b) Cross-correlation matrix of neighboring probe intensity profiles across 88 HapMap samples (a depicts (more ...)
Validation of the CNP genotypes from such an approach is important, but currently hampered by the lack of a gold-standard set of reference genotypes (such as HapMap12
has provided for SNPs). The vast majority of CNPs have not been previously genotyped with accuracy demonstrated in a set of reference samples. We created one such reference dataset (on the basis of consistency across two independent studies of 263 HapMap samples) that has few mendelian inconsistencies, conforms to Hardy-Weinberg equilibrium and shows strong concordance to fosmid end-sequenced samples6
. However, this specific dataset is inappropriate for validation of Canary, as it would be statistically overfit and therefore inflate measures of performance.
Instead, we assessed the quality of Canary-derived CNP genotypes by examining (i) inheritance in 91 independent parent-offspring trios and (ii) reproducibility across many laboratories. For the 1,177 diallelic CNPs tested (consisting of only a simple deletion or duplication, but not both), genotypes in 91 trios showed a mendelian inconsistency rate of approximately 0.005 per trio per CNP (). Copy number genotypes for 96 multiallelic CNPs6
were assessed for inheritance using Fisher’s h
, which was distributed closely around 1.0, with only one CNP generating a P
value <0.01. Canary genotypes were reproducible across the same HapMap samples run across seven independent labs, achieving an average sample call rate of 96.1% and a sample concordance with our reference dataset of 98.0%. Concordance with 783 independent copy number genotypes obtained by quantitative PCR (in 27 CNPs and 29 samples) averaged 97.6% across the seven labs. This is less complete and accurate than that for SNP genotypes, suggesting that further refinements are needed to either the algorithms or the underlying array data. Nonetheless, this performance across >1,000 CNPs far exceeds that of the small numbers (<100) of CNPs that have been genotyped in any previous study of appreciable sample size.
Genotyping performance of Birdsuite judged by mendelian inheritance (MI) patterns
Genotyping of SNPs
We next turn to SNP genotyping (). For any given genomic segment containing a SNP, samples with two copies of the locus per diploid genome are expected to have one of the canonical SNP genotypes of AA, AB or BB. For most autosomal SNPs we expect all samples to have two copies, but for those overlapping a CNP other possibilities may be observed13
(). For such SNPs, we use the information from Canary (above) to restrict initial SNP genotyping to those samples whose CNP genotype (integer copy number) is two (). Such a partitioning allows for the model of diploid SNP clustering not to be misled by samples that have fewer or extra copies (as might happen if one clustered the raw SNP data shown in ).
Figure 3 Schematic of how a single SNP is processed through Birdsuite and evaluation of mendelian inconsistencies. (a) Raw data for the SNP, plotting allele-A intensity versus allele-B intensity. Unlike most SNPs on the array, this SNP does not form three discrete (more ...)
SNP genotyping is done using Birdseed (see Supplementary Methods
and the BirdSuite web site), a specialized two-dimensional GMM, where the two dimensions are summarized probe intensities for each of the two alleles (A and B). Like Canary, Birdseed utilizes prior models representing the expected allele intensity information for each genotype class, built from previous data for samples of known genotype at each SNP (in this case, 270 HapMap samples and genotypes). Briefly, the algorithm utilizes expectation-maximization14
to determine the location of the AA, AB and BB clusters for each SNP ( and Supplementary Fig. 1
online). These clusters are used to assign a genotype (AA, AB or BB) to each sample along with a score reflecting the confidence of each call. Special procedures are used on the X, Y and mitochondrial chromosomes. Birdseed performance was validated on HapMap samples run on the Affymetrix SNP 6.0 array6
across seven different labs, not including any experiments used to generate the models or develop the algorithm. At the default confidence threshold, call rate on the HapMap samples was 99.47%, and these confident genotype calls were 99.74% concordant with HapMap genotypes, approaching the estimated error rate of HapMap itself. Like previous algorithms, Birdseed is not without bias: minor-allele homozygotes are less well genotyped when the minor allele frequency is low, affecting both call rate and concordance of this class of genotypes. However, Birdseed compares favorably to the BRLMM algorithm1
(comparison done on Affymetrix 500K data, as BRLMM does not work on the Affymetrix SNP 6.0 array; see Supplementary Fig. 2
online). Birdseed as a stand-alone program has been used to genotype over 50,000 samples at the Broad Institute with an average call rate >99% (S.B.G., unpublished observations).
Discovery and genotyping of rare CNVs
Although the first two steps in the framework focus on accurate typing of known, common polymorphisms, it is also possible using the same platforms to identify rare and de novo
copy number variants for which there is no prior knowledge. Such problems of ab initio
discovery are fundamentally more difficult because of the need to distinguish a relatively small number of real CNVs at unknown sites from the statistical fluctuations that arise in any genome-scale dataset. The heterogeneity of probe performances on array platforms further complicates this problem: different probes show different intrinsic measurement variance across samples (a fact seldom modeled by CNV discovery algorithms); furthermore, different SNP probe sets show different quantitative responses to having 0, 1 or 2 copies of each allele (Supplementary Fig. 1
). We therefore sought to model the empirical properties of each probe in order to maximize the power to detect rare CNVs. As in most other algorithms10
, we search for consistent evidence for copy number variation across multiple neighboring probes to reduce the effect of normal statistical fluctuations.
We consider first the task of accurately estimating copy number at a single location in the genome. Having previously run Canary and Birdseed aids in this task, in that they define copy number and allele-specific properties of each probe, as well as noise properties specific to each sample. The locations and variances of the Birdseed posterior clusters (corresponding to AA, AB and BB genotypes in intensity space) together represent an accurate estimate of the emission probability (the probability density function of intensity measurements given a particular underlying state) for each probe on the array in response to a sample with two copies at that locus. The expected intensity profile of ‘copy-variable’ genotypes (A-null, AAB and so on) can then be imputed from the locations of the Birdseed two-copy clusters (). Combining these profiles across genotypes of equivalent copy number models the emission probability of a probe given a sample with 0, 1, 3 or 4 copies of a locus. For the copy number probes on the array, we model only a single cluster per locus representing two copies, which represents the emission probability of normal samples; alternative copy number emission probabilities are imputed analogous to the method for imputing additional SNP clusters (Supplementary Methods).
The assumption that copy number is an integer allows for predefined, strongly modeled states to increase sensitivity. For both SNP probe sets and copy number probes, these emission probabilities allow us to estimate the relative likelihood of each possible copy number level (0, 1, 2, 3 or 4) in a way that is informed by the specific performance of each probe. We note that although this considerably improves performance when assessing germline copy number, it may make Birdseye less suitable for applications where average copy number at a locus is noninteger (such as detection of mosaic copy number changes in heterogeneous tumor DNA).
Even after an empirically modeled interpretation of intensity measurements, the estimate of copy number from single probes can be noisy. The next step is therefore to integrate information across neighboring probes to find strong, consistent evidence for altered copy number states. Birdseye, an HMM-based algorithm, utilizes dynamic programming to perform this search quickly and efficiently across each chromosome15
(). Each segment of discrete copy number is assigned a lod score indicating the relative probability of the variant versus normal copy number in the region (which can be used in downstream analyses to prioritize discovered CNVs on the basis of confidence). Because the approach uses probe-specific variances, noisier probes are inherently downweighted with respect to more responsive probes, reducing the number of false positives one would find by assuming all probes are equal (Supplementary Methods
Figure 4 Discovery of unknown or de novo copy number variation using Birdseye. (a) Raw data from a copy number probe, with one sample (arrow) colored green (top left). Raw data from a neighboring SNP, with the same sample (arrow) colored green (top right). Although (more ...)
To assess the sensitivity and specificity of Birdseye (and lacking a gold-standard dataset), we simulated CNVs across a range of sizes via an in silico gender-mixing experiment. Intensity measurements from consecutive X-chromosome probes from a female sample were replaced with the intensities of the corresponding probes from a male sample, in order to create virtual samples with deletions at known locations (Methods).
Using this simulation framework, iterated thousands of times over dozens of independent female and male samples, we find that for deletions spanning 3, 5 and 10 probes (corresponding to mean sizes of 5 kb, 8 kb and 17 kb), Birdseye identified with lod of 2 or greater 10%, 51% and 97.5% of the events, respectively; as expected, mean reported lod score also increased with deletion size (Supplementary Table 1
online). Breakpoints were typically determined to within a single probe of the simulated CNV, and fewer than one false positive is expected per genome at a lod of 2 or greater (). We note that because the simulation removes local autocorrelation of noise, this may overestimate performance on actual data; higher lod cutoffs may be appropriate in different datasets. Nonetheless, this simulation indicates that the combination of the Affymetrix SNP 6.0 array and the Birdsuite seems highly sensitive for deletions of 10 kb or larger.
Because the mutation rate to create de novo
CNVs is exceedingly low6,16
, when observed in individuals with a sporadic disease phenotype they are particularly good candidates to be causal factors5
. However, in searching for de novo
events it is critical not only to evaluate the evidence in favor of a CNV in a proband (or a tumor), but also to accurately estimate the likelihood against the presence of the CNV in the parents (or normal tissue). Birdseye is designed to address this need: in addition to reporting the evidence in favor of a CNV, it can be used to reassess a discovered region in other samples (such as the proband’s parents) using a framework that does not involve a stringent genome-wide discovery threshold (). Thus one can filter a list of CNVs on the basis of strong evidence against variation in parents, as opposed to simply failing to achieve genome wide-significant evidence in favor of variation (which is frequently a false negative).
Combining copy number and SNP allele information
The CNV events identified by Birdseye, together with the Canary genotypes for common CNPs, yield an assessment of copy number for each sample across its genome; the SNP genotypes from Birdseed describe sequence variation for samples at SNPs with the expected two copies of each locus. The fourth component of Birdsuite, Fawkes (‘fast analysis with kopy-number et SNPs’), merges these results to yield an integrated picture of the genetic variation in each sample. Fawkes utilizes the imputed locations (in A/B intensity space) of copy-variable clusters to assign an allele-specific copy number genotype (such as AAB, ABBB, A or B) at each SNP ( and Supplementary Methods
). Notably, the genotype assignment for a sample is constrained to the set of clusters corresponding to its integer copy number as determined by Canary and Birdseye; allelic copy number is thus informed by measurements not only from the SNP probe, but also from nearby SNP and copy number probes. This approach differs fundamentally from earlier attempts to estimate allele-specific copy number from intensity data at individual SNPs13,17
We evaluated the genotypes provided by Fawkes of autosomal SNPs across a set of 790 ancestrally diverse samples. Across these samples (comprising 689,451,960 total genotypes) Fawkes changed 717,301 SNP genotypes in 267,070 unique SNPs as compared to running Birdseed alone. A small fraction of SNPs (5,600) had a copy-variable call in at least 1% of unrelated individuals. In the 91 parent-offspring trios available in this dataset, Birdsuite genotypes showed a lower rate of mendelian inconsistencies versus those from Birdseed alone (). In fact, every family analyzed showed a lower rate of mendelian inheritance errors, with an average decrease of 5% and a maximum decrease of 45% (). This indicates that copy number variation is an infrequent but not insubstantial source of apparent mendelian inconsistency in all samples, and a major contributor in select samples (potentially owing to cell line artifacts that affect whole chromosomes). As expected, the rate of mendelian inconsistency was particularly reduced across the 11,256 SNPs that lie within common CNPs6
, with an average reduction of 33% ().
The removal of errors due to copy number variation, whether the CNVs originated in the germ line, somatically, or during cell culture, results in higher-quality data and increases the number of SNPs that pass typical filters applied to whole-genome association studies.
Comparison to other algorithms
We carried out a preliminary analysis to test the ability of Birdsuite as a whole to call CNVs. CGH- or sequence-confirmed CNVs discovered using fosmid end sequencing on eight HapMap samples were used as a reference dataset18
. Combining Canary calls surpassing the default confidence threshold with Birdseye calls surpassing a lod cutoff of 5, we recovered 56% of the reference CNVs that overlap at least 2 probes on the array, and 94% of those that overlap at least 20 probes. We compared these results to those from two commercial copy number analysis platforms, Nexus and Partek, using the same set of CEL files as input and default thresholds. At these settings, Nexus recovers 26% and 73% of CNVs overlapping at least 2 probes and at least 20 probes, respectively, and Partek recovers 4% and 12%. Relaxing the Nexus parameters to allow significantly more total CNV calls per genome than Birdsuite calls boosts Nexus sensitivity to 36% and 74%, still well below that of Birdsuite (Supplementary Note
Association testing in regions of altered copy number
The methods above, in conjunction with a map of common polymorphisms (both single nucleotide and copy number) and hybrid arrays for detection6
, allow characterization of the genetic variation in each sample with high accuracy and in a more comprehensive manner than previously possible. In addition to providing discrete calls, this framework provides a confidence of each call, which serves as a good guideline as to data and genotype quality. (In the downstream analyses that follow, we use these confidences only as a threshold for inclusion or exclusion of data; we note that such filtering has the potential to introduce bias, and methods that incorporate the uncertainty may perform better.)
The utility of genotypes from the Birdsuite, however, is not realizable unless analysis tools can accept and evaluate CNVs and noncanonical SNP genotypes, and test them for association with phenotype in a statistically robust manner. Specifically, in addition to performing the typical SNP test of association, one needs to assess the potential relationship of phenotypes to total copy number and allele-specific copy number (for example, AAB versus ABB). For example, a locus may be haploinsufficient when the remaining copy carries a low-expression allele, but not linked to the phenotype if the remaining copy is a high-expression allele. It is also important to assess association of phenotype with a collection of individually rare CNVs that overlie a common locus.
We have developed and implemented one such initial approach to test for such associations. For sites showing both allelic and copy number variation, we regress the phenotype (either a quantitative trait or disease status) on both the sum and the difference of the number of copies of each allele. A significant regression coefficient for the sum represents an association with overall copy number, whereas a significant coefficient for the difference represents an association with variation tagged by a SNP. If there is either no CNV or no SNP at a site, we fit a reduced model by removing the appropriate term (the sum or the difference, respectively), effectively giving a standard one-dimensional regression of phenotype.
Simulations indicate that this model is often considerably more powerful than the traditional approach that does not explicitly represent noncanonical genotypes (Supplementary Table 2
online). This joint approach is particularly powerful under certain scenarios, such as if a duplicated form of a ‘low-activity’ allele shows normal activity comparable with the wild-type allele (for example, when A and BB have similar phenotypes, different from B). In addition, the joint model can often disentangle SNP and CNP effects; for example, it can indicate whether a duplication or the allele that happens to be duplicated (or both) affects phenotype. The new release of the whole-genome association toolset PLINK19
now directly accepts Birdsuite output to perform these tests.
As has been observed in SNP genotyping20
, results of association analyses using Canary CNP genotypes can be sensitive to differential bias that can arise when data from individual plates or batches are incorrectly clustered. It is prudent to search for any individual plates or batches for which the Canary clustering may be incorrect; this can be accomplished by visual inspection of the underlying intensity data, or by automated identification of individual plates for which the frequency of the observed genotype classes is statistically unusual given the distribution on the other plates.
For rare or de novo
CNVs, one might expect a potentially stronger effect on phenotype—but in a smaller number of samples, as is the case with the 16p11.2 deletion recently discovered (using Birdseye) that seems to explain 1% of idiopathic autism21
, or one of numerous deletions associated with schizophrenia22
. In such a scenario, aggregation of events in cases at a particular locus, coupled with a lack of events in controls, can indicate that the region affects the assayed phenotype. The new release of PLINK includes a set of tools for manipulating, summarizing and analyzing rare and de novo
CNVs output from Birdseye (see URLs section in Methods).