Structural variants in the human genome, such as CNVs or balanced inversions, represent a major form of variation with widespread functional consequences 
. Numerous surveys mapping CNVs at varying levels of resolution 
have created a comprehensive CNV inventory, with the latest survey reporting 1,098 CNVs on average between two individuals spanning nearly 0.8% of the genome 
. Collectively, the list of reported CNVs presently involves 8,410 loci (Database of Genomic Variants 
, DGV) when applying the frequently used operational definition for CNVs, i.e.
, gains and losses of segments 1 kb or larger in size 
Recent studies have associated CNVs with various phenotypes, including benign and disease-related phenotypes such as cancer, HIV-1/AIDS susceptibility, autoimmunity, and complex disorders (
and references therein). Yet, while different conceptual approaches for CNV-discovery have been developed 
most CNV analysis approaches presently do not distinguish CNVs based on the copy-number of the underlying DNA segment, i.e.
, its copy-number genotype
, a distinction that is crucial for leveraging CNV assignments for studies focusing on genome evolution and genotype-phenotype associations 
. For example, copy-number genotypes enable distinguishing bi-allelic
, loci at which in addition to the reference allele either a single duplication or a single deletion allele is observed) from multi-allelic
, loci with more than one variant, such as deletion and duplication, or multiple duplications). Furthermore, in bi-allelic loci copy-number genotypes allow discriminating heterozygous
CNVs. Such information is crucial in association studies, where the failure to assign locus copy-numbers or to discriminate heterozygotes from homozygotes limits the statistical power. Recently, improvements in microarray technology have led to advances in CNV analysis by facilitating the ascertainment of copy-number genotypes in genomic regions amenable to hybridization by high-resolution comparative genome hybridization (array-CGH) or state-of-the-art SNP/CNV hybrid array platforms 
. While microarrays have advantages in enabling CNV ascertainment at high-throughput and low-cost, their resolution can be limited in CNV-rich regions involving segmental duplications 
(SDs). This might be because of probe cross-hybridization issues, which may reduce the number of effective oligonucleotide probes that can be designed for these regions 
. Indeed, commercial microarray-based approaches for copy-number genotyping are restricted to genomic loci for which probes are available at sufficient densities 
, while custom array designs may compensate for probe densities with the remaining limitation of relying on regions for which effective probes can be designed.
Recent breakthroughs in ‘Next Generation Sequencing’ (NGS) technologies have stimulated the development of computational approaches that enable the discovery of CNVs with excellent quantitative and spatial resolution 
. In this regard, several studies have demonstrated that the sequencing depth-of-coverage of NGS reads can be employed for CNV-discovery 
. For example, Xie et al. 
and Chiang et al. 
described CNV-discovery approaches conceptually related to array-CGH analysis, whereby the read-depth in genomic intervals is compared between pairs of samples to detect CNVs as relational changes in studies involving case/reference setups (e.g.
, cancer tissue vs.
healthy tissue). Furthermore, Alkan et al.
recently reported an elegant read-count based approach for mapping locus copy-number differences in large (≥20 kb) SDs using high-coverage (6 to 20-fold coverage) NGS data, by equating averaged and rounded read-counts in individual samples with integer locus copy-numbers 
. However, with recent advances enabling sequencing hundreds of genomes in studies focused on population genetics or genotype-phenotype correlations, a statistical framework for copy-number genotyping will soon become a prerequisite to enable the probabilistic ascertainment of CNV sets in NGS-based association studies. To be useful for genome-wide association studies, a NGS-based copy-number genotyping approach needs to provide absolute locus copy-number estimates in a sample-specific manner and needs to be able to determine confidence values for each copy-number genotype (to maximize statistical power). Furthermore, it should enable accurate ascertainment of a wide range of CNVs, including rare and common ones, and including those at the 1–20 kb size-range, a highly abundant CNV size-class 
. Lastly, the ability to utilize low-coverage (i.e.
, ≤4× coverage) NGS datasets, i.e.
, datasets such as the ones generated by the ‘1000 Genomes Project’ (1000GP; see http://1000genomes.org
), will be a crucial asset for such a copy-number genotyping approach, given that sample number and sequencing coverage will be at a constant tradeoff in future association studies.
Here, we present CopySeq, a statistical framework for copy-number genotype inference from low-coverage genomes, which is available at http://embl.de/~korbel/copyseq/
. As a benchmark we used CopySeq to genotype a set of CNVs previously analyzed with microarrays and obtained excellent genotyping concordances for CNVs across a wide size-range. In addition, as a proof-of-principle we used the approach to infer copy-number genotypes in the largest human gene family, with many genes and pseudogenes embedded within SDs: i.e.
, we analyzed the >800 olfactory receptor (OR) genes and pseudogenes in the human genome. OR genes form one of the most genetically variable and rapidly evolving protein-coding gene families and display a strong enrichment for CNVs 
compared to most other gene families. Thus, the OR gene family represents an appealing model for assessing copy-number genotype ascertainment using low-coverage sequencing and for studying the effect of CNVs on protein coding loci. Owing to the comparative nature of earlier studies, CNVs in ORs were thus far mostly reported as gains
relative to an arbitrarily chosen reference sample, and for most ORs no absolute locus copy-number assignments have been reported so far. Thus, the full nature and extent of copy-number variation in ORs remained unknown. Notably, it is presently unclear to what degree single deletions or duplications (i.e.
, bi-allelic) or multiple recurrent CNV-formation events (i.e.
, multi-allelic) affect particular OR loci, an information that is crucial for functional analyses as multiple alleles can reduce signals in association studies. Our analysis of ORs using CopySeq revealed a widespread diversity in integer locus copy-numbers in human OR loci in the 150 individuals assessed. We report a segregation of copy-number variable OR loci into bi-allelic, multi-allelic, and non-variable CNVs, with notable population differences in some OR loci. In addition, our analysis enabled us to address and further dissect genomic biases that may influence the extent of CNVs affecting ORs, including functional (genes vs.
pseudogenes), DNA sequence context (non-repetitive vs.
repetitive DNA), and evolutionary (‘young
’ ORs) biases.