We have carried out a high-resolution analysis of CNVs affecting OR gene and pseudogene loci, and identified many OR-related CNVs for which copy-number variability has not been reported previously. In contrast to a previous low-resolution study 
we observe that CNVs are enriched among evolutionary young ORs as well as pseudogenes. Our results differ from those published by Nozawa et al., probably due to the significant increase in resolution (nearly 2 orders of magnitude, from 50–100 kb to <1 kb), which enabled us to focus on single loci, rather than clusters, and thus to distinguish unaffected OR genes (non-CNVs) from adjacently located affected OR pseudogenes (CNVs). Our analysis suggests that both formation bias and evolutionary constraints have likely shaped the distribution of CNVs in the human OR repertoire. In fact, both biases are difficult to distinguish. For instance, pseudogenes and other repeats are known to be enriched in vertebrate gene deserts 
, which is presumably both due to mutational and selective biases. Also, we defined OR pseudogenes based on the absence of an intact open reading frame in the human reference genome. This may lead to misclassification of some of the intact genes, which may not be functional in reality, due to missense mutations 
or mutations affecting non-coding regulatory elements. However, both of these confounding factors are presumably affecting only a minority of loci and thus are unlikely to influence our conclusions in relation to a depletion of CNVs affecting OR genes relative to OR pseudogenes.
Furthermore, our data showed a bias for CNV-enriched OR loci to be located between tandemly oriented segmental duplications, which are known to induce NAHR 
. Besides NAHR, other formation-processes such as non-homologous end-joining (NHEJ; 
) are likely to play a role in the genesis of CNVs affecting OR loci. In the future, determining the relative contribution of such mutational processes will require the identification of the breakpoint junction sequences of numerous CNVs. In addition to regional biases caused by different mutational processes involved in CNV formation, large CNVs may sometimes span both genes and adjacently located unprocessed pseudogenes. Unprocessed pseudogenes often arise through tandem duplication of OR loci followed by inactivation of the newly generated, neutrally evolving, paralogous loci. This may lead to biases in the frequency at which pseudogenes vary in terms of copy-number, depending on selective constraints acting on adjacent functional loci. Finally, even in the absence of large CNVs such a bias may occur, as CNVs affecting pseudogenes in the proximity of genes may be detrimental due to long-range regulatory effects (e.g., through interfering with non-coding regulatory elements). Furthermore, the enrichment of CNVs among ORs located in close proximity to telomeres/centromeres may also be reflective of CNV formation biases or selective biases. In this regard, human subtelomeric regions are enriched for segmental duplications, and NHEJ and NAHR presumably operate efficiently in those regions 
. At least for some cases, we were already able to present sequence-based evidence for the likely involvement of NAHR in CNV formation. In particular, we demonstrated that CNVs causing a fusion of tandemly oriented OR genes were presumably formed through NAHR (Figure S9
). Such events exemplify a potential mechanism for accelerated functional diversification of ORs, where paralogs are originally created with new function or regulation pattern, rather than through the process of sequential duplication and diversion. Consequently it may be hypothesized that large OR subfamilies came to existence through frequent and/or large duplication events, implying that genes from large OR subfamilies will be prone to reside in CNV loci. However, our data showed no obvious correlation between OR subfamily size and averaged variances of R
). This may partly be due to a confounding factor, namely the reduced sensitivity of microarrays in detecting CNVs within regions sharing very high sequence similarity (see additional discussion bellow); such regions are enriched in the largest OR subfamilies, and our microarrays may have failed to detect CNVs in these.
As discussed above, our high-resolution data helped to clarify that CNVs do not randomly affect genes and pseudogenes, and that for OR genes purifying selection may operate on top of formation biases. In evolutionary terms, CNVs, which are variants en route
to fixation, have good potential to influence the OR repertoire size. Here, we have presented evidence for an abundance of polymorphic gene loss events affecting the most copy-number variable group of ORs, i.e. a group classified here as evolutionarily “young”. This may point to one possible underlying mechanism for the well-documented diminution of the human OR repertoire as it is reflected in the considerably reduced human OR repertoire size (i.e. 851 ORs) compared with dog and chimpanzee (~1000 ORs), and with rat and mouse (~1400 ORs) (
and references therein). It should be noted, however, that although these ORs were herein classified as “young” for simplicity, they do not necessarily have to represent recent gene duplicates. In particular, due to the orthology assignments, ORs that underwent deletion or duplication in the chimpanzee genome are also classified as “young” in our study. In contrast, the more “ancient” ORs potentially provide a more stable backbone of the olfactory subgenome, which is less affected by CNVs and also appears to have an overall positive balance between gains and losses. This slight enrichment for gains may imply stronger evolutionary constraints acting on these ORs, as losses are thought to be more detrimental than gains, and genes under purifying selection are more biased away from deletions than from duplications 
The identification of 9 deletion alleles, encompassing 15 OR gene loci and present at appreciable frequency, is significant for studies of olfactory function. Previously, functional OR gene inactivation alleles, involving SNPs leading to in-frame stop codons or substitution of conserved amino acids in an otherwise unmodified OR locus sequence, have been reported 
. Such alleles were subsequently linked to individual human responses to specific odorants, using both in-vitro
and association study approaches 
. However, large deletions have not so far been reported among the variants used for genetic association studies. The present identification of a number of unexpectedly frequent deletion alleles (with deletion allele frequencies of up to 0.6), some of which encompass several genes from the same OR subfamily, thus provides additional strong candidates for genetic association studies of human olfaction. To this end, we have recently initiated a CNV genotyping experiment using qPCR for the herein reported deletion I
() against a Caucasian cohort 
of 94 subjects, phenotyped for olfactory acuity towards eight odorants (unpublished). The results were inconclusive, probably due to the low number of samples and odorants involved. Future association studies will require larger sample sizes and, ideally, a-priori in-vitro
assessment of ligand specificity for the affected ORs.
Our study also has certain technical limitations. First, while microarrays represent the most cost-effective method for studying CNVs at large scale and high resolution, cross-hybridization limits their specificity and sensitivity in repetitive genomic regions. Cross-hybridization results in averaging of the signal over several loci, and is thus more likely to lead to false negative than to false positive CNV calls when using a stringent cutoff for scoring the arrays. A second potentially confounding factor is the inter-individual sequence variability, i.e. SNPs and small indels, that may affect probe hybridization. Yet, different probes on our arrays are generally
1 bp apart from each other, there are dozens of probes for each locus, and the signal is analyzed over all probes mapping to an OR locus. Thus, inter-individual sequence variability in specific probes, typically at the level of 1 SNP per kilobase, is unlikely to considerably affect our CNV calls. Furthermore, our qPCR results indicate that the false-positive rate is relatively low in our microarray experiments, as opposed to a considerable false negative rate, which was expected due to the stringent cutoff applied for scoring the arrays. Third, the comparative nature of our analysis may introduce an overestimation of the frequencies of some CNVs, if the reference sample carries a rare allele. In such cases, the rest of the samples are expected to show only one type of change – gain or loss, across a majority of the samples. Importantly, this would not change the CNV status of the locus (i.e. whether the OR is considered to be copy-number variable or not), and thus did not affect the main conclusions of this study. We nevertheless specifically addressed this issue by calling CNVs independent from the reference individual (see Text S1
), an analysis that did not considerably affect our overall CNV counts and did not change the conclusions of our study. Fourth, a considerable fraction of CNVs may represent recurrent, rather than common variants emerging from single mutational events (Table S3
). Distinguishing recurrent from common CNVs coherently will become a challenging task that will require breakpoint-resolution data, which is currently available only for few CNVs affecting ORs. Finally, a large portion of CNVs (62%, ) reported in DGV to intersect with OR loci are not observed in our study. This is likely to be, in part, attributed to the relatively low number of samples we analyzed and to false-negative calls in our study, but also to the fact that for most CNVs listed in DGV the size-ranges have been overestimated (in this regard, note the excellent recent survey published in 
). Furthermore, a parallel survey of CNVs affecting functional OR loci was published while the present paper has been under review 
. In this study, the authors report a statistical analysis of a subset of CNVs listed in DGV, as well as an experimental validation of CNVs recorded in DGV, which affect a set of 37 OR loci. In agreement with our data, they failed to validate 16 out of the 37 CNVs tested, despite using 50 samples of diverse ancestry. Altogether, these results support the size over-estimation of previous CNV surveys at low resolution (
) and stress the relevance of systematic follow-up studies focusing on CNV subsets.
In conclusion, our results emphasize the importance of carrying out genome-wide CNV surveys at high resolution. This is especially important, if one aims to identify events relevant to association studies, which requires the delineation of CNV event nature (i.e. deletion/duplication or complex, common or recurrent), exact CNV boundaries, and CNV population frequencies. Thus our study both provides insights into the evolution of the largest human gene family, and suggests specific targets for subsequent association studies.