A central strategy in the genetic study of human diseases is to identify genomic DNA variations related to clinical phenotypes. Human genomic variation exists in many forms, including single nucleotide polymorphisms (SNPs), simple repeat elements, microsatellites and structural variations such as copy number variations (CNVs) (1
). A CNV is defined as a chromosomal segment, at least 1 kb in length, whose copy number varies in comparison with a reference genome (2
). A significant fraction of CNVs are likely to have functional consequences, due to gene dosage alteration, disruption of genes, positional effects or the uncovering of deleterious alleles (3
). Thus, comprehensive identification and cataloging of CNVs will greatly benefit the genetic and functional analysis of human genome variation.
Multiple techniques have been developed to detect deletions or duplications in the human genome and other mammalian genomes (5
), and many of them depend on analyzing patterns of signal intensities across the genome. Traditionally, large chromosome rearrangements have been detected by array-comparative genomic hybridization (CGH) techniques that analyze the fluorescence signal intensities of clones (6–9
). Another comparable platform for CNV detection is whole genome oligonucleotide arrays. Since design of the arrays does not depend on SNPs, such technology can achieve complete genome coverage with higher precision for boundary inference of CNVs. Due to recent increased popularity of genome-wide association studies, high-density SNP genotyping arrays have been commonly used for CNV detection and analysis. With such arrays, signal intensity is measured for each allele of a given SNP, and analysis of signal intensities across all SNPs in the genome is used to infer CNVs (10
). More recently, to improve the coverage of SNP arrays for CNV analysis, manufacturers of SNP genotyping arrays, such as Affymetrix and Illumina, have incorporated nonpolymorphic (NP) markers into their SNP genotyping arrays, especially in known CNV regions.
Although traditionally ‘losses’ and ‘gains’ have been used to describe the major classes of CNVs, CNVs in a diploid genome are indeed chromosome-specific events. That is, CNVs can exist in any of the two homologous chromosomes, such as being deleted on one chromosome but duplicated on the other. Knowing chromosome-specific copy number is important to the development of linkage and association tests for CNVs. However, those commonly used CNV detection techniques mentioned above all depend on signal intensity measures, and are therefore unable to infer copy number in each homologous chromosome. The efficient utilization of family information can potentially help circumvent this issue. Furthermore, since most CNVs follow Mendelian inheritance (8
), the use of family information can improve the sensitivity and specificity of CNV detection (12
). In fact, family-based designs are now commonly used in genome-wide association studies, making it highly desirable to develop methods to infer chromosome-specific copy numbers. For example, in a recent CNV study on autism spectrum disorders, 751 families have been genotyped by the Affymetrix genome-wide 5.0 Human SNP arrays (13
); in our ongoing study, 943 autism families were genotyped using the Illumina HumanHap550 SNP arrays (14
). Other family-based genome-wide association studies include the Framingham heart study (15
), a multiple sclerosis study (16
) and type I diabetes studies (17
To use family information in analysis of CNVs, Kosta et al.
) developed an approach to infer chromosome-specific copy numbers for nuclear families after the total copy numbers are obtained from quantitative PCR. In our previous CNV analysis (12
), we incorporated family information in a two-step procedure in which family members were first used independently to generate CNV calls, and then combined together to post-validate calls obtained in the first step by incorporating family relationships. Although this approach has been shown to significantly increase the sensitivity and specificity of CNV detection, the family information is not optimally used. Moreover, if the CNV boundary is inferred incorrectly in the first step, it cannot be corrected in the second step. More recently, Marioni et al.
) discussed similar issues extensively for array CGH data, and proposed that copy numbers can be inferred on each chromosome, using HapMap family data as examples.
Efficient utilization of family information in CNV detection requires incorporation of the family relationships when modeling the joint probability distribution of signal intensities for family members. Similar to traditional multipoint linkage analysis with families, such a modeling procedure requires consideration of two levels of dependency—the dependency of signal intensities both between adjacent markers for each family member and at the same marker between family members. The first level of dependency can be modeled by a hidden Markov chain, in which the degree of dependency is determined by transition probabilities of the hidden copy number states, whereas the second level of dependency is determined by Mendelian inheritance. However, unlike the analysis of SNPs or microsatellites, family-based CNV studies are limited by the technical platforms, which can only give intensity estimates of the total copy number of a diploid genome. The analysis of CNVs in families is further complicated by the occurrence of de novo events, which occur as germline, somatic or cell line-induced chromosome aberrations in offspring that were not inherited from either parent.
To address these complications, we describe a unified statistical framework developed to jointly model the signal intensities for a parents–offspring trio. We demonstrate that our model is computationally feasible and can be used to analyze trios in a more efficient manner than existing methods, which do not consider family relationships or use family relationships separately (12
). By computer simulations and analysis of experimentally validated CNVs on real data, we demonstrate its superior performance in increasing call rates and in identifying the exact boundaries of CNVs. In addition, by analyzing a set of families genotyped using both the Illumina and Affymetrix SNP arrays, we further show the applicability of our method on different technical platforms and in detecting both inherited and de novo
CNVs. Although CNV detection only concerns the total copy number, our model gives probabilistic estimates of chromosome-specific copy numbers, which can be used for the future development of linkage and association tests that require chromosome-specific copy number information.