In recent years there has been a growing interest in the role of copy number variations (CNV) in genetic diseases. Though there has been rapid development of technologies and statistical methods devoted to detection in CNVs from array data, the inherent challenges in data quality associated with most hybridization techniques remains a challenging problem in CNV association studies.
To help address these data quality issues in the context of family-based association studies, we introduce a statistical framework for the intensity-based array data that takes into account the family information for copy-number assignment. The method is an adaptation of traditional methods for modeling SNP genotype data that assume Gaussian mixture model, whereby CNV calling is performed for all family members simultaneously and leveraging within family-data to reduce CNV calls that are incompatible with Mendelian inheritance while still allowing de-novo CNVs. Applying this method to simulation studies and a genome-wide association study in asthma, we find that our approach significantly improves CNV calls accuracy, and reduces the Mendelian inconsistency rates and false positive genotype calls. The results were validated using qPCR experiments.
In conclusion, we have demonstrated that the use of family information can improve the quality of CNV calling and hopefully give more powerful association test of CNVs.
Copy Number Variation (CNV) is increasingly implicated in disease pathogenesis. CNVs are often identified by statistical models applied to data from single nucleotide polymorphism (SNP) panels. Family information for samples provides additional information for CNV inference. Two modes of PennCNV (the Joint-call and Posterior-call), which are some of the most well-developed family-based CNV calling methods, use a “Joint-model” as a main component. This models all family members’ CNV states together with Mendelian inheritance. Methods based on the Joint-model are used to infer CNV calls of cases and controls in a pedigree, which may be compared to each other to test an association. Although benefits from the Joint-model have been shown elsewhere, equality of call rates in parents and offspring has not been evaluated previously. This can affect downstream analyses in studies that compare CNV rates in cases versus controls in pedigrees. In this paper, we show that the Joint-model can introduce different CNV call rates among family members in the absence of a true difference. First, we show that the Joint-model may analytically introduce differential CNV calls because of asymmetry of the model. We demonstrate these differential call rates using single-marker simulations. We show that call rates using the two modes of PennCNV also differ between parents-offspring in one multi-marker simulated dataset and two real datasets. Our results advise need for caution in use of the Joint-model calls in CNV association studies with family-based datasets.
Schizophrenia; Calling Algorithm; Family-Based Study; CNV burden
A major concern for all copy number variation (CNV) detection algorithms is their reliability and repeatability. However, it is difficult to evaluate the reliability of CNV calling strategies due to the lack of gold standard data that would tell us which CNVs are real. We propose that if CNVs are called in duplicate samples, or inherited from parent to child, then these can be considered validated CNVs. We used two large family-based Genome-Wide Association Study (GWAS) datasets from the GENEVA consortium to look at concordance rates of CNV calls between duplicate samples, parent-child pairs, and unrelated pairs. Our goal was to make recommendations for ways to filter and use CNV calls in GWAS datasets that do not include family data. We used PennCNV as our primary CNV-calling algorithm, and tested CNV calls using different datasets and marker sets, and with various filters on CNVs and samples. Using the Illumina core HumanHap550 SNP (single nucleotide polymorphism) set, we saw duplicate concordance rates of approximately 55% and parent-child transmission rates of approximately 28% in our datasets. GC model adjustment and sample quality filtering had little effect on these reliability measures. Stratification on CNV size and DNA sample type did have some effect. Overall, our results show that it is probably not possible to find a CNV calling strategy (including filtering and algorithm) that will give us a set of “reliable” CNV calls using current chip technologies. But if we understand the error process, we can still use CNV calls appropriately in genetic association studies.
evaluation; CNV calling strategies; family-based GWAS
The detection of copy number variants (CNVs) and the results of CNV-disease association studies rely on how CNVs are defined, and because array-based technologies can only infer CNVs, CNV-calling algorithms can produce vastly different findings. Several authors have noted the large-scale variability between CNV-detection methods, as well as the substantial false positive and false negative rates associated with those methods. In this study, we use variations of four common algorithms for CNV detection (PennCNV, QuantiSNP, HMMSeg, and cnvPartition) and two definitions of overlap (any overlap and an overlap of at least 40% of the smaller CNV) to illustrate the effects of varying algorithms and definitions of overlap on CNV discovery.
Methodology and Principal Findings
We used a 56 K Illumina genotyping array enriched for CNV regions to generate hybridization intensities and allele frequencies for 48 Caucasian schizophrenia cases and 48 age-, ethnicity-, and gender-matched control subjects. No algorithm found a difference in CNV burden between the two groups. However, the total number of CNVs called ranged from 102 to 3,765 across algorithms. The mean CNV size ranged from 46 kb to 787 kb, and the average number of CNVs per subject ranged from 1 to 39. The number of novel CNVs not previously reported in normal subjects ranged from 0 to 212.
Conclusions and Significance
Motivated by the availability of multiple publicly available genome-wide SNP arrays, investigators are conducting numerous analyses to identify putative additional CNVs in complex genetic disorders. However, the number of CNVs identified in array-based studies, and whether these CNVs are novel or valid, will depend on the algorithm(s) used. Thus, given the variety of methods used, there will be many false positives and false negatives. Both guidelines for the identification of CNVs inferred from high-density arrays and the establishment of a gold standard for validation of CNVs are needed.
In addition to single-nucleotide polymorphisms (SNP), copy number variation (CNV) is a major component of human genetic diversity. Among many whole-genome analysis platforms, SNP arrays have been commonly used for genomewide CNV discovery. Recently, a number of CNV defining algorithms from SNP genotyping data have been developed; however, due to the fundamental limitation of SNP genotyping data for the measurement of signal intensity, there are still concerns regarding the possibility of false discovery or low sensitivity for detecting CNVs. In this study, we aimed to verify the effect of combining multiple CNV calling algorithms and set up the most reliable pipeline for CNV calling with Affymetrix Genomewide SNP 5.0 data. For this purpose, we selected the 3 most commonly used algorithms for CNV segmentation from SNP genotyping data, PennCNV, QuantiSNP; and BirdSuite. After defining the CNV loci using the 3 different algorithms, we assessed how many of them overlapped with each other, and we also validated the CNVs by genomic quantitative PCR. Through this analysis, we proposed that for reliable CNV-based genomewide association study using SNP array data, CNV calls must be performed with at least 3 different algorithms and that the CNVs consistently called from more than 2 algorithms must be used for association analysis, because they are more reliable than the CNVs called from a single algorithm. Our result will be helpful to set up the CNV analysis protocols for Affymetrix Genomewide SNP 5.0 genotyping data.
CNV defining algorithm; DNA copy number variations; SNP array
Recent studies have shown that copy number variations (CNVs) are frequent in higher eukaryotes and associated with a substantial portion of inherited and acquired risk for various human diseases. The increasing availability of high-resolution genome surveillance platforms provides opportunity for rapidly assessing research and clinical samples for CNV content, as well as for determining the potential pathogenicity of identified variants. However, few informatics tools for accurate and efficient CNV detection and assessment currently exist.
We developed a suite of software tools and resources (CNV Workshop) for automated, genome-wide CNV detection from a variety of SNP array platforms. CNV Workshop includes three major components: detection, annotation, and presentation of structural variants from genome array data. CNV detection utilizes a robust and genotype-specific extension of the Circular Binary Segmentation algorithm, and the use of additional detection algorithms is supported. Predicted CNVs are captured in a MySQL database that supports cohort-based projects and incorporates a secure user authentication layer and user/admin roles. To assist with determination of pathogenicity, detected CNVs are also annotated automatically for gene content, known disease loci, and gene-based literature references. Results are easily queried, sorted, filtered, and visualized via a web-based presentation layer that includes a GBrowse-based graphical representation of CNV content and relevant public data, integration with the UCSC Genome Browser, and tabular displays of genomic attributes for each CNV.
To our knowledge, CNV Workshop represents the first cohesive and convenient platform for detection, annotation, and assessment of the biological and clinical significance of structural variants. CNV Workshop has been successfully utilized for assessment of genomic variation in healthy individuals and disease cohorts and is an ideal platform for coordinating multiple associated projects.
Availability and Implementation
Available on the web at: http://sourceforge.net/projects/cnv
A number of copy number variation (CNV) calling algorithms exist; however, comprehensive software tools for CNV association studies are lacking. We describe ParseCNV, unique software that takes CNV calls and creates probe-based statistics for CNV occurrence in both case–control design and in family based studies addressing both de novo and inheritance events, which are then summarized based on CNV regions (CNVRs). CNVRs are defined in a dynamic manner to allow for a complex CNV overlap while maintaining precise association region. Using this approach, we avoid failure to converge and non-monotonic curve fitting weaknesses of programs, such as CNVtools and CNVassoc, and although Plink is easy to use, it only provides combined CNV state probe-based statistics, not state-specific CNVRs. Existing CNV association methods do not provide any quality tracking information to filter confident associations, a key issue which is fully addressed by ParseCNV. In addition, uncertainty in CNV calls underlying CNV associations is evaluated to verify significant results, including CNV overlap profiles, genomic context, number of probes supporting the CNV and single-probe intensities. When optimal quality control parameters are followed using ParseCNV, 90% of CNVs validate by polymerase chain reaction, an often problematic stage because of inadequate significant association review. ParseCNV is freely available at http://parsecnv.sourceforge.net.
Motivation: Copy number variations (CNVs) are increasingly recognized as an substantial source of individual genetic variation, and hence there is a growing interest in investigating the evolutionary history of CNVs as well as their impact on complex disease susceptibility. CNV/SNP haplotypes are critical for this research, but although many methods have been proposed for inferring integer copy number, few have been designed for inferring CNV haplotypic phase and none of these are applicable at genome-wide scale. Here, we present a method for inferring missing CNV genotypes, predicting CNV allelic configuration and for inferring CNV haplotypic phase from SNP/CNV genotype data. Our method, implemented in the software polyHap v2.0, is based on a hidden Markov model, which models the joint haplotype structure between CNVs and SNPs. Thus, haplotypic phase of CNVs and SNPs are inferred simultaneously. A sampling algorithm is employed to obtain a measure of confidence/credibility of each estimate.
Results: We generated diploid phase-known CNV–SNP genotype datasets by pairing male X chromosome CNV–SNP haplotypes. We show that polyHap provides accurate estimates of missing CNV genotypes, allelic configuration and CNV haplotypic phase on these datasets. We applied our method to a non-simulated dataset—a region on Chromosome 2 encompassing a short deletion. The results confirm that polyHap's accuracy extends to real-life datasets.
Availability: Our method is implemented in version 2.0 of the polyHap software package and can be downloaded from http://www.imperial.ac.uk/medicine/people/l.coin
Supplementary information: Supplementary data are available at Bioinformatics online.
Single nucleotide polymorphism (SNP) and copy number variation (CNV) are both widespread characteristic of the human genome, but are often called separately on common genotyping platforms. To capture integrated SNP and CNV information, methods have been developed for calling allelic specific copy numbers or so called copy number polymorphism (CNP), using limited inter-marker correlation. In this paper, we proposed a haplotype-based maximum likelihood method to call CNP, which takes advantage of the valuable multi-locus linkage disequilibrium (LD) information in the population. We also developed a computationally efficient algorithm to estimate haplotype frequencies and optimize individual CNP calls iteratively, even at presence of missing data. Through simulations, we demonstrated our model is more sensitive and accurate in detecting various CNV regions, compared with commonly-used CNV calling methods including PennCNV, another hidden Markov model (HMM) using CNP, a scan statistic, segCNV, and cnvHap. Our method often performs better in the regions with higher LD, in longer CNV regions, and in common CNV than the opposite. We implemented our method on the genotypes of 90 HapMap CEU samples and 23 patients with acute lung injury (ALI). For each ALI patient the genotyping was performed twice. The CNPs from our method show good consistency and accuracy comparable to others.
CNV; CNP; GWAS; haplotype; joint SNP and CNV calling; integrated SNP and CNV
Accurate and complete measurement of single nucleotide (SNP) and copy number (CNV) variants, both common and rare, will be required to understand the role of genetic variation in disease. We present Birdsuite, a four-stage analytical framework instantiated in software for deriving integrated and mutually consistent copy number and SNP genotypes. The method sequentially assigns copy number across regions of common copy number polymorphisms (CNPs), calls genotypes of SNPs, identifies rare CNVs via a hidden Markov model (HMM), and generates an integrated sequence and copy number genotype at every locus (for example, including genotypes such as A-null, AAB and BBB in addition to AA, AB and BB calls). Such genotypes more accurately depict the underlying sequence of each individual, reducing the rate of apparent mendelian inconsistencies. The Birdsuite software is applied here to data from the Affymetrix SNP 6.0 array. Additionally, we describe a method, implemented in PLINK, to utilize these combined SNP and CNV genotypes for association testing with a phenotype.
Determination of copy number variants (CNVs) inferred in genome wide single nucleotide polymorphism arrays has shown increasing utility in genetic variant disease associations. Several CNV detection methods are available, but differences in CNV call thresholds and characteristics exist. We evaluated the relative performance of seven methods: circular binary segmentation, CNVFinder, cnvPartition, gain and loss of DNA, Nexus algorithms, PennCNV and QuantiSNP. Tested data included real and simulated Illumina HumHap 550 data from the Singapore cohort study of the risk factors for Myopia (SCORM) and simulated data from Affymetrix 6.0 and platform-independent distributions. The normalized singleton ratio (NSR) is proposed as a metric for parameter optimization before enacting full analysis. We used 10 SCORM samples for optimizing parameter settings for each method and then evaluated method performance at optimal parameters using 100 SCORM samples. The statistical power, false positive rates, and receiver operating characteristic (ROC) curve residuals were evaluated by simulation studies. Optimal parameters, as determined by NSR and ROC curve residuals, were consistent across datasets. QuantiSNP outperformed other methods based on ROC curve residuals over most datasets. Nexus Rank and SNPRank have low specificity and high power. Nexus Rank calls oversized CNVs. PennCNV detects one of the fewest numbers of CNVs.
Motivation: Estimating the frequency distribution of copy number variants (CNVs) is an important aspect of the effort to characterize this new type of genetic variation. Currently, most studies report a strong skew toward low-frequency CNVs. In this article, our goal is to investigate the frequencies of CNVs. We employ a two-step procedure for the CNV frequency estimation process. We use family information a posteriori to select only the most reliable CNV regions, i.e. those showing high rates of Mendelian transmission.
Results: Our results suggest that the current skew toward low-frequency CNVs may not be representative of the true frequency distribution, but may be due, among other reasons, to the non-negligible false negative rates that characterize CNV detection methods. Moreover, false positives are also likely, as low-frequency CNVs are hard to detect with small sample sizes and technologies that are not ideally suited for their detection. Without appropriate validation methods, such as incorporation of biologically relevant information (for example, in our case, the transmission of heritable CNVs from parents to offspring), it is difficult to assess the validity of specific CNVs, and even harder to obtain reliable frequency estimates.
Availability: Software implementing the methods described in this article is available for download at the following address: http://www.isites.harvard.edu/icb/icb.do?keyword=k36162
Supplementary informantion: Supplementary data are available at Bioinformatics online.
Copy number variation (CNV) is pervasive in the human genome and can play a causal role in genetic diseases. The functional impact of CNV cannot be fully captured through linkage disequilibrium with SNPs. These observations motivate the development of statistical methods for performing direct CNV association studies. We show through simulation that current tests for CNV association are prone to false-positive associations in the presence of differential errors between cases and controls, especially if quantitative CNV measurements are noisy. We present a statistical framework for performing case-control CNV association studies that applies likelihood ratio testing of quantitative CNV measurements in cases and controls. We show that our methods are robust to differential errors and noisy data and can achieve maximal theoretical power. We illustrate the power of these methods for testing for association with binary and quantitative traits, and have made this software available as the R package CNVtools.
Genome-wide association studies (GWAS) based on single nucleotide polymorphisms (SNPs) revolutionized our perception of the genetic regulation of complex traits and diseases. Copy number variations (CNVs) promise to shed additional light on the genetic basis of monogenic as well as complex diseases and phenotypes. Indeed, the number of detected associations between CNVs and certain phenotypes are constantly increasing. However, while several software packages support the determination of CNVs from SNP chip data, the downstream statistical inference of CNV-phenotype associations is still subject to complicated and inefficient in-house solutions, thus strongly limiting the performance of GWAS based on CNVs.
CONAN is a freely available client-server software solution which provides an intuitive graphical user interface for categorizing, analyzing and associating CNVs with phenotypes. Moreover, CONAN assists the evaluation process by visualizing detected associations via Manhattan plots in order to enable a rapid identification of genome-wide significant CNV regions. Various file formats including the information on CNVs in population samples are supported as input data.
CONAN facilitates the performance of GWAS based on CNVs and the visual analysis of calculated results. CONAN provides a rapid, valid and straightforward software solution to identify genetic variation underlying the 'missing' heritability for complex traits that remains unexplained by recent GWAS. The freely available software can be downloaded at http://genepi-conan.i-med.ac.at.
Copy number variations (CNVs) have been reported to be causal suspects in a variety of
psychopathologic traits. We investigate whether de novo and/or inherited CNVs
contribute to the risk for Attention Problems (APs) in children. Based on longitudinal
phenotyping, 50 concordant and discordant monozygotic (MZ) twin pairs were selected from a
sample of ∼3200 MZ pairs. Two types of de novo CNVs were investigated: (1)
CNVs shared by both MZ twins, but not inherited (pre-twinning de novo CNVs),
which were detected by comparing copy number (CN) calls between parents and twins and (2)
CNVs not shared by co-twins (post-twinning de novo CNVs), which were investigated
by comparing the CN calls within MZ pairs. The association between the overall CNV burden
and AP was also investigated for CNVs genome-wide, CNVs within genes and CNVs outside of
genes. Two de novo CNVs were identified and validated using quantitative PCR: a
pre-twinning de novo duplication in a concordant-unaffected twin pair and a
post-twinning deletion in the higher scoring twin from a concordant-affected pair. For the
overall CNV burden analyses, affected individuals had significantly larger CNVs that
overlapped with genes than unaffected individuals (P=0.008). This study
suggests that the presence of larger CNVs may increase the risk for AP, because they are
more likely to affect genes, and confirms that MZ twins are not always genetically
copy number variation; twin; Attention Problem; ADHD
Copy number variation (CNV) is a major source of structural variants and has been commonly identified in mammalian genome. It is associated with gene expression and may present a major genetic component of phenotypic diversity. Unlike many other mammalian genomes where CNVs have been well annotated, studies of porcine CNV in diverse breeds are still limited.
Here we used Porcine SNP60 BeadChip and PennCNV algorithm to identify 1,315 putative CNVs belonging to 565 CNV regions (CNVRs) in 1,693 pigs from 18 diverse populations. Total 538 out of 683 CNVs identified in a White Duroc × Erhualian F2 population fit Mendelian transmission and 6 out of 7 randomly selected CNVRs were confirmed by quantitative real time PCR. CNVRs were non-randomly distributed in the pig genome. Several CNV hotspots were found on pig chromosomes 6, 11, 13, 14 and 17. CNV numbers differ greatly among different pig populations. The Duroc pigs were identified to have the most number of CNVs per individual. Among 1,765 transcripts located within the CNVRs, 634 genes have been reported to be copy number variable genes in the human genome. By integrating analysis of QTL mapping, CNVRs and the description of phenotypes in knockout mice, we identified 7 copy number variable genes as candidate genes for phenotypes related to carcass length, backfat thickness, abdominal fat weight, length of scapular, intermuscle fat content of logissimus muscle, body weight at 240 day, glycolytic potential of logissimus muscle, mean corpuscular hemoglobin, mean corpuscular volume and humerus diameter.
We revealed the distribution of the unprecedented number of 565 CNVRs in pig genome and investigated copy number variable genes as the possible candidate genes for phenotypic traits. These findings give novel insights into porcine CNVs and provide resources to facilitate the identification of trait-related CNVs.
Copy number variation; Copy number variable gene; Complex trait; QTL; Pig
Array-based detection of copy number variations (CNVs) is widely used for identifying disease-specific genetic variations. However, the accuracy of CNV detection is not sufficient and results differ depending on the detection programs used and their parameters. In this study, we evaluated five widely used CNV detection programs, Birdsuite (mainly consisting of the Birdseye and Canary modules), Birdseye (part of Birdsuite), PennCNV, CGHseg, and DNAcopy from the viewpoint of performance on the Affymetrix platform using HapMap data and other experimental data. Furthermore, we identified CNVs of 180 healthy Japanese individuals using parameters that showed the best performance in the HapMap data and investigated their characteristics.
The results indicate that Hidden Markov model-based programs PennCNV and Birdseye (part of Birdsuite), or Birdsuite show better detection performance than other programs when the high reproducibility rates of the same individuals and the low Mendelian inconsistencies are considered. Furthermore, when rates of overlap with other experimental results were taken into account, Birdsuite showed the best performance from the view point of sensitivity but was expected to include many false negatives and some false positives. The results of 180 healthy Japanese demonstrate that the ratio containing repeat sequences, not only segmental repeats but also long interspersed nuclear element (LINE) sequences both in the start and end regions of the CNVs, is higher in CNVs that are commonly detected among multiple individuals than that in randomly selected regions, and the conservation score based on primates is lower in these regions than in randomly selected regions. Similar tendencies were observed in HapMap data and other experimental data.
Our results suggest that not only segmental repeats but also interspersed repeats, especially LINE sequences, are deeply involved in CNVs, particularly in common CNV formations.
The detected CNVs are stored in the CNV repository database newly constructed by the "Japanese integrated database project" for sharing data among researchers. http://gwas.lifesciencedb.jp/cgi-bin/cnvdb/cnv_top.cgi
SNP genotyping arrays have been developed to characterize single-nucleotide polymorphisms (SNPs) and DNA copy number variations (CNVs). Nonparametric and model-based statistical algorithms have been developed to detect CNVs from SNP data using the marker intensities. However, these algorithms lack specificity to detect small CNVs owing to the high false positive rate when calling CNVs based on the intensity values. Therefore, the resulting association tests lack power even if the CNVs affecting disease risk are common. An alternative procedure called PennCNV uses information from both the marker intensities as well as the genotypes and therefore has increased sensitivity.
By using the hidden Markov model (HMM) implemented in PennCNV to derive the probabilities of different copy number states which we subsequently used in a logistic regression model, we developed a new genome-wide algorithm to detect CNV associations with diseases. We compared this new method with association test applied to the most probable copy number state for each individual that is provided by PennCNV after it performs an initial HMM analysis followed by application of the Viterbi algorithm, which removes information about copy number probabilities. In one of our simulation studies, we showed that for large CNVs (number of SNPs ≥ 10), the association tests based on PennCNV calls gave more significant results, but the new algorithm retained high power. For small CNVs (number of SNPs <10), the logistic algorithm provided smaller average p-values (e.g., p = 7.54e - 17 when relative risk RR = 3.0) in all the scenarios and could capture signals that PennCNV did not (e.g., p = 0.020 when RR = 3.0). From a second set of simulations, we showed that the new algorithm is more powerful in detecting disease associations with small CNVs (number of SNPs ranging from 3 to 5) under different penetrance models (e.g., when RR = 3.0, for relatively weak signals, power = 0.8030 comparing to 0.2879 obtained from the association tests based on PennCNV calls). The new method was implemented in software GWCNV. It is freely available at http://gwcnv.sourceforge.net, distributed under a GPL license.
We conclude that the new algorithm is more sensitive and can be more powerful in detecting CNV associations with diseases than the existing HMM algorithm, especially when the CNV association signal is weak and a limited number of SNPs are located in the CNV.
Copy number variations (CNVs) are universal genetic variations, and their association with disease has been increasingly recognized. We designed high-density microarrays for CNVs, and detected 3000–4000 CNVs (4–6% of the genomic sequence) per population that included CNVs previously missed because of smaller sizes and residing in segmental duplications. The patterns of CNVs across individuals were surprisingly simple at the kilo-base scale, suggesting the applicability of a simple genetic analysis for these genetic loci. We utilized the probabilistic theory to determine integer copy numbers of CNVs and employed a recently developed phasing tool to estimate the population frequencies of integer copy number alleles and CNV–SNP haplotypes. The results showed a tendency toward a lower frequency of CNV alleles and that most of our CNVs were explained only by zero-, one- and two-copy alleles. Using the estimated population frequencies, we found several CNV regions with exceptionally high population differentiation. Investigation of CNV–SNP linkage disequilibrium (LD) for 500–900 bi- and multi-allelic CNVs per population revealed that previous conflicting reports on bi-allelic LD were unexpectedly consistent and explained by an LD increase correlated with deletion-allele frequencies. Typically, the bi-allelic LD was lower than SNP–SNP LD, whereas the multi-allelic LD was somewhat stronger than the bi-allelic LD. After further investigation of tag SNPs for CNVs, we conclude that the customary tagging strategy for disease association studies can be applicable for common deletion CNVs, but direct interrogation is needed for other types of CNVs.
One of the primary objectives in cancer research is to identify causal genomic alterations, such as somatic copy number variation (CNV) and somatic mutations, during tumor development. Many valuable studies lack genomic data to detect CNV; therefore, methods that are able to infer CNVs from gene expression data would help maximize the value of these studies.
We developed a framework for identifying recurrent regions of CNV and distinguishing the cancer driver genes from the passenger genes in the regions. By inferring CNV regions across many datasets we were able to identify 109 recurrent amplified/deleted CNV regions. Many of these regions are enriched for genes involved in many important processes associated with tumorigenesis and cancer progression. Genes in these recurrent CNV regions were then examined in the context of gene regulatory networks to prioritize putative cancer driver genes. The cancer driver genes uncovered by the framework include not only well-known oncogenes but also a number of novel cancer susceptibility genes validated via siRNA experiments.
To our knowledge, this is the first effort to systematically identify and validate drivers for expression based CNV regions in breast cancer. The framework where the wavelet analysis of copy number alteration based on expression coupled with the gene regulatory network analysis, provides a blueprint for leveraging genomic data to identify key regulatory components and gene targets. This integrative approach can be applied to many other large-scale gene expression studies and other novel types of cancer data such as next-generation sequencing based expression (RNA-Seq) as well as CNV data.
breast cancer; copy number variation; gene regulatory networks; oncogenes
Recent studies of mammalian genomes have uncovered the vast extent of copy number variations (CNVs) that contribute to phenotypic diversity. Compared to SNP, a CNV can cover a wider chromosome region, which may potentially incur substantial sequence changes and induce more significant effects on phenotypes. CNV has been becoming an alternative promising genetic marker in the field of genetic analyses. Here we firstly report an account of CNV regions in the cattle genome in Chinese Holstein population. The Illumina Bovine SNP50K Beadchips were used for screening 2047 Holstein individuals. Three different programes (PennCNV, cnvPartition and GADA) were implemented to detect potential CNVs. After a strict CNV calling pipeline, a total of 99 CNV regions were identified in cattle genome. These CNV regions cover 23.24 Mb in total with an average size of 151.69 Kb. 52 out of these CNV regions have frequencies of above 1%. 51 out of these CNV regions completely or partially overlap with 138 cattle genes, which are significantly enriched for specific biological functions, such as signaling pathway, sensory perception response and cellular processes. The results provide valuable information for constructing a more comprehensive CNV map in the cattle genome and offer an important resource for investigation of genome structure and genomic variation underlying traits of interest in cattle.
African Americans are a genetically diverse population with a high burden of many, common heritable diseases. However, our understanding of genetic variation in African Americans is substandard because of a lack of published population-based genetic studies. We report the distribution of copy-number variation (CNV) in African Americans collected as part of the Hypertension Genetic Epidemiology Network (HyperGEN) using the Affymetrix 6.0 array and the CNV calling algorithms Birdsuite and PennCNV. We present population estimates of CNV from 446 unrelated African-American subjects randomly selected from the 451 families collected within HyperGEN. Although the majority of CNVs discovered were individually rare, we found the frequency of CNVs to be collectively high. We identified a total of 11 070 CNVs greater than 10 kb passing quality control criteria that were called by both algorithms – leading to an average of 24.8 CNVs per person covering 2214 kb (median). We identified 1541 unique copy-number variable regions, 309 of which did not overlap with the Database of Genomic Variants. These results provide further insight into the distribution of CNV in African Americans.
DNA copy-number variation; African American; calling algorithm; Birdsuite; PennCNV; HyperGEN
Motivation: Exome sequencing has proven to be an effective tool to discover the genetic basis of Mendelian disorders. It is well established that copy number variants (CNVs) contribute to the etiology of these disorders. However, calling CNVs from exome sequence data is challenging. A typical read depth strategy consists of using another sample (or a combination of samples) as a reference to control for the variability at the capture and sequencing steps. However, technical variability between samples complicates the analysis and can create spurious CNV calls.
Results: Here, we introduce ExomeDepth, a new CNV calling algorithm designed to control for this technical variability. ExomeDepth uses a robust model for the read count data and uses this model to build an optimized reference set in order to maximize the power to detect CNVs. As a result, ExomeDepth is effective across a wider range of exome datasets than the previously existing tools, even for small (e.g. one to two exons) and heterozygous deletions. We used this new approach to analyse exome data from 24 patients with primary immunodeficiencies. Depending on data quality and the exact target region, we find between 170 and 250 exonic CNV calls per sample. Our analysis identified two novel causative deletions in the genes GATA2 and DOCK8.
Availability: The code used in this analysis has been implemented into an R package called ExomeDepth and is available at the Comprehensive R Archive Network (CRAN).
Supplementary data are available at Bioinformatics online.
Genome-wide association studies (GWAS) using Copy Number Variation (CNV) are becoming a central focus of genetic research. CNVs have successfully provided target genome regions for some disease conditions where simple genetic variation (i.e., SNPs) has previously failed to provide a clear association.
Here we present a new R package, that integrates: (i) data import from most common formats of Affymetrix, Illumina and aCGH arrays; (ii) a fast and accurate segmentation algorithm to call CNVs based on Genome Alteration Detection Analysis (GADA); and (iii) functions for displaying and exporting the Copy Number calls, identification of recurrent CNVs, multivariate analysis of population structure, and tools for performing association studies. Using a large dataset containing 270 HapMap individuals (Affymetrix Human SNP Array 6.0 Sample Dataset) we demonstrate a flexible pipeline implemented with the package. It requires less than one minute per sample (3 million probe arrays) on a single core computer, and provides a flexible parallelization for very large datasets. Case-control data were generated from the HapMap dataset to demonstrate a GWAS analysis.
The package provides the tools for creating a complete integrated pipeline from data normalization to statistical association. It can effciently handle a massive volume of data consisting of millions of genetic markers and hundreds or thousands of samples with very accurate results.
Using microarray and sequencing platforms, a large number of copy number variations (CNVs) have been identified in humans. In practice, because our human genome is a diploid, these platforms are limited to or more accurate for detecting total copy numbers rather than chromosome-specific copy numbers at each of the two homologous chromosomes. Nevertheless, the analysis of linkage disequilibrium (LD) between CNVs and SNPs indicates that distinct copy numbers often sit on their own background haplotypes.
We propose new computational models for inferring chromosome-specific copy numbers by distinguishing background haplotypes of each copy number. The formulated problems are shown to be NP-hard and approximation/heuristic algorithms are developed. Simulation indicates that our method is accurate and outperforms the existing approach. By testing the program in 60 parent-offspring trios, the inferred chromosome-specific copy numbers are highly consistent with the law of Mendelian inheritance. The distributions of copy numbers at chromosomal level are provided for 270 individuals in three HapMap panels.
The estimation of chromosome-specific copy numbers using microarray or sequencing platforms was often confounded by a number of factors. This study showed that the integration of background haplotypes is able to improve the accuracies of copy number estimation at chromosome level, especially for the CNVs having strong LD with SNPs in proximity.