Genome copy number changes (copy number variations, or CNVs) include inherited,
de novo, and somatically acquired deviations from a diploid state within a particular chromosome segment. CNVs likely contribute substantially to inherited and/or acquired risk for a variety of human diseases, including cancer and neuropsychiatric disorders [
1,
2]. In addition, CNVs are widely distributed in the genomes of apparently healthy individuals and thus constitute significant amounts of population-based genomic variation [
3-
8]. New genotyping technologies such as SNP-based arrays provide high-resolution coverage of entire genomes as well as an opportunity for rapidly determining CNV content in sample collections of interest [
4,
6,
7,
9-
11]. Accordingly, numerous recent studies have described constellations of structural variants in various healthy and disease cohorts [
1,
2,
12,
13]. However, interpretation of the exact extent, character, distribution, and effect of these CNVs has been limited by the emerging nature of computational methods for accurate detection, and further challenged by the difficulty in assessing the biological importance of particular CNVs in context with other genomic features and study findings.
Detection of CNVs in high-density SNP arrays requires genotypes that yield high quality intensity and, optimally, allelic ratio data for each locus surveyed. A number of algorithms have been utilized for the detection of CNVs from such genotyping data sets. Software from array vendors such as Illumina and Affymetrix provide basic CNV calls along with graphical interfaces that allow visual inspection of a region of interest. However, these tools generally lack the ability to quickly manage, annotate, and assess CNVs from a sizable number of samples. Moreover, visual inspection becomes challenging for interpreting small or complex rearrangements, or CNVs predicted from genome array data of marginal quality. A number of 3
rd party commercial and open-source algorithms, including QuantiSNP [
14] and PennCNV [
15], utilize algorithms employing Hidden Markov Models [
16] to predict CNVs, and these approaches have been developed and adopted for a number of recent genome-wide studies of structural variation. Equally promising are segmentation algorithms such as GLAD [
17] and Circular Binary Segmentation (CBS) [
18] that have been successfully applied for analysis of data from array-based comparative genomic hybridization (aCGH) platforms. These segmentation approaches are particularly attractive as they have been shown to outperform certain HMM-based approaches for aCGH data [
19,
20]. Regardless of the approach, these algorithms typically overcall CNV events [
12,
15,
21,
22], thus requiring post-prediction methods that consider data quality metrics for distinguishing true events from false positives. Currently, researchers interested in analyzing genotypes for CNV content for the first time, or in setting up production systems for high-throughput analysis and interpretation, are challenged by the considerable variety and limited scope of most existing methods and tools. This is especially true in the use of SNP arrays for clinical diagnostic applications, where reliability and performance are of critical importance.
At the same time, assessing the importance of particular CNVs in context with other genomic features and study findings is a complex task even without robust quality assessment of CNV predictions, especially given limited current knowledge of the distributions of CNVs across the genome and in populations. Contextual genomic and phenotypic annotations need to be considered, while projects involving sizable cohorts also require an infrastructure for managing, accessing, batch-processing, and visualizing annotated CNV predictions.
To address these challenges, we describe the integrated platform CNV Workshop. This package incorporates a modified segmentation algorithm that we have previously applied successfully for detecting pathogenic CNVs in large-scale research and clinical projects [
12,
13]. CNV Workshop includes a database layer, role-based security and authentication schemes suitable for clinical diagnostic environments, a web-based presentation layer providing textual and graphical visualization of CNV predictions, and integration of CNV content with known genomic and biomedical annotations for rapidly determining the significance of a particular CNV. These components are modular yet seamlessly integrated and together provide an effective platform for identification of high-throughput copy number variation; discovery of inherited,
de novo, and somatically acquired pathogenic variants; and clinical diagnostics.