In this study we present a novel computational method to objectively identify clinically relevant CNVs using an NBTree classifier and 13 diverse genomic features. This is the first description of such a method applied to CNVs that can significantly improve interpretation of this important class of genomic variation. Our classification method has been validated on a set of 1,203 CNVs detected in 584 patients with MR, achieving a high accuracy (94%), with a sensitivity of 88% and a specificity of 94% ().
Several other computational methods have been developed previously to predict if disruption or disturbance of genomic elements have pathogenic consequences. Often these methods are focused on identifying disease genes or on predicting if mutation or splicing events are pathogenic 
. Such methods make use of protein structure and stability measures, and phylogenetic or sequence conservation data 
, and often cross-validate their predictions using OMIM (Online Mendelian Inheritance in Man) data 
. These approaches may be less applicable for larger structural variants such as CNVs because they predict the effect of a single change on a single disease gene, rather than a large change involving many genes. Our approach differs in that we directly predict the causal CNV from genome-wide copy number scans on the basis of the distinguishing features of benign and disease-causing CNVs. In addition, OMIM does not provide a suitable source for validating the performance of a classification method for CNVs as dosage-sensitive genes are largely underrepresented in this database (<5% of the entries describe haploinsufficient genes 
), and because a precise mapping of CNVs in OMIM is lacking. In contrast to OMIM, the Decipher database list of known syndromes (https://decipher.sanger.ac.uk
) provides a suitable list of CNVs for external validation of the classifier with high-resolution mapping of their genomic locations. Our classification method correctly identified all the CNVs listed in this database as causing MR-associated syndromes.
The classifier incorporated specific knowledge about CNVs via 13 diverse structural and functional genomic features (including a number of different transposable element types). The proximity of these elements to CNVs has been reported previously and it has been hypothesized that they mediate the formation of recurrent CNVs 
. We confirm previous results that benign CNVs are enriched in both LINE and segmental duplication elements 
and show that both the LINE density and the segmental duplication density substantially contribute to the classifier's accuracy (Table S2)
. Previous studies have also reported that CNV gains are enriched in many of the same features as CNV losses 
. Our feature contribution results support this finding: when the CNV type was removed from the classifier only a 3.7% decrease in accuracy was observed, and 7 additional features had a greater contribution to the classifier's accuracy. In addition to these transposable elements, we included functional genomic elements which have recently been shown to assist in distinguishing benign from MR-associated CNVs 
. The significant enrichment of MGI mouse nervous system phenotypes in MR loss CNVs has previously been reported 
. We show that the MGI mouse knock-out phenotype feature is effective in distinguishing benign from MR-associated CNVs: 80% of all MR-associated CNVs contain one or more genes whose unique orthologue's disruption in mouse reveals a nervous system phenotype, whereas benign CNVs only rarely contain such genes (Table S2)
Despite the MGI mouse phenotype dataset being incomplete, this feature contributes greatly to the classifier's accuracy (5%). To date, gene knockout experiments with recorded ontology based phenotype information have been performed for approximately 5,000 of the possible 15,287 genes with mouse 1
1 orthologues 
. Furthermore the MGI phenotype data are included in the classifier as a binary feature (which is labelled as ‘true’; when a CNV contains 1 or more genes exhibiting a nervous system phenotype; MP:0003631). However, as the MGI phenotype dataset is incomplete, our approach is conservative with respect to missing values. This is because CNVs overlapping genes whose disruption does not result in a nervous system phenotype are weighted equally to those CNVs overlapping genes whose disruption phenotypes are currently unknown. Thus, we expect that increased coverage by the MGI mouse knock-out dataset will significantly improve the accuracy of the classifier. In addition, further genomic features such as CpG islands or conserved non-coding regions 
can now be tested for their potential to improve the accuracy of this approach. Nevertheless, as the densities of many genomic features are strongly correlated 
, it is likely that the addition of further features to the classifier will not result in a substantial improvement in predictive power.
Most of the CNVs we used to train the classifier were identified on low-resolution (BAC–based) microarray platforms. In contrast, the replication set contained CNVs collected solely from Affymetrix 250k SNP microarrays. Despite the different microarray technologies used, only a negligible decrease in classification accuracy (−1.7%) was observed between the training and the replication set. This indicates that the classifier is platform-independent and will not require retraining when used on data generated from comparable microarray platforms.
MR-associated CNVs discovered thus far are, in general, larger than benign CNVs 
. Previously developed CNV risk assessments for identifying disease-associated CNVs use a length greater than 3Mb as a distinguishing criterion 
. Closer inspection of the MR-associated CNVs from our validation study indeed revealed a larger mean length (6.8Mb) compared to the benign CNVs (474kb). Despite this large size, 25% of the MR-associated CNVs in the validation set were smaller than 1.1Mb. We separately tested the accuracy of the classifier on CNVs smaller than 1.1Mb which revealed it to exhibit a decrease in sensitivity (−18%) but still a high accuracy (93%). As might be expected, small MR-associated CNVs showed a decrease in the number of MGI knock-out genes displaying a nervous system phenotype, but their SINE and gene densities are comparable to those of larger MR-associated CNVs (Table S2)
. Importantly, the classifier was still able to correctly classify 9 of the 13 small MR-associated CNVs, demonstrating the advantage of the classifier in comparison to conventional interpretation methods which often are unable to clearly identify clinically relevant CNVs unless specific information about their genomic content is known 
Although current clinical interpretation of CNVs focuses on large, rare and de novo
CNVs, an increasing number of genomic loci being reported show variable inheritance and penetrance 
. Our replication study contained a number of such CNVs, including CNVs at 1q21.1 and 15q13.3 which, in addition, show variation in genomic size and content 
. Three rare inherited CNVs encompassing the 1q21.1 critical region were all classified as associated with MR, even though their genomic breakpoints differed. Two rare de novo
CNVs in the 15q13.3 region were classified differently, one as benign and one as pathogenic. In addition, three inherited CNVs at this locus were all classified as benign. Interestingly, the distal breakpoint for all five CNVs was identical whereas the proximal breakpoint of the four CNVs classified as benign was extended by an additional 150kb. This difference in classification is explained by the fact that the 150kb region showed a higher repeat element count and density due to repetitive elements surrounding the 15q13.3 critical region (Table S2) 
. This particular example highlights the current challenge in clinical interpretation of CNVs which relies on the availability of large control datasets. We do not claim that our classification method replaces the need for such datasets. Our method does show that 27 out of 41 (66%) rare inherited CNVs identified in patients contain genomic features similar to previously recognized MR-associated CNVs, a significant proportion when compared to the remainder of the genome (). This provides independent support for the clinical relevance of this group of CNVs and shows that the interpretation of CNVs should not be limited to rare de novo
CNVs with a fully penetrant dominant effect 
. Furthermore, in the set of 53 rare CNVs with unknown inheritance, 46 CNVs were classified as being MR-associated, the vast majority with high confidence. These rare CNVs with unknown inheritance demonstrate strong similarities to rare de novo
CNVs in that they have a low segmental duplication density, a high SINE density, often contain genes whose mouse knockouts result in nervous system phenotypes, have similar gene expression values and similar synonymous substitution rates. This suggests that these rare CNVs with unknown inheritance are indeed similar in pathoetiology to rare de novo
CNVs and thus can be considered strong candidates for being causal CNVs. The ability of the classifier to identify such CNVs of unknown inheritance should be of great benefit to the diagnostic communities.
This CNV classifier may also be informative of disorders other than mental retardation. This is of particular relevance because CNVs have recently been associated with other neurodevelopmental disorders such as autism and schizophrenia 
but screening for causal CNVs in these diseases has yet to be implemented in most clinics. Interestingly, many of the CNVs associated with autism and schizophrenia, as well as mental retardation, contain genes whose proteins are involved in neurotransmission or in synapse formation and maintenance. This supports the existence of shared biological pathways that are disrupted in each of these neurodevelopmental disorders 
. Our CNV classifier trained on MR CNVs may therefore already have predictive power for CNVs in other neurological disorders. It is likely, however, that this predictive power can be further optimized by retraining the classifier using disease-specific CNVs. In addition, the KEGG and MGI features selected for the MR patient cohort are also easily configurable for pathways and phenotypes which are more relevant to these other disease cohorts. For this reason we have made the Java source code of the CNV classifier, called GECCO, freely available (see Materials and Methods
In conclusion, we have developed a novel objective method to identify disease-associated CNVs which has overcome several limitations with current CNV interpretation methodology. Our NBTree classifier is able to distinguish between MR-associated CNVs and benign CNVs with high accuracy without the use of data from large control cohorts or parental samples. Results indicate that computational classification methods can be used for objectively prioritizing CNVs in clinical research and diagnostics. The tool for classifying CNVs, called GECCO (Genomic Classification of CNVs Objectively), as well as the Java source code, are readily available online. The benefits of such methods will increase with advancements in microarray technology, which already identifies many thousands of such structural variants per individual 
, and in whole genome resequencing technology,. Establishing objective criteria and methods for interpretation of these genomic variants will be crucial for implementation of these technologies in a clinical setting.