All the analyses were performed on three data sets: smoothed CBS, GLAD and unsmoothed CBS. In the training data set they contained 1448, 1624 and 2037 candidate segments, respectively, and 904 (62%), 744 (46%) and 1448 (71%) of them were considered true CNVs. The samples in the test set accounted for 1683, 1738 and 2686 candidates and 939 (56%), 846 (49%) and 1727 (64%) true CNVs respectively. There are 638 (510) and 761 (674) segments in the training and test set respectively that are true CNVs (true CNAs) in both smoothed CBS and GLAD, therefore the overlap between smoothed CBS and GLAD is very substantial. The training set contained more patients than the test set but it contained fewer candidate segments. This can be explained by the fact that the training set had slightly noisier arrays (higher MAD of residuals), and, therefore, there was less power to detect smaller segments.
To test association of predictors with true CNV status we pooled the training and test sets. The results within these sets separately were very similar and are not presented. Table contains both Anova p-values and regression β coefficients.
Univariate results by logistic regression, training and test sets combined.
The smoothed CBS and GLAD had very similar rankings of significant predictors and their effects. As expected, Database score had the most significant p-value, followed by Matching breakpoint in other-patients - percent in CBS or Overlap with other patients - percent in GLAD, Length, and Percent of Normal. All other predictors were also significant. Obviously, overlap with many variants from the DGV was a strong positive predictor of being a CNV. Segments that were shorter, matched with candidates from many other patients or overlapped with both gain and loss candidates in other patients were also more likely to be CNVs. Having other patients with overlapping candidate losses only was also a positive predictor. As seen from the direction of the main effects in Table , CNVs tended to have larger absolute values of segment means; were often surrounded by Normal segments; located on chromosomes with fewer gains and losses, or with other candidates with high Database score; or located close to a telomere, centromere or segmental duplication. Also, we saw several clusters of small CNAs right next to each other, so presence of other candidate segments within 500 kb was predictive of CNA.
In unsmoothed CBS the strongest predictor was Overlap with other patients, followed by the Database score. One possible explanation for this difference is that the small CNVs are underrepresented in the DGV but are likely to appear in the unsmoothed arrays of other patients in the cohort. The other notable difference with smoothed segmentation is that closeness to a centromere, telomere or segmental duplication were not significant, possibly because longer CNVs tend to be located there. In fact, the interaction term between length and closeness to centromere (or segmental duplication) was significant in logistic regression for both smoothed and unsmoothed CBS. As demonstrated by the interaction effect, segments at these locations and of longer length were even more likely to be CNVs.
Note that these associations are not causal, and the mechanisms by which CNVs occur and fixate in the population are still to be elucidated.
We have fitted prediction models using smoothed CBS, GLAD and unsmoothed CBS with 4 different sets of predictors. Accuracy was defined as the percentage of correctly classified candidate segments among all CNV candidates in all tumor samples of validation set. Accuracy was evaluated on 3 validation sets. The full set of predictors contained all the variables described in Table except Database score II and Height that were nearly equivalent to the already included variables Database score and Relative height. We first will discuss the results based on smoothed CBS. The fitted CART model selected only five predictors, as is shown in Figure . The first split was made on the Database score: if the probes in the candidate segment were included in the DGV at least 2.45 times on average, the candidate was predicted to be a CNV. Otherwise, only segments with the following characteristics were predicted to be CNVs: 1) segments shorter than 30 Kb; or 2) segments of length longer than 30 Kb that had matching candidate segments in 37% or more of the other patients. The prediction accuracy of this model, estimated for the smoothed CBS test set, was 86%, as shown in Table . Table has the numbers of candidate segments that were correctly and falsely classified. There were 793 true CNVs predicted to be CNVs, and 654 correctly identified true CNAs. Interestingly, number of CNAs falsely identified as CNVs (182) was much higher than number of missed CNVs (54). We believe CNVs were easier to identify because the DGV contains extensive information about them.
Prediction rates: A - test set, B - CGH against self-reference (all CNAs), C - normal tissue (all CNVs).
Counts from the accuracy table of the test set.
The two other validation sets we used were the set of 39 tumors that were hybridized against self-reference, so that all its 1780 candidates were considered true CNAs, and the set of 8 normal tissue arrays, so that all its 257 candidates were considered true CNVs. As seen in Table , 79% and 90% of these segments, respectively, were identified correctly. As in the test set, the rate of missed CNVs was smaller.
The RF model with the same set of predictors increased the accuracy by 1% on the test set, and by 3-5% on the 'all CNAs' and 'all CNVs' sets. Since the best model was a combination of many trees it is difficult to display; however, relative importance of each variable measured by Gini index is shown in Table . The more influential variables have higher indices. The ranking of the variables was roughly consistent with univariate results: the top predictors were Database score, Length, Matching breakpoint in other patients -percent, Overlap with other patients - percent, Percent of Normal and Relative height.
Relative importance of variables in random forest models as measured by Gini index (higher is more important).
We have also fitted the CART model with a single variable - Database score. Its only split was the same as the first split of the full model: segments seen in the DGV on average 2.45 times were predicted to be CNVs. The prediction accuracy of this model was equal to 85%, 88% and 81% on the test, 'all CNAs' and 'all CNVs' sets respectively. Therefore, using all the proposed predictors on the test set in addition to the Database score increased the accuracy on the test set by 2%. As can be seen from Table , the sensitivity of the full RF model is 82%, which is slightly lower than the 84% sensitivity of the Database score only model. The specificity, however, of the full RF model is much higher: 94% compared to 85%.
The fitted CART tree was different using GLAD, although the first split was still made on the Database score. As seen in panel (b) of Figure , segments were predicted to be CNVs if 1) they were included in the DGV at least 3 times on average and had relative absolute mean greater than 1.5; or 2) they were included in the DGV less than 3 times on average and overlapped with other candidates in at least 38% of other patients of the cohort. In the RF model for GLAD, the 6 variables with the highest Gini indices (Table ) were the same as in smoothed CBS, while having slightly different ranking. In spite of these differences, the prediction characteristics of models based on GLAD and smoothed CBS were similar. The RF with all predictors correctly identified 86% of candidates in the test set, 91% of 1861 true CNAs in 'all CNA' dataset, and 87% of 247 true CNVs in 'all CNVs' set, while the rates of false CNVs and false CNAs were more balanced. The model with only Database score had the same first split of Database score greater or less than 3, and its accuracy on the test set was only 3% smaller than that for the full model. For GLAD, the sensitivity was 82% and 78% for the full RF and Database score only models respectively, while the specificity was 90% and 87%.
Since many of the predictors were highly correlated there could be many classification trees with similar prediction accuracy, so the difference in models between GLAD and smoothed CBS might be a result of random variation rather than fundamental segmentation differences. In fact, when we applied the full RF developed on the GLAD segmented training set to the smoothed CBS test set, the prediction accuracy was 87%, which was the same as the model developed based on smoothed CBS. Similarly, the RF developed on the smoothed CBS training set resulted in 83% accuracy when assessed on the GLAD test set, which was just 3% lower than the RF developed on the GLAD training set.
Prediction modeling based on unsmoothed CBS had several important differences. The variable Matching breakpoint in other patients - percent served as the first split in the classification tree. If a segment 1) had candidates with matching breakpoints in more than 1.2% of other patients and was either shorter than 396 Kb or was both longer than 396 Kb and was included in the DGV on average 4.5 times; or 2) had candidates with matching breakpoints in less than 1.2% of other patients, was shorter than 22 Kb, or shorter than 77 Kb and included in the DGV on average 1.3 times, or longer than 77 Kb and included in the DGV on average 3.1 times, then it was predicted to be a CNV. Note that in our training set the first split is equivalent to having at least one other tumor with breakpoint exactly matching the breakpoint of a candidate. RF had the same 6 variables with the highest Gini indices as two other segmentation methods, and it correctly predicted 84% of segments in the test set, as well as 77% of 1785 CNAs and 92% of 464 CNVs in the two other validation sets. Unlike in smoothed CBS and GLAD, the classification tree that included only Database score showed only 72% accuracy on the test set, 12% lower than the full model. The sensitivity was 83% and 69% for the full RF and Database score only models respectively, while the specificity was 86% and 91%. This model had a much higher false CNV rate - 66% of all CNAs in the test set and 66% of the CNAs in 'all CNAs' validation set were falsely identified as CNVs (see Tables , ). We speculate that unsmoothed CBS contained smaller intervals that rarely appeared in the DGV, and the Database score was less informative about them. As a result the model had lower prediction rates on the validation sets.
While the DGV provided the strongest univariate information, we investigated whether it was absolutely necessary for predicting CNVs by fitting RF that excluded Database score and Database score of other candidates. We saw only a modest drop in prediction accuracy of 0-2%. The most important variables suggested by the Gini index Matching breakpoint in other patients - percent, Overlap with other patients -percent, Length, Relative height, and Percent of Normal were the same across all three segmentation methods.
Since Overlap with other patients and Overlap with CNAs are only informative when there are multiple patients in the cohort, we have also considered models without these variables since they could be applied to single arrays. They are presented in the last row of Tables and . There was a 1-2% loss of accuracy compared to the full models.
The accuracy measure in Table represents per-study error rate where all potential CNVs in all tumors are pooled together. Note that CNVs that appear in many patients will have a higher chance of being correctly classified compared to the rare CNVs. Thus, per-variant accuracy might be lower on average than per-study accuracy. We do not estimate the per-variant error rate since it is not clear how to classify observed segments into distinct variants. Another accuracy metric we considered was per-tumor accuracy. The median per-tumor accuracy rates are shown in the Additional file 1
, Table S3 and are very similar to per-study accuracy rates.
Since, as we mentioned, training set had slightly noisier arrays, we have also examined whether switching training and validation sets would lead to different results. The accuracy rates were in fact similar to those shown in Table . For example, the accuracy of the full RF model and model with Database score only developed on the former testing set and tested on the former training set, was 86% and 81% respectively for smoothed CBS, 88% and 84% for GLAD, and 88% and 78% for unsmoothed CBS. Thus, the full model gives similar advantage compared to the Database score only as seen previously.
While the basic CART model with just Database score works fairly well, we have tested it against the basic ad-hoc rule often used in the literature: a candidate CNV is classified as CNV if it overlaps variants reported in the DGV from at least two different studies (manuscripts), and it is a CNA otherwise. Based on the combined training and testing set (there is no model development), this ad-hoc rule identified correctly 65%, 62% and 60% candidates in smoothed CBS, GLAD and unsmoothed CBS respectively, significantly fewer than CART Database score only.
Validation using ovarian dataset
We expect the models to be valid across all cancers as long as few CNVs are associated with cancer. To verify this we obtained 38 pairs of matched normal and ovarian cancer samples from the TCGA website. They were segmented using smoothed CBS only, the method we predominantly use in practice. Since they were already smoothed during the normalization process, unsmoothed segmentation was not performed. There were 2623 candidate CNVs in the tumor samples of which 485 were called true CNVs by the same algorithm that was used for glioblastoma data. The models developed based on smoothed CBS and GLAD segmentation have prediction accuracy of 86% and 87%, respectively, if they contain the DGV information only, and 89% and 90% if all predictors are utilized. As in the GBM data, the error rates of the full smoothed CBS model are unbalanced: 230 CNAs predicted as CNVs and 70 missed true CNVs, with 1908 and 415 correctly identified CNAs and CNVs. The errors are well balanced for the GLAD-derived model; the respective counts are 134, 137, 2004 and 348. Thus, a model derived from GLAD-segmented glioblastoma data offers about 90% prediction accuracy, which is 3% higher than the DGV only model, as well as balanced error rates, even in ovarian cancer data.