|Home | About | Journals | Submit | Contact Us | Français|
High-throughput genotyping and sequencing techniques are rapidly and inexpensively providing large amounts of human genetic variation data. Single Nucleotide Polymorphisms (SNPs) are an important source of human genome variability and have been implicated in several human diseases, including cancer. Amino acid mutations resulting from non-synonymous SNPs in coding regions may generate protein functional changes that affect cell proliferation. In this study, we developed a machine learning approach to predict cancer-causing missense variants. We present a Support Vector Machine (SVM) classifier trained on a set of 3163 cancer-causing variants and an equal number of neutral polymorphisms. The method achieve 93% overall accuracy, a correlation coefficient of 0.86, and area under ROC curve of 0.98. When compared with other previously developed algorithms such as SIFT and CHASM our method results in higher prediction accuracy and correlation coefficient in identifying cancer-causing variants.
Single Nucleotide Polymorphisms (SNPs) are a specific class of genomic variation responsible for about 90% of human variability . In particular the SNPs occurring in coding regions may have higher impact affecting the function of the transcribed proteins . More efficient sequencing and genotyping techniques are detecting a large amount of human genetic variation data . Different international consortiums are collecting information about variations in human genome. The HapMap consortium  is characterizing common variation and linkage disequilibrium patterns that can be related to common diseases [5,6]. The Human Variation Project  has been funded to collect, curate, and make accessible information on genetic variations affecting human health. International institutions are collaborating in the 1000 Genomes Project (http://www.1000genomes.org/) to produce the most complete catalog of genetic variations in human population . In 2005, the Wellcome Trust Case Control Consortium (WTCCC) has been established to understand the relationship between human genome sequence variation and disease. Using high-throughput technologies, WTCCC collaborators have genotyped about 14,000 patients for seven common diseases performing one of the largest Genome-Wide Association Study (GWAS) . This effort results in an increasing number of SNPs data stored in databases available online. Currently, the dbSNP database at the NCBI  collects about 20 million of validated human SNPs. The manually curated SwissVar database  reports the possible pathologic effect of about 61,000 missense SNPs and the public version of the HGMD database  includes more than 74,000 mutations causing or associated with human inherited disease, plus disease-associated/functional polymorphisms. It is evident that there is a need of computational methods to analyze and identify functionally important variants and describe their molecular effects. During the last decade several bioinformatics methods has been developed to predict the effect of a particular class of SNPs resulting in Single Amino acids Polymorphisms (SAPs) [12–14]. In general, computational methods for the prediction of the impact of SAPs use empirical rules [15,16], Hidden Markov Models , Neural Networks[18,19], Decision Tree [20,21], Random Forest [22–26] and Support Vector Machines [27–31], algorithms relying on amino acid sequence, structure and evolutionary information. The amino acid sequence provides information about the physico-chemical properties of the mutated residues such as hydrophobicity, charge, polarity, bulkiness etc. Structural information describes the structural environment of the mutation and has been successfully used to predict the protein stability change upon mutation [32,33]. The most important source of information for the characterization of the effect of SAPs is the evolutionary information. The main hypothesis presumes that important amino acids will be conserved in the protein family, and so changes at well-conserved positions tend to be predicted as deleterious. Recently, a second generation of algorithms that includes also knowledge-based information [24,25,28] has shown better performances with respect to older predictors. The first developed methods SIFT  and PolyPhen  use different representation of evolutionary information. For each mutated site, SIFT scores the normalized probabilities for all possible substitutions using a multiple sequence alignment between homolog proteins and PolyPhen evaluates the impact of SAPs calculating different sequence-based features and a Position Specific Independent Counts (PSIC) matrix from a multiple sequence alignment. Protein family HMM models implemented in PANTHER  have been used to predict deleterious mutations and recently, protein three-dimensional structure features have been shown to improve the performance of SAPs prediction algorithms [22,27,31]. Machine learning-based methods such as PhD-SNP  and SNAP  have shown better results with respect to traditional methods. The new class of predictors relying on knowledge-based information results in overall accuracy higher than 80%. SNPs&GO  includes a new function annotation score calculated using GO terms and MutPred  evaluates the impact of a given variant considering the output of several machine learning approaches. A selected list of web available tools for the detection of deleterious missense variants is reported in Supplementary Table 1.
Although available methods are producing valuable results in the detection of disease-related mutations they do not provide any information about the associated pathology. Only MutPred  is the first attempt of algorithm able to provide information about the disease mechanism.
To address this problem, we propose a new class of disease-specific predictors trained on a subset of SAPs related to specific disease classes. One of the highest causes of mortality and morbidity in the developed countries is cancer. Although several advances have been made in cancer therapy [35,36], the disease mechanism is still largely unclear. Unlike Mendelian disease where the pathology is principally related to one gene, cancer is a complex disease that often involves several genes. Although it is difficult to dissect the contribution of each gene, individual variants could be indicators of disease risk . To address this problem, two machine learning-based methods have been proposed to predict cancer-causing mutation [23,24]. CanPredict  combines SIFT output, a PFAM  and a functional-based scores  to predict cancer-causing mutations and CHASM  takes in input several sequence and profile features to discriminate between passenger and driver variants. These methods are addressing two different aspects of the problem: CanPredict discriminates deleterious mutations occurring in cancer genes from neutral variants from dbSNP database and CHASM detects driver SAPs in cancer-related proteins. To reduce possible over-estimation of the performances , we tested our method considering all the driver cancer variants of the same protein either in training or testing set. Our disease-specific machine learning-based predictor, which has been extensively tested on a large set of manually annotated from different sources, results in good level of accuracy when compared with previously implemented methods.
In this work we use as synonymous the words single amino acid polymorphism (SAP), missense variant and SNP although the term variant is more general and includes also missense SNP with allele frequency lower than 0.01. We distinguished three classes of variants: cancer-causing, neutral polymorphisms and other disease-related SAPs. We refer to cancer-causing SAPs as the driver variants identified to play a functional role in oncogenic cell transformation and used to test and train CHASM algorithm . The missense SNPs without any evidences of association to disease in SwissVar and recently selected as negative cases  are indicated as neutral variants or polymorphisms. We also used Synthetic passengers SAPs generated by CHASM as neutral polymorphisms. A set of variants associated to pathologies not related to the MeSH term “neoplasms” are referred as other disease-related variants. In the binary classification problem addressed in this paper, all the variants are classified in Disease and Neutral. The driver cancer variants belong to the class Disease (D). Passenger, neutral and other disease-related variants, that are not associated with the insurgence of cancer, are classified as Neutral (N).
The selection of a representative set of variants for the training and testing of SAPs prediction methods is a key issue. The performances of the algorithms are strongly dependent on the selected set of neutral and disease-related polymorphisms . For this study, we collected SAPs data from different sources. Cancer-causing variants are selected from breast, colorectal, pancreatic tumor resequencing studies [43–45] and COSMIC database  that are provided with CHASM package. Neutral variants are from Swiss-Prot database  or generated by CHASM. Other disease-related variants are non “neoplasms” disease-related variants annotated in SwissVar database .
In particular the neutral polymorphisms and other disease-related variants from SwissVar have been selected according to a recently described procedure . We built three main datasets to train and test the ability of our method to detect cancer-causing variants. The CNO dataset (Cancer and Neutral missense variants only) with a total number of 6326 variants is composed by 3163 cancer-causing variants and an equal number of neutral polymorphisms. The 3163 cancer-causing mutations from 74 proteins in CNO dataset have been selected from the set of driver cancer mutation used to train CHASM algorithm . The 3163 neutral polymorphisms included in the CNO dataset have been randomly selected from the subset of neutral SAPs in SwissVar database with allele frequency higher than 0.01 and chromosome sample count higher than 49 from the dbSNP database  build 131. The performance of our method has been evaluated on the subsets of the CNO dataset with primary histology annotated in the COSMIC database as Carcinoma, Hematopoietic Neoplasm, Lymphoid Neoplasm, Glioma and Malignant Melanoma. The Carcinoma, Hematopoietic, Lymphoid, Glioma and Melanoma subsets are composed respectively by 1899, 461, 441, 384 and 257 driver cancer variants and an equal number of neutral polymorphisms. To test the performance of our predictor in the discrimination between cancer and other disease-causing variants, we build the CND dataset (Cancer, Neutral and other Disease-related missense variants) substituting 50% of neutral polymorphisms with same number of randomly selected from disease-related variants in SwissVar not associated to the MeSH term “neoplasms”.
We have also tested our method in the discrimination between driver and passenger cancer variants building the Synthetic dataset composed by the 3163 driver mutations included in the previous two datasets and an equal number of passenger variants generated by CHASM algorithm. The composition of the three datasets and subsets used in this work is summarized in Table 1.
The proposed task is to predict whether a given missense variant is a neutral or involved in the insurgence of cancer. The task is treated as a binary classification problem for the protein variants. The Support Vector Machine (SVM) classifies SAPs in cancer-causing (desired output set to 0) and neutral polymorphism (desired output set to 1). The SVM output is a number between 0 and 1 and the decision threshold has been set to 0.5. The input features of our algorithm (SPF-Cancer) include: the amino acid mutation, its local sequence environment, sequence-profile derived features, the output of PANTHER algorithm  and a cancer-specific functional-based log-odd score calculated considering the GO slim ontology. The final input vector consists of 51 values:
Two other predictors have been developed considering subset of features: mutation site specific method (SeqProf) with input features composed by the 45 elements vector corresponding to Seq and Prof data and protein specific method (F-Cancer) with 2 elements vector features encoding for the cancer-specific functional score (LGO). A third predictor (SPF-All) has been developed calculating a generic functional log-odd score on the whole set of SwissVar SAPs including all type of diseases.
The input vector portion relative to sequence information consists of 40 values: the first 20 (the 20 residue types) explicitly define the mutation by setting to –1 the element corresponding to the wild type residue and to 1 the newly introduced residue (all the remaining elements are kept equal to 0). The last 20 input values encode for the mutation sequence environment (again the 20 elements represent the 20 residue types). Each input is provided as the number of the encoded residue type, to be found inside a window centered at the residue that undergoes mutation and that symmetrically spans the sequence to the left (N-terminus) and to the right (C-terminus) with a length of 19 residues .
We derive for each mutation: the frequency of the wild type, the frequency of the mutated residue, the number of totally and locally aligned sequences and a Conservation Index (CI)  for the position at hand: the more a residue is functionally important the more is conserved over evolution. The Conservation Index is calculated as:
where fa(i) is the relative frequency of residue a at mutated position i and fa is the overall frequency of the same residue in all the alignmed positions. The sequence profile is computed from the output of the BLAST program , running on the uniref90 database (release 13.3 April 2008) (E-value threshold = 10−9, number of runs = 1).
The 4 elements vector from PANTHER  output is composed by the probability of deleterious mutation, the frequencies of the wild-type and new residues in the PANTHER family alignment and the number of independent counts. In case that PANTHER does not return any output the probability of deleterious mutation is set to 0.5 and the remaining value has been set to 0.
The Gene Ontology log-odds score (LGO) is computed to derive information related to the correlation among a given SAPs effect (cancer-causing and neutral) and the protein function. The annotation data are relative to the Gene Ontology  Database version 1.37 and are retrieved at the web resource hosted at the European Bioinformatics Institute (EBI). The version of gene ontology classification we used (Dec 2009) contains 30,304 Gene Ontology (GO) terms. To reduce the number of terms and have more general functional terms we consider the GO slim annotation. The GO slim is a simplified version of the GO ontology containing a subset of the terms in the whole GO. They give a broad overview of the ontology content without the detail of the specific terms. In this work we used the generic GO slim ontology (release Sep. 2009) that consists of 132 different GO terms. The generic GO slim file has been downloaded from the Gene Ontology web site (http://www.geneontology.org/GO_slims/goslim_generic.obo). To calculate the LGO, first we derived the GO terms (from all the three branches: molecular function, biological process and cellular components, when available) for all our proteins in the dataset (CNO). For each annotated term the appropriate ontology tree was used to retrieve all the parent terms with the GO-TermFinder-0.8 tool (http://search.cpan.org/dist/GO-TermFinder/)  and counting a GO term only once. When all the GO terms for each protein have been collected, we mapped them on the generic GO slim terms using the map2slim.pl script downloaded from the Gene Ontology web site. The LGO is finally calculated as the log-odds score associated to each protein:
where fGO is the frequency of occurrence of a given GO slim term for the cancer-causing (D) and neutral mutations (N) adding one pseudocount to each class. The LGO scores are evaluated considering fGO values computed over the training sets without including in the GO slim term counts of the corresponding test set. This strategy avoids overfitting in the cross-validation procedure.
The LIBSVM package (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) has been used for the SVM implementation . The selected SVM kernel is a Radial Basis Function (RBF) kernel K(xi,xj) = exp(−γxi − xj2) and γ and C parameters are optimized performing a grid like search. After input rescaling the values of the best parameters are C = 8 and γ = 0.03125.
The results obtained with our SVM methods are evaluated using a cross-validation procedure on the CNO dataset. The reported data for the classification task performed by the SVM methods are obtained adopting a 20-fold cross-validation procedure in such a way that the ratio of the disease-related to the neutral polymorphism mutations is similar to the original distribution of the whole set. Furthermore, all the proteins in the CNO datasets are clustered according to their sequence similarity with the blastclust program in the BLAST suite by adopting the default value of length coverage equal to 0.9 and the percentage similarity threshold equal to 30%. We kept the mutations detected on the same protein cluster s in the same training set to prevent an overestimation of the results. In the comparison with CHASM and SIFT, the methods are tested using a similar strategy used in the CHASM paper . The whole Synthetic dataset is divided in two similar subsets composed same number of drivers and passenger cancer variants. The accuracy measures are calculated using a 2-fold cross validation procedure. In this paper, the efficiency of the predictors is scored using the following statistical indexes.
The overall accuracy is:
where CP is the total number of correctly predicted mutations and T is the total number of mutations.
The Matthews correlation coefficient C is defined as:
where D is the normalization factor:
for each class s (D and N, stand for cancer-causing and neutral polymorphism, respectively); p(s) and n(s) are the total number of correct predictions and correctly rejected assignments, respectively, and u(s) and o(s) are the numbers of false negative and false positive for the class s.
The coverage S (sensitivity) for each discriminated class s is evaluated as:
where p(s) and u(s) are the same as in Eq. (5).
The probability of correct predictions P (or positive predictive values) is computed as:
where p(s) and o(s) are the same as in Eq. (5) (ranging from 0 to 1).
Finally, it is very important to assign a reliability score to each prediction. For each output O(D), this is obtained by computing:
Other standard scoring measures, such as the area under the ROC curve (AUC) and the true positive rate (TPR = Q(s)) at 10% of False Positive Rate (FPR = 1-P(s)) are also computed .
We evaluated our method for predicting cancer-causing missense variants (SPF-Cancer) using a 20-fold cross-validation procedure on the CNO dataset. The SPF-Cancer predictor reaches 93% of overall accuracy, 0.86 correlation coefficient and area under the ROC curve 0.98 (see Table 2). When 10% of false positive are accepted the true positive rate is 0.94 (see Fig. 1 panel A). If predictions with reliability index (RI) higher than 4 are selected, the method results in ~96% accuracy and 0.92 correlation coefficient on 91% of the datasets (see Fig. 1 panel D). We also evaluated the accuracy of our algorithm on the subsets of variants associated to different histology description in COSMIC database. In comparison with the results on CNO dataset, our predictor shows similar performances on the Carcinoma, Lymphoid and Glioma subsets. Contrarily, SPF-Cancer results in 2% higher accuracy and 0.04 higher correlation coefficient on the Melanoma subset and 3% lower accuracy and 0.06 lower correlation coefficient on the Hematopoietic subset with respect to CNO dataset (see Table 2).
The ability of SPF-Cancer in the classification of cancer-causing missense variants, has been tested using the CND dataset that includes 25% of variants from other diseases. In Table 2, we show that the accuracy and the AUC of SPF-Cancer on CND dataset are only 3% lower with respect to those on the CND dataset.
To score the improvement of accuracy resulting from the combination of protein sequence, evolutionary and functional information, the SPF-Cancer method has been compared with simpler SVM-based approaches including either protein sequence and profile information (SeqProf) or only functional information (F-Cancer). On CNO dataset SeqProf and F-Cancer methods result in 64% and 92% overall accuracies and 0.28 and 0.85 correlation coefficients respectively (see Table 3). Thus, SFP-Cancer that includes all the input features results in 1% more accurate predictions and 0.02 higher correlation coefficient with respect to F-Cancer. More interestingly, the SeqProf and F-Cancer results can be used as a filter to select high reliable predictions. In ~62% of the variants in CNO dataset, for which the predictions of SeqProf and F-Cancer methods agree (Consensus), the overall accuracy of SPF-Cancer reaches 96% of accuracy, 0.92 correlation coefficient and 0.99 AUC (see Fig. 1 panel E). On the remaining subset of variants (~38%) where the predictors disagree (notConsensus), SPF-Cancer results only in 88% overall accuracy and 0.76 correlation coefficient (see Fig. 1 panel F). To explain the different level of accuracy between Consensus and notConsensus subset we plot the distributions of the CI values for cancer-causing and neutral variant (see Fig. 2 panel A) and calculated distances (d) between the cumulative distribution for the Kolmogorov–Smirnov (KS) test. The resulting distances are 0.21, 0.44 and −0.22 for the CNO, Consensus and notConsensus datasets respectively. We observed similar trend plotting the distributions of the LGO-scores (see Fig. 2 panel B). In this case, the distances associated to the KS test are 0.87, 0.92 and 0.78 respectively. In Table 4 we reported the summary of the comparison between the CI and LGO distributions.
We compared the performance of SPF-Cancer with those obtained by SIFT, CHASM and a similar SVM-based predictor with generic GO slim-based score (SPF-All) calculated using whole set of disease-related variants (see Table 5). On the Synthetic dataset, SIFT and CHASM result in 68% and 80% overall accuracies and 0.22 and 0.60 correlation coefficients respectively. Thus, SPF-Cancer shows more than 10% higher accuracy and correlation coefficient with respect to CHASM. SPF-Cancer also results in 2% higher overall accuracy and 0.06 higher correlation coefficient when compared with SPF-All. To estimate the significance of the differences between the four predictors, we calculated the χ2 obtained comparing the confusion matrix SPF-Cancer with those of SPF-All, CHASM and SIFT. The associated probabilities to observe this differences by chance are 3.4×10−5, 8.6×10−82 and 0 respectively for SPF-All, CHASM and SIFT.
The GO score used in this work, has been calculated using GO slim terms. To better understand the ability of the method to correctly classify cancer-causing mutations score, we compare the values of cancer-specific and generic LGO scores. In particular the comparison between the LGO values calculated on the dataset driver cancer variants and on the dataset including all disease-related variants has been used to detect GO terms associated to cancer. Although the LGO scores are dependent on the training set, their relative values obtained in comparison with generic LGO scores provide an estimation of the GO terms’ occurrences. Thus, a positive difference between the cancer-specific and generic LGO scores indicates an enrichment of the relative GO terms in the cancer specific dataset while negative difference corresponds to GO terms more abundant in the dataset including all disease-related variants. In Fig. 3 the scatter plot of the generic LGO score versus the cancer-specific LGO score for each GO slim term. The interesting GO functions are those corresponding to the points far from the diagonal. The points with negative generic LGOs and positive cancer-specific LGOs are those with GO slim functions related to cancer. The points with cancer-specific LGOs close to zero and higher generic LGOs are those with GO slim functions generally associated to the all the pathologies in SwissVar dataset. For example, in our study we observed that Growth (GO:0040007) and Kinase Activity (GO:0016301) GO slim terms have stronger association to cancer showing respectively cancer-specific LGOs 4.02 and 3.30 and generic LGOs 2.63 and 1.78. Other interesting GO slim terms associated to all the diseases are the Transporter Activity (GO:0005215) and Oxygen Binding (GO:0019825) which have respectively cancer LGOs −7.77 and −4.09 and generic LGOs 1.20 and 2.99. There are also GO slim terms that have similar values for cancer and generic diseases LGO scores. Two examples are the Carbohydrate Metabolic Process (GO:0005975) that has similarly related cancer and all the diseases in our dataset resulting in LGO scores respectively 2.55 and 2.23, and the Calcium Ion Binding (GO:0005509) that is not related to cancer and slightly associated to all the diseases showing LGO scores −0.01 and 0.56 respectively.
In general cancer-specific prediction methods have been trained either to discriminate between passenger and driver cancer-causing SAPs in a known cancer-related protein or to detect cancer-causing using a negative set of neutral SAPs in proteins with different functions. SPF-Cancer method has been tested on both tasks. We built the CNO dataset selecting all the cancer-causing variants used to train and test CHASM method and an equal number of randomly selected neutral polymorphism from a curated set of variants recently used to test the performances of predictive algorithms . The results obtained on this dataset should be considered as upper bound performances since we selected only neutral variants with allele frequencies higher than 0.01 for which their annotation is expected to be more accurate. To compare our methods against previously developed algorithms we use the Synthetic dataset for which neutral missense variants are generated by CHASM algorithm.
The SPF-Cancer predictor tested in cross-validation on CNO dataset, resulting in 93% overall accuracy and 0.86 correlation coefficient. With respect to the whole CNO dataset, our algorithm shows better performance in the detection of variants annotated as Malignant Melanoma and lower performances on Hematopoietic Neoplasm variants. When compared against CHASM on the Synthetic dataset, SPF-Cancer shows about 10% better accuracy and 0.2 better correlation coefficient. The development of cancer-specific predictor is justified by the improvement of 2% in overall accuracy and 0.06 in correlation coefficient resulting from the cancer-specific LGO scores. Differences between cancer-specific and generic predictors are higher when other disease-related variants are included in the dataset (data not shown). Although SPF-Cancer shows 3% lower accuracy on the Synthetic dataset with respect to CNO dataset, this difference can be due to the unknown annotation of the passenger variants generated by CHASM.
SPF-Cancer is also able to discriminate between cancer-causing variant and other disease-related mutations, while reaching 90% of accuracy on the CND dataset where 50% of the neutral polymorphisms are replaced with variants related to non “neoplasm” diseases. The improvement of the performances resulting from the combination of site-dependent sequence and profile features and functional information can be quantified in 1% higher accuracy and 0.02 higher correlation coefficient with respect to the GO score-based method. In addition, using two different methods it is possible to select a subset of highly accurate predictions. In 62% of the mutations where the sequence and profile-based (SeqProf) and GO score-based (F-Cancer) predictions agree, SPF-Cancer results in 3% better accuracy and 0.06 better correlation with respect to the performance on the whole CNO dataset. On the subset of variants where predictions are in disagreement (NotConsensus) the low performances are justified by the reverse trend in the distributions of the Conservation Index for cancer-causing and neutral variants (d = −0.22). Finally, the comparison between cancer-specific and generic LGO score values allows the estimation of the functional enrichment in cancer-related proteins. For example we observed enrichment of GO terms Growth and Kinase Activity in cancer-related proteins and Transporter Activity and Oxygen Binding in the whole set of disease-related proteins.
In conclusion, we present a new machine learning-based algorithm (SPF-Cancer) to predict cancer-causing variants. The SPF-Cancer method that has been extensively tested on a large set of variants is a valid alternative to previously developed algorithms. Considering that cancer is a complex disease that can involve multiple genes, SPF-Cancer reaches a good level of accuracy also when compared with previously developed algorithms such as SIFT and CHASM. The comparison between SPF-Cancer and SPF-All method indicates that cancer-specific LGO term score improves the prediction accuracy. The calculation of cancer-specific LGO values allows to rank with higher scores those proteins annotated with GO term functions involved in the development of cancer. This suggests new strategies for the development of the next generation of disease-specific algorithms able to discriminate between the genetic variants related to a specific disease and other class of pathologies. Finally, scoring the deleterious effect of missense variants using sequence profile-based and functional-based methods allows to select higher confident predictions where both methods predictions agree. For this subset of high quality predictions (62%), the SPF-Cancer method results in 96% overall accuracy and 0.92 correlation coefficient.
The authors thank anonymous reviewers for their suggestions that allowed to improve the quality of this work. EC acknowledges support from the Marie Curie International Outgoing Fellowship program (PIOF-GA-2009-237225). RBA would like to acknowledge the following funding sources: NIHLM05652 and GM61374.
Appendix A. Supplementary data Supplementary data to this article can be found online at doi:10.1016/j.ygeno.2011.06.010.