Although several features for MS/MS quality assessment have been previously published, their importance and synergism are still poorly understood. Furthermore, strongly correlating variables may bias the prediction method performance. In order to identify a set of features to use for classifications, we first analyzed all 17 features for their correlations and distributions. The correlations of the features are shown in . As expected, several of the features were highly correlated, such as number of peaks (N) with number of low peaks (n) and maximum intensity (Imax) with mean intensity (Iavg). Also the Ibal feature has strong negative correlation with 10 features, which is due to the fact that when the overall intensity of the spectrum increases, the intensity balance becomes lower.
The distributions of the features were also examined and 15 out of 17 features do not obey normal distribution, with the exceptions of
NnoID and
Ibal (data not shown). Thus, the Fisher criterion score that has been used to identify the most important features [
32] is not valid here as it strongly depends on the normality assumption.
In order to comprehensively characterize the impact of spectral features to classification, we reduced computational complexity by retaining only one feature of the feature-pairs having >95% correlation. This resulted in 13 features for further analysis. All correlation values are provided in
Supplementary Material (Table S1).
To find the best classification algorithm we tested five different classifiers with all 13 features. Performance of the classification algorithms was first measured with cross-validation (see
Supplementary Material for details), followed by an analysis of the independent validation dataset of phosphoserine and -threonine data. The algorithms were compared using the receiver operating characteristic (ROC). In an ROC plot, sensitivity is plotted as a function of (1 - specificity), which corresponds to the fraction of true positives vs. the fraction of false positives [
33]. In our case study, the area under the ROC curve (AUC) was used as the primary metric to quantify quality of each classifier.
The positive predictive value (PPV) was used as the secondary metric to quantify quality of each classifier. PPV is defined as the fraction of true positive assignments (“correct” that are classified as “correct”) over all positive assignments (all “correct” classifications) or true positives/(true positives + false positives). The PPV value states how well a classifier is able to minimize the number of false positives while maximizing the number of true positives. This minimization is important, because in automated validation it is important to be able to automatically accept all instances classified as ”correct” with as few false positives as possible.
contains AUC and PPV values for each algorithm for cross-validation and the independent validation set, a large dataset with phosphoserine and -threonine proteins. The ROC curves for cross-validation can be seen in . Four of the five algorithms show good classification accuracies. For reference, the ROC curve for the Mascot identifications is also plotted. Notably, all classifiers except decision tree outperform the Mascot identification alone. The high PPV and AUC values in the independent validation show that the features tested are relevant for all kinds of phosphorylation types (phosphotyrosine, phosphoserine and phosphothreonine).
| Table 1The comparison of descriptive statistics for classifiers using all thirteen of the features. |
The random forest algorithm was used to calculate the importance of each feature as described in Section 2.3.3. This analysis determined the Mascot score as the most important feature, followed by percent of unidentified peaks as shown in . The random forest variable importance calculation is inherently univariate, i.e., the decrease of classification accuracy is calculated using one quality feature at a time. This may result in spurious results if subsets of the features are strongly correlated, as they are here ().
| Table 2The features used in the study in order of their significance in classifying phosphorylated MS/MS data according to random forest classifier variable importance test. |
To overcome this univariate approach and identify the optimal set of features that gives the highest AUC and PPV values in the training and validation datasets, we trained each of the five classifiers with all the 8191 (213-1) feature combinations. Feature combinations that resulted in the highest AUC values are listed in along with the best AUC and PPV values. Searching through all feature combinations improved the AUC for all algorithms, with the decision tree classifier having the most dramatic effect. All optimal quality feature sets include the Mascot score (S) and the percentage of unassigned peaks (NnoID). Four of the five optimal sets also contain the maximum observed m/z value (mzmax).
| Table 3The descriptive statistics for best feature set for each classifier using cross-validation (CV) and independent validation set (IV) and features used in best classifiers. |
The results in demonstrate that ANN, logistic regression, random forest and naïve Bayes classifiers performed remarkably well. In the cross-validation analysis the random forest classifier achieved the best overall performance with the cross-validation AUC value of 97.8% and the PPV value of 96.5%, though the results with logistic regression, ANN and naïve Bayes are practically equally good. The validation with an independent pS/pT dataset shows that naïve Bayes achieved the best result in terms of AUC (92.8%) and PPV (91.3%) followed by ANN, logistic regression and random forest. These results demonstrate that the machine learning approach is able to validate both pY and pS/pT data.
In the cross-validation, the naïve Bayes classifier reduced the number of false positives compared to original Mascot identifications by 76% (from 271 to 57) while retaining 97% of true positives (1968 out of 2038). Similar results were obtained when using the independent validation set (51% reduction of false positives (from 63 to 26); and 94% retention of true positives (273 out of 290)). The differences between the prediction methods are not significant, which indicates that the features used are informative for classifying the MS/MS spectra.
The classifiers are able to process multiply phosphorylated spectra. In the independent validation dataset, there were 52 multiply phosphorylated spectra, of which Mascot incorrectly assigned 21 and correctly 31 spectra (40% false positives). PhoMSVal with the naïve Bayes method reduced the number of false positives to 4 (18% false positives) while retaining 18 (58%) of the true positives.
The majority of the currently available peptide identification methods, such as Ascore, PhosphoScore and DeBunker are designed for SEQUEST data and cannot be directly used with Mascot data. However, to mimic DeBunker we used the SVM classifier in Weka with eight features and parameters delineated in [
23]. The classification accuracy for the cross-validation was 86.1% (AUC 69.4% and PPV 92.9%) and for the independent dataset 79.6% (AUC 58.4% and PPV 84.9%).