Once the random databases were constructed, each individual search engine analyzed the ten thousand spectra and searched for peptide hits in the random protein databases. Since the random databases contain no true positives, all hits found were false positives. Because each search engine reports either a quality score or an E-value for each hit found, we then obtained the average number of false positives with E-value smaller than (or quality score greater than) the given cutoff.
For search methods reporting E-values, it is well known that the smaller the E-value the better the identification confidence. For quality scores, the trend is the opposite: the higher the quality score, the better the identification confidence. For the sake of uniformity, we introduce a database size dependent effective variable xdb, which is simply the E-value for search methods reporting E-values and is a monotonically decreasing function of quality score, e-(Quality score), for search methods that do not return E-values; see Table .
Scaling of the database size and functional transformation to yield calibrated E-values.
For all search methods tested, except Mascot, we ran each spectrum against all eight random databases of three different sizes. It was found that in general there is little difference in the number of false positives found with a given cutoff when using random databases of the same size. We also observed an interesting trend: the variation within the same size decreases as the database size increases. Due to limitation on the local availability and speed, it was not feasible to run Mascot on all eight random databases. For Mascot, we ran the 10, 000 spectra against a single 1 giga-residue random database instead of all three of them. The databases of smaller sizes were all used. Since the set-to-set variation decreases with random database size, we believe having only one run using a 1 giga-residue random database does not greatly increase the uncertainty in the statistical accuracy of Mascot after calibration.
A larger difference may appear, however, amongst results obtained from searching in random databases of different sizes. Except for X!Tandem and RAId_DbS, all search methods require a size normalization in their effective variable xdb
in order to collapse the curves in the FP
plot, where FP
represents the average number of false positives with effective variable less than or equal to xdb
. As a consequence, the first step we took was to normalize the effective variable to a fixed size, 1 giga residues. Panel (a) of Fig. displays the FP
-value for search results using random databases of different sizes. After a size normalization to 1 giga residues, x1gr
= xdb ×
, the data obtained from random databases of various sizes collapse well (panel (b)). Furthermore, the aggregated curves form a bundle with slope quite parallel to the theoretical curve on the log-log plot, indicating that one only needs a simple rescaling to bring the aggregate to the theoretical curve. It may not always be this easy for every search engine. However, the basic idea of performing the calibration is very straightforward: (1) the scale needed to transform the effective variable from a given database size to the size of 1 giga-residue is assumed to be of the form (1gr/db_size)α
, or more precisely x1gr
determined by best fitting; (2) approximate the relationship between the logarithm of the scaled variable x1gr
and the average number of false positives by linear segments.
Figure 1 An example of E-value calibration with database size dependence. Average cumulative number of false positives versus E-values for Mascot. Theoretically speaking, if the number of trials is large enough, the average number of false positives with E-value (more ...)
The size normalizations and the transformation functions, giving rise to the correct E
-value for given effective variables of various methods, are documented in Table . The scaling assumption made for effective variable is heuristic and is not guaranteed to hold true for all search methods yet to come. When this assumption breaks down, however, all one needs is to consider a more general function g
, 1gr/db_size) to relate x1gr
. Of course, the condition g
= 1) = x
has to be imposed then. As for the linear segment approximation between the logarithm of the scaled variable x1gr
, it can be replaced by a tabulation approach. That is, based on the data obtained from searching the random one may construct a table converting from the variable x1gr
to the standardized E
-value, i.e. FP
In the first seven panels of Fig. , we show the aggregate of results obtained from up to eight random databases of three different sizes. Note that except for X!Tandem and RAID_DbS, all figures are obtained after one rescales the effective variable to its 1 giga-residue value. As shown in this figure, the aggregates for Mascot, X!Tandem, and RAId_DbS are reasonably parallel to the theoretical curve. In fact, not only quite parallel to the theoretical curve, RAId_DbS is also very close
to it. As a consequence, RAId_DbS does not need additional transformation to retrieve correct E
-values for peptide hits found. For SEQUEST, ProbID and OMSSA, the aggregate is fitted by a straight line on the log-log plot. For InsPecT, we plot along the ordinate FP2
instead of FP
. We approximate the aggregate by two straight lines depending on whether the effective variable is larger or smaller than a threshold 0.148. The [..]1/2
expression in the last column of Table reflects the fact that we need to take a square root before returning to FP
. The last panel of Fig. deserves further elaboration. In this panel, each method is represented by a single curve. Each curve is obtained in two steps. After database size normalization, the average normalized false positives versus effective variable curves of a given method will become close. One first averages over those curves to obtain a single curve, based on which an approximate functional transformation is constructed to transform the effective variable to the calibrated E
-value (see Table ). One then goes through the results from the ten thousand spectra and transforms their quality scores or their original E
-values into calibrated E
-values. Plotting the calibrated E
-value along the abscissa and average number of false positives as the ordinate yields a single curve in this panel. After our calibration, the statistical significance is unified and there is a universal standard.
Average cumulative number of false positives versus the effective variable to 1 giga residues.
As a specific example of how one may transform a method specific quality score to calibrated E-value, we demonstrate using SEQUEST. For a SEQUEST hit with X-correlation value 3.5 obtained by searching in a database of size 100 mega residues (implying db_size = 108 amino acids), this procedure is easily carried out using Table . First, from the third column we find that xdb = e-3.5 ≈ 0.03. Then going to the fourth column we find x1gr = xdb(109/108)-0.176 ≈ 0.03 × (2/3) ≈ 0.02. We then go to the last column and use
E = f(x1gr) = e10.59(x1gr)4.11 ≈ 0.00413.
That is, for a hit with X-correlation value 3.5, we end up obtaining a calibrated E-value of approximately 0.00413 if the search is done in a database of size 100 mega residues.
Finally, we also tested the accuracy of calibrated E-values in a different context, i.e., from the perspective of true positives. Basically, the E-value of a significant peptide hit may be interpreted as the probability of that peptide being a false positive when E is small. Using a more careful argument (see Appendix for details), one finds that
where TP(E ≤ Ec) represents the number of true positives with E-value less than Ec and FP (E ≤ Ec) represents the number of false positives with E-value less than Ec. For each search engine, each of the ten thousand spectra was used to search for candidate peptides in the NCBI's nr database. Each candidate peptide was then classified as a true positive if it is a partial segment of the seven standard proteins used, and was classified as a false positive otherwise. We then binned the true positives and false positives according to the logarithm of the (calibrated) E-values. The ratio of the cumulative number of true positives to the sum of true and false positives is expected to follow the theoretical line (1) in Fig. .
Average ratio of true positives to the sum of true and false positives versus E-values.
In the left panel of Fig. , the results for the four search engines that report E-values are displayed. Note that, the abscissa is the original E-value reported by each search engine prior to calibration. Further, curves from different search engines do not agree with each other, and most importantly, many of them are not in agreement with the theoretical curve. In the right panel of Fig. , the abscissa is now the calibrated E-value. Except for ProbID, curves from different search engines aggregate well and are much closer to the theoretical line. This indicates that it is indeed possible to calibrate the E-values for various search engines and the calibration, once done, can approximate well through (1) the probability that a candidate peptide is a true positive.
A way to make a real protein database act as a random database is to perform a cluster removal procedure [23
]. Basically, one uses the target proteins as the queries and removes from the database proteins with low E
-values (say E =
) when aligning with any of the target proteins. This procedure is designed to eliminate multiple-counting
of hits resulting from the fact that real proteins come in families of various sizes. The target proteins themselves, however, are excluded from removal. In Fig. we display again the results of TP/
) with cluster removal invoked. When compared to the right panel of Fig. , the agreement between the theory and experimental data has improved. In particular, ProbID is now much closer to the rest of the search methods.
Average ratio of true positives to the sum of true and false positives versus E-values with cluster removal employed.