The basic algorithms used a distance threshold of 0; here we tested 5 additional distance thresholds of 1, 2, 3, 4, and 5. Results from the evaluation of the cellular function filter are shown in . The sensitivity showed a linear increase with node distance. The DR for verified minimotifs was the highest when the distance threshold was one with a 4.6-fold preference for verified minimotifs, but still showed a 3.4-fold preference for a threshold of two nodes. The sensitivity significantly increased over the basic filter by using a distance of one or two, rather than 0.
Evaluation of the cellular function filter algorithm.
To test the statistical significance of the filters we have used ROC curves and p-values. We have employed the programs of the R project 
for this purpose. In the case of the cellular function filter, we have used the distance as the underlying parameter for plotting the ROC curve (). The area under the ROC curve is 0.7. and the p-value is 0.12. Note that p-value indicates the probability of getting the same sensitivity and selectivity results using a random predictor or filter.
ROC curves for minimotif filters.
Results from the evaluation of the molecular function filter are shown in . Again sensitivity significantly increased with distances of one or two without a major compromise in the DR. The molecular function algorithm is more sensitive, but less selective when compared to the cellular function filter. For the molecular function filter also, we have plotted the ROC curve with distance as the underlying parameter. shows this ROC curve. The area under the ROC curve is 0.8 and the p-value is 0.03.
Evaluation of the molecular function filter algorithm.
Both filters have value in reducing false-positives in the test datasets and stringency of predictions can be controlled by selecting distances between 0 and 3, whereas the performance of the algorithms degrades at distance values above 3. The above results indicate that the filters differentiate verified data from negative data with a good confidence and strongly suggest when predicting novel minimotifs these filters would help to decrease the number of false-positive predictions.
A comparison with the frequency score filter
We wanted to compare the performance of the new filters with one of the already existing MnM filters, namely, the frequency score filter. To begin with we have plotted the ROC curve for the frequency score filter. This ROC curve is shown in . The area under this curve is 0.7 and the p-value is 0.08, which is similar to that of the molecular and cellular functional filters.
shows a comparison of the new filters with the frequency score filter on various aspects. Consistent with the ROC curves this table shows that the molecular function filter is somewhat stronger than MnM Frequency score filter in discriminating true positives from false positives. The cellular function filter is similar to the MnM frequency score filter in performance.
Statistics for comparison of functional filters to the Frequency Score filter.
Note: The above results indicate that the cellular function filter has a poorer p-value than the frequency score and the molecular function filters. As a result, one has to exercise caution while employing the cellular function filter. Both the filters could be of value in clustering the motifs predicted by MnM.
A combination of molecular function and frequency score filters
A novel contribution of this paper is the conclusion that a combination of several filters can yield a better predictability than the individual filters. In particular, we have devised two combination filters. The first combination filter employs the molecular function and the frequency score filters. Note that these two filters are based on two different principles. The frequency score is based on the number of occurrences of the predicted motif whereas the molecular function filter is based on whether the source and target proteins share a common molecular function. Our tests of the combined filter indicate that the combined filter has a better p-value than the two individual filters.
We have employed the either-or-based combination of the molecular function filter and frequency score filter, in the expectation that the two filters can complement each other in some way, which is reasonable since they focus on different aspects and therefore the combined filter may outperform any of the two. Given a motif of some source protein, associated with its target protein, the combined filter examines whether the source and target proteins are retained by the molecular function filter, as well as whether the motif and source are retained by the frequency score filter. If either filter retains them, the combined filter retains them.
This idea was tested on the same positive dataset and negative datasets. The positive datasets have already got experimentally verified entries of motif, its source protein and the associated target protein. For the negative datasets, which are 20,000 random protein pairs, we threw one of each protein pair into Minimotif Miner (MnM) 
as the source query protein and found its motif to form the triple of motif, source protein and target protein. There are totally 463, 062 such triples, of which an unknown molecular function can be found for both the source and target in GO dataset. Then three thresholds (0.02, 0.03, 0.04) for frequency score filter were picked up, together with three distances (0, 1, 2) for molecular function filter, and the nine combinations of these thresholds and distances are used as the threshold parameters of the combined filter. The prediction of the combined filter is shown in . To form a smooth curve, very small noises were added to the sensitivity and selectivity, which is no more than 1.463283e−10
. The ROC curve is shown in , of which the area under the curve (AUC) is 0.89 and the p-value is 0.002, shown in .
Evaluation of the molecular function – frequency score combined filter.
ROC curve for the combined filters.
A combination of cellular function and frequency score filters
The second combination filter employs the cellular function filter and frequency score filter in the same way. Considering the cellular function filter is more stringent, five distances (0, 1, 2, 3, 4) were used, together with the same three thresholds (0.02, 0.03, 0.04) for frequency score filter. As a result, fifteen threshold parameters were formed for this combination of cellular function and frequency score filters. To smoothen the ROC curve, very small noises were also added, which is no more than 6.743894e−11. The prediction of this combination is shown in and the ROC curve is shown in , for which AUC is 0.87 and the p-value is 0.0002, shown in . Note that even though the frequency score filter and the cellular function filter on their own are not highly predictive, their combination is very impressive.
Evaluation of the cellular function – frequency score combined filter.
Implementation of cellular and molecular function filters
We have implemented these new filters with the other filters on the MnM 2 website (). We allow the user to vary the stringency by choosing different thresholds. We have added the results of this analysis and a description to help users interpret the results they should expect for different distance thresholds. We have also designed the implementation so that this filter can be used in combination with other MnM filters. We expect that when used in combination with other MnM filters, this will increase the specificity, but reduce the sensitivity of identifying true minimotifs. We anticipate that some users will want to look for new function of proteins and exclude minimotif predictions that are related to the known functions. Therefore, we have used a GUI checkbox that allows users to only see minimotifs that were excluded from the filter.
Image of the filter selector on the MnM website.
We wanted to examine how many predicted minimotifs were filtered by the algorithms. We ran the filter on P53, Cyclin A, and MSH2, which each have different molecular and cellular functions (22 more proteins were tested and are shown in Supporting Information S1
). Statistics for predictions from this analysis are shown in . The basic Cellular function filters eliminated 90–95 percent of the target predictions, retaining only those with similar cell functions as expected. The Molecular function filter was less robust eliminating 27–48 of the minimotif predictions. Altering the GO term distance threshold also had the anticipated result where the stringency of predictions was titrated as expected.
Analysis of novel queries with the cellular and molecular function filters.