Peer ranking algorithm
The peer ranking algorithm used for monitoring simulated peers in this study was based on PageRank [10
], which notations and formulation were specified and implemented as follows:
A data set of initial rankings for each peer, each of the form cr(P, T) = R
A data set of records of successful or failed interactions, each of the form im(T, I, PI, CP)
P is the identifier of a peer
T is either "pos" (denoting the positive ranking) or "neg" (denoting the negative ranking)
R is the numerical magnitude of the (positive or negative) ranking, Rp is positive ranking, Rn is negative ranking
I is the identifier of the interaction model
Pi is the initiating peer for an interaction
CP is the set of peers that subscribe to the interaction initiated by Pi
The algorithm for calculating the current rank of a peer was then as follows (where i is the empirically chosen number of iterations used to obtain stable ranking values:
For i iterations:
For each peer, P:
Calculate rank(P) = (Rp
assign cr(P, pos) = Rp
assign cr(P, neg) = Rn
rank(P) calculates the current rank for peer P, where d is an empirically chosen damping value used to tune the ranking system (a frequently used value in page ranking is 0.85):
s(T, P) gives the list of peers supporting peer P. If T is "pos" then this is the set of positive support or if T is "neg" this is the set of negative support. Note that this is a list (which may contain duplicates) rather than a set because we are interested in the number of times each peer is supported.
a(P, T, Ps) is true when peer P is supported by peer Ps either positively (if T is "pos") or negatively (if T is "neg"). Note that this may (intentionally) generate the same instance of Ps more than once. This allows us to count the number of times the same peer is supported.
r(S, T) is the sum of the current ranks for all the peers in peer set, S, each peer's rank being divided by the number of peers it supports (to apportion the influence of the rank evenly across those supported peers). If T is "pos" then this is the sum of ranks from positive associations or if T is "neg" this is the sum of ranks from negative associations. In other words, r(S, T) is the sum for all P in peer set S of cr(P, T)/L, where L is the number of peers supported by P.
Experiments on simulated peers
The experiments with simulated peers used basic interaction model for data sharing as specified in Figure . The interaction model described two roles: a data source which offers a data set to a data seeker if it receives a request for data from the seeker, and a data seeker which requests data for a query and caches the data received (testing also that the data set was of acceptable overall quality). In the interaction model as depicted in Figure , X is the identifier for the data source; Y is the identifier for the data seeker; have_data(Q, D) generates the best available set of data, D, known to X for query, Q; need_data(Q) generates a data query, Q for Y; cache_data(Q, D) merges the data set, D with the data cached by Y for query, Q; and acceptable(T, D) succeeds when the data set, D, has a mean quality level that exceeds the threshold T (note that T is set to a specific value).
An interaction model for competitive data sharing.
This interaction model had no representation of the provenance of the data but there was a point in the interaction, when the data seeker sent its request to a data source, when an appropriate data source needed to be chosen. At this point the data seeker could, if it so chose, use peer rank information to select the highest ranking peer. The rank of a data source peer in this interaction, in turn, depended on how frequently the data seekers with which it interacted found the quality of the data it supplied to be acceptable (otherwise the acceptable constraint in the data seeker role failed and consequently the interaction overall failed). This gave a feedback loop from supply to ranking via the peers that requested data.
To make our simulation as simple as possible we represented data quality as a number ranging from 0 (lowest quality) to 1 (top quality). Instead of storing actual data elements, the peers in our simulation stored these quality values (representing the quality of a data item). Although in the OpenKnowledge framework it is possible for any peer to coordinate interactions, and therefore to have any blend of control over service orchestration from a highly centralised system to a pure peer-to-peer arrangement, in the simulation a single peer represented the central database and (to keep it even simpler) it generated a set of 10 data elements with quality chosen randomly to be between 0 and 1, so the data elements that were obtained from it by peers tended to have a normal distribution with mean of 0.5. We assumed that the quality that was acceptable to peers was higher than this mean, setting it at 0.8 for all peers, indicating that peers seeking data from the database obtained a mixture of acceptable (mean quality greater than 0.8) and unacceptable results, so the database built up a negative ranking as well as a positive one.
Experiments on MS/MS protein identification
Roles and LCC specifications
Using OpenKnowledge peer-to-peer interaction infrastructure and LCC, we built an experimental environment to access and manipulate multiple web-enabled services or local programs of various types of MS/MS sequencing techniques. These services were configured in parallel pathways (Figure ) for alternative execution.
Peer ranking experiment on MS/MS protein identification.
As depicted in Figure , a peer spectra_input, the interaction initiator, uploaded the mass spectrum, and randomly selected an interaction model to execute. If the "de novo + similarity search" interaction model was selected, peers assigned to denovo_approach and similarity_search were invoked. The analysis result was passed to peer output_interface for evaluation. Similarly, if the PFF analysis interaction model was executed, peers responsible for pff_approach were invoked. The analysis result was passed to peer output_interface for evaluation as well. Evaluation result, success or failure of interaction model, was passed to spectra_input in the end.
According to the PFF interaction pathway as specified in Figure , peer SI uploaded a mass spectrum (Spec) from data source, received the evaluation result (Val), and terminated the interaction. Peer PFF received a mass spectrum from SI, performed PFF analysis of the spectrum and forwarded the analysis result Res to peer OI which evaluated the result of PFF analysis and then passed the evaluation result Val (0 or 1) to peer SI.
Interaction pathway for peptide fragement fingerprinting (PFF).
Similarly, in the de novo interaction pathway shown in Figure , peer SI uploaded a mass spectrum (Spec) from data source, received the evaluation result (Val), and terminated the interaction. Peer NOVO received a mass spectrum from SI, performed de novo analysis of the spectrum and forwarded the de novo analysis result Denovo to peer SS for similarity database searching. The peer OI received the similarity search result Res from SS, evaluated the data Res, and then forwarded the evaluation result Val (0 or 1) to peer SI.
Interaction pathway for de novo sequencing and database searching.
Figure specified the LCC specification for the PFF interaction pathway and Figure listed LCC codes for the de novo sequencing + database searching pathway. Two roles, spectra_input and output_interface, were specified in both interaction pathways with the same arguments and interactions. Peers subscribed to the roles of spectra_input were responsible for uploading MS/MS spectra and selecting preferred routes for the spectrum interpretation. Peers subscribed to the role output_interface were in charge of the re-formating, filtering and displaying of the final result yielded by the interaction model executed. Other roles specified in the two LCC interaction models included pff_approach, denovo_approach, and similarity_search, which performed PFF, de novo sequencing, and database searching, respectively.
OpenKnowledge Components (OKCs) were developed to access and manipulate web servers and/or local programs for MS/MS identification, including algorithms OMSSA and MASCOT subscribed to the role pff_approach
, PepNovo Win32 Executable [11
] and Lutefisk XPv1.0 [12
] performed the role denovo_approach
, and MS-BLAST [13
] subscribed to the role similarity_search
. In this experiment, parameters for MASCOT was set with database NCBI nr
, enzyme trypsin
, and all the other parameters were set to default settings. Similarly, OMSSA was run with database nr
, enzyme trypsin
, maximum missed cleavages set as "2
", minimum charge to start using multiply charged products set to "2", all the optional species being selected, and all the other default parameters for ion trap spectrometers. Both Lutefisk and PepNovo were run with their default parameters for doubly charged tryptic peptides on ion trap MS. MS-BLAST was run with default parameters and database nrdb95
OKCs for system roles spectra_input
were implemented with GUIs for human users to upload the MS spectra and display MS/MS identification results. The filtering criteria for the results were taken from literature [5
] on different MS/MS identification algorithms as listed in Table , to achieve a false discovery rate (FDR) less than 0.1.
The scores and threshold values for MASCOT, OMSSA, and MS-BLAST.
The experiment was based on a benchmark dataset with doubly charged tryptic peptides obtained from low-energy ion trap LC/MS/MS runs [11
]. Each round went through only one of the possible routes (Figure ) with one peer to perform the roles specified in the associated LCC codes (Figures and ).
A single round started with uploading the MS spectra data of peptides to the spectra_input peer. In addition to spectrum uploading, the OKC developed for this peer allowed the peer to select the analysis approach, which was either (1) PFF approach or (2) de novo sequencing and database similarity searching approach, for MS/MS protein identification. In the cases (e.g., the present study) when none of the approaches was preferred, the system randomly selected one approach. In each route, one of the peers subscribed to the role pff_approach, denovo_approach, or similarity_search, was randomly selected by the system in the experiment. The result of protein identification was sent to the peer performing the role output_interface for reformatting and filtering according to the criteria as shown in Table . The failure or success message of the execution was finally sent to the peer spectra_input in the end of the interaction model. The peer ranking for monitoring the performance of the peers was simply based on counting the numbers of failures and successes.