Proteomics studies the quantitative changes occurring in a proteome and its application for disease diagnostics and therapy, and drug development. It examines proteins at different levels, including their sequences, structures and functionalities, and it is considered the next step in the study of biological systems, after genomics. It is much more complicated than genomics mostly because the proteome differs from cell to cell and changes constantly through its biochemical interactions with the genome and the environment, while the genome of an organism is rather constant.
Proteins are large linear chains of amino-acids (residues). The sequence of amino-acids in a protein is directly translated from the information encoded in the genome. However, a proteome is more complex than a genome. One organism has radically different protein expression in different parts of its body, different stages of its life cycle and different environmental conditions (e.g., in humans there are about 20,500 identified genes but an estimate of more than 500,000 proteins that are derived from these genes [1
]). This is mainly caused by mRNA alternative splicing processes and by the possibility of residues in a protein being chemically altered in post-translational modification (PTM), either as part of the protein maturation processes before the protein takes part in the cell's functionalities, or as part of control mechanisms. The discrepancy implies that protein diversity cannot be fully characterized by gene expression analysis. Thus, proteomics is necessary for a better characterization of cells and tissues, and for manufacturing improved drugs and medicines.
Protein Identification in Proteomics
One important and challenging task in proteomics is the identification of proteins, that is, the recognition of the sequenced protein if the protein is known, or its discovery if it is unknown. For this, protein sequences are stored in public databases (such as nrNCBI, UniProt, or Genpept). However, they are mostly produced by the direct translation of gene sequences. This means that neither proteins with post-translation modifications (PTM) nor proteins whose genomes have not been sequenced would find exact matches in such databases.
A key experimental technique for the identification of proteins is mass spectrometry (MS). Mass spectra provide very detailed fingerprints of the proteins contained in a given sample. In the so called shotgun approach, MS is often combined with cutting-edge separation technologies to allow large-scale analysis of proteomes. For this, proteins are extracted from cells and tissues, enzymatically digested, and the resulting peptides (shorter amino-acids chains) separated by multidimensional liquid chromatography techniques. As the peptides are separated, they are on-line injected into the mass spectrometer, where they are ionized, fragmented and these fragments mass-monitored to produce a specific sequence fingerprint.
Identification of the huge amount of spectra produced by current state-of-the-art high-throughput analysis is one of the major tasks for proteomics laboratories. Mainly two popular bioinformatics techniques are involved in this effort. The first one takes advantage of public genome-translated databases (GTDB) that can be accessed through data-mining software (search engines), which directly relates mass spectra with database sequences. Most of these search engines (Mascot, X! Tandem, SEQUEST, OMSSA
) are available both as stand-alone programs that consult a local copy of a GTDB, or as web-services connected to online GTDBs. The limitations, once again, lie in their capability of identifying missing PTMs or unsequenced genomes. The latter case is addressed applying de novo
interpretation algorithms that yield a sequence for a given mass spectrum, thus avoiding any database search. But these algorithms cannot become a solution of the problem because of intrinsic technical limitations. Once a protein has been sequenced de novo
, one can look for similar proteins in a GTDB using a matching algorithm such as BLAST [2
] or FASTA [3
]; or, alternatively, one can use an algorithm such as OMSSA [4
] to match spectra directly to sequences of a GTDB.
Mass spectra identification is usually carried out by mixing and combining these two techniques. However, among other factors, the following issues complicate this task: the number of possible PTMs can multiply the amount of results to be analysed; bad quality and noise in mass spectra increase the uncertainty of interpretation; and database errors in sequence annotations can lead to misunderstandings in the identification. Consequently, we get a huge amount of apparently useless data (for instance, non-matching mass spectra or low-scoring de novo interpreted sequences), which most of the times are simply discarded. As a result, this data is seldom accessible to other groups involved in the identification of the same or homologous proteins. Our conviction is that we can benefit from this kind of data making it available as searchable repositories for other laboratories. If we compared data coming from different laboratories then we would be able to eventually discover new matches. The discovery of matches would contribute to further discriminate between really waste data and possibly good data. We envision many advantages with this new methodology, as other laboratories could provide the missing information for an incomplete spectrum or sequence, making a proteine identification process succeed; or even more, matches could help to recognize new proteins or identify PTMs.
P2P Networks for Proteomics
We propose a new scenario where the information to be searched is no longer centralised in a few repositories, but where information gathered from experiments in peer proteomics laboratories can be searched by fellow researchers. To avoid centralising all data into a single repository --with all the problems that such centralisation would entail--, it is better to maintain the information locally at each of the proteomics laboratories. As a result, this decentralised data storage needs a decentralised search mechanism. The use of peer-to-peer (P2P) technologies fits our needs.
A P2P network provides methods for accessing distributed resources with minimal maintenance cost. It also provides scalable techniques to search through large amounts of resources scattered through the network. Furthermore, joining or leaving the network becomes a simple task. These properties of P2P networks make the technology an ideal candidate to implement a distributed search mechanism in a network of proteomics labs. Other distributed storage systems such as distributed databases or federated storage services have been developed with efficiency in mind, and the maintenance and joining costs for these solutions are very high.
A proteomics laboratory acting as a peer in a P2P network would be able to share its complete or partial data repository --e.g., mass spectra and de novo interpreted sequences-- so that other peers can benefit from it. In addition, in order to find matches among data coming from different peers, the interacting peers of such a P2P network would need also to validate and cross check the consistency of the information obtained by fellow peers.
In this article, we describe an approach that implements such a P2P network on top of the OpenKnowledge (OK) system [5
], which was developed in the scope of the European OpenKnowledge
The OpenKnowledge System
The OpenKnowledge (OK) system is a fully distributed system that uses P2P technologies in order to share peer-interaction protocols and service components across the network. For this, a kernel module -- the OK kernel-- needs to be installed in each machine that is to be connected to the system. We shall call the protocols and service components to be shared generically OpenKnowledge Components (OKCs). Furthermore, these services are executed and coordinated using the same set of tools. In the Methods section below we will show how the tools of the OK system are used to implement the proteomics P2P application. The OK system consists of three main services which can be executed by any computer running the OK kernel:
• a discovery service consisting of a distributed hash table (DHT), by which peer-interaction protocols and other OKCs are stored, so that they can be located and downloaded by users;
• a coordination service, which manages the peer interactions between OKCs; and
• an execution service, which is capable of executing the offered service by means of the OK kernel at the local machine.
The workflow for implementing a new application on top of the OK platform is as follows. First, a specification defining the interaction protocol linking different services has to be defined. This specification is published to the discovery service so that other users can find it and can execute OKCs capable of playing the roles specified in the peer-interaction protocol. A developer, not necessarily the one that specified the protocol originally, will develop the OKCs that are to play the roles defined in the protocol specification. Some of these OKCs may be shared across the network by publishing them to the discovery service, so others can also execute them on their local machines. At this point the application is said to be implemented.
After the application is implemented, it can be executed on top of the OK system. For this purpose, the users wanting to interact as specified in the given peer-interaction protocol by playing one of the roles will subscribe the appropriate OKCs to it. The discovery service is in charge of managing these subscriptions, and when it gathers enough of these to satisfy all the necessary roles in the protocol, it sends this information to a designated peer acting as the protocol coordinator who will start managing the peer interaction by asking each of the components to provide the services when required by the interaction protocol.
The Lightweight Coordination Calculus
For the case at hand, the developer has to specify a protocol of the peer interaction defining the roles each perticipating peer has to play, the sort of messages sent amongst them, and the particular constraints to be solved by the OKCs enacting these roles. Several modelling languages such as those reviewed in [8
] could have been chosen. Our aim, however, is to use the most easily applied formal language for this engineering task that we could conceive and for which an executable peer-to-peer environment already exists, choosing thus the Lightweight Coordination Calculus
LCC is the executable interaction modelling language underlying the OK system. It is used to constrain interactions between distributed components and is neutral to the infrastructure used for message passing between components, although for the purposes of this paper we assume components are peers in some form of peer-to-peer network.
For example, Figure shows the specification in LCC of the protocol for sequenced MS spectra sharing that we will describe in detail later in the Methods section. It is based on a simple query-answering protocol between one inquirer and many repliers.
LCC specification of the protocol for sequenced MS spectra sharing.
An LCC specification describes (in the style of a process calculus) a protocol for interaction between peers in order to achieve a collaborative task. The nature of this task is described through definitions of roles, with each role being defined as a separate LCC clause. The set of these clauses forms the LCC interaction model. An interaction model provides a context for each message that is sent between peers by describing the current state of the interaction (not of the peer) at the time of message passing. Coordination is achieved between peers by communicating this state along with the appropriate messages. Since roles are independently defined within an interaction model, it is possible to distribute the computation to peers performing roles independently, with synchronisation occurring only through message passing. Should the application demand it, however, LCC can also be used in more centralised, server-based style.
Figure shows the main definitions of LCC's syntax. A detailed discussion of LCC, its semantics, and the mechanisms used to deploy it, lies outside the scope of this paper. For these, the reader is referred to [9
]. In this paper, though, we explain enough of LCC to demonstrate how to represent interactions.
An interaction model in LCC is a set of clauses, each of which defines how a role in the interaction must be performed. Roles are described in the head of each clause by the type of role (and its parameters) and an identifier for the individual peer undertaking that role. Clauses may require subroles to be undertaken as part of the completion of a role. The definition of performance of a role is constructed using combinations of the sequence operator ('then') or choice operator ('or') to connect messages and changes of role. Messages are terms, and are either outgoing to another peer in a given role (' => ') or incoming from another peer in a given role (' <= '). Message input/output or change of role can be governed by a constraint to be solved before (when at the right of ' <-') or after (when at the left of ' <-') message passing or role change. Constraints are defined using the normal logical operators for conjunction, disjunction, and negation. If they are subject to fail, the interaction may proceed along alternative paths (e.g., those specified with operator 'or'). Notice that there is no commitment to the system of logic through which constraints are solved --on the contrary we expect different peers to operate different constraint solvers.
A protocol like the one in Figure is generic in the sense that it gives different interactions depending on how the variables (starting with a capital letter) in the clauses are bound at run time --this depending on the choices made by peers when satisfying the constraints within these clauses.
To complete the application, we need also an implementation of the OKCs enacting each of the roles. For the protocol specified in Figure , this means two OKCs. One has to enact the researcher role as specified in the first two clauses, and another one has to enact the omicslab role as specified in the third clause. As a result each OKC will need to be able to solve the constraints occurring in their respective role specification.
For instance, for the omicslab role, the relevant OKC must be able to solve the constraint findHit(...). Therefore, its implementation must provide at least a findHit method. This method should search the local database for data that matches a given query. Obviously, this implementation will be tightly coupled to the local machinery, the file format used for storing this information, and the type of storage system from where it has to be retrieved. This is an obstacle for the portability of OKCs across different laboratories. Consequently, it is advisable that each laboratory develops its own particular OKC for the omicslab role to be played, adjusted to its own system requirements. However, standard OKCs for the most common formats and mass spectrometers could be made publically availabe for dwonload and sharing. There is no restriction in the OK system to prevent locally produced OKCs from being published and downloaded by other users.