Agile Protein Interaction DataAnalyzer (APID) design tries to be as simple and light as possible keeping the minimal information to provide a correct and easy access to all included data sets. This design follows the software engineering methodology named ‘agile’ (13
), that embraces software development using lightweight and adaptable methods. In this way, agile methods demand the idea of evolutionary design and seek to assume changes, allowing them to occur along all the live cycle of a product. Changes are controlled and easy to implement and the attitude of the designer is to enable them. APID has been designed following this strategy to achieve the purpose of a useful and active integration of the protein–protein interaction source databases included.
All the work has been developed in Java programming language (http://java.sun.com/
), and a J2EE architecture has been used to built the web interface and the applet graphic tool described below. For the parsing of source data we have used SAX and DOM Java programs to extract the information from the XML files, and JDBC programs to insert the processed data in the server. After the parsing efforts we still found problems to unify all the source data, being the main obstacle the heterogeneous and multiple protein identifiers given by the different sources, that many times cause false disjunction and incoherence in the data. To solve it we used the proteins sequences as the most unique and biological meaningful ‘protein code’, that allowed a good unification using algorithm BLAST2 (14
) to find in UniProt each protein given by the source databases. Once a protein was recognized based on sequence alignment, we linked to it a univocal UniProt code. Together with the protein univocal code to obtain a coherent and uniform data, we also had to reach coherence about the experimental method or methods that validate any given interaction. The identification of the method also allows to find the existing consensus or agreement between the different databases for any given interaction. In this way, we have obtained a protocol able to store and unify protein interaction databases in a clear uniform structure, maintaining the integrity of the data and correcting some existing failures found in the original files.
Following the described strategy, the data unification has been done based on three key reference identifiers (IDs): (i) UniProt ID (i.e. UniProt accession number), to allow a specific identification of each protein and a direct link to its sequence and to the rest of the curated protein information included in UniProt (15
); (ii) PSI-MI ID, to unify the experimental methods used in different publications to a common terminology developed by PSI-MI (16
) (i.e. to a controlled vocabulary with standard identifiers); (iii) PubMed ID (PMID), to link each interaction validated by a given experimental method to a specific PubMed literature reference, and also to assign experimental method identifiers to the PubMed publications that describe each method. These main key identifiers constitute a simple information core that makes APID an agile tool to access and search through the interactomes.
At present, APID integrates data coming from five main source databases: BIND (10
), DIP (11
), HPRD (Human Protein Reference Database) (17
), IntAct (Database system and analysis tools for protein interaction data) (12
) and MINT (Molecular Interactions Database) (18
). The data included in APID coming from these source databases correspond only to protein–protein interactions (i.e. not interactions of proteins with other ligands like DNA and the like) and the interactions have to be experimentally validated with a PubMed reference given. At the same time, as indicated above each protein has to be identified by its sequence and its UniProt code. In all cases, the web tool includes for each interaction links to the original files of the source databases, and to the PubMed references that validated each interaction. Finally, each protein includes links to the corresponding UniProt file and to other related databases [like InterPro, Pfam, Gene Ontology (GO), Ensembl, NCBI Gene].