|Home | About | Journals | Submit | Contact Us | Français|
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Existing scientific literature is a rich source of biological information such as disease markers. Integration of this information with data analysis may help researchers to identify possible controversies and to form useful hypotheses for further validations. In the context of proteomics studies, individualized proteomics era may be approached through consideration of amino acid substitutions/modifications as well as information from disease studies. Integration of such information with peptide searches facilitates speedy, dynamic information retrieval that may significantly benefit clinical laboratory studies.
We have integrated from various sources annotated single amino acid polymorphisms, post-translational modifications, and their documented disease associations (if they exist) into one enhanced database per organism. We have also augmented our peptide identification software RAId_DbS to take into account this information while analyzing a tandem mass spectrum. In principle, one may choose to respect or ignore the correlation of amino acid polymorphisms/modifications within each protein. The former leads to targeted searches and avoids scoring of unnecessary polymorphism/modification combinations; the latter explores possible polymorphisms in a controlled fashion. To facilitate new discoveries, RAId_DbS also allows users to conduct searches permitting novel polymorphisms as well as to search a knowledge database created by the users.
We have finished constructing enhanced databases for 17 organisms. The web link to RAId_DbS and the enhanced databases is http://www.ncbi.nlm.nih.gov/CBBResearch/qmbp/RAId_DbS/index.html. The relevant databases and binaries of RAId_DbS for Linux, Windows, and Mac OS X are available for download from the same web page.
Scientific literature, documenting different studies and conclusions, is among the most important sources of knowledge and biological information. It has been noted [1,2] that it is in the scientific community's best interest to be able to have such information consolidated and organized in an easy-to-use format so that researchers can integrate and/or interrogate the existing knowledge during biological data analysis. Such a knowledge integration may help researchers in identifying conflicting results , forming new hypotheses, and performing experimental validations. In the scope of proteomics studies to which we now turn, information related to single amino acid polymorphisms (SAPs) and post-translational modifications (PTMs) is among the most important.
Like single nucleotide polymorphisms (SNPs) that occur roughly every 300 base pairs , SAPs also differentiate individuals from one another. It is well known that SNPs may result in SAPs that are not yet annotated and thus not present in the standard protein databases. To enable identification of peptides containing this type of SAPs, Edwards  had come up with a compression scheme to reduce the size of the expressed sequence tag (EST) database to allow searches within the compactified database. In addition to resulting from nonsynonymous SNPs, however, SAPs may also occur due to post-transcriptional regulations such as mRNA editing . SAPs together with PTMs are often used as disease markers [7-10]. Integration of this annotated, disease-related knowledge with data analysis facilitates speedy, dynamic information retrieval that may significantly benefit clinical laboratory studies.
To incorporate existing knowledge and information within peptide searches, we start by constructing a human protein database where information about annotated SNPs, SAPs, PTMs, and their disease associations (if any) are integrated. Consequently, the database part of our work may be considered an advancement of references  and . The former extended the human protein database to include SAPs but without PTMs and without integration of disease information, while the latter allows for protein-specific annotated PTMs but without SAPs and without integration of disease information. We have also modified our peptide identification software RAId_DbS  to take into account the integrated information of annotated SAPs/PTMs and diseases while performing peptide searches. It is perceivable that the disease marker within a protein might be manifested as specific combinations of SAPs/PTMs, which we term information correlation. As explained in the caption of Figure Figure1,1, our database construction can easily accommodate correlations of this type. To further facilitate new discoveries, RAId_DbS allows users to conduct searches permitting novel SAPs.
However, it is worth pointing out that allowing annotated SAPs/PTMs (or novel SAPs) during the search, one is dealing with a larger search space than before and thus should anticipate an increase (decrease) in false positives (retrieval efficiency). Therefore, we recommend the users to turn on annotated SAPs/PTMs and novel SAPs only if the regular searches returns no significant hit. Specifically, we recommend the users to perform regular searches first. For spectra that do not receive any significant hit from regular searches, one may turn on annotated SAPs/PTMs and then search again. Finally, for spectra that receive no significant hit from both regular searches and searches with annotated SAPs/PTMs, one may turn on the novel SAPs together with annotated SAPs/PTMs and then search again.
We have built a web-based application taking query spectrum online as well as prepared standalone downloadable executables that can be installed locally on users' own machines. An important feature of the standalone version is the flexibility for users to add SAP and/or PTM information to various proteins they are interested in and even to create a user-specific database that contains new protein sequences. In the next section, we describe our database construction to illustrate how we accommodate the SAPs, PTMs, and their disease associations. We then provide a brief introduction to our software RAId_DbS and elaborate on its augmentation. In the result section, we use a few examples to show the structure of our database. The optimal use of our enhanced database in information retrieval is sketched in the discussion section.
In the following discussion, we use human database construction as an example to illustrate how we enhance the protein databases. Similar procedures are employed to construct enhanced databases of other organisms, see Table Table11 for a summary. We extracted 34,197 human protein sequences with a total of 16,814,674 amino acids from the flat file (last updated 09/05/2006) ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/protein/protein.gbk.gz. Included in this file are proteins and their associated annotations generated respectively through the Reference Sequence and the Genome Annotation projects of the National Center for Biotechnology Information (NCBI). Each protein sequence is accompanied by a list of annotated SAPs and PTMs. Out of the 34,197 proteins, we found 29,979 unique proteins with a total of 15,324,913 amino acids. To avoid having multiple copies of identical or almost identical proteins in the database without losing information, we first perform an information-preserved clustering on the 34,197 sequences.
This process starts with an all-against-all BLAST  among the 34,197 sequences. Two sequences with identical lengths and aligned gaplessly with less than 2% mismatches are clustered, and each sequence is called a qualified hit of the other. Any other sequence that satisfies this condition with a member of an existing cluster is assigned to that existing cluster. All the annotations in the same cluster are then merged. We find it possible for every given cluster to choose a consensus sequence that will make all other members its polymorphous forms. Hence, we only retain one protein sequence for each of the 29,272 clusters finally obtained. The total number of amino acids associated with these 29,272 consensus proteins is 15,001,326. Although we only retain one sequence (the consensus sequence) per cluster, the information of other member sequences is still kept. For example, when a member sequence and the consensus sequence disagree at two sites, the presence of the member sequence is documented by introducing two cluster-induced SAPs at the two sites of the consensus sequence. The originally annotated SAPs and PTMs of the member sequence are also merged into those of the consensus sequence. Figure Figure11 and its caption illustrate how this process is done iteratively. In our processed definition file (see Table Table22 for an example), each SAP or PTM is documented with its origin. SAPs arising from clustering are easily distinguished from annotated SAPs. For member sequences that are identical to the consensus sequence, the accession numbers of those member sequences are also recorded with their SAPs/PTMs annotations merged into the consensus sequence. When a user selects not to have annotated SAPs, RAId_DbS still allows for cluster-induced SAPs resulting in an effective search of the original databases but with minimum redundancy. The strategy employed by RAId_DbS to search for SAPs and PTMs will be briefly described in the "RAId_DbS and its Augmentation" section below.
The consensus protein in a given cluster is then used as a query to BLAST against the NCBI's nr database to retrieve its RefSeq accession number and its corresponding Swiss-Prot http://ca.expasy.org/sprot/ accession number, if it exists, from the best qualified hit. It is possible for a cluster to have more than one accession number. This happens when there is a tie in the qualified best hits and when a protein sequence in nr actually is documented with more than one accession number.
To minimize inclusion of less confident annotations, we only keep the SAPs and PTMs that are consistently documented in more than one source. For example, for proteins with Swiss-Prot accession numbers, we only keep the SAPs and PTMs that are present both in Swiss-Prot  and GenBank . For proteins without Swiss-Prot accession numbers, the retentions of SAPs and PTMs are described below. The PTM annotations are kept only if they are present in the gzipped document HPRD_FLAT_FILES_090107.tar.gz of the Human Protein Reference Database http://www.hprd.org. The SAP annotations are kept only if they are in agreement with the master table, SNP_mRNA_pos.bcp.gz (last updated 01/10/2007), of dbSNP: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/database/organism_data. Even though in many cases the primary information sources might be identical, the curation protocols may differ resulting in inconsistent annotations that are removed by our filtering strategy.
The web interface of RAId_DbS is shown in Figure Figure2.2. An example of the reported peptide list resulting from a search is shown in Figure Figure3.3. Note that the Goodness-of-fit of the score model is reported. Basically, it represents how well the score model, be theoretically derived or assumed, agrees with the cumulated score histogram. Also reported is a quantity called the model P-value, which documents the likelihood for the correlation strength between the model score distribution and the score histogram to come out of random matching. Readers are referred to reference  for details.
Here we briefly explain the underpinning of RAId_DbS statistics and RAId_DbS's strategy to deal with searches in different effective database sizes. The latter strategy can be generalized to handle effective database size expansion due to inclusion of SAPs and PTMs. Taking into account the finite sample effect and the skewness of peak intensity distribution, the form of asymptotic score statistics (P-values) of RAId_DbS  is derived theoretically. Since the skewness varies per spectrum, the theoretically determined parameters for our derived distribution are spectrum-specific. For most spectra considered, our theoretical distributions (used to compute P-values) agree well with the score histograms accumulated. The final E-value for each peptide hit in a search, however, is obtained by multiplying the peptide's P-value by the number of peptides of its category. As a specific example, when trypsin is used as the digesting enzyme, RAId_DbS allows for incorrect N-terminal cleavages. RAId_DbS has internal counters, Cc and Cinc, totaling respectively the number of scored peptides with correct and incorrect N-terminal cleavage. In general, Cinc Cc. When calculating the E-value of a peptide with correct N-terminal cleavage, RAId_DbS multiplies the peptide's P-value by Cc. However, the E-value of a peptide with incorrect N-terminal cleavage will be obtained by multiplying the peptide's P-value by Cc + Cinc . In line with the Bonferroni correction that is rooted in the Bonferroni inequality , our approach avoids overstating the significance of a hit from a larger effective database (the pool of peptides regardless of whether the N-terminal cleavage is correct) versus a hit from a smaller effective database (the pool of peptides with correct N-terminal cleavage only).
The same statistical approach is used in the augmented RAId_DbS. Different counters are set up to record the numbers of scored peptides in different categories. As a specific example, when novel SAPs are allowed, RAId_DbS creates a new counter, Cnovel_sap, to total the number of scored peptides with a novel SAP. This is in general a much larger number than other counters. When we calculate the E-value associated with a peptide hit that contains a novel SAP, we multiply the peptide's P-value by the sum of existing counters with Cnovel_sap included. However, in the same search, for a peptide without novel SAP, its E-value is obtained by multiplying the peptide's P-value by the sum of existing counters excluding Cnovel_sap. The same approach is applied to PTMs and other annotations.
Below we briefly sketch how RAId_DbS deals with the presence of annotated SAPs, PTMs as well as novel SAPs. In our database format, annotated SAPs and PTMs are inserted right after the site of variation, see Figures Figures11 and and4.4. From this point on, we will call sites containing annotated SAPs/PTMs variable sites and sites without annotated SAPs/PTMs unvaried sites. When searching the database for peptides with parent ion mass 1500 Da, for example, RAId_DbS sums the masses of amino acids within each possible peptide to see if the total mass is within 3 Da (the default parent ion mass error range of RAId_DbS) of 1500 Da. At this stage, a variable site has, instead of a fixed mass, several possible masses depending on the number of annotated SAPs/PTMs at that site. A peptide that covers some variable sites therefore has several masses, each corresponding to a specific arrangement of SAPs/PTMs. If some of these masses happen to be within 3 Da of 1500 Da, RAId_DbS will score this peptide with corresponding annotated SAPs/PTMs that give rise to the proper masses. If none of these masses are within the allowed molecular mass range, that peptide will not be scored. It is worth pointing out that the default mass error tolerance (3Da) may be changed by the user on the web page while submitting a query spectrum to search.
Note that our approach is computationally efficient in terms of mass selection. For example, if a peptide contains a variable site, one first sum the amino acid masses of unvaried sites to obtain munv. One then checks whether the masses associated with the variable site adding to munv will fall in the desirable mass range. This approach is particularly powerful when there is more than one variable site in the peptide considered. As demonstrated in Figure Figure11 and its caption, the combinatorics associated with two variable sites result in only a longer list of possible masses to be added to munv. This should be contrasted with methods that incorporate SAPs via appending polymorphous peptides to the end of the primary sequence. In the latter approach, the program needs to do the mass sum multiple times, repeating the mass sum of unvaried sites, and thus may slow down the searches.
Despite RAId_DbS's strategic advantage, introduction of SAPs/PTMs does increase the complexity of the algorithm. Therefore, during the searches RAId_DbS only considers for each candidate peptide to have up to two annotated SAPs and up to five annotated PTMs. To facilitate discovery, RAId_DbS also permits novel SAPs, but limited to one novel SAP per not-yet-annotated peptide, meaning peptides that do not contain any annotated SAPs/PTMs. This is because the introduction of novel SAP largely expands the search space, and if one allows novel SAPs within peptides already documented with SAPs/PTMs, the search space expansion will be even larger and may render the search intractable. Currently, the novel SAP search is expedited via a pre-computed list of amino acid mass difference. As an example, assume that one is searching for a peptide with parent ion mass 1500 Da, and a not-yet-annotated candidate peptide has mass 1477 Da, 23 Da smaller than the target mass. It happens that 23 Da is also the mass difference between Tryptophan and Tyrosine, and if the candidate peptide contains a Tyrosine, RAId_DbS will replace that Tyrosine with a Tryptophan and score the new peptide. If the candidate peptide contains two Tyrosines, RAId_DbS will replace one Tyrosine at a time with a Tryptophan and score both of the new peptides. It is evident that the complexity grows fast if one were to allow for two novel SAPs or more per peptide.
In this section, we first report the status of our ongoing construction of and real examples of enhanced organismal databases. Comparison to related approaches will also be provided, followed by a few example studies.
As summarized in Table Table1,1, we have finished constructing databases for 17 organisms. Note that disease information is included only in the human database. Within the enhanced human database, we have 123,464 SAPs and 81,984 PTMs. Of those SAPs and PTMs, 15,787 have disease associations. In each enhanced organismal database, the consensus sequences (after information-preserved sequence clustering) are fused into a single string separated by the "[" character. This long string is stored in a file with a suffix ".seq" or simply called the sequence file. The sequence identifiers and other annotations are relegated into a file with a suffix ".def" or called a processed definition file. The processed definition file is only used in the final reporting stage of the search. A typical protein sequence in an enhanced sequence file carries with it annotated SAPs and PTMs in a simple format. In Figure Figure44 we show two consensus sequences containing SAPs/PTMs. The entries, associated with these two consensus sequences, in the processed definition file are shown in Table Table22.
The format of our sequence file minimizes redundancy in searches. For example, if a single site contains two SAPs, construction method proposed by Schandorff et. al.  will demand two almost identical partial sequences, each may be several tens of amino acids in length, be appended after the primary sequence, while in our case it only takes up a few additional bytes. The compactness of our database becomes obvious when incorporating the information of two nearby sites, each containing several annotated SAPs and PTMs, into the database. In our construction, we only need a few additional bytes. But in other approaches, it may introduce an appreciable expansion due to including/excluding and pairing of different variations at both sites along with the flanking peptides, see Figure Figure55 for an illustration. Another key difference between our method and other database methods is that we do not need to limit the number of enzymatic miscleavages.
When needed and when using the standalone version of RAId_DbS, users may create their own databases with user-specific knowledge input. The user will provide both a FASTA file containing sequences to be included and a separate information file documenting the modifications and annotation associated with variable sites of each sequence. The format is illustrated in Figure Figure6.6. Through our Perl script UserDB.pl, the flat information file -containing the protein accession numbers, detailed SAP/PTM information, and disease associations- is processed together with the FASTA file provided by the user to generate the user-specific ".seq" file and ".def" file which are suitable for searches using RAId_DbS. If one wishes to add additional SAPs or PTMs, one simply updates both the FASTA sequence file as well as the flat information file and rerun the Perl script. When reporting a hit with annotated SAPs or PTMs, RAId_DbS automatically reports the corresponding detailed information and disease association if it exists.
Using a tandem mass (MS2) spectrum taken from the profile dataset described in reference , we illustrate in Table Table33 two search results in the human protein database with the annotated SAPs and PTMs turned off (a) and on (b) respectively. In case (a), the best hit is a false positive with E-value about 0:11 implying that one probably ends up declaring no significant peptide hit for this spectrum. In case (b), however, the best hit is a true positive (a peptide from human transferin with an annotated SAP) with E-value about 4.0 × 10-7. This example shows that if properly used, allowing SAPs/PTMs may increase the peptide identification rate. That is, it may be fruitful to turn on the SAPs/PTMs when a regular search returns no significant hit. Turning on SAPs/PTMs without first searching with SAPs/PTMs turned off, however, may cause a loss of sensitivity due to the increase of search space.
It is commonly believed that when searching a large database, sensitivity is severely lost. This is particularly true if the E-value for every hit is obtained by multiplying the peptide's P-value by the same number (e.g. the largest effective database size) regardless of the category that peptide belongs to. As we have explained earlier, RAId_DbS does not do that. It uses a method equivalent to Bonferroni correction. We use E-values to rank peptide hits and each peptide's E-value is obtained by multiplying its P-value by the corresponding size of the effective database that the peptide belongs to. Consequently, to be equally significant, peptide hits falling in a category that has a large effective database size need to have smaller P-values than those of peptide hits falling in a category that has a small effective database size.
Nevertheless, even with such a strategy one can never guarantee to bypass the sensitivity loss problem associated with searching a large database.
Although we have shown  as a preliminary study that no appreciable loss of sensitivity is found using the 54 training spectra of PEAKS , the number of spectra there is too small to ensure our observation to be statistically robust. We therefore set out to use spectra from a larger collection of human proteins , henceforth referred to as the Aurum dataset, to test the severity of sensitivity loss when expanding the search space via turning on SAPs/PTMs and novel SAPs. The Aurum data  is a set of MS/MS spectra generated in an ABI 4700 MALDI TOF/TOF instrument. The sample is a mixture of 246 human proteins that were individually purified and tryptically digested. This data was developed to be a standard reference for the purpose of testing or training new algorithms.
In Figure Figure7,7, we show the Receiver Operating Characteristic (ROC) curves when analyzing the Aurum dataset  which contains 9977 spectra from a selection of human proteins. In each of the two panels of Figure Figure7,7, there are three ROC curves corresponding to regular searches without SAPs/PTMs, searches allowing annotated SAPs/PTMs, and searches allowing both annotated SAPs/PTMs as well as novel SAPs. Although there seems to be no performance degradation judging from the sensitivity versus 1-specificity plot (panel (A)), we do see mild degradation in terms of the cumulative number of true positives found using a fixed number of false positives as the threshold (panel (B)). Compared to regular searches, turning on SAPs/PTMs and novel SAPs results in a larger number of false hits which pulls the ROC curve towards the left end upon normalization. This may partially explain why turning on SAPs/PTMs and novel SAPs does not introduce an appreciable loss in sensitivity.
From analyzing the Aurum dataset using different search spaces, we confirmed that the number of true positives found at a false positive threshold may decrease if the search is done in a larger search space, i.e., with novel SAPs and/or annotated SAPs/PTMs unconditionally enabled. This indicates that it is not productive to search with the annotated SAPs/PTMs and novel SAPs enabled all the time. We recommend the user to turn on these features conditionally. For example, if a spectrum does not receive any significant hit from a regular search, one may then allow the annotated SAPs/PTMs. If the search still returns no significant hit, one may then turn on novel SAPs in the search. It is in this context that one may increase the number of peptides identified.
It has not escaped our attention that the ROC curves shown in panel (B) of Figure Figure77 do not rise steeply as one typically sees. This, however, may be caused by the presence of contaminants during protein purification that introduced peptides not belonging to the target proteins. Since our main purpose is to study the relative sensitivity degradation upon enlarging the search space, we do not delve into the investigations of peptide hits with low E-values but not subsequences of target proteins.
The processed definition files associated with our enhanced databases contain consolidated information in a tab delimited format, allowing easy information extraction by others who are interested in utilizing this information in different contexts. While the information contained in our enhanced databases are helpful in terms of forming hypotheses and narrowing down the scope of investigation, it should be used with caution because scientific literature, consisting of coherent information, also contains conflicting information. Therefore the reported disease association should not be used as a diagnosis report but only be used as a reference for further investigation. In particular, from clinical application point of view, patients and clinical scientists may benefit from such information as it suggests possibilities of diseases that may otherwise be overlooked.
It is our plan to continue construction of enhanced databases for additional organisms. Although little disease information exists for most organisms other than human, we will include it in our databases when more information becomes available. For example, the NCBI's Online Mendelian Inheritance in Animals OMIA, http://www.ncbi.nlm.nih.gov/ contains information of genes, inherited disorders and traits in animal species (other than human and mouse). We plan to integrate this information into our organismal databases in the near future. Under our database format, it is also possible to incorporate other information such as protein fusions, 3D protein structures, drug-binding/active sites, cross-linking sites, and isoforms. We are currently assessing which additional information to include next. It is also worth pointing out that our database compactification via clustering has an advantage in reducing search time.
Without collapsing identical and almost identical proteins, one is bound to score identical peptides multiple times. Our compactification strategy minimizes redundant searches of this sort. This reduction of redundancy will become important when exploring unrestricted PTM searches.
All authors contributed to the design of the research and analysis of the results, GA and AO carried out the research, YKY wrote the paper. All authors read and approved the final manuscript.
We thank the administrative group of the NIH biowulf clusters, where all the computational tasks were carried out. We also thank Dr. David Landsman and Dr. Lewis Geer for helpful comments. This work was supported by the Intramural Research Program of the National Institutes of Health, National Library of Medicine.