A typical PSI-BLAST search consists of iterative scanning and alignment of a large sequence database during which a scoring profile is progressively built and refined. Such a profile can also be stored and used to search against a different database of sequences. Using it to search against a database of consensus rather than native sequences is a simple add-on that boosts performance surprisingly well. The improvement comes at a price: we hypothesized that random alignment score statistics would differ between native and consensus sequences. Thus PSI-BLAST-based profile searches against consensus sequences might incorrectly estimate statistical significance of alignment scores. In addition, iterative searches against consensus databases may fail. Here, we addressed these challenges in an attempt to harness the full power of the combination of PSI-BLAST and consensus sequences.
We studied alignment score statistics for various types of consensus sequences. In general, the score distribution parameters of profile-based consensus sequence alignments differed significantly from those derived for the native sequences. PSI-BLAST partially compensated for the parameter variation. We have identified a protocol for building specialized consensus sequences that significantly improved search sensitivity and preserved score distribution parameters. As a result, PSI-BLAST profiles can be used to search specialized consensus sequences without sacrificing estimates of statistical significance. We also provided results indicating that iterative PSI-BLAST searches against consensus sequences could work very well. Overall, we showed how a widely popular and effective method could be used to identify significantly more relevant similarities among protein sequences.
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.
Summary: Iterative similarity searches with PSI-BLAST position-specific score matrices (PSSMs) find many more homologs than single searches, but PSSMs can be contaminated when homologous alignments are extended into unrelated protein domains—homologous over-extension (HOE). PSI-Search combines an optimal Smith–Waterman local alignment sequence search, using SSEARCH, with the PSI-BLAST profile construction strategy. An optional sequence boundary-masking procedure, which prevents alignments from being extended after they are initially included, can reduce HOE errors in the PSSM profile. Preventing HOE improves selectivity for both PSI-BLAST and PSI-Search, but PSI-Search has ~4-fold better selectivity than PSI-BLAST and similar sensitivity at 50% and 60% family coverage. PSI-Search is also produces 2- for 4-fold fewer false-positives than JackHMMER, but is ~5% less sensitive.
Availability and implementation: PSI-Search is available from the authors as a standalone implementation written in Perl for Linux-compatible platforms. It is also available through a web interface (www.ebi.ac.uk/Tools/sss/psisearch) and SOAP and REST Web Services (www.ebi.ac.uk/Tools/webservices).
BLAST is a commonly-used software package for comparing a query sequence to a database of known sequences; in this study, we focus on protein sequences. Position-specific-iterated BLAST (PSI-BLAST) iteratively searches a protein sequence database, using the matches in round i to construct a position-specific score matrix (PSSM) for searching the database in round i + 1. Biegert and Söding developed Context-sensitive BLAST (CS-BLAST), which combines information from searching the sequence database with information derived from a library of short protein profiles to achieve better homology detection than PSI-BLAST, which builds its PSSMs from scratch.
We describe a new method, called domain enhanced lookup time accelerated BLAST (DELTA-BLAST), which searches a database of pre-constructed PSSMs before searching a protein-sequence database, to yield better homology detection. For its PSSMs, DELTA-BLAST employs a subset of NCBI’s Conserved Domain Database (CDD). On a test set derived from ASTRAL, with one round of searching, DELTA-BLAST achieves a ROC5000 of 0.270 vs. 0.116 for CS-BLAST. The performance advantage diminishes in iterated searches, but DELTA-BLAST continues to achieve better ROC scores than CS-BLAST.
DELTA-BLAST is a useful program for the detection of remote protein homologs. It is available under the “Protein BLAST” link at http://blast.ncbi.nlm.nih.gov.
This article was reviewed by Arcady Mushegian, Nick V. Grishin, and Frank Eisenhaber.
A SIB-BLAST web server (http://sib-blast.osc.edu) has been established for investigators to use the SimpleIsBeautiful (SIB) algorithm for sequence-based homology detection. SIB was developed to overcome the model corruption frequently observed in the later iterations of PSI-BLAST searches. The algorithm compares resultant hits from the second iteration to the final iteration of a PSI-BLAST search, calculates the figure of merit for each ‘overlapped’ hit and re-ranks the hits according to their figure of merit. By validating hits generated from the last profile against hits from the first profile when the model is least corrupted, the true and false positives are better delineated, which in turn, improves the accuracy of iterative PSI-BLAST searches. Notably, this improvement to PSI-BLAST comes at minimal computational cost as SIB-BLAST utilizes existing results already produced in a PSI-BLAST search.
Sequence alignments may be the most fundamental computational resource for molecular biology. The best methods that identify sequence relatedness through profile–profile comparisons are much slower and more complex than sequence–sequence and sequence–profile comparisons such as, respectively, BLAST and PSI-BLAST. Families of related genes and gene products (proteins) can be represented by consensus sequences that list the nucleic/amino acid most frequent at each sequence position in that family. Here, we propose a novel approach for consensus-sequence-based comparisons. This approach improved searches and alignments as a standard add-on to PSI-BLAST without any changes of code. Improvements were particularly significant for more difficult tasks such as the identification of distant structural relations between proteins and their corresponding alignments. Despite the fact that the improvements were higher for more divergent relations, they were consistent even at high accuracy/low error rates for non-trivially related proteins. The improvements were very easy to achieve; no parameter used by PSI-BLAST was altered and no single line of code changed. Furthermore, the consensus sequence add-on required relatively little additional CPU time. We discuss how advanced users of PSI-BLAST can immediately benefit from using consensus sequences on their local computers. We have also made the method available through the Internet (http://www.rostlab.org/services/consensus/).
We have characterized a novel type of PSI-BLAST error, homologous over-extension (HOE), using embedded PFAM domain queries on searches against a reference library containing Pfam-annotated UniProt sequences and random synthetic sequences. PSI-BLAST makes two types of errors: alignments to non-homologous regions and HOE alignments that begin in a homologous region, but extend beyond the homology into neighboring sequence regions. When the neighboring sequence region contains a non-homologous domain, PSI-BLAST can incorporate the unrelated sequence into its position specific scoring matrix, which then finds non-homologous proteins with significant expectation values. HOE accounts for the largest fraction of the initial false positive (FP) errors, and the largest fraction of FPs at iteration 5. In searches against complete protein sequences, 5–9% of alignments at iteration 5 are non-homologous. HOE frequently begins in a partial protein domain; when partial domains are removed from the library, HOE errors decrease from 16 to 3% of weighted coverage (hard queries; 35–5% for sampled queries) and no-error searches increase from 2 to 58% weighed coverage (hard; 16–78% sampled). When HOE is reduced by not extending previously found sequences, PSI-BLAST specificity improves 4–8-fold, with little loss in sensitivity.
Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's; retrieval accuracy is now employed by default.
Sequence similarity/database searching is a cornerstone of molecular biology. PairsDB is a database intended to make exploring protein sequences and their similarity relationships quick and easy. Behind PairsDB is a comprehensive collection of protein sequences and BLAST and PSI-BLAST alignments between them. Instead of running BLAST or PSI-BLAST individually on each request, results are retrieved instantaneously from a database of pre-computed alignments. Filtering options allow you to find a set of sequences satisfying a set of criteria—for example, all human proteins with solved structure and without transmembrane segments. PairsDB is continually updated and covers all sequences in Uniprot. The data is stored in a MySQL relational database. Data files will be made available for download at ftp://nic.funet.fi/pub/sci/molbio. PairsDB can also be accessed interactively at http://pairsdb.csc.fi. PairsDB data is a valuable platform to build various downstream automated analysis pipelines. For example, the graph of all-against-all similarity relationships is the starting point for clustering protein families, delineating domains, improving alignment accuracy by consistency measures, and defining orthologous genes. Moreover, query-anchored stacked sequence alignments, profiles and consensus sequences are useful in studies of sequence conservation patterns for clues about possible functional sites.
HHpred is a fast server for remote protein homology detection and structure prediction and is the first to implement pairwise comparison of profile hidden Markov models (HMMs). It allows to search a wide choice of databases, such as the PDB, SCOP, Pfam, SMART, COGs and CDD. It accepts a single query sequence or a multiple alignment as input. Within only a few minutes it returns the search results in a user-friendly format similar to that of PSI-BLAST. Search options include local or global alignment and scoring secondary structure similarity. HHpred can produce pairwise query-template alignments, multiple alignments of the query with a set of templates selected from the search results, as well as 3D structural models that are calculated by the MODELLER software from these alignments. A detailed help facility is available. As a demonstration, we analyze the sequence of SpoVT, a transcriptional regulator from Bacillus subtilis. HHpred can be accessed at .
The aim of this study was to use comparative modeling to predict the three-dimensional structure of the CHAPK protein (cysteine, histidine-dependent amidohydrolase/peptidase domain of the LysK endolysin, derived from bacteriophage K). Iterative PSI-BLAST searches against the Protein Data Bank (PDB) and nonredundant (nr) databases were used to populate a multiple alignment for analysis using the T-Coffee Expresso server. A consensus Maximum Parsimony phylogenetic tree with a bootstrap analysis setting of 1,000 replicates was constructed using MEGA4. Structural templates relevant to our target (CHAPK) were identified, processed in Expresso and used to generate a 3D model in the alignment mode of SWISS-MODEL. These templates were also processed in the I-TASSER web server. A Staphylococcus saprophyticus CHAP domain protein, 2K3A, was identified as the structural template in both servers. The I-TASSER server generated the CHAPK model with the best bond geometries when analyzed using PROCHECK and the most logical organization of the structure. The predicted 3D model indicates that CHAPK has a papain-like fold. Circular dichroism spectropolarimetry also indicated that CHAPK has an αβ fold, which is consistent with the model presented. The putative active site maintained a highly conserved Cys54-His117-Glu134 charge relay and an oxyanion hole residue Asn136. The residue triplet, Cys-His-Glu, is known to be a viable proteolytic triad in which we predict the Cys residue is used in a nucleophilic attack on peptide bonds at a specific site in the pentaglycine cross bridge of staphylococcal cell wall peptidoglycan. Use of comparative modeling has allowed approximation of the 3D structure of CHAPK giving information on the structure and an insight into the binding and active site of the catalytic domain. This may facilitate its development as an alternative antibacterial agent.
bacteriophage; CHAP; endolysin; in silico; peptidase; staphylococcus
SCANPS performs iterative profile searching similar to PSI-BLAST but with full dynamic programing on each cycle and on-the-fly estimation of significance. This combination gives good sensitivity and selectivity that outperforms PSI-BLAST in domain-searching benchmarks. Although computationally expensive, SCANPS exploits onchip parallelism (MMX and SSE2 instructions on Intel chips) as well as MPI parallelism to give acceptable turnround times even for large databases. A web server developed to run SCANPS searches is now available at http://www.compbio.dundee.ac.uk/www-scanps. The server interface allows a range of different protein sequence databases to be searched including the SCOP database of protein domains. The server provides the user with regularly updated versions of the main protein sequence databases and is backed up by significant computing resources which ensure that searches are performed rapidly. For SCOP searches, the results may be viewed in a new tree-based representation that reflects the structure of the SCOP hierarchy; this aids the user in placing each hit in the context of its SCOP classification and understanding its relationship to other domains in SCOP.
The 3D-partner is a web tool to predict interacting partners and binding models of a query protein sequence through structure complexes and a new scoring function. 3D-partner first utilizes IMPALA to identify homologous structures (templates) of a query from a heterodimer profile library. The interacting-partner sequence profiles of these templates are then used to search interacting candidates of the query from protein sequence databases (e.g. SwissProt) by PSI-BLAST. We developed a new scoring function, which includes the contact-residue interacting score (e.g. the steric, hydrogen bonds, and electrostatic interactions) and the template consensus score (e.g. couple-conserved residue and the template similarity scores), to evaluate how well the interfaces between the query and interacting candidates. Based on this scoring function, 3D-partner provides the statistic significance, the binding models (e.g. hydrogen bonds and conserved amino acids) and functional annotations of interacting partners. The correlation between experimental energies and predicted binding affinities of our scoring function is 0.91 on 275 mutated residues from the ASEdb. The average precision of the server is 0.72 on 563 queries and the execution time of this server for a query is ∼15 s on average. These results suggest that the 3D-partner server can be useful in protein-protein interaction predictions and binding model visualizations. The server is available online at: http://3D-partner.life.nctu.edu.tw.
Motivation: The exponential growth of protein sequence databases has
increasingly made the fundamental question of searching for homologs a computational
bottleneck. The amount of unique data, however, is not growing nearly as fast; we can
exploit this fact to greatly accelerate homology search. Acceleration of programs in the
popular PSI/DELTA-BLAST family of tools will not only speed-up homology search directly
but also the huge collection of other current programs that primarily interact with large
protein databases via precisely these tools.
Results: We introduce a suite of homology search tools, powered by
compressively accelerated protein BLAST (CaBLASTP), which are significantly faster than
and comparably accurate with all known state-of-the-art tools, including HHblits,
DELTA-BLAST and PSI-BLAST. Further, our tools are implemented in a manner that allows
direct substitution into existing analysis pipelines. The key idea is that we introduce a
local similarity-based compression scheme that allows us to operate directly on the
compressed data. Importantly, CaBLASTP’s runtime scales almost linearly in the
amount of unique data, as opposed to current BLASTP variants, which scale linearly in the
size of the full protein database being searched. Our compressive algorithms will speed-up
many tasks, such as protein structure prediction and orthology mapping, which rely heavily
on homology search.
Availability: CaBLASTP is available under the GNU Public License at
PSI-BLAST is an iterative program to search a database for proteins
with distant similarity to a query sequence. We investigated over
a dozen modifications to the methods used in PSI-BLAST, with the goal
of improving accuracy in finding true positive matches. To evaluate
performance we used a set of 103 queries for which the true positives
in yeast had been annotated by human experts, and a popular measure
of retrieval accuracy (ROC) that can be normalized to take on values
between 0 (worst) and 1 (best). The modifications we consider novel
improve the ROC score from 0.758 ± 0.005
to 0.895 ± 0.003. This does not include
the benefits from four modifications we included in the ‘baseline’ version,
even though they were not implemented in PSI-BLAST version 2.0.
The improvement in accuracy was confirmed on a small second test
set. This test involved analyzing three protein families with curated
lists of true positives from the non-redundant protein database.
The modification that accounts for the majority of the improvement
is the use, for each database sequence, of a position-specific scoring
system tuned to that sequence’s amino acid composition.
The use of composition-based statistics is particularly beneficial
for large-scale automated applications of PSI-BLAST.
Motivation: Homologous protein families share highly conserved sequence and structure regions that are frequent targets for comparative analysis of related proteins and families. Many protein families, such as the curated domain families in the Conserved Domain Database (CDD), exhibit similar structural cores. To improve accuracy in aligning such protein families, we propose a profile–profile method CORAL that aligns individual core regions as gap-free units.
Results: CORAL computes optimal local alignment of two profiles with heuristics to preserve continuity within core regions. We benchmarked its performance on curated domains in CDD, which have pre-defined core regions, against COMPASS, HHalign and PSI-BLAST, using structure superpositions and comprehensive curator-optimized alignments as standards of truth. CORAL improves alignment accuracy on core regions over general profile methods, returning a balanced score of 0.57 for over 80% of all domain families in CDD, compared with the highest balanced score of 0.45 from other methods. Further, CORAL provides E-values to aid in detecting homologous protein families and, by respecting block boundaries, produces alignments with improved ‘readability’ that facilitate manual refinement.
Availability: CORAL will be included in future versions of the NCBI Cn3D/CDTree software, which can be downloaded at http://www.ncbi.nlm.nih.gov/Structure/cdtree/cdtree.shtml.
Supplementary information: Supplementary data are available at Bioinformatics online.
All-versus-all BLAST, which searches for homologous pairs of sequences in a database of proteins, is used to identify potential orthologs, to find new protein families, and to provide rapid access to these homology relationships. As DNA sequencing accelerates and data sets grow, all-versus-all BLAST has become computationally demanding.
We present FastBLAST, a heuristic replacement for all-versus-all BLAST that relies on alignments of proteins to known families, obtained from tools such as PSI-BLAST and HMMer. FastBLAST avoids most of the work of all-versus-all BLAST by taking advantage of these alignments and by clustering similar sequences. FastBLAST runs in two stages: the first stage identifies additional families and aligns them, and the second stage quickly identifies the homologs of a query sequence, based on the alignments of the families, before generating pairwise alignments. On 6.53 million proteins from the non-redundant Genbank database (“NR”), FastBLAST identifies new families 25 times faster than all-versus-all BLAST. Once the first stage is completed, FastBLAST identifies homologs for the average query in less than 5 seconds (8.6 times faster than BLAST) and gives nearly identical results. For hits above 70 bits, FastBLAST identifies 98% of the top 3,250 hits per query.
FastBLAST enables research groups that do not have supercomputers to analyze large protein sequence data sets. FastBLAST is open source software and is available at http://microbesonline.org/fastblast.
COMPASS is a method for homology detection and local alignment construction based on the comparison of multiple sequence alignments (MSAs). The method derives numerical profiles from given MSAs, constructs local profile-profile alignments and analytically estimates E-values for the detected similarities. Until now, COMPASS was only available for download and local installation. Here, we present a new web server featuring the latest version of COMPASS, which provides (i) increased sensitivity and selectivity of homology detection; (ii) longer, more complete alignments; and (iii) faster computational speed. After submission of the query MSA or single sequence, the server performs searches versus a user-specified database. The server includes detailed and intuitive control of the search parameters. A flexible output format, structured similarly to BLAST and PSI-BLAST, provides an easy way to read and analyze the detected profile similarities. Brief help sections are available for all input parameters and output options, along with detailed documentation. To illustrate the value of this tool for protein structure-functional prediction, we present two examples of detecting distant homologs for uncharacterized protein families. Available at http://prodata.swmed.edu/compass
The MPI Bioinformatics Toolkit is an interactive web service which offers access to a great variety of public and in-house bioinformatics tools. They are grouped into different sections that support sequence searches, multiple alignment, secondary and tertiary structure prediction and classification. Several public tools are offered in customized versions that extend their functionality. For example, PSI-BLAST can be run against regularly updated standard databases, customized user databases or selectable sets of genomes. Another tool, Quick2D, integrates the results of various secondary structure, transmembrane and disorder prediction programs into one view. The Toolkit provides a friendly and intuitive user interface with an online help facility. As a key feature, various tools are interconnected so that the results of one tool can be forwarded to other tools. One could run PSI-BLAST, parse out a multiple alignment of selected hits and send the results to a cluster analysis tool. The Toolkit framework and the tools developed in-house will be packaged and freely available under the GNU Lesser General Public Licence (LGPL). The Toolkit can be accessed at .
Sequence similarity searching is a powerful tool to help develop hypotheses in the quest to assign functional, structural and evolutionary information to DNA and protein sequences. As sequence databases continue to grow exponentially, it becomes increasingly important to repeat searches at frequent intervals, and similarity searches retrieve larger and larger sets of results. New and potentially significant results may be buried in a long list of previously obtained sequence hits from past searches.
ReHAB (Recent Hits Acquired from BLAST) is a tool for finding new protein hits in repeated PSI-BLAST searches. ReHAB compares results from PSI-BLAST searches performed with two versions of a protein sequence database and highlights hits that are present only in the updated database. Results are presented in an easily comprehended table, or in a BLAST-like report, using colors to highlight the new hits. ReHAB is designed to handle large numbers of query sequences, such as whole genomes or sets of genomes. Advanced computer skills are not needed to use ReHAB; the graphics interface is simple to use and was designed with the bench biologist in mind.
This software greatly simplifies the problem of evaluating the output of large numbers of protein database searches.
The Blocks Database World Wide Web (http://www.blocks.fhcrc.org ) and Email (email@example.com) servers provide tools for the detection and analysis of protein homology based on alignment blocks representing conserved regions of proteins. During the past year, searching has been augmented by supplementation of the Blocks Database with blocks from the Prints Database, for a total of 4754 blocks from 1163 families. Blocks from both the Blocks and Prints Databases and blocks that are constructed from sequences submitted to Block Maker can be used for blocks-versus-blocks searching of these databases with LAMA, and for viewing logos and bootstrap trees. Sensitive searches of up-to-date protein sequence databanks are carried out via direct links to the MAST server using position-specific scoring matrices and to the BLAST and PSI-BLAST servers using consensus-embedded sequence queries. Utilizing the trypsin family to evaluate performance, we illustrate the superiority of blocks-based tools over expert pairwise searching or Hidden Markov Models.
The large numbers of protein sequences generated by whole genome sequencing projects require rapid and accurate methods of annotation. The detection of homology through computational sequence analysis is a powerful tool in determining the complex evolutionary and functional relationships that exist between proteins. Homology search algorithms employ amino acid substitution matrices to detect similarity between proteins sequences. The substitution matrices in common use today are constructed using sequences aligned without reference to protein structure. Here we present amino acid substitution matrices constructed from the alignment of a large number of protein domain structures from the structural classification of proteins (SCOP) database. We show that when incorporated into the homology search algorithms BLAST and PSI-blast, the structure-based substitution matrices enhance the efficacy of detecting remote homologs.
computational biology; protein homology; amino acid substitution matrix; protein structure
Annotating the tremendous amount of sequence information being generated requires accurate automated methods for recognizing homology. Although sequence similarity is only one of many indicators of evolutionary homology, it is often the only one used. Here we find that supplementing sequence similarity with information from biomedical literature is successful in increasing the accuracy of homology search results. We modified the PSI-BLAST algorithm to use literature similarity in each iteration of its database search. The modified algorithm is evaluated and compared to standard PSI-BLAST in searching for homologous proteins. The performance of the modified algorithm achieved 32% recall with 95% precision, while the original one achieved 33% recall with 84% precision; the literature similarity requirement preserved the sensitive characteristic of the PSI-BLAST algorithm while improving the precision.
Protein sequence alignments have become indispensable for virtually any evolutionary, structural or functional study involving proteins. Modern sequence search and comparison methods combined with rapidly increasing sequence data often can reliably match even distantly related proteins that share little sequence similarity. However, even highly significant matches generally may have incorrectly aligned regions. Therefore when exact residue correspondence is used to transfer biological information from one aligned sequence to another, it is critical to know which alignment regions are reliable and which may contain alignment errors.
PSI-BLAST-ISS is a standalone Unix-based tool designed to delineate reliable regions of sequence alignments as well as to suggest potential variants in unreliable regions. The region-specific reliability is assessed by producing multiple sequence alignments in different sequence contexts followed by the analysis of the consistency of alignment variants. The PSI-BLAST-ISS output enables the user to simultaneously analyze alignment reliability between query and multiple homologous sequences. In addition, PSI-BLAST-ISS can be used to detect distantly related homologous proteins. The software is freely available at: .
PSI-BLAST-ISS is an effective reliability assessment tool that can be useful in applications such as comparative modelling or analysis of individual sequence regions. It favorably compares with the existing similar software both in the performance and functional features.
Biologists regularly search DNA or protein databases for sequences that share an evolutionary or functional relationship with a given query sequence. Traditional search methods, such as BLAST and PSI-BLAST, focus on detecting statistically significant pairwise sequence alignments and often miss more subtle sequence similarity. Recent work in the machine learning community has shown that exploiting the global structure of the network defined by these pairwise similarities can help detect more remote relationships than a purely local measure.
We review RankProp, a ranking algorithm that exploits the global network structure of similarity relationships among proteins in a database by performing a diffusion operation on a protein similarity network with weighted edges. The original RankProp algorithm is unsupervised. Here, we describe a semi-supervised version of the algorithm that uses labeled examples. Three possible ways of incorporating label information are considered: (i) as a validation set for model selection, (ii) to learn a new network, by choosing which transfer function to use for a given query, and (iii) to estimate edge weights, which measure the probability of inferring structural similarity.
Benchmarked on a human-curated database of protein structures, the original RankProp algorithm provides significant improvement over local network search algorithms such as PSI-BLAST. Furthermore, we show here that labeled data can be used to learn a network without any need for estimating parameters of the transfer function, and that diffusion on this learned network produces better results than the original RankProp algorithm with a fixed network.
In order to gain maximal information from a network, labeled and unlabeled data should be used to extract both local and global structure.