|Home | About | Journals | Submit | Contact Us | Français|
The Fold and Function Assignment System (FFAS) server [Jaroszewski et al. (2005) FFAS03: a server for profile–profile sequence alignments. Nucleic Acids Research, 33, W284–W288] implements the algorithm for protein profile–profile alignment introduced originally in [Rychlewski et al. (2000) Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Science: a Publication of the Protein Society, 9, 232–241]. Here, we present updates, changes and novel functionality added to the server since 2005 and discuss its new applications. The sequence database used to calculate sequence profiles was enriched by adding sets of publicly available metagenomic sequences. The profile of a user’s protein can now be compared with ~20 additional profile databases, including several complete proteomes, human proteins involved in genetic diseases and a database of microbial virulence factors. A newly developed interface uses a system of tabs, allowing the user to navigate multiple results pages, and also includes novel functionality, such as a dotplot graph viewer, modeling tools, an improved 3D alignment viewer and links to the database of structural similarities. The FFAS server was also optimized for speed: running times were reduced by an order of magnitude. The FFAS server, http://ffas.godziklab.org, has no log-in requirement, albeit there is an option to register and store results in individual, password-protected directories. Source code and Linux executables for the FFAS program are available for download from the FFAS server.
The original publication about the Fold and Function Assignment System (FFAS) server (1) introduced the server and suggested optimal strategies for using it for challenging cases of remote homology and protein structure prediction. The FFAS algorithm was described in 2000 (2), and subsequent improvements were described in 2005 (1). Here we review tools and data added to the server and discuss several new applications of FFAS.
Methods for detecting remote homology are most often used to predict protein structures. Three-dimensional (3D) models of protein structures allow identification of functionally relevant residues and, thus, enable applications such as planning of mutagenesis experiments or computational docking of ligand molecules. Alignments between the protein of interest and proteins with known structures make it possible to identify structural domains in multidomain proteins (3), helping design constructs for X-ray crystallography and identify surface residues that may be modified to increase the likelihood of crystallization by the method of surface entropy reduction (SER) (4).
However, detection of remote homology may be a very valuable source of information, even if it does not link the protein of interest to any known structure (5). For instance, the homology between the protein of interest and a functionally annotated protein or protein family often provides a hypothesis about a protein’s function and helps in the planning of experiments. This application of FFAS is becoming more relevant with the rapid growth of protein sequence databases fueled by continued improvements of DNA sequencing techniques, which are increasingly used to probe novel, previously never studied regions of the protein universe (6–8). Recent analyses suggest that despite their novelty, these regions are dominated by very divergent members of known protein families rather than completely new ones (9).
FFAS is regularly assessed in CASP (10) competitions and continually benchmarked in the LIVEBENCH (11) experiment. In the last available LIVEBENCH evaluation, FFAS is ranked in the top 2–4 of all sequence-based methods (see http://meta.bioinfo.pl/results.pl?comp_name=livebench-2009.2). In addition, FFAS is continuously tested on pairs of proteins of the same fold but from different superfamilies [based on the SCOP (12) database]. The current version of the FFAS algorithm was optimized in 2003 using SCOP v.1.65 and retested in 2009 on representatives of superfamilies that were added to the PDB later and, thus, not used in any training set. The results of this test confirm that FFAS detects more than twice as many cases of the extremely remote homology as PSI-Blast (13) (14 and 5% of pairs, respectively). Detailed results of this benchmark are included in the server’s documentation, available online.
The sensitivity of profile–profile comparison is now widely recognized, and many Web servers implementing such algorithms are available, including HHPRED (14), COMPASS (15), COMA (16), PHYRE (17), GenThreader (18), FORTE (19) and webPRC (20). A comprehensive review and comparison of these servers and methods is beyond the scope of this publication. Based on our experience, the strengths of FFAS in comparison to other servers include: speed, the large number of profile databases available for searches, password-protected lists of users’ results, the option of processing multiple sequences (from registered accounts), lists of precalculated results, dotplot analysis of local similarities in two profiles, and, last but not least, the longevity and stability of the server, which has been in continuous use for over 10 years now.
The original FFAS server was designed to answer a specific question: ‘Is my protein homologous (and thus structurally similar) to any protein with an already known structure?’ We found out that many users are interested in related, but more general, questions, such as: ‘Does an organism A contain a (putative) member of a protein family B?’ or ‘What percentage of proteins in organism A have detectable homology to known structures or annotated families’. To make answering such questions possible, we added databases of profiles for complete proteomes to the FFAS server (Table 1). In addition to direct searches of profile databases with the FFAS algorithm, a user may search the precalculated FFAS results of comparisons between these proteomes to selected databases of profiles such as PDB (21), SCOP (12), Pfam (22) and COG (23).
The FFAS server returns a single, local–local alignment for each pair of compared sequences, represented by their profiles. Dotplot graphs allow a visual inspection of a the entire landscape of similarity between two proteins being compared, allowing a user to identify regions of similarity not included in the reported alignment, such as repeats, and domains that are present in more than one copy. It also makes it possible to assess the relative reliability (stability) of different sections of the alignment. An element (M, N) of the similarity matrix used in dynamic programming is a profile–profile similarity score of a position M in the first sequence and a position N in the second sequence. Visualization of this matrix as an M by N heat map with a color scale ranging from blue (the highest similarity between N and M) to red (the lowest similarity) is available on the ‘align 2 sequences and dot plot’ tab of the FFAS server.
The interface allows modification of the averaging window used in preparation of dotplot graphs. The averaging radius of 0 corresponds to the visualization of the original profile–profile similarity matrix used to calculate the FFAS alignment; using non-zero values often enhances regions of local similarity. An optimal alignment returned by FFAS can also be displayed on the graph as a series of diagonal lines. This feature can be used to determine whether there are any regions of similarity between two proteins that are not included in the standard alignment [See example in Figure 1A. The presence of regions of high similarity (diagonal blue lines) not overlapping with actual alignments (series of green lines) often indicates the presence of a sequence repeat or duplicated domain].
The FFAS server provides links to the ProtMod modeling server, which allows building 3D protein models with the SCWRL (30) algorithm. The modeling job on the ProtMod server can be launched via model links, displayed next to the alignments with templates from the PDB and the SCOP databases. Clicking on such a link sends the alignment between the query and the modeling template to the ProtMod server. On the ProtMod input page, a user can select the model type and the modeling program that will be used. Two model types are available: all-atom models, in which all sidechains of a modeling template are replaced according to the FFAS alignment, and ‘mixed models’ with truncated residue sidechains. ‘Mixed’ models are intended to be used in phasing of X-ray crystallography data by molecular replacement (MR), especially in cases in which a modeling template is only remotely homologous to the protein of interest (query) (31).
In FFAS searches against the SCOP database, a user can easily check the consistency of structural predictions by comparing SCOP classification codes of predicted homologs. Usually, all SCOP domains aligned with a specific region of a query protein belong to the same fold. If this is not the case (SCOP domains aligned with a specific query region belong to two or more different folds), it often indicates possible problems with the prediction. However, some SCOP folds share partial structural similarity and, thus, the fact that they both appear on the list of FFAS hits for the same protein does not have to indicate inconsistencies in the prediction. We addressed this issue by providing the results of the FATCAT structural alignment program (32), which are displayed next to the alignments with template SCOP databases (see example in Figure 1B).
The alignment viewer available via ali links displayed by individual hits on the FFAS results page (Figure 1C) allows quick visualization of a query–template alignment and ‘projects’ the alignment onto the template structure if the structure is available (for comparisons to the PDB and SCOP database) using a Jmol (33) viewer plug-in. The pairwise alignment viewer was expanded to allow quick identification of pairs of aligned residues in the alignment and in the 3D structure. By clicking on any of the residues in the 3D structural view or on the alignment, a user can highlight residues in the alignment and, at the same time, label these residues in the 3D view (Figure 1C).
The increase in the number and size of databases of profiles used by the FFAS server made it necessary to increase the program’s execution speed. This was achieved by several technical improvements: introduction of a binary format of profile databases (speeding up loading of the databases), parallelization and optimization of the FFAS program using options provided by the Intel(R) Fortran Compiler, and installation of the FFAS server on a dedicated 12-node Linux cluster using dual quad-core CPUs per node. The combined effect of these updates (with the largest impact from parallelization enabled by a new generation of multi-core CPUs) was a reduction of execution times by an order of magnitude, despite significant increase of both the size and the number of the annotation databases. The source code of all programs included in the FFAS suite and accompanying Perl scripts and Linux executables are now available for download from the FFAS server (‘Download’ tab).
Adding more searchable databases and tools to the server required a significant reorganization of the FFAS server’s interface, which is now displayed in a ‘tab’ view. Server output shows a ‘master–slave’ alignment of sequences represented in a database of profiles with the query sequence. (In a master–slave format, gaps in the query sequence are omitted.) Individual query–template alignments can be displayed by clicking ali links on the results page. The ProtMod modeling tool is available via model links. A user can also display FFAS results for each template profile by clicking follow links. The follow feature often allows detection of very remote similarities by finding a protein or protein domain that is similar to both the query and the template. However, one has to make sure that the same region of an ‘intermediate’ protein domain is aligned to both proteins.
Novel modeling and alignment analysis tools are intended to help in protein structure prediction, which remains the most popular application of the FFAS server. It is noteworthy that structural predictions are increasingly used to aid experimental structure determination. At the same time, adding full proteomes of several organisms as searchable profile databases should help in another, increasingly frequent application of FFAS, i.e. using remote homology to link newly sequenced proteins to better annotated proteins or protein families.
Dividing proteins into structural domains is a relatively straightforward task if it is possible to align them with homologous proteins of known structures (which are often already parsed into domains in resources such as SCOP). However, this task becomes increasingly difficult when homology is very weak. In such cases, remote homology prediction tools such as FFAS are in many cases the only source of complete alignment with known structures that allow determination of domain boundaries.
For prokaryotic proteins without detectable similarity to any known structures or annotated domains, it is oftentimes possible to propose putative domain boundaries based on conserved blocks in multiple sequence alignment of homologous sequences. For eukaryotic proteins, it is usually much more challenging because of the presence of multiple domains and long regions of structural disorder and low complexity that regularly surround structural domains. These factors frequently cause ‘profile contamination’ (34,35) that can diminish or bias a sequence conservation ‘signal’ from a structural domain. Besides remote homology detection algorithms, sequence profiles are used in local structure prediction methods such as programs for predicting secondary structure and structural disorder. As a result, ‘profile contamination’ not only interferes with remote homology detection and makes it impossible to notice conserved blocks corresponding to structural domains, but also introduces noise into secondary structure and disorder predictions. This problem can be alleviated by dividing the sequence of a protein of interest into overlapping fragments and submitting them separately to profile-based prediction servers, such as FFAS, or secondary structure services. In our experience, it is useful to try at least two different sets of such fragments of different lengths (for instance, 500 and 300 amino acid). If any such fragment corresponds to a structural domain, it should be possible to predict its secondary structure and sometimes even detect homology to known protein structures or annotated protein families, which is oftentimes impossible when a full protein sequence is used. In the current implementation, we applied this procedure to proteomes stored on the FFAS server, where all proteins longer than a specific threshold are divided into shorter overlapping fragments (Table 1).
Dotplot graphs described in the previous section allow detection of internal repeats in protein sequences and alternative variants of alignments between two proteins. Profile–profile dotplot graphs are expected to be more sensitive than traditional sequence–sequence graphs. However, as is the case with all profile-based methods, they may be prone to profile contamination. Because of this, dotplot analysis of repeats should be done in parallel with a full analysis of a protein and splitting a protein sequence into (predicted) structural domains. Then, detection of internal repeats should be performed again for individual domains to see whether results remain consistent.
Protein crystallization remains the main bottleneck in structure determination by X-ray crystallography, and remote homology detection by servers such as FFAS can address at least two aspects of this problem. Our participation in a structural genomics center gives us a unique opportunity to test these applications of FFAS on real-life examples, but we would like to note that other accurate alignment methods can also be used for these purposes.
Protein crystallization often depends on the design of a proper crystallization construct (36)—a fragment of a protein sequence that corresponds to one or more structural domains. While prokaryotic proteins can routinely be crystallized in full length, eukaryotic proteins usually require nontrivial construct design. The problem of construct design is directly related to the problem of detecting structural domains described in the previous paragraph. Alignment with a known structure is a potential source of information about optimal construct boundaries, especially if a protein region is aligned with a complete protein structure or a complete domain. It is important to note that protein sequences longer than 500 amino acid should be split into putative domains before submitting them to FFAS. Thus, construct design with FFAS is often an iterative process in which approximate domain boundaries are improved in subsequent searches. FFAS predictions are extensively used to design protein constructs at the Joint Center for Structural Genomics and first structures based on these constructs have already been solved.
It is known that sidechains involved in contacts between different protein molecules in the crystal have a significant impact on the proteins’ ability to crystallize, and by performing site-directed mutagenesis of these residues, one can significantly improve their likelihood of crystallization (37). The candidate residues for such mutations can be proposed by a method of SER (4). The application of SER is greatly facilitated if it is known which high-entropy sidechains are exposed to the solvent. Information about solvent exposure can be derived from 3D models of proteins, and by detecting remote homology to known structures, FFAS may reduce the number of mutations that need to be tested.
Solving the phase problem remains a bottleneck in X-ray crystallography of proteins. The MR method addresses this problem by calculating phase information from a predicted 3D model. The success of MR strongly depends on the accuracy of this model. By finding modeling templates for proteins without close similarity to known structures, FFAS extends the applicability of MR. For instance, over 70 protein structures have been solved at the Joint Center of Structural Genomics using models based on FFAS alignments, including 17 with <30% sequence identity to their modeling templates (31). A detailed description of strategies of MR phasing with FFAS models has been described by our group previously (31,38).
The maintenance and development of FFAS server is funded by National Institute of Health (grant GM087218). Funding for open access charge: National Institutes of Health.
Conflict of interest statement. None declared.
The authors would like to thank all members of Godzik’s Lab and the JCSG team for useful comments and extensive testing of the server.