Novel modeling and alignment analysis tools are intended to help in protein structure prediction, which remains the most popular application of the FFAS server. It is noteworthy that structural predictions are increasingly used to aid experimental structure determination. At the same time, adding full proteomes of several organisms as searchable profile databases should help in another, increasingly frequent application of FFAS, i.e. using remote homology to link newly sequenced proteins to better annotated proteins or protein families.
Discovery of new domains in eukaryotic proteins
Dividing proteins into structural domains is a relatively straightforward task if it is possible to align them with homologous proteins of known structures (which are often already parsed into domains in resources such as SCOP). However, this task becomes increasingly difficult when homology is very weak. In such cases, remote homology prediction tools such as FFAS are in many cases the only source of complete alignment with known structures that allow determination of domain boundaries.
For prokaryotic proteins without detectable similarity to any known structures or annotated domains, it is oftentimes possible to propose putative domain boundaries based on conserved blocks in multiple sequence alignment of homologous sequences. For eukaryotic proteins, it is usually much more challenging because of the presence of multiple domains and long regions of structural disorder and low complexity that regularly surround structural domains. These factors frequently cause ‘profile contamination’ (
34,
35) that can diminish or bias a sequence conservation ‘signal’ from a structural domain. Besides remote homology detection algorithms, sequence profiles are used in local structure prediction methods such as programs for predicting secondary structure and structural disorder. As a result, ‘profile contamination’ not only interferes with remote homology detection and makes it impossible to notice conserved blocks corresponding to structural domains, but also introduces noise into secondary structure and disorder predictions. This problem can be alleviated by dividing the sequence of a protein of interest into overlapping fragments and submitting them separately to profile-based prediction servers, such as FFAS, or secondary structure services. In our experience, it is useful to try at least two different sets of such fragments of different lengths (for instance, 500 and 300 amino acid). If any such fragment corresponds to a structural domain, it should be possible to predict its secondary structure and sometimes even detect homology to known protein structures or annotated protein families, which is oftentimes impossible when a full protein sequence is used. In the current implementation, we applied this procedure to proteomes stored on the FFAS server, where all proteins longer than a specific threshold are divided into shorter overlapping fragments ().
Detection of internal repeats and alternative alignment variants
Dotplot graphs described in the previous section allow detection of internal repeats in protein sequences and alternative variants of alignments between two proteins. Profile–profile dotplot graphs are expected to be more sensitive than traditional sequence–sequence graphs. However, as is the case with all profile-based methods, they may be prone to profile contamination. Because of this, dotplot analysis of repeats should be done in parallel with a full analysis of a protein and splitting a protein sequence into (predicted) structural domains. Then, detection of internal repeats should be performed again for individual domains to see whether results remain consistent.
Aiding protein crystallography
Protein crystallization remains the main bottleneck in structure determination by X-ray crystallography, and remote homology detection by servers such as FFAS can address at least two aspects of this problem. Our participation in a structural genomics center gives us a unique opportunity to test these applications of FFAS on real-life examples, but we would like to note that other accurate alignment methods can also be used for these purposes.
Construct design Protein crystallization often depends on the design of a proper crystallization construct (
36)—a fragment of a protein sequence that corresponds to one or more structural domains. While prokaryotic proteins can routinely be crystallized in full length, eukaryotic proteins usually require nontrivial construct design. The problem of construct design is directly related to the problem of detecting structural domains described in the previous paragraph. Alignment with a known structure is a potential source of information about optimal construct boundaries, especially if a protein region is aligned with a complete protein structure or a complete domain. It is important to note that protein sequences longer than 500 amino acid should be split into putative domains before submitting them to FFAS. Thus, construct design with FFAS is often an iterative process in which approximate domain boundaries are improved in subsequent searches. FFAS predictions are extensively used to design protein constructs at the Joint Center for Structural Genomics and first structures based on these constructs have already been solved.
Prediction of exposed residues for surface engineering It is known that sidechains involved in contacts between different protein molecules in the crystal have a significant impact on the proteins’ ability to crystallize, and by performing site-directed mutagenesis of these residues, one can significantly improve their likelihood of crystallization (
37). The candidate residues for such mutations can be proposed by a method of SER (
4). The application of SER is greatly facilitated if it is known which high-entropy sidechains are exposed to the solvent. Information about solvent exposure can be derived from 3D models of proteins, and by detecting remote homology to known structures, FFAS may reduce the number of mutations that need to be tested.
Modeling for MR Solving the phase problem remains a bottleneck in X-ray crystallography of proteins. The MR method addresses this problem by calculating phase information from a predicted 3D model. The success of MR strongly depends on the accuracy of this model. By finding modeling templates for proteins without close similarity to known structures, FFAS extends the applicability of MR. For instance, over 70 protein structures have been solved at the Joint Center of Structural Genomics using models based on FFAS alignments, including 17 with <30% sequence identity to their modeling templates (
31). A detailed description of strategies of MR phasing with FFAS models has been described by our group previously (
31,
38).