|Home | About | Journals | Submit | Contact Us | Français|
It is informative to detect highly conserved positions in proteins and nucleic acid sequence/structure since they are often indicative of structural and/or functional importance. ConSurf (http://consurf.tau.ac.il) and ConSeq (http://conseq.tau.ac.il) are two well-established web servers for calculating the evolutionary conservation of amino acid positions in proteins using an empirical Bayesian inference, starting from protein structure and sequence, respectively. Here, we present the new version of the ConSurf web server that combines the two independent servers, providing an easier and more intuitive step-by-step interface, while offering the user more flexibility during the process. In addition, the new version of ConSurf calculates the evolutionary rates for nucleic acid sequences. The new version is freely available at: http://consurf.tau.ac.il/.
The degree to which an amino (or nucleic) acid position is evolutionarily conserved is strongly dependent on its structural and functional importance. Thus, conservation analysis of positions among members from the same family can often reveal the importance of each position for the protein (or nucleic acid)’s structure or function. ConSurf (1,2) and ConSeq (3) are web servers for calculating the evolutionary rate of each position of the protein and for identifying structurally and functionally important regions within proteins. The degree of conservation of each position is the inverse of the site’s evolutionary rate; rapidly evolving positions are variable while slowly evolving positions are conserved. In ConSurf, the evolutionary rate is estimated based on the evolutionary relatedness between the protein and its homologues and considering the similarity between amino acids as reflected in the substitutions matrix (4,5). One of the advantages of ConSurf in comparison to other methods is the accurate computation of the evolutionary rate by using either an empirical Bayesian method or a maximum likelihood (ML) method (5). The differences between the two methods are explained in detail in reference (4). The strength of those methods is that they explicitly account for the stochastic process underlying the evolution of the analyzed sequences, and that they rely on the phylogeny of the sequences. Thus, they can correctly discriminate between conservation due to short evolutionary time and genuine sequence conservation. In addition, the Bayesian based method provides reliability estimates for the site-specific conservation scores.
A short description of the methodology is provided below. More detailed description is available at http://consurf.tau.ac.il/, under ‘OVERVIEW’, ‘QUICK HELP’ and ‘FAQ’.
A flowchart of the ConSurf web server is shown in Figure 1 and detailed below.
If a protein 3D structure is provided:
For all cases, ConSurf creates the following outputs:
For proteins in which the 3D structure was not provided by the user, an up-to-date version of the Protein Data Bank (13) is searched for relevant homologues. If a structure of at least one homologous protein is available, the user may map the conservation scores on the structure. This option should ease the procedure for the non-expert users, who may be unfamiliar with the 3D structure homologue. This option can also be useful for analyzing proteins that share the same sequence but differ in their 3D structure (for example, two structures solved in different conformations or with different ligands).
As an example we provide the main output of a ConSurf run for the N-terminal region of the GAL4 transcription factor in yeast (PDB ID: 3COQ, chain A and B) in complex with its DNA recognition site (Figure 2). The analysis revealed, as expected, that the functional regions of this protein are highly conserved. For example, all the cysteines that form the Zn(2)-C6 DNA binding domain (CYS11, CYS14, CYS21, CYS28, CYS31, CYS38; 14) were assigned the highest conservation scores. Likewise, PRO26, which is known to be central for DNA binding (15) is also highly conserved according to our analysis. In addition, other amino acid residues, which are in contact with the DNA (i.e. GLN9, LYS17, LYS18, LYS20, ARG15, LYS23; 16) are relatively conserved.
ConSurf was also applied to nucleic acid sequences from yeast, which are the known binding sites of GAL4 and their adjacent neighborhood (Figure 2). As anticipated, the analysis revealed that the consensus pattern CGG-N11-CCG typical to GAL4 binding site is highly conserved. An extended full ConSurf analysis of this example is available in the ‘GALLERY’ section on the ConSurf web site.
Despite increasing interest in the non-coding fraction of transcriptomes, the number, the level of conservation, and functions, if any, of many non-protein-coding transcripts remain to be discovered. However, it has already been shown that many of the non-coding sequences are connected to regulatory processes. The new version of ConSurf offers estimations of the evolutionary rate for each position of nucleic acid sequences in the same manner used for amino acid residues. For that purpose, four evolutionary models were implemented in the Rate4Site program: (i) the Juke and Cantor 69 model (JC69), which assumes equal base frequencies and equal substitution rates (17). (ii) The Tamura 92 model that uses only one parameter, which captures variation in G-C content (18). (iii) The HKY85 model, which distinguishes between transitions and transversions and allows unequal base frequencies (19). (iv) The General Time Reversible (GTR) model, which is the most general time-reversible model. The GTR parameters consist of an equilibrium base frequency vector, giving the frequency at which each base occurs at each site, and the rate matrix (20). When enough data (i.e. sequences) are available, the GTR model is superior over the more simplified Tamura 92 model. However, the Tamura 92 model is recommended in cases in which the data are not sufficient for reliable estimation of the model parameters and thus it is the default option for analyzing nucleic acid sequences in ConSurf.
The LG substitution matrix, which incorporates variability of evolutionary rates across sites in the matrix estimation was shown to outperform other substitutions matrices for proteins (21). The LG matrix was added to Rate4Site and is offered in the new version of ConSurf in addition to the previous substitution models: JTT (22), Dayhoff (23), WAG (24), mtREV (25) and cpREV (26).
The accuracy of conservation scores is directly influenced by the amount and quality of sequence data available in the MSA and the relatedness between the homologous sequences themselves and the sequence of interest. For example, using homologous sequences with different functions might blur the signal. One of the important changes in the new version of ConSurf is the addition of a clear and intuitive interface that helps controlling which of the sequences are included in the analysis. These improvements include:
In this new version of ConSurf, we put great emphasis on the user interface. ConSurf now presents an easier and more intuitive step-by-step interface, while still offering the user great flexibility during the process as described above. Each step is accompanied by built-in detailed help.
The new version of the ConSurf web server runs on a Linux cluster of 2.6GHz AMD Opteron processors, equipped with 4 GB RAM per quad-core node. The server runs with up to date versions of the supported MSA programs, and regularly updated databases. Running time depends on the dataset size (number and length of sequences) and the server load. The ConSurf server is implemented in PHP and Perl using the support of BioPerl modules (38). Rate4Site is implemented in C++ (4). For proteins with available 3D structure the conservation scores are projected on the structure and visualized using version 1.44 of FirstGlance in Jmol.
ConSurf and ConSeq have an established reputation in the identification of functional regions in proteins using evolutionary information. In addition, these methods are a focal point that facilitates the development of more useful tools in our group and in other groups. For example, they are the basis for the development of the PatchFinder tool for the automatic detection of clusters of highly conserved amino acids (39), and the detection of DNA-binding proteins (40). Along with the massive growth of sequence and structure databases we believe that this new version of the ConSurf server will be highly useful to a growing number of molecular biology researchers and allow them to perform complex analyses using sophisticated algorithms accurately, easily and comprehensively.
BLOOMNET ERA-PG; Israeli Science Foundation (878/09 to T.P.). Funding for open access charge: BLOOMNET ERA-PG.
Conflict of interest statement. None declared.
The authors are grateful to Nimrod Rubinstein, Adi Doron-Faigenboim, Eyal Privman, Itay Mayrose, Fabian Glaser, Maya Schushan, Guy Nimrod, Ofir Goldenberg, Yana Gofman, Uri Zonens, Gilad Wainreb and Matan Kalman for technical help, useful comments and helpful discussions.