|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: QSD RBH. Performed the experiments: QSD CHW SML RBH. Analyzed the data: QSD CHW SML RBH. Wrote the paper: QSD RBH.
It has been widely recognized that the mutations at specific directions are caused by the functional constraints in protein family and the directional mutations at certain positions control the evolutionary direction of the protein family. The mutations at different positions, even distantly separated, are mutually coupled and form an evolutionary network. Finding the controlling mutative positions and the mutative network among residues are firstly important for protein rational design and enzyme engineering.
A computational approach, namely amino acid position conservation-mutation correlation analysis (CMCA), is developed to predict mutually mutative positions and find the evolutionary network in protein family. The amino acid position mutative function, which is the foundational equation of CMCA measuring the mutation of a residue at a position, is derived from the MSA (multiple structure alignment) database of protein evolutionary family. Then the position conservation correlation matrix and position mutation correlation matrix is constructed from the amino acid position mutative equation. Unlike traditional SCA (statistical coupling analysis) approach, which is based on the statistical analysis of position conservations, the CMCA focuses on the correlation analysis of position mutations.
As an example the CMCA approach is used to study the PDZ domain of protein family, and the results well illustrate the distantly allosteric mechanism in PDZ protein family, and find the functional mutative network among residues. We expect that the CMCA approach may find applications in protein engineering study, and suggest new strategy to improve bioactivities and physicochemical properties of enzymes.
Coevolution is a well known phenomenon in biological world. However, coevolution in families of proteins and genes is still an open topic . Conservation and mutation are two opposite aspects of functional evolution of protein family. It is commonly accepted that the evolution of a protein family is the result of large-scale random mutagenesis, with selection constraints imposed by their biological functions. In the studies of statistical analysis for protein evolutionary family the following two basic hypotheses were recognized widely, which were derived from the empirical observation of sequence evolution . (i) The lack of evolutionary constraint at one position should cause the distribution of observed amino acids at that position in the MSA (multiple structure alignment) to approach their mean abundance in all proteins, and deviances from the mean values should quantitatively represent conservation. (ii) The functional coupling of two positions, even if distantly positioned in the structure, should mutually constrain evolution at the two positions, and these should be represented in the statistical coupling of the underlying amino acid distributions , , .
The protein functions are not only determined by the interactions between local residues, but also depend on nonlocal, long-range communication between amino acids . For example, information transmission between distant functional surfaces on signaling proteins , the distributed dynamics of amino acids involved in enzyme catalysis , , , and allosteric regulation in various proteins ,  all represent manifestations of nonlocal interactions between residues. To the extent that these features contribute to defining biological properties of protein lineages, it is expected that the underlying mechanisms represent a long-range mutative network, consisting of local and nonlocal residues. Understanding the fundamental basis of long-range communication represents a major challenge in structural biology, which is significantly important for enzyme engineering and rational protein design.
Over the last ten years, a considerable methodological effort has been made to detect coevolution in protein and gene families at molecular level. In the method developed by Dunn et al. ,  information theory was used to reduce the random noise in the identification of coevolving positions. Dutheil and Galtier reviewed the literatures about molecular coevolution between or within residues in gene and protein families , . A successful approach, namely statistical coupling analysis (SCA), was developed by Ranganathan's group , , , , which focused on the conservations between coupling positions (sectors). However, in the enzyme engineering  what we are more interested is that how the amino acid mutations at certain positions modify (or improve) the biological functions (including bioactivity, thermostability, pH tolerance, and other properties) of enzymes. In the rational protein design we want to know the dominative positions for functional evolution and the mutually mutative network among positions in three dimensional structures of proteins.
In this study a computational approach, namely amino acid position conservation-mutation correlation analysis (CMCA), is developed to predict mutually mutative positions and find the evolutionary network in protein family. Unlike traditional SCA (statistical coupling analysis) approach , , , , which is based on the statistical analysis of position conservations, the CMCA focuses on the correlations of position mutations in a protein evolutionary family. We expect that the CMCA approach may find applications to rational protein design and enzyme engineering to improve bioactivities and physicochemical properties of enzymes.
In this study the PDZ domain family is selected as a model system to demonstrate the CMCA approach. The PDZ domain is a common structural domain found in the signaling proteins  of bacteria, yeast, plants, viruses , animals , , and human , . PDZ domains consist of 90–100 amino acid modules that adopt a six-stranded β sandwich configuration with two flanking α helices (Fig. 1). Target C-terminal ligands bind in a surface groove formed between the β2 strand and α2 helix and make a number of interactions that determine both general and sequence specific recognition , , . Both the overall three-dimensional structure and most details of ligand recognition are highly conserved in the PDZ family despite considerable sequence divergence . PDZ domains well represent protein binding motifs for which four high-resolution structures of distantly related members exist , , . These domains help anchor transmembrane proteins to the cytoskeleton and hold together signaling complexes , .
In this study we use a multiple structure aligned PDZ database  consisting of 240 PDZ proteins. After sequence alignment there are 129 positions in the PDZ database, and after deletion of the unnecessary gaps, 27 positions are deleted and the reduced database contains 102 positions.
Following the procedure described in Method section, we first perform the conservation position correlation analysis to the PDZ protein family. The position conservation correlation matrix R(con)L×L of PDZ is graphically shown in Fig. 2. The matrix R(con)L×L is symmetric to the diagonal line. For a clear view the matrix elements r(con)i,j less than 0.5 are filtered. The matrix elements on the diagonal line, whose values are 1 (r(con)i,i=1), are not shown, because the diagonal elements are self correlation coefficients, having no statistical meaning. Fig. 2 A is the relief map of position conservation correlation matrix R(con)L×L of PDZ database. The red bands indicate the region with correlation coefficients from 0.80 to 0.85. Fig. 2 B is the contour map of position conservation correlation matrix R(con)L×L of PDZ database. The map is colored according to the values: the regions with value higher than 0.90 are colored in red, higher than 0.80 in pink, and higher than 0.70 in orange.
In Fig. 2 A most places are high peaks and in Fig. 2 B most areas are in red color, meaning that in many sequence positions the residues are highly conserved, consistent to the high conservation of PDZ family. It is difficult to dig out detailed information from the position conservation correlation matrix R(con)L×L of PDZ database, because too many conservative positions complicate the analysis.
Then we perform the mutation position correlation analysis to the PDZ protein family. The position mutation correlation matrix R(mut)L×L of PDZ is graphically shown in Fig. 3.
The matrix R(mut)L×L is symmetric to the diagonal line. For a clear view the matrix elements r(mut)i,j less than 0.5 are filtered, and the elements on the diagonal line (r(mut)i,i=1) are not shown. Fig. 3 A is the relief map of position mutation correlation matrix R(mut)L×L of PDZ database. The red bands indicate the region with correlation coefficients from 0.80 to 0.85. Fig. 3 B is the contour map of position mutation correlation matrix R(mut)L×L. The map is colored in the same manner as in Fig. 2.
After careful observation and comparison we find partially complementary relationship between the two correlation matrices R(con)L×L and R(mut)L×L: the peaks and valleys are located in alternate places in Fig. 2 A and Fig. 3 A. And in Fig. 3 B there are less areas in red color than in Fig. 2 B. Only 12 separated red regions (R1 to R12) with higher correlation coefficients (ri,j>0.80) are found in Fig. 3 B. It is easier to find useful information from the position mutation correlation matrix R(mut) L×L than that from the position conservation correlation matrix R(con) L×L.
Several studies have highlighted the presence of interaction networks within single-domain proteins, which are crucial for allostery, stability, and folding , , , , . Based on the statistical coupling analysis (SCA)  PDZ domains were proposed to contain energetically coupled positions between residues located in the binding site and elsewhere, forming a long-range interaction network.
Table 1 lists 24 position pairs with higher correlation coefficients (r(mut)i,j>0.80) in the mutation correlation matrix R(mut)L×L, which distribute in 12 red regions in Fig. 3 B. In Table 1 the 24 position pairs are numbered according to the PDZ database . The corresponding position numbers in the PDZ protein 1BE9  are also listed in Table 1, which is a well investigated PDZ protein. Total 30 different positions are in the 24 position pairs. Among the 30 positions, 3 positions are gaps in the protein 1BE9. The correlations of four position pairs are shown in Fig. 4, which possess higher mutative correlation coefficients (R71−25=0.8388, R46−14=0.8522, R83−14=0.892, and R61−56=0.9742). The position 14 is highly correlated to both position 46 and position 83. Therefore, the three positions form a mutually mutative group (14, 46, 83) in the PDZ family.
The structure alignment of four PDZ proteins (2QKT, 2F5Y, 1G9O, and 1BE9) and results of position mutation correlation analysis are shown Fig. 1. The 27 residues, shown in stick render in Fig. 1 A, are located at the positions having higher mutative correlation coefficients (see Table 1). The surface of α2-β2 groove of 1BE9 and the peptide ligand is shown in Fig. 1 B. Blue is for hydrophilic surface and green is for hydrophobic surface. Fig. 1 C shows the residues of PDZ proteins 1BE9 and 2QKT at the positions controlling the ligand affinity. The size of Tyr79 and Leu81 of 2QKT (blue) are much larger than the corresponding residues Ala76 and Ala78 of 1BE9 (green). The PDZ protein 2QKT has a disulfide bond between Cys37 and Cys78 shown in Fig. 1 D, which makes it different from other PDZ proteins , . The position 37 and 78 are easily mutative positions based on CMCA results in Table 1. Fig. 1 E shows the sequence alignment of four PDZ proteins (2QKT, 2F5Y, 1G9O, and 1BE9). The residues, at the positions having higher position mutation correlation coefficients (see Table 1), are indicated by green frames.
The 27 mutative positions with higher mutation correlation coefficients distribute in two α-helices, all 6 β-strands, and some loops, including the easily mutative positions and some very conservative positions . In Fig. 1 E among the 27 positions there are two larger sectors, each of them consists of four adjacent positions: 36–39 (in β3) and 41–44. Some interesting findings are summarized as follows.
The groove between α2 helix and β2 strand is the binding location for ligand peptide ,  and the residues at these positions are highly conservative. However, mutations at these highly conservative positions may have more important significance to biological functions. Four easily mutative positions are found in the α2-β2 groove: Asn26 and Ile27 in β2, Ala76 and Ala78 in α2 (1BE9 numbering), which determine the ligand binding affinity and control the peptide shape and specificity. In Fig. 1 C the small residues Ala76 and Ala78 (in green) of 1BE9 are replaced by Tyr79 and Leu81 (in blue) of 2QKT. The size of Tyr79 and Leu81 of 2QKT are much larger than the Ala76 and Ala78 of 1BE9. Therefore 1BE9 and 2QKT must have very different preferences of peptide ligand.
The biological relevance of long-range allosteric effects in PDZ domains has attracted considerable attention , , . The PDZ domain of the cell polarity protein Par6 was shown to be allosterically regulated by its adjacent Crib domain in response to binding of CdC42 . Structural analysis showed the β1-α1 interface of the Par6 PDZ domain to be in direct contact with the Crib domain and responsible for transmission to the structurally distinct peptide binding pocket . The results of CMCA fully support above observations. Two easily mutative positions (Ala47 and Asp48, in 1BE9 numbering) are found in α1 helix, and two positions Pro11 and Ile16 (in 1BE9 numbering) are found in β1 strand. These easily mutative positions are connected through peptide ligand and define an allosteric mechanism for regulating binding affinity at the α2-β2 groove through molecular interactions at a distant surface site on the α1 helix .
In the alignment of four PDZ proteins in Fig. 1 E the 2QKT  is an INAD PDZ  domain and belongs to type 5 PDZ. The INAD PDZ domain (PDZ5) exists in a redox-dependent equilibrium ,  between two conformations—a reduced form that is similar to the structure of other PDZ domains, and an oxidized form. In INAD PDZ an intramolecular disulfide bond covalently links a pair of buried cysteine residues located underneath the floor of the ligand-binding pocket , . In 2QKT the disulfide bond is formed between Cys37 in β3 and Cys78 in α2 (in Fig. 1 E numbering). The positions of Cys37 and Cys78 are corresponding to the positions of residues Ile36 and Ala75 of 1BE9, respectively. The position 36 (in 1BE9 numbering) is an easily mutative position according to results of CMCA calculations (see Table 1), and the position 75 (in 1BE9 numbering) is adjacent to the easily mutative position 76 (see Table 1) falling into the mutative region R7 (see Fig. 3 B). The strong intramolecular disulfide bond connects the β3 strand with the α2 helix, suggesting that this interaction may be responsible for the equilibrium between the reduced conformation and the oxidized conformation in INAD PDZ5.
The functional coupling of two positions, even if distantly positioned in the structure, could mutually constrain evolution at the two positions, and these should be represented in the statistical coupling of the underlying amino acid distributions , , . In some cases the functional coupling is not limited only between two positions, but could be among several distant positions, which form a mutually evolutionary network.
Long-range allosteric effects that cause the preference change of ligand peptide in the PDZ binding groove happen in several distant positions. The results of CMCA study reveal the mutually multi coupling positions in the PDZ family. Table 1 lists the couple pairs of easily mutative positions. Actually, these positions can be reorganized into three groups according to the mutual couple pairs (in 1BE9 numbering): (30, 44, 51, 56, 64), (16, 41, 78), and (60, 81, 86). In the group 2 the three positions are at distantly separated β1 (position 16), β3 (position 41), and α2 (position 78) that may form a mutually mutative network and may affect the binding sites in α2-β2 groove. These findings could provide an explanation to distant allosteric interaction network in PDZ proteins.
Conservation and mutation are two opposite aspects of functional evolution of protein family. In the studies for protein evolutionary family the conversation statistical analysis can provide useful information, and several successful tools are developed based on the position conversations, such as SCA (statistical coupling analysis) ,  and MI (mutual information) , . In this study we prove that correlation analysis based on position mutations of amino acids also can reveal very useful information for study of functional evolution of protein family. The position mutations are equally important to the position conservations for study of functional evolution of protein family.
In the conservation-based statistical methods the “phylogenetic relationship” in a protein family causes the “coherent correlation” of all positions , which is raised by sharing common ancestry. Great efforts have been made to solve the “coherent correlation” problem , . In this study the foundational equation is the amino acid position mutative function (Eq.7), based on which the amino acid position mutation matrix TM×L is constructed, and the CMCA approach is developed. Unlike the conservation-based methods, the “coherent correlation” problem may be avoided in the mutation-based method. The theoretical implications of Eq.7 and CMCA approach are summarized as follows. (i) The actual mutations at specific directions are caused by the functional constraints in the protein evolutionary family. The directional mutations at some key positions control the functional evolution of the protein family. (ii) The functional coupling of two or more positions, even if distantly positioned in the structure, mutually constrains mutations at these positions, which form a communicative and evolutionary network in the protein family.
The computation results of CMCA application to the PDZ protein database show that generalizing the principle of mutations to account for correlations between positions reveals a novel structural organization for PDZ proteins that is distinct from traditional structural descriptions and the SCA approach. The CMCA approach and the SCA approach describe the distant allostery and mutative network in protein evolutionary family from different aspects (mutations and conservations), therefore both methods can provide useful information complementally.
Because the conservation-mutation correlation analysis is based on the correlation analysis of amino acid mutations, the CMCA approach may find applications in rational protein design and enzyme engineering by means of artificial residue mutations, and provide suggestion to improve the bioactivities and physicochemical properties of enzymes.
Evolutionarily related proteins have similar sequences and naturally occurring homologous proteins have similar protein structures. It has been shown that three-dimensional protein structure is evolutionarily more conserved than expected due to sequence conservation , . However, the evolution of protein family mainly depends on the mutations happening on the key positions in the 3D structures. Statistical analysis for a protein evolutionary family starts from a multiple 3D structural alignment (MSA) of a homologous protein group.
In this study, the multiple structure alignment procedure is used. Chains that possess coordinates for all their alpha carbons can be realigned taking into account their structure. From an initial estimate of the alignment, a new similarity matrix is generated using the relative alpha carbon coordinates that result from a multi-body superposition. This matrix is used to realign just these alpha carbon populated chains. This procedure is then repeated until the Root Mean Square Distance (RMSD) of the superposition fails to improve. The multiple structural alignment of a protein family has to reveal the structural features: all key functional residues are aligned in same sequence columns, and all key secondary structures (α-helices, β-sheets, and loops) are positioned in the same sectors.
After multiple sequence alignment the protein family is represented by a three dimensional primer data matrix X(0)N×M×L. N is the number of protein structures in the database, M is the types of amino acids (M=21, including 20 natural amino acids and the gap, which are inserted during the multiple alignment), and L is the length of amino acid sequences (including gaps). The database matrix X(0)N×M×L is a binary matrix, in which the element x(0)i,k,l of sequence i at position l is 1 when the amino acid is ak, otherwise, it is 0,
From the primer data matrix X(0)N×M×L the primer amino acid position frequency matrix F(0)M×L is constructed as follows,
The value of f(0)k,l is an integer in region [0,N], equals the times of amino acid ak appearing at position l in all N protein sequences. The higher value of f(0)k,l means the higher frequency of amino acid ak at position l. In this study the gaps are treated as a special amino acid type numbered by 0, and the 20 natural amino acids are numbered from 1 to 20. The summation of f(0)k,l from k=0 to M is N. The F(0)M×L is integer frequency matrix of amino acids. It can be transformed to decimal frequency after dividing by N.
The position correlation analysis is complicated by the presence of alignment gaps, commonly called indels, indicating the structural region present in some proteins but not in others. The gaps (space positions) in the primer data matrix X(0)N×M×L may interfere with the results of statistical analysis badly. Before performing the correlation analysis we have to reduce the unnecessary gaps. To do so, the total amino acid frequencies of 20 natural amino acids at each position l are computed as follows.
In Eq.3 the index k for amino acid types is from 1 to M=20, not including the gap. If the total amino acid position frequencies of 20 natural amino acids q(0)l is less than 20%, the position l is deleted from the primer sequences. Because it means that at the position l more than 80% ‘amino acids’ are gaps, and this position is less important for the biological function of the protein family. After unnecessary gaps are deleted, we get the reduced data matrix XN×M×L and amino acid position frequency matrix FM×L, in which the sequence length L is smaller than in the primer data matrix. For simplicity, we still use L for the reduced sequence length.
The position conservation correlation matrix can be derived from the reduced position frequency matrix FM×L. Because the conservation is directly correlated to the amino acid position frequency, the higher frequency fk,l of amino acid k at position l, the more conservation of amino acid k at this position. For position conservation correlation analysis the position frequency covariance matrix C(con)L×L is constructed firstly from the reduced position frequency matrix FM×L,
where and are the average frequencies at position i and j, respectively,
Hereby we get the position conservation correlation matrix R(con)L×L from the position covariance matrix C(con)L×L as follows.
where the superscript ‘con’ indicates the ‘conservation’, and r(con)i,j is the position conservation correlation coefficient between position i and j.
Before computing the amino acid position mutation correlation matrix, we have to build the amino acid position mutative equation, which measures the mutation of amino acid k at position l in protein family. For this purpose the amino acid types nl (including gap) at each position l in the protein family is very useful, which describes the diversification of amino acids at position l. The larger value of nl, the more mutations at position l. The amino acid position mutation matrix TM×L is constructed as follows.
The value tk,l is the measurement of mutation of amino acid k at position l in protein family, which is directly proportional to the amino acid types nl at position l and inversely proportional to the integer frequency fk,l of amino acid k at position l. The term N/nl in denominator is the average frequency at position l. The ‘1’ in denominator is added to avoid the infinite value when the frequency of amino acid k at position l equals to the average frequency fk,l=N/nl. The ‘1’ in numerator makes the value of tk,l is 0 for all amino acid types when nl is 1 (no mutation at position l).
Fig. 5 shows the curve shapes of amino acid position mutative equation (Eq.7). When the frequency of amino acid k takes the average value at position l (fk,l=N/nl) all curves have the maximum values, and when the amino acid type at position l has the largest value (nl=20), the mutative factor gets the largest value.
Following the same procedure described in sector 2.4, we can construct the position mutation covariance matrix C(mut)L×L from amino acid position mutation matrix TM×L,
where and are the average mutations at position i and j, respectively,
Hereby we get the position mutation correlation matrix R(mut)L×L from the position mutation covariance matrix C(mut)L×L as follows.
where r(mut)i,j is the position mutation correlation coefficient between position i and j. The computational procedure is graphically illustrated in Fig. 6, the flowchart of conservation-mutation correlation analysis (CMCA).
The authors thank the research group of Prof. Rama Ranganathan (University of Texas Southwestern Medical Center) for the PDZ database.
Competing Interests: The authors have declared that no competing interests exist.
Funding: This work is financially supported by the National Program on Key Basic Research Project (973 Program) of China under the project 2009CB724703, by the Guangxi key laboratory of Biorefinery under the project 07-109-001A, and by the National Science Foundation of China (NSFC) under the project 30970562. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.