Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2841228

Formats

Article sections

- Abstract
- 1. INTRODUCTION
- 2. THEORY AND ANALYSIS
- 3. RESULTS AND DISCUSSION
- 4. CONCLUDING REMARKS
- Supplementary Material
- References

Authors

Related links

Proteins. Author manuscript; available in PMC 2011 May 1.

Published in final edited form as:

PMCID: PMC2841228

NIHMSID: NIHMS161378

Shalom R Rackovsky: ude.mssm@yksvokcar.molahs

Using information-theoretic concepts, we examine the role of the reference state, a crucial component of empirical potential functions, in protein fold recognition. We derive an information-based connection between the probability distribution functions of the reference state and those that characterize the decoy set used in threading. In examining commonly used contact reference states, we find that the quasi-chemical (QC) approximation is informatically superior to other variant models designed to include characteristics of real protein chains, such as finite length and variable amino acid composition from protein to protein. We observe that in these variant models, the total divergence, the operative function that quantifies discrimination, descreases along with threading performance. We find that any amount of nativeness encoded in the reference state model does not significantly improve threading performance. A promising avenue for the development of better potentials is suggested by our information-theoretic analysis of the action of contact potentials on *individual* protein sequences. Our results show that contact potentials perform better when the compositional properties of the data set used to derive the score function probabilities are similar to the properties of the sequence of interest. Results also suggest to use only sequences of similar composition in deriving contact potentials, to tailor the contact potential specifically for a test sequence.

The prediction of protein structure requires conformational-energy-based score functions that can correctly pick the native conformation out of a large number of incorrect folds. In order to properly evaluate the nativeness of the interactions in a given conformation, its conformational energy is measured relative to a so-called reference state, a hypothetical “random” state where those interactions are absent. A common empirical approach is to construct this energy using the Boltzmann formalism,^{1}^{; }^{2} quantifying it as a log-odds ratio of two probabilities: the probability of finding the query sequence in a given conformation under native conditions, and the probability of its occurrence in the reference state. The former, the so-called “observed” probability, is usually estimated from a statistical survey of experimentally solved protein conformations. The estimation of the latter, the “expected” or reference probability, has proven to be a difficult task, because this state is inaccessible by direct experimental observation. Computational modelling of the hypothetical “random” state is not straightforward either. This uncertainty has led to the development of a number of reference state models, giving rise to the variety of empirical energy functions found in the literature.^{3}

Empirical energy or score functions have, in recent years, performed increasingly well under stringent computational assessment. This is because such functions, however modelled, are statistical in nature.^{4} They can be taken as a quantitative summary of the sequence-dependent structural information found in native folds. In previous work, we have applied concepts of information theory to quantify such structural information,^{5}^{; }^{6} and have formulated information-based methods to make statistical potentials more effective in structure prediction.^{7}^{; }^{8} In particular, we have demonstrated that the way these sequence-dependent probabilities are defined affects the amount of information that can be extracted from empirical data. Consequently, we have developed methods to optimize descriptions of sequence and conformation to maximize performance in structure prediction. In the present work, we use the same information-theoretic tools to explore the reference state problem. The advantage of an information-based approach is that it allows us to bypass complex biophysical considerations, and examine directly the statistical and informatic properties of score functions.

We use our information-based methodology to examine the effect of the choice of reference state model on the effectiveness of potentials involving contacts between side chains of residues in the protein chain. Contact potentials are used widely because of their respectable performance in fold recognition, relative simplicity, and undemanding parameterization.^{9}^{–}^{11} In reality, one can choose any reference state from which to measure energies or scores. Though its precise meaning is open to interpretation, the concept of “expected” probability can provide initial guidance. Early models of contact energy^{12} assumed that the expected probability of contact between any two amino acids in a folded protein should be proportional to their mole fractions. This model, the so-called quasichemical approximation, has proven to be effective in parameterizing contact energy, despite the fact that it neglects correlations that arise from the connectivity of the chain. Improvements to the reference state to account for chain connectivity and other properties of folded proteins have been made,^{10} but it has been shown that many of these improved models can be easily reduced to the simpler quasichemical reference state,^{13} and provide only modest performance improvement in fold recognition. In the meantime, other contact energy reference states have been advanced in the literature as alternatives to the quasichemical approximation, using different models for the “expected” probability.^{10}^{; }^{14}^{–}^{17}

Despite growing empirical evidence that variants of the quasichemical approximation work equally well, there is still no clear consensus on how to derive the best-performing reference state. (We should note that there is also a parallel set of investigations for reference states for distance-dependent energy functions,^{18}^{–}^{20} which we hope to address in future work.) In this work, we revisit the contact reference state problem using information-theoretic tools we have developed in previous work. We have found that a key determinant of the correct discrimination of native folds amidst an ensemble of incorrect or “decoy” conformations is the total divergence, an information-theoretic entity that quantifies the distance between the score of the correct structure and the mean score of the decoys.^{8} Here, we demonstrate how the definition of “expected” probability affects the total divergence of contact potentials, and evaluate the impact of the definitions on their effectiveness in actual threading. In the course of our investigation, we discover a connection between the properties of the data set from which a potential is derived, and the properties of the particular query sequence on which the potential will be used. In effect, we formulate a basis for query-specific contact potentials, which have been shown to improve performance. Our goal is to understand how the choice of reference state affects the quantity of information that can be extracted from empirical data, in order to maximize data use in structure prediction efforts.

We begin by outlining the information-theoretic tools that will be used in the analysis. The information-theoretic divergence

$$D(X||Y)=\sum _{\begin{array}{l}\text{all}\\ \text{states}\end{array}}p(x)\phantom{\rule{0.38889em}{0ex}}ln\frac{p(x)}{p(y)}$$

(1)

is used routinely to measure the distance between discrete probability distributions describing random variables *X* and *Y*. Strictly speaking, this is not true a distance, because it is not symmetric, and does not satisfy the triangle inequality. A related measure is the total divergence

$$\begin{array}{l}J(X,Y)=D(X||Y)+D(Y||X)\\ =\sum p(x)\phantom{\rule{0.38889em}{0ex}}log\frac{p(x)}{p(y)}+\sum p(y)\phantom{\rule{0.38889em}{0ex}}log\frac{p(y)}{p(x)}\end{array}$$

(2)

which, unlike *D*, is symmetric. A useful property of divergence is that

$$D(X||Y)\ge 0$$

(3)

with equality iff *p*(*x*) = *p*(*y*) across all states.^{21} Eq. (3) can be used to derive another inequality, as follows:

$$\begin{array}{l}\sum p(x)\phantom{\rule{0.38889em}{0ex}}log\frac{p(x)}{p({x}^{\prime})}\ge 0\\ \sum p(x)\left[log\frac{p(x)}{p(y)}-log\frac{p({x}^{\prime})}{p(y)}\right]\ge 0,\text{and}\phantom{\rule{0.16667em}{0ex}}\text{therefore}\\ \sum p(x)\phantom{\rule{0.38889em}{0ex}}log\frac{p(x)}{p(y)}\ge \sum p(x)\phantom{\rule{0.38889em}{0ex}}log\frac{p({x}^{\prime})}{p(y)}\end{array}$$

(4)

with equality iff {*p*(*x*)} = {*p*(*x*′)} for all states. This inequality will prove useful at a number of points in this work. One immediate implication is in estimating probabilities, critical in building empirical potentials. The inequality indicates diminished divergence if the true underlying probability distribution *p*(*x*) is poorly approximated by *p*(*x*′), i.e. {*p*(*x*)} ≠ {*p*(*x*′)}. Therefore, since the total divergence *J* of a potential is indicative of performance,^{7}^{; }^{8} accurate approximations of probabilities from empirical data is critical.

The informatic quantities described above have been used to model sequence-structure alignment or threading. We extend previous results^{7}^{; }^{8} here to examine the importance of the reference state to the effectiveness of the resulting score function. Typically, the score of an alignment of a query sequence *s* and a test conformation *c* is an additive potential, built from empirical data:

$${E}_{q}(cs)={\sum}_{i}^{m}{e}_{q}({c}_{i}s)$$

(5a)

where *e _{q}* (

$${e}_{q}({c}_{i}s)=log\frac{{p}^{\text{obs}}({c}_{i}s,q){p}^{exp}({c}_{i}s)}{}$$

(5b)

The numerator *p*^{obs} (*c _{i}*|

$${E}_{q}^{\prime}(cs)=\frac{1}{{n}_{x}}{E}_{q}(cs)=\frac{1}{{n}_{x}}{\sum}_{i}^{m}{e}_{q}(c{}_{i}s)$$

(6)

Various interactions *q* have been quantitated by this scoring scheme.^{4}^{; }^{6}^{; }^{8}^{; }^{22}^{–}^{26} If the interaction *q* is significant in protein stability, the score function
${E}_{q}^{\prime}(cs)$ can be effective in evaluating the fitness of any given sequence-structure alignment.

The gapless threading procedure, a good model for structure prediction, involves comparison of the score of the native (correct) conformation with the spectrum of scores given by an extensive ensemble of decoy (incorrect) structures.^{27} Correct detection occurs when the score of the native conformation is highest (or the corresponding energy is lowest). Expected behavior of a threading potential can be evaluated by repeated use of the potential in a battery of fold recognition tests. In previous work, we have shown that one quantity used to evaluate discrimination success is related to well-known information-theoretic properties of the scoring function.^{7}^{; }^{8} This quantity is the gap between native and decoy scores. For a typical sequence *s*, this gap is

$${J}_{q}(c,s)={E}_{q}^{\prime}({c}^{N}s)-\frac{1}{{n}_{d}}{\sum}_{j}{E}_{q}^{\prime}({c}^{j}s)$$

(7)

where *c ^{N}* refers to the native conformation, and the summation runs through the ensemble of

$$\begin{array}{l}{J}_{q}(C,S)=\frac{1}{{n}_{s}}{\sum}_{k}[{E}_{q}^{\prime}({c}^{N}{s}_{k})-\frac{1}{{n}_{d}}{\sum}_{j}{E}_{q}^{\prime}({c}^{j}{s}_{k})]=\frac{1}{{n}_{s}}{\sum}_{k}{E}_{q}^{\prime}({c}^{N}{s}_{k})-\frac{1}{{n}_{s}}{\sum}_{k}\frac{1}{{n}_{d}}{\sum}_{j}{E}_{q}^{\prime}({c}^{j}{s}_{k})\end{array}$$

(8a)

or, in terms of the basic scoring function *e _{q}* (

$${J}_{q}(C,S)=\frac{1}{{n}_{s}}{\sum}_{k}\frac{1}{{n}_{x}}{\sum}_{i}{e}_{q}({c}_{i}^{N}{s}_{k})-\frac{1}{{n}_{s}}{\sum}_{k}\frac{1}{{n}_{d}}{\sum}_{j}\frac{1}{{n}_{x}}{\sum}_{i}{e}_{q}({c}_{i}^{j}{s}_{k})$$

(8b)

The first term of the right hand side represents the average per-interaction score given by a sequence in its native conformation, which we have shown in previous work to be equal to mutual information between sequence and conformation.^{7}^{;}^{8} The second term is the expected per-interaction score given by a given sequence mounted onto a typical decoy conformation. To simplify the equation further, we recognize that in repeated threading, the total numbers of sequences *n _{s}* and decoy conformations

$${J}_{q}(C,S)={\sum}_{k}{\sum}_{i}\frac{1}{{n}_{s}{n}_{x}}{e}_{q}({c}_{i}^{N}{s}_{k})-{\sum}_{k}{\sum}_{i}{\sum}_{j}\frac{1}{{n}_{s}{n}_{d}{n}_{x}}{e}_{q}({c}_{i}^{j}{s}_{k})$$

(9)

The summations above run through each instance of (*c _{i}*,

Another way to express this equation is to count the instances of each *unique* (*c*, *s*) alignment, and then recast it as summations through all unique pairs. For instance, if the (*c _{r}*,

$${J}_{q}(C,S)={\sum}_{r,t}\frac{{n}^{N}({c}_{r},{s}_{t})}{{n}_{s}{n}_{x}}{e}_{q}({c}_{r}{s}_{t})-{\sum}_{r,t}\frac{{n}^{D}({c}_{r},{s}_{t})}{{n}_{s}{n}_{d}{n}_{x}}{e}_{q}({c}_{r}{s}_{t})$$

(10a)

where the summation runs through all unique sequence-conformation pairs. The frequency ratios can be represented by more familiar notation:

$${J}_{q}(C,S)={\sum}_{r,t}{p}^{N}({c}_{r},{s}_{t})\phantom{\rule{0.16667em}{0ex}}{e}_{q}({c}_{r}{s}_{t})-{\sum}_{r,t}{p}^{D}({c}_{r},{s}_{t})\phantom{\rule{0.16667em}{0ex}}{e}_{q}({c}_{r}{s}_{t})$$

(10b)

while, using the score function (Eq. (5b)), we have:

$${J}_{q}(C,S)={\sum}_{r,t}{p}^{N}({c}_{r},{s}_{t})log\frac{{p}^{\text{obs}}({c}_{r}{s}_{t},q){p}^{exp}({c}_{r}{s}_{t})}{-}$$

(10c)

This equation can be converted into a more familiar information-theoretic formulation, by multiplying the numerator and denominator of the score function by *p*(*s _{k}*), and reversing the sign of the decoy term:

$${J}_{q}(C,S)={\sum}_{r,t}{p}^{N}({c}_{r},{s}_{t})log\frac{{p}^{\text{obs}}({c}_{r},{s}_{t}q){p}^{exp}({c}_{r},{s}_{t})}{}+{\sum}_{r,t}{p}^{D}({c}_{r},{s}_{t})log\frac{{p}^{exp}({c}_{r},{s}_{t})}{{p}^{\text{obs}}({c}_{r},{s}_{t}q)}$$

(11)

In summary, the expected gap between the native score and the mean score of decoy conformations, represented by *J _{q}* (

Comparing to Eq. (2), it is easy to recognize that *J _{q}* (

$${p}^{\text{obs}}({c}_{r},{s}_{t}q)={p}^{N}({c}_{r},{s}_{t})$$

(12)

$$\text{and}\phantom{\rule{0.16667em}{0ex}}{p}^{exp}({c}_{r},{s}_{t})={p}^{D}({c}_{r},{s}_{t})$$

(13)

The quantities on the left hand side are components of the score function *e _{q}* (

We now examine issues relating to the two pairs of probability functions more closely. The first, expressed in Eq. (12), is an equality widely accepted in computational biology, but is only valid under a strict condition—that the expectation of the empirical probabilities characterizing the native state may be assumed to be identical to the probabilities observed from empirical data only when the data set is sufficiently representative of the diversity of protein sequences and structures.

The second condition (Eq. (13)), defining the nature of the reference state, is of primary interest here. We shall explore the consequences of a choice of reference state and gauge the performance of the resulting score functions in threading. The other important characteristic of score functions, the variance of scores, is also examined.

From Eq. (4), it can be seen that the two equalities maximize the native and decoy terms individually:

$${\sum}_{r,t}{p}^{N}({c}_{r},{s}_{t})log\frac{{p}^{N}({c}_{r},{s}_{t})}{{p}^{D}({c}_{r},{s}_{t})}\ge {\sum}_{r,t}{p}^{N}({c}_{r},{s}_{t})log\frac{{p}^{\prime}({c}_{r},{s}_{t})}{{p}^{D}({c}_{r},{s}_{t})}$$

(14)

for the native term, and

$${\sum}_{r,t}{p}^{D}({c}_{r},{s}_{t})log\frac{{p}^{D}({c}_{r},{s}_{t})}{{p}^{N}({c}_{r},{s}_{t})}\ge {\sum}_{r,t}{p}^{D}({c}_{r},{s}_{t})log\frac{{p}^{\prime}({c}_{r},{s}_{t})}{{p}^{N}({c}_{r},{s}_{t})}$$

(15)

for the decoy term. The information-based optimization implemented previously^{6}^{–}^{8} employs the strategy of maximizing the native term. In those studies, we found that factors that increase mutual information (the left hand side of the inequality in Eq. (14)) also increase *J _{q}* (

The prescription given by Eqs. (14) and (15) when applied to the score function, however, does not guarantee maximization of *J _{q}* (

We explore issues regarding the reference state correctly by using pairwise contact potentials. Details of the comprehensive threading procedure can be found elsewhere.^{8} Briefly, using a set of representative X-ray structures of protein chains, we model the gapless threading exercise by designing an all-against-all test. This procedure involves finding the score-rank of the native conformation of *every* sequence chain in the data set with respect to the ensemble of incorrect conformations provided by the same data set. In recent work,^{8} we demonstrated that measurements from comprehensive threading tests correspond to the components of the total divergence equation *J _{q}* (

We rewrite the equations derived above in terms of the contact potential. The score function is

$${e}_{c}(cab)=log\frac{{p}^{\text{obs}}(cab){p}^{exp}(cab)}{}$$

(16)

while the total divergence, or mean score gap, is:

$${J}_{q}(C,S)={\sum}_{ab}{p}^{N}(cab)log\frac{{p}^{\text{obs}}(cab){p}^{exp}(cab)}{+}$$

(17)

Lastly, the maximization condition for the decoy term is:

$${p}^{exp}(cab)={p}^{D}(cab)$$

(18)

In these equations, *p*(*c*|*ab*) refers to the probability of contact between amino acid pair *ab*.

To define the contact potential, amino acid pairs are represented by their beta-carbons (alpha carbon for glycine). Contact occurs between two side chains if their representative beta (or alpha) carbon atoms are within 9.5Å. All-against-all threading was implemented with 150-mer sequences, mounted onto all continuous 150-mer conformations in the database. With a data set of high-resolution X-ray structures of 1036 proteins, made up of 210,995 residues, a total of 58,034 150-mer sequences were aligned with each one of the same number of conformations, and their scores tallied.

As an initial exercise, we have designed a simple experiment to track the behavior of *J _{q}* (

$${p}_{\text{QC}}^{exp}(cab)=k{\chi}_{a}\phantom{\rule{0.16667em}{0ex}}{\chi}_{b}=k\left(\frac{{\sum}_{i}{n}_{i}(a)}{{\sum}_{i}{N}_{i}}\right)\phantom{\rule{0.16667em}{0ex}}\left(\frac{{\sum}_{i}{n}_{i}(b)}{{\sum}_{i}{N}_{i}}\right)$$

(19)

where *n _{i}*(

The observed probability distribution component, *p*^{obs} (*c*|*ab*), is derived from frequency counts of native contacts in the data set. The present simulation entails random perturbations of *p*^{exp} (*c*|*ab*) from the initial quasi-chemical reference, while keeping *p*^{obs} (*c*|*ab*) constant, to create entirely new score functions. Perturbations of varying degrees are made to the probability distribution, in order to explore a wide range of reference states relative to QC. The effectiveness of each newly generated score function is evaluated using the all-against-all threading.

We generated 400 unique reference state distributions, yielding changes in the mean gap score *J _{q}* (

$${J}_{q}({p}_{\text{QC}}^{exp},{p}_{\delta}^{exp})={\sum}_{ab}{p}_{\text{QC}}^{exp}(cab)log\frac{{p}_{\text{QC}}^{\text{exp}}(cab){p}_{\delta}^{exp}(cab)}{+}$$

(20)

where
${p}_{\delta}^{exp}(cab)$ is the perturbed probability. We find that a modest number of the reference states (5%) actually yield a larger *J _{q}* (

Information-theoretic properties of 400 randomly generated reference states. These reference state distributions (δ) were generated by perturbation of the quasi-chemical reference state (QC). The distance between any reference state δ **...**

Effectiveness of a score function can be measured by the relative rank of the native conformation in relation to the decoy ensemble. A rank *r*(*s*) of the native score of 1 signifies that the native score is the best over-all. The mean percentile rank,

$$r=\frac{1}{{n}_{L-\text{mer}}}\sum _{\begin{array}{l}\text{all}\\ \text{seqs}\end{array}}r(s)$$

(21)

computed from all-against-all threading, is the most stringent gauge of performance. In this set of 400 perturbed reference states, all except 20 have a higher *r* that the quasi-chemical reference state (which corresponds to poorer discrimination of the native conformation). The dependence of *r* on the proximity of the reference state to the quasi-chemical approximation is demonstrated in Figure 1D.

From this exercise, we learn the following: (1) QC appears to be in the neighborhood of the local optimal reference state. Indeed, it has been demonstrated^{13} that, in the situation of gapless threading of sequences through a diverse ensemble of conformations that preserve their native contacts, the mole-fraction product adequately approximates the probability of two amino acids to be in contact. This is consistent with Figure 1C, which demonstrates (via Eq. (15)) that QC optimizes the decoy term, thereby confirming that
${p}_{\text{QC}}^{exp}(cab)\approx {p}^{D}(cab)$. (2) The farther the reference states are from the quasi-chemical approximation, the lower the value of decoy term. Moreover, there are only 18 reference distributions, out of 400 randomly perturbed distributions, that produce a marginally higher score gap *J _{q}* (

The previous section pointed to the effectiveness of the quasi-chemical approximation in modelling the reference state. While there may be randomly perturbed reference states that can outperform QC, they do only by a small margin. More importantly, such reference states may prove impractical and undesirable because they don’t arise from well-defined models or exact prescriptions. In this section, we confine our analysis to the space around probability distributions that arise from conceptually modelable states.

Apart from QC, a number of reference states have been advanced in the literature. Such models attempt to take into account relevant structural properties of natural proteins that QC does not. In particular, QC assumes that residues are not linked in chains of finite length, whose composition can differ significantly from the over-all composition of the “amino acid gas” (i.e., the composition of the universe of protein structures).^{3} More sophisticated reference state models take real-world characteristics of native protein conformations into account, in order to better estimate the “expected” probabilities of contact.^{10}

There are a number of reference state models that attempt to consider the biases in amino acid composition within individual sequence chains of finite length.^{10} The first model we consider takes the expected probability of finding the pair *ab* in contact to be proportional to the number of times they exist together in the same sequence. The probability distribution can be derived from a set of sequences by the following formulation:

$${p}_{\text{QC1}}^{exp}(cab)=\{\begin{array}{ll}\frac{{\sum}_{i}2{n}_{i}(a){n}_{i}(b)}{{\sum}_{i}{N}_{i}({N}_{i}-1)},\hfill & a\ne b\hfill \\ \frac{{\sum}_{i}{n}_{i}(a)\phantom{\rule{0.16667em}{0ex}}({n}_{i}(a)-1)}{{\sum}_{i}{N}_{i}({N}_{i}-1)},\hfill & a=b\hfill \end{array}$$

(22)

where *N _{i}* is the sequence length of sequence

$$\frac{{\sum}_{i}2{n}_{i}(a)\phantom{\rule{0.16667em}{0ex}}{n}_{i}(b)}{{\sum}_{i}{N}_{i}({N}_{i}-1)}=\frac{{n}_{\mathit{prot}}\xb72\phantom{\rule{0.16667em}{0ex}}n(a)\phantom{\rule{0.16667em}{0ex}}n(b)}{{n}_{\mathit{prot}}\xb7{N}_{i}\phantom{\rule{0.16667em}{0ex}}({N}_{i}-1)}$$

(23)

where *n _{prot}* is the number of chains in the data set. Upon applying the limit

The second reference state (QC2) considers more specific structural properties of folded proteins. In this model, the contact probability is estimated as the mean of the expected probability of *ab* contact for each chain in the data set. This is calculated as follows:

$${p}_{\text{QC}2}^{exp}(cab)=\frac{1}{{\sum}_{i}{\sum}_{x}{\sum}_{y\ge x}{n}_{i}^{c}(xy)}{\sum}_{i}{f}_{i}(ab){\sum}_{x}{\sum}_{y\ge x}{n}_{i}^{c}(xy)$$

(24a)

where
${n}_{i}^{c}(ab)$ is the number of contacts between amino acids *a* and *b* in protein chain *i*, and

$${f}_{i}(ab)=\{\begin{array}{ll}\frac{2{n}_{i}(a)\phantom{\rule{0.16667em}{0ex}}{n}_{i}(b)}{{N}_{i}\phantom{\rule{0.16667em}{0ex}}({N}_{i}-1)},\hfill & a\ne b\hfill \\ \frac{{n}_{i}(a)\phantom{\rule{0.16667em}{0ex}}({n}_{i}(a)-1)}{{N}_{i}\phantom{\rule{0.16667em}{0ex}}({N}_{i}-1)},\hfill & a=b\hfill \end{array}$$

(24b)

Unlike QC1, QC2 recognizes that different folds occur in the data set, implying a variation in the number of contacts from chain to chain. The model, which has been referred to as partial-composition corrected reference state,^{10} proportionally partitions the number of contacts observed in *each* protein chain among the residue pairs within that chain, after which a weighted sum across all the chains in the data set is derived, to give the over-all expected probability. QC1, on the other hand, simply collates the proportion of expected pairings in the entire data set, without regard to fold detail. Mathematically, QC2 reduces to QC1 if the total number of contacts for each chain *i* is made proportional only to sequence length. This is equivalent to setting
${\sum}_{x}{\sum}_{y\ge x}{n}_{i}^{c}(xy)={kN}_{i}({N}_{i}-1)$, at constant *k*, thus transforming Eq. (24a) into Eq. (22).

Results from the comprehensive all-against-all threading using QC and the two variant models QC1 and QC2 are summarized in Table I. Two sequence lengths (*L* = 150, 200) were used, along with two contact distances (*d _{c}* = 9.5, 12.5Å), in order to survey a range of threading conditions. The difference in reference state models is clearly reflected by the decoy divergence term

Information-Theoretic Quantities and Threading Performance of Contact Potentials with Varying Reference States

$${\sum}_{r,t}{p}^{D}({c}_{r},{s}_{t})log\frac{{p}^{x}({c}_{r},{s}_{t})}{{p}^{N}({c}_{r},{s}_{t})}>{\sum}_{r,t}{p}^{D}({c}_{r},{s}_{t})log\frac{{p}^{y}({c}_{r},{s}_{t})}{{p}^{N}({c}_{r},{s}_{t})}$$

(25)

Though divergence is not a distance in the strict sense, this relationship is useful in understanding the way reference state models act with respect to the amount of information incorporated in them. Any detail that can bring the expected probabilities of pairwise contact closer to what is actually seen in a typical threading exercise should increase the decoy divergence *D*.

Closer inspection of the data from all four threading sets in Table I, however, reveals that an increase in *D* does not necessarily mean a marked improvement in performance. The three models seem to perform similarly, with QC exhibiting slightly higher *r* than its two variants. The mutual information *I* is highest for QC, which more than offsets any drop in *D* to make its *J* maximal among the three models. These results suggest that the quasi-chemical approximation does at least as well as any of the more sophisticated models, and may even actually outperform them. These issues will be explored further in a later section.

The most accurate reference state model for a given data set can be computed directly from empirical pairing frequencies generated by the specific threading procedure. In this “data-based” model (DB),^{13} the expected probability of contact for any given pair is derived directly by aligning a series of sequences with a range of decoy conformations, and tallying every occurrence of the contact. In the context of a set of query protein sequences, the best estimate should be achieved by counting the total pairwise contact frequencies when *all* sequences are mounted onto *all* conformations. In effect, the empirical *p ^{D}*(

Scatter plots of the 190 score elements (one for each amino acid pair) of the contact potentials derived using the reference states examined in this work. (A) Comparison between the score elements given by the quasi-chemical reference state (QC) and the **...**

Results of comprehensive threading for this model are summarized in Table I. In accordance with Eq. (15), the decoy divergence *D* is highest for DB, continuing the trend established by QC1 and QC2. However, the decrease in mutual information *I* is more dramatic than the improvement in *D*, lowering the total divergence *J*, and resulting in a decreased average performance, as measured by <*r*>.

We examine the relationships among DB, QC, and its two variants QC1 and QC2 more closely. If DB indeed embeds many characteristics of native-like chains in the model, then its probabilities
${p}_{\text{DB}}^{exp}(cab)$ should be closer to the true native probabilities *p*^{obs}(*c*|*ab*) than
${p}_{\text{QC}}^{exp}(cab)$. This can be confirmed by a simple calculation. For each *ab* pair, relative distances among the three probabilities can be compared by:

$$\mathrm{\Delta}={p}_{\text{QC}}^{exp}(cab)-{p}^{\text{obs}}(cab)-{p}_{\text{DB}}^{exp}(cab)-{p}^{\text{obs}}(cab)$$

(26)

A positive Δ indicates that
${p}_{\text{DB}}^{exp}(cab)$ is closer to *p*^{obs} (*c*|*ab*) than
${p}_{\text{QC}}^{exp}(cab)$, while a negative value indicates the opposite. Figure 3 shows that more than 86% of the unique amino acid pair probabilities that make up DB have positive Δ values, demonstrating that DB indeed exhibits more native-like character than QC. Likewise, comparisons between QC and QC1, and between QC and QC2, yield the expected ordering, namely that QC < QC1 < QC2 < DB in terms of proximity to *p*^{obs} (*c*|*ab*).

In constructing models that best embody the idea of “expected” probabilities, we seek ways to encode more native-like characteristics in the reference state. The limit of this exercise is the point where the reference state model approaches the observed contact probabilities, or *p*^{exp} (*c*|*ab*)= *p*^{obs} (*c*|*ab*). At this limit, Eq. (17) yields a value of zero for the three informatic quantities *I*, *D*, and *J*. Models that encode varying amounts of nativeness, including those examined here thus far, are informatically located between this extreme and QC. In order to examine the characteristics of such models, we built 100 evenly spaced reference state models from a weighted sum of
${p}_{\text{QC}}^{exp}(cab)$ and *p*^{obs} (*c*|*ab*):

$${p}_{n}^{exp}(cab)=\frac{1}{100}[n\phantom{\rule{0.16667em}{0ex}}{p}_{\text{QC}}^{exp}(cab)+(100-n){p}^{\text{obs}}(cab)]$$

(27)

where *n* = {0,1, 2,…,100}, and subjected each to the same all-against-all threading. We note that this group of models is but a small subset of reference states that occur in this region. Random perturbations of any of the models, similar to the procedure in Section 3.1, reveals that the reference states generated by Eq. (27) are local optimal models at the particular level of “nativeness” (i.e., distance from *p*^{obs} (*c*|*ab*)) (results not shown). Therefore, consideration of the 100 models here should serve to evaluate locally optimal models in this region.

The “complete information” limit, described above, occurs at *n* = 0, while the QC is generated at *n* = 100. The informatic quantities that result from this set of models are plotted in Figure 4, spanning the range bounded by the state labelled “A” (at *n* = 0 ) to state “B” (at *n* = 100 ). In the right half of the figure, another 100 models were generated in a similar fashion, but this time forming a gradient from
${p}_{\text{QC}}^{exp}(cab)$ to the uniformly distributed reference state:

$${p}_{\text{U}}^{exp}(cab)=\{\begin{array}{ll}\frac{1}{200}\hfill & \text{for}\phantom{\rule{0.16667em}{0ex}}a\ne b\hfill \\ \frac{1}{400}\hfill & \text{for}\phantom{\rule{0.16667em}{0ex}}a=b\hfill \end{array}$$

(28)

The information map, which explores the range of reference states, from the state encoding total knowledge of the contact probabilities (“A”) to the state encoding no knowledge (“C”). These reference states are generated **...**

This reference state, marked as state “C” in the figure, assumes equal probabilities of finding any pair of amino acids. This is the extreme case of “ignorance”, in which even the most basic information, the uneven composition of the sequence universe, is not taken into account. This is obviously not a practical nor acceptable model, and is included here only to serve as a limit.

The four reference states studied thus far are included in Figure 4, with the location of QC indicated by the dashed vertical line. First, we observe that the score gap *J _{q}* (

The right side of the plot illustrates what happens when less and less prior knowledge is used to construct the reference state probabilities. Any attempt to increase “information” (as measured by the native term), by lowering the prior knowedge level, is offset by a proportional decrease of the decoy term, to produce a nearly constant *J _{q}* (

We examine a concrete example of a reference state that incorporates significantly more native characteristics (i.e., a model located in the left side of the information map in Figure 4). This reference state utilizes the quasi-chemical approximation not on amino acid composition (QC) but on the *contact* mole fraction. That is, the expected probability of contact between *a* and *b* is assumed to be proportional to the product of their individual contact mole fractions.

$${p}_{\text{CQC}}^{exp}(cab)=k{\chi}_{a}^{c}{\chi}_{b}^{c}=k\left(\frac{{\sum}_{i}{n}_{i}^{c}(a)}{{\sum}_{i}{N}_{i}^{c}}\right)\left(\frac{{\sum}_{i}{n}_{i}^{c}(b)}{{\sum}_{i}{N}_{i}^{c}}\right)$$

(29)

where
${n}_{i}^{c}(a)$ is the number of contacts of *a* in protein chain *i*, *k*= 2 if *a* ≠ *b* and 1 otherwise,
${N}_{i}^{c}$ is the total number of contacts in protein *i*, and the summation covers all protein chains in the data set. This reference state, which we shall call here as CQC (as a reminder that this is the quasichemical approximation applied to contact mole fraction), is analogous to the GKS scale.^{13}^{; }^{29}

While superficially similar, CQC and QC (Eq. (19)) differ significantly in the use of information. The latter uses the amino acid mole fraction *χ _{a}*, while the former uses the contact mole fraction
${\chi}_{a}^{c}$. The difference arises from the fact that
${\chi}_{a}^{c}$ is dependent not only on the mole fraction of

Operationally, using the CQC model has the effect of disregarding the influence of hydropathy in the contact potential. Viewing this phenomenon in terms of information, QC-based potentials includes both the information on intrinsic contact propensities contained in CQC-based potentials as well as the information contained in the hydropathy of individual amino acids. Thus, the CQC reference state can be said to hold significantly more native properties than QC, and therefore should be expected to occur in the left side of the information map in Figure 4.

The low correlation between CQC and the QC variants (Table II) and the plot comparing their *e _{q}* (

$$h(a)=\frac{1}{{\sum}_{x}{p}^{\text{obs}}(cax)}$$

(30)

Values for the *h*(*a*) index can be found in Table III, along with three representative hydrophobicity/hydropathy indices taken from the literature^{30}^{–}^{32} for comparison. The strong correlations among them, summarized in Table IV, confirm that the quantity *h*(*a*) is, indeed, both a kind of data-derived hydrophobicity index, as well as a measure of the amount of information incorporated in score functions that use variants of QC but not those that use CQC.

Comprehensive threading results under the CQC model are summarized in Table I. Because hydropathy is no longer incorporated in the resulting contact potential, the informatic quantities *I*, *D*, and *J* are significantly lower than those of the QC models. Indeed, these numbers show that CQC occurs in the left side of the information map in Figure 4. Consequently, its performance, as measured by *r**,* is diminished. These observations are expected. Score functions that use QC variants have been designed with the explicit purpose of summarizing all sequence-dependent pairwise contact information, including transfer energy or hydropathy information. CQC, on the other hand, concerns itself only with actual pairwise propensities, without regard to location in the folded protein. This potential embodies only specific residue-residue interactions and excludes any effects arising from the collapse of the chain and the competitive interactions with respect to the aqueous solvent. Since the former set of score functions have included hydropathy as a factor, they ought to perform significantly better in threading.

This does not mean that QC should always be chosen over CQC in constructing folding potentials. In building total potentials representing all the significant interactions which determine native conformation, surface/solvent effects can be included explicitly to account for the significant preference of hydrophobic residues to be buried in the protein interior. A separate term for this factor, in addition to a QC-type contact potential, may overcount burial effects in the total potential. Weighting parameters are used routinely to calibrate total potentials and limit such overcounting. But perhaps using CQC as the reference state for the contact potential term will ensure the independence of the contact and burial terms, minimizing the need for empirical weighting factors in the total potential expression. Further study is needed to uncover redundancies and untangle the elements which should be included in a total potential function.

The information-theoretic dissection of the score-gap function above, as it relates to the *average* threading behavior across a range of test sequences and decoy conformations, can also guide us in the interpretation of the threading behavior of *individual* proteins.

The total divergence *J _{q}* (

$${J}_{q}(C,S)=\frac{1}{{n}_{\mathit{prot}}}{\sum}_{i}{J}_{q}(C,{s}_{i})$$

(31a)

where

$${J}_{q}(C,{s}_{i})=\sum _{\begin{array}{l}\text{all}\phantom{\rule{0.16667em}{0ex}}\text{pairs}\\ \text{in}\phantom{\rule{0.16667em}{0ex}}\text{prot}\phantom{\rule{0.16667em}{0ex}}i\end{array}}\frac{{n}_{i}^{N}(cab){n}_{i}^{N}(c)}{log}$$

(31b)

The same strategy used to study the optimization of *J _{q}* (

$$\frac{{n}_{i}^{N}(cab){n}_{i}^{N}(c)}{=}$$

(32)

while for the decoy term:

$$\frac{{n}_{i}^{D}(cab){n}_{i}^{D}(c)}{=}$$

(33)

The quantities on the left hand side are properties of the query sequence *s _{i}*, while the quantities on the right hand side are components of the score function. The latter is derived typically from the statistics of a diverse pool of sequences and conformations, independent of the characteristics of the query sequence.

We begin by examining the decoy term equality first. The fraction on the left hand side is the proportion of contacts assigned to the *ab* pair in threading sequence *i* through the ensemble of decoy conformations. Using the quasi-chemical approximation, this fraction is proportional to the product of the mole fractions of *a* and *b* in protein *i*. On the other hand, also guided by the quasi-chemical approximation, the right hand side is proportional to the product of the mole fractions of *a* and *b* in the universe (*U*) of sequences. Thus, Eq. (33) can be approximated by the following:

$${\chi}_{i}(a){\chi}_{i}(b)={\chi}_{U}(a){\chi}_{U}(b)$$

(34)

for all *ab* residue pairs that exist in protein *i*. This system of 190 equations and 38 unknowns has one non-trivial solution:

$${\chi}_{i}(x)={\chi}_{U}(x)$$

(35)

for all amino acids x. Thus, *the decoy term is optimized if the amino acid composition of the data set used to construct the score function approximates that of the query protein i*. Enforcing this restraint may also provide the conditions to satisfy the first equality, Eq. (32). One way to allow for the possibility that the native distribution of contacts of a query sequence is similar to that found in the data set would be for both to have similar compositions. Conceptually, one can think of the native conformation as a coalescence of the most desirable (highest scoring) pairs in a competitive environment of a multitude of possible pairings. A different composition will alter the competitive environment, and therefore also the “winning” distribution of native contacts. In other words, the mutual information or “energy” of a particular amino acid pairing is a measure of its contact propensity in native conformations *relative* to the energies of the other contacting *and* non-contacting pairs. As a thought experiment, one may imagine observing exclusive binary contacts between an *ab* pair and a *cd* pair in a native conformation. Taking out the *a* residue from the sequence may affect the likelihood of the *cd* pairing, if alternative pairings *bc* or *bd* are more energetically stabilizing. Real proteins are certainly more complex than the binary contact situation just described, but one can hypothesize a similar high-order competition occurring in the collapse and rearrangement of residues in a folded chain.

We return to the comprehensive threading data to examine the validity of these propositions. First, we establish a definition of distance between the composition of a query sequence and that of the data set used to derive the score function. Again, the natural choice is the total divergence equation (Eq. (2)), which defines the distance between the compositions of sequence *i* and of the universe of sequences, *X _{i}* and

$$J({X}_{i},{X}_{U})={\sum}_{x}{\chi}_{i}(x)log\frac{{\chi}_{i}(x)}{{\chi}_{U}(x)}+{\sum}_{x}{\chi}_{U}(x)log\frac{{\chi}_{U}(x)}{{\chi}_{i}(x)}$$

(36)

The value of the compositional divergence *J* ( *X _{i}*,

The action of the score function on individual protein chains. Each point in these two plots represents one protein chain. (A) The quantity *J* ( *X*_{i}, *X*_{U}) is the distance (measured by divergence) between the amino acid mole fraction composition of protein **...**

These measurements amplify an alternative approach to constructing potentials. There have been a number of studies that point to the influence of the data set on the performance of score functions in protein folding.^{33}^{–}^{36} If compositional divergence is partly responsible for the database dependence of the effectiveness of potential functions, a viable strategy would be to tailor potentials with respect to the composition of the query sequence. The idea is to select only chains in the structural database with low compositional divergence relative to the query sequence, in constructing the potential to be used to fold or thread that sequence. The notion of “sequence-specific potentials” has been advanced by Skolnick and co-workers,^{10} who have weighted the contribution, to the potential, of each occurrence in the data set by the relative similarity of its local sequence to the query sequence. In this study, we expand the concept of query-specific potentials, by suggesting that composition, a global sequence characteristic, can also be used as a parameter to tailor potentials. Amino acid composition, of course, has been shown to contain information about over-all secondary structure content of proteins.^{37}^{–}^{39} In this work, we see the outline of an informatic connection between composition and long-range interactions.

We point out that we are pursuing issues relating to query-specific potentials in current work. Early results indicate that using only the most compositionally-similar part of the structural data set may, on average, be better than using the entire data set, despite the “representative” or nonredundant nature of the entire data set (Solis & Rackovsky, unpublished results).

Using basic information-theoretic concepts, we set out to understand the role of the reference state in the action of folding potentials in fold recognition. We utilize information-theoretic quantities derived previously^{7}^{; }^{8} to represent typical behavior of log-odds score functions in threading. We particularly study the total divergence, *J _{q}* (

An explicit formulation of the discrimination score results in the establishment of a connection between the probability distributions which describe the reference state and the ensemble of decoy conformations. We find that the decoy term is maximized when these two distributions are identical. The farther the reference state probabilities are from the decoy ensemble probabilities, the lower the total divergence *J _{q}* (

In the vicinity of the optimal reference state distribution, relationships are less well-defined. We examined the contact reference states derived from the quasi-chemical approximation (QC), the most commonly used model, and two variant models (QC1 and QC2). The basic quasi-chemical assumption takes the expected contact propensity to be proportional only to the mole fractions of the amino acids in the universe of natural sequences. The variants involve the inclusion of characteristics of protein chains: namely, that amino acid residues interact not in a “gas phase” but within chains of finite length, and whose amino acid distributions are variable. These variants, while increasing the decoy term of *J _{q}* (

A fourth reference state (DB) was derived directly using raw statistics, arising from the threading of a set of sequences through the ensemble of decoys. This kind of reference state realizes completely the equality between the probabilities of the decoy ensemble and the reference state, and therefore produces the highest decoy term divergence. However, as in the case of the two quasi-chemical variants, the decrease in mutual information is larger, bringing about a lower *J _{q}* (

We compared the informatic properties of the four reference states (QC, QC1, QC2, and DB). The difference among them lies in the amount of native-ness included explicitly in the model. In the attempt to approach the essential meaning of “expected” probabilities, more sophisticated models encode natural protein properties like finite length chains, uneven amino acid composition, and different conformations from chain to chain. It is conceptually appropriate to include such properties in the model. We find, however, that any amount of nativeness encoded in the reference state model does not significantly improve threading performance.

Another class of reference state models (CQC) uses the quasi-chemical approximation to partition the number of contacts according to the proclivity of the amino acids to be in contact in native folds. As a consequence, contacts made by amino acids that prefer to be in the protein interior receive lower scores than those given by QC-type models. This is because operationally, the prior expectation of contact for those amino acids have already been built into the score function. Conversely, contacts formed by amino acids found frequently on the protein surface are scored higher. Figure 2 illustrates the difference between CQC and QC. The preference to be on the protein surface or in the interior, of course, has been universally recognized as a consequence of the relative hydropathies of amino acid side chains. Thus, encoding these sequence-specific properties into the reference state has the effect of removing useful information about hydrophobic and solvent interactions from the contact potential. Results of all-against-all threading using CQC show that, indeed, the effectiveness of the score function is significantly depressed by the absence of hydropathy information in discrimination. We confirm the fundamental nature of the “missing” information further by building an amino acid index that quantifies the informatic difference between CQC and QC. This index, found in Table III, shows remarkable correlation with a number of hydrophobicity indices derived through other means. Though CQC-based contact potentials do not perform as well as QC-based ones, CQC may prove to be a useful component of a total potential if that potential also accounts for hydropathy and solvent accessibility explicitly.

The information map in Figure 4 is a useful guide to the range of reference states possible for contact potentials. The QC-based potentials are located at the region of greatest discrimination (i.e., highest *J*), which is consistent with their superior performance in threading. Reference states that contain varying amounts of “nativeness” are located in the left part of the map, among them the CQC reference state, which includes prior information about the relative hydropathy of amino acids (and therefore their location in the folded protein relative to the protein surface). Potentials which use these reference states see a depressed level of discrimination, owing to the increased prior knowledge contained in the base reference state. Conversely, located on the right side of the map are reference states that do not completely incorporate prior knowledge of the amino acid composition of natural proteins. It can be observed from this region of the information map that discrimination (as measured by *J*) cannot be artificially increased by any attempt to “increase” information (e.g., by removing prior information about amino acid composition from the reference state).

A promising avenue for the development of better potentials is suggested by our analysis of the action of the score gap function on *individual* protein sequences. If the same optimization is done to the two terms of this expression, an interesting prescription arises: that the individual terms are optimized if the mole fractions of the decoy set and the reference state are equal. This result suggests that potentials have the capacity to perform better when the compositional properties of the data set used to derive the score function probabilities are similar to the properties of the sequence of interest. When one constructs the “expected” probabilities to form the reference state, one is really interested in the proportion of contacts that ought to be expected for the particular sequence being analyzed. Only similar sequences in the database will show the same “expected” contact behavior. Extracting parameters from similar sequences is also consistent with the real meaning of the other component of the score function: the “observed” contact probabilities. Sequences of near-identical amino acid compositions have available to them very similar possible permutations of contact (allowing for chain connectivity). Therefore, the set of contacts that finally “wins out” in the competition for the final conformation (i.e., the “observed” cases) should be a good indication of the relative “energies” of each contact pair. Our data supports these propositions remarkably well. We find that, the greater the divergence between the composition of the test sequence and the data set from which score parameters are derived, the lower the score gap, and the higher the mean rank *r* of the correct conformation. It may make sense, therefore, to use only sequences of similar composition to tailor the score function specifically for a test sequence.

This work was supported by the National Library of Medicine of the National Institutes of Health, through grant LM06789.

1. Hendlich M, Lackner P, Weitckus S, Floeckner H, Froschauer R, Gottsbacher K, Casari G, Sippl MJ. Identification of native protein folds amongst a large number of incorrect models. The calculation of low energy conformations from potentials of mean force. J Mol Biol. 1990;216:167–80. [PubMed]

2. Sippl MJ. Boltzmann’s principle, knowledge-based mean fields and protein folding. An approach to the computational determination of protein structures. J Comput Aided Mol Des. 1993;7:473–501. [PubMed]

3. Godzik A, Kolinski A, Skolnick J. Are proteins ideal mixtures of amino acids? Analysis of energy parameter sets. Protein Sci. 1995;4:2107–17. [PubMed]

4. Sippl MJ. Knowledge-based potentials for proteins. Curr Opin Struct Biol. 1995;5:229–35. [PubMed]

5. Solis AD, Rackovsky S. Optimized representations and maximal information in proteins. Prot Struct Funct Bioinform. 2000;38:149–64. [PubMed]

6. Solis AD, Rackovsky S. Optimally informative backbone structural propensities in proteins. Prot Struct Funct Bioinform. 2002;48:463–86. [PubMed]

7. Solis AD, Rackovsky S. Improvement of statistical potentials and threading score functions using information maximization. Prot Struct Funct Bioinform. 2006;62:892–908. [PubMed]

8. Solis AD, Rackovsky S. Information and discrimination in pairwise contact potentials. Prot Struct Funct Bioinform. 2008;71:1071–1087. [PubMed]

9. Miyazawa S, Jernigan RL. Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol. 1996;256:623–44. [PubMed]

10. Skolnick J, Kolinski A, Ortiz A. Derivation of protein-specific pair potentials based on weak sequence fragment similarity. Proteins. 2000;38:3–16. [PubMed]

11. Zhang C, Kim SH. Environment-dependent residue contact energies for proteins. Proc Natl Acad Sci U S A. 2000;97:2550–5. [PubMed]

12. Miyazawa S, Jernigan RL. Estimation of Effective Interresidue Contact Energies from Protein Crystal Structures: Quasi-Chemical Approximation. Macromolecules. 1985;18:534–552.

13. Skolnick J, Jaroszewski L, Kolinski A, Godzik A. Derivation and testing of pair potentials for protein folding. When is the quasichemical approximation correct? Protein Sci. 1997;6:676–88. [PubMed]

14. Berrera M, Molinari H, Fogolari F. Amino acid empirical contact energy definitions for fold recognition in the space of contact maps. BMC Bioinformatics. 2003;4:8. [PMC free article] [PubMed]

15. Chen WW, Shakhnovich EI. Lessons from the design of a novel atomic potential for protein folding. Protein Sci. 2005;14:1741–52. [PubMed]

16. Chelli R, Gervasio FL, Procacci P, Schettino V. Inter-residue and solvent-residue interactions in proteins: a statistical study on experimental structures. Proteins. 2004;55:139–51. [PubMed]

17. McConkey BJ, Sobolev V, Edelman M. Quantification of protein surfaces, volumes and atom-atom contacts using a constrained Voronoi procedure. Bioinformatics. 2002;18:1365–73. [PubMed]

18. Zhou H, Zhou Y. Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci. 2002;11:2714–26. [PubMed]

19. Liang S, Liu S, Zhang C, Zhou Y. A simple reference state makes a significant improvement in near-native selections from structurally refined docking decoys. Proteins. 2007;69:244–53. [PMC free article] [PubMed]

20. Liu S, Zhang C, Zhou H, Zhou Y. A physical reference state unifies the structure-derived potential of mean force for protein folding and binding. Proteins. 2004;56:93–101. [PubMed]

21. Cover TM, Thomas JA. Elements of Information Theory. 2. Wiley; New York: 2006.

22. Sippl MJ. Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. J Mol Biol. 1990;213:859–83. [PubMed]

23. Jernigan RL, Bahar I. Structure-derived potentials and protein simulations. Curr Opin Struct Biol. 1996;6:195–209. [PubMed]

24. Lazaridis T, Karplus M. Effective energy functions for protein structure prediction. Curr Opin Struct Biol. 2000;10:139–45. [PubMed]

25. Russ WP, Ranganathan R. Knowledge-based potential functions in protein design. Curr Opin Struct Biol. 2002;12:447–52. [PubMed]

26. Skolnick J. In quest of an empirical potential for protein structure prediction. Curr Opin Struct Biol. 2006;16:166–71. [PubMed]

27. Godzik A. Fold recognition methods. Methods Biochem Anal. 2003;44:525–46. [PubMed]

28. Mirny LA, Shakhnovich EI. How to derive a protein folding potential? A new approach to an old problem. J Mol Biol. 1996;264:1164–79. [PubMed]

29. Godzik A, Kolinski A, Skolnick J. Topology fingerprint approach to the inverse protein folding problem. J Mol Biol. 1992;227:227–38. [PubMed]

30. Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982;157:105–32. [PubMed]

31. Kidera A, Konishi Y, Oka M, Ooi T, Scheraga HA. Statistical analysis of the physical properties of the 20 naturally occurring amino acids. J Prot Chem. 1985;4:23–55.

32. Casari G, Sippl MJ. Structure-derived hydrophobic potential. Hydrophobic potential derived from X-ray structures of globular proteins is able to identify native folds. J Mol Biol. 1992;224:725–32. [PubMed]

33. Dehouck Y, Gilis D, Rooman M. Database-derived potentials dependent on protein size for in silico folding and design. Biophys J. 2004;87:171–81. [PubMed]

34. Furuichi E, Koehl P. Influence of protein structure databases on the predictive power of statistical pair potentials. Proteins. 1998;31:139–49. [PubMed]

35. Rykunov D, Fiser A. Effects of amino acid composition, finite size of proteins, and sparse statistics on distance-dependent statistical pair potentials. Proteins. 2007;67:559–68. [PubMed]

36. Zhang C, Liu S, Zhou H, Zhou Y. The dependence of all-atom statistical potentials on structural training database. Biophys J. 2004;86:3349–58. [PubMed]

37. Nakashima H, Nishikawa K, Ooi T. The folding type of a protein is relevant to the amino acid composition. J Biochem. 1986;99:153–62. [PubMed]

38. Eisenhaber F, Frommel C, Argos P. Prediction of secondary structural content of proteins from their amino acid composition alone. II. The paradox with secondary structural class. Proteins. 1996;25:169–79. [PubMed]

39. Eisenhaber F, Imperiale F, Argos P, Frommel C. Prediction of secondary structural content of proteins from their amino acid composition alone. I. New analytic vector decomposition methods. Proteins. 1996;25:157–68. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |