|Home | About | Journals | Submit | Contact Us | Français|
We report a C‐atom‐based scoring function, named OPUS‐CSF, for ranking protein structural models. Rather than using traditional Boltzmann formula, we built a scoring function (CSF score) based on the native distributions (derived from the entire PDB) of coordinate components of mainchain C (carbonyl) atoms on selected residues of peptide segments of 5, 7, 9, and 11 residues in length. In testing OPUS‐CSF on decoy recognition, it maximally recognized 257 native structures out of 278 targets in 11 commonly used decoy sets, significantly outperforming other popular all‐atom empirical potentials. The average correlation coefficient with TM‐score was also comparable with those of other potentials. OPUS‐CSF is a highly coarse‐grained scoring function, which only requires input of partial mainchain information, and very fast. Thus, it is suitable for applications at early stage of structural building.
A potential function plays a central role in predicting protein structures. Generally, there are two kinds of potential functions: physics‐based potentials and knowledge‐based potentials. Physics‐based potentials typically are the all‐atom molecular mechanics force‐fields,1, 2, 3, 4, 5 such as CHARMM1,2 and AMBER.4 They also include coarse‐grained potentials such as MARTINI,6 UNRES7, 8 and OPEP.9
The knowledge‐based potentials are derived from statistical analysis of known structures and are widely used in structural prediction.10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41 They usually perform better than the physical potentials in structural prediction. In general, knowledge‐based potentials can be constructed either at coarse‐grained residue level17, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 or at atomic level.32, 33, 34, 35, 36, 37, 38, 39, 40, 41 Although coarse‐grained potentials may not be rigorous, it helps to focus on essential features and excludes less important details, thus reduces computational cost.42, 43 The performance of coarse‐grained potential is related to how one designs the coarse‐graining scheme. For example, OPUS‐Ca potential30 uses the positions of Cα atoms as input, calculates other atomic positions as pseudo‐positions and significantly reduces the computing cost. Other applications of coarse‐grained models using Cα positions are also reported in literature.44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55
In this work, unlike traditional empirical potential functions using Boltzmann formula, we built a scoring function based on the native distributions of coordinate components of mainchain C (carbonyl) atoms on a few selected residues of small peptide segments of 5, 7, 9, and 11 residues in length. A lookup table, termed as configurational native distribution (CND) lookup table, was first generated for native distributions of coordinate components by analyzing peptide segments in the entire Protein Data Bank (PDB). Then the scoring function, termed as CSF scoring function, was calculated for a particular test structure by comparing the information of its segments with the CND lookup table. The performance of OPUS‐CSF was tested on 11 commonly used decoy sets, the results indicated that OPUS‐CSF was able to identify significantly more native structures from their decoys than other empirical potentials. In terms of the correlation coefficients between CSF scores and TM‐scores, they were comparable to those of popular all‐atom empirical potentials. Most importantly, OPUS‐CSF achieved such performance despite its highly coarse‐grained nature. That indicates the advantages of OPUS‐CSF in terms of its speed and also for its applicability in the early stage of structural modeling. This is vitally important for applications such as building structural models from intermediate resolution data from experimental techniques like cryogenic electro‐microscopy (cryo‐EM).
We compared the performance of OPUS‐CSF on 11 commonly used decoy sets with that of popular all‐atom potential functions. In Table 1, we listed the results of 5‐residue segment case (OPUS‐CSF5) and all‐segment combined case (OPUS‐CSF). For the 5‐residue segment case, OPUS‐CSF5 successfully recognized 244 out of 278 native structures from their decoys and had the average Z‐score (–3.56) nearly identical to that of GOAP (–3.57). For combined segment case, OPUS‐CSF performs even better and successfully recognized 257 out of 278 native structures from their decoys and had an average Z‐score (–4.12) better than that of GOAP (–3.57). It is interesting that although OPUS‐CSF is a highly coarse‐grained scoring function, its performance is significantly better than other all‐atom potentials.
We also calculated the Pearson's correlation coefficients between CSF score and TM‐score56 in all decoy sets. The results are shown in Table 2. OPUS‐CSF has comparable average correlation coefficient with those of GOAP and OPUS‐PSP despite the fact that OPUS‐CSF is highly coarse‐grained and the other two are all‐atom potentials.
For further analysis of the method, we use 5‐residue segment case as an example, Figure Figure11 shows the histogram of standard deviations of the coordinate components of mainchain C (carbonyl) atoms of the 1st and 5th residues in the CND lookup table. It is clear that the distribution peaks at a very small value indicating that the coordinate components are clustered in a narrow distribution, that is, the configurational distributions of the 5‐residue peptide segments are narrow,57 which provides a foundation for the success of OPUS‐CSF. The narrow configurational distribution of small peptide fragments is also seen in other studies.58 In addition, the average value of the standard deviation is 1.20 Å.
It needs to be mentioned that, in the implementation of OPUS‐CSF, we assume that the smaller the CSF score, the more likely the structure to be native. This is an approximation because even a native structure may not usually have a zero CSF score. However, the narrow distributions of standard deviations of the coordinate components of mainchain C (carbonyl) atoms (Fig. (Fig.1)1) suggests small scores for the native structures. Figure Figure22 shows a population distribution of the CSF scores for 278 native structures in 11 decoy sets (per independent coordinate component). The average value of the native CSF scores is 0.84 and the standard deviation is 0.27. Thus, in native structures, the deviations of the coordinate components from their average values are less than one standard deviation of the coordinate component distribution in CND lookup table. The fluctuation of the native CSF scores is also very small.
Figure Figure33 shows the frequencies of sequence repeating in the CND lookup table in 5‐residue case. In principle, the more times a sequence repeats in PDB, the better statistics one would have for that sequence in CND lookup table. In the 5‐residue case, half of the sequences repeat >26 times in the distribution. The largest value of X‐axis is 29,618 with one sequence. In constructing CND lookup table, there is always an issue between the sequence diversity and sequence repeating frequency in PDB.
We examined OPUS‐CSF using different length of segments. As the length of segment increases, naturally the coverage decreases, and the ratio of the number of segments that appear more than five times to the total number of segments in PDB decreases (Table 3). On the other hand, if Coverage is defined as the ratio between the number of segments available in CND lookup table and the number of total segments of a test sequence, the average coverage of the 11 decoy sets (in total 278 targets) decreases as the length of segment increases. If a test sequence has <20% of its segments available in the CND lookup table, that is, its coverage is <20%, it is regarded as Unknown, then the number of unknowns increase as the lengths of segments increase. More details of OPUS‐CSF on different segment lengths can be found in Supplemental Information.
The 5‐residue case delivers the best performance in terms of decoy recognition (244 out 278 native recognition in Table 4). However, the Z‐scores are better for longer‐segment cases. This is probably because the longer segments preserve more sequence homology information.
For the 5‐residue case, we also tested a scenario by constructing CND lookup table using four residues (1, 2, 4, and 5), instead of using two terminal residues (1, 5). The number of native recognition and Z‐score are 226 and −3.60, while, in the case of (1, 5), they are 244 and −3.56 (as indicated in Table 4). This is very interesting as it indicates that using two terminal residues (1, 5) captures a better coarse graining level than using more residues (1, 2, 4, and 5).
OPUS‐CSF has some obvious advantages. First, the CND lookup table is constructed directly from the entire PDB, and it contains the information of all allowed configurational information of the native segments (at least for the ones repeated more than five times in PDB). The results seem to indicate that it is better than Boltzmann formula based methods. Second, the speed of OPUS‐CSF is very fast, especially for longer polypeptide chains. This is because the entire chain is scanned once and linearly, it only requires partial mainchain atom coordinates to calculate the CSF score for a structure. Unlike other potentials such as GOAP40 and OPUS‐PSP,34 no inter‐atomic distances need to be calculated. We want to emphasize that, in modeling protein structures, an empirical potential function or a scoring function, should be fast and accurate. In early stage of modeling, it is advantageous that the scoring function requires minimal amount of structural information. In this regard, OPUS‐CSF seems to be a good choice.
Scanning through the polypeptide chain with a step size of one residue, we collected small peptide segments with sequence length of 5, 7, 9, and 11 residues and searched for their configurations in the entire PDB. Totally, we downloaded 130,054 PDB structures on June 7, 2017 via ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb. The sequences that appeared less than five times in PDB were discarded. The number five was chosen empirically. Peptide segments with poorly resolved structures such as broken bonds were not included.
Here we use 5‐residue segment case as an example to illustrate the details of the procedure. The ratio of segments that appear more than five times to all segments in PDB is 75.1%, which means we can utilize 75.1% of the information in the whole PDB using 5‐residue segments (also see Table 3 in Results and Discussion).
A local molecular coordinate system was defined for every segment using the positions of three main‐chain atoms in the middle residue. The origin was set at the Cα atom, the X‐axis was defined along the line connecting Cα and C (carbonyl) atoms, Y‐axis was in the Cα ‐C‐O plane, parallel to component of C‐O vector that was perpendicular to the X‐axis, and the Z‐axis was defined correspondingly (Fig. (Fig.44).
For a 5‐residue segment with a specific sequence, we saved the mainchain C (carbonyl) coordinates of the 1st and 5th residue in the local coordinate system, denoted as and . And under our assumption, we treated coordinate components as six independent variables. By scanning through the entire PDB, we generated six independent distributions of these variables, called configurational native distributions (CNDs) of 5‐residue segments. We then calculated the means and standard deviations of the distributions and they were kept as the CND lookup table.
For a test structure, we scanned through its sequence with 5‐residue‐segments. For each segment and its sequence, we looked for the Z‐scores of the six independent variables in the CND lookup table. At the end, we added up all the absolute values of Z‐scores of all variables for all segments, and it was called CSF score. We assume the structure with smallest CSF score has the largest likelihood to be the native structure.
The segments of varying lengths are denoted as 5(1, 3, 5), 7(2, 4, 6), 9(1, 3, 5, 7, 9) and 11(2, 4, 6, 8, 10). Here, in segments with the form of 5(1, 3, 5), for example, the first number 5 is the segment length, 1,5 in the parenthesis are the residues that we record C (carbonyl) atom positional distributions in local coordinate system, 3 is the residue on which the local coordinate system is defined. For 9(1, 3, 5, 7, 9) and 11(2, 4, 6, 8, 10), four atoms are used for recording mainchain C (carbonyl) positional distributions, thus totally 12 independent variables are used.
The CSF score can be calculated either based on one particular segment length or by combining all segment length together. In the case of combined segment length, final CSF score is a linear sum of all CSF scores of different segment length. No weighting function is introduced for the contribution of different segment lengths.
The 11 commonly used decoy sets we used to test OPUS‐CSF are the same as those used in GOAP,40 including decoy sets of 4state_reduced,59 fisa,58 fisa_casp3.58 hg_structal, ig_structal and ig_structal_hires (R. Samudrala, E. Huang, and M. Levitt, unpublished). I‐TASSER,39 lattice_ssfit,60, 61 lmds,62 MOULDER63 and ROSETTA.64
The scoring function is freely available to the academic community.
The authors wish to thank Robert L. Jernigan for careful reading of the manuscript and numerous comments on how to improve it. J.M. thanks support from the National Institutes of Health (R01‐GM067801, R01‐GM116280), and the Welch Foundation (Q‐1512). Q.W. thanks support from the National Institutes of Health (R01‐AI067839, R01‐GM116280), the Gillson‐Longenbaugh Foundation, and The Welch Foundation (Q‐1826).