Although the occurrence of significant cooperativity (or anti-cooperativity) is rare among the ATM/ATR, CDK1/Cyclin B and CK2 substrates examined here, a relatively small number of pairs of amino acids that act cooperatively in the bulk context were identified. Most of these do not lend themselves to facile biophysical explanation, with the possible exception of the statistically significant preference for serine over threonine as the phosphorylatable residue in the context of ATM/ATR substrates with proline or glycine in the −1 position, glycine in the +2 position or serine in the +3 position. The difference between a serine and a threonine side chain is quite modest: the threonine has an additional methyl group attached to the beta carbon. The disfavouring of threonine at the phosphorylated position in concert with specific residues at other positions among ATM and ATR substrates is likely caused by one of two biophysical effects: either a steric clash between the threonine methyl group and the residue disfavoured in tandem, or the threonine methyl disfavouring a substrate backbone configuration amenable to kinase–substrate interaction in the context of the other residue.
While crystal structures of the kinases studied here in complex with substrates do not exist, there is a crystal structure solved of the kinase Cdk2/Cyclin A in complex with an optimized substrate peptide [
18]. Cdk2 shares 66 per cent sequence identity with Cdk1, and has similar substrate specificity as well. In this structure, there are no close contacts between the amino acid side chains of the substrate peptide, perhaps partially explaining the lack of interdependence seen among Cdk1/Cyclin B substrates.
None of the datasets used in this work is perfectly suited to the task of describing the bulk substrate specificity of a kinase. Although ATM and ATR have a large number of substrates determined in the course of a single study, the two are individual kinases treated, perhaps incorrectly [
19], as having identical specificity. Moreover, substrates of the kinases were identified using a cocktail of antibodies [
12], and the specificities of these antibodies must be convoluted with those of ATM and ATR to generate the final list of putative substrates. Nonetheless, if ATM preferred different amino acids at individual positions than ATR, then pairs of residues independently preferred at pairs of positions by each of the two kinases would have appeared as being enriched relative to what would be expected under a position-independent model. The curation of CK2 substrates [
14], no matter how expertly performed, is subject to the study biases of those who originally reported CK2 substrates in the literature. As with ATM/ATR, however, these biases would be expected to introduce, and not negate, apparent interpositional dependencies. CDK1/Cyclin B substrates were identified by
in vitro phosphorylation of lysates with an engineered kinase [
13], but the number of substrates identified was small with respect to what might be needed to adequately describe the frequency of pairs of amino acids at pairs of positions. Strikingly, the three data sources examined span a wide range of sizes and cover several collection methodologies. Across all three cases, the same consistent pattern of rare interpositional interaction is found.
The methodology applied here is fundamentally similar to both the statistical coupling analysis developed in the research group of Ranganathan [
20] and to the mutual information method of Gfeller
et al. [
21]. While each of these methods is aimed at identifying interpositional correlations, we chose our method to directly examine statistical significance of enrichment or diminishment of co-occurrence of pairs of amino acids, rather than using information or entropy as an intermediate metric of statistical significance. Our interest is primarily in identifying co-occurrence rates that are poorly explained by chance, and our methods are meant to approach this goal as directly as possible.
The surprising scarcity of amino acid pairs occurring significantly more or less frequently among a kinase's substrates than would be expected if each amino acid were independently recognized by the kinase might have one of, or a combination of, several explanations. First, it may simply be true that kinases largely recognize each amino acid of their substrates independently, although this seems biophysically implausible. Second, it is possible that kinases incubated in vitro with an infinitely varied library of potential substrates would express statistically significant preferences. In vivo, however, a kinase is exposed to a subset of the possible amino acid sequence space: it is limited in access to peptide sequences encoded by the genome, in the same subcellular localization and structurally accessible. This convolution may lead to the obfuscation of a kinase's pure biophysical preferences. Finally, there is probably a contribution of effect size. It seems unlikely that each substrate sub-site is recognized independently in kinases, and rather more likely that the energetic contribution of second-order effects exists but is very modest. The smaller the size of such effect, the larger the sample of substrates necessary to detect it will be.
A low degree of interpositional dependence in kinase substrate specificity has interesting implications for the evolution of phosphorylation sites and of phosphorylation signalling networks. If each substrate sequence position contributes independently to the ability of a kinase to phosphorylate its substrates, then the evolutionary fitness landscape of substrates as a function of the amino acid at each position is smooth, with a single minimum. That is, there are no non-global local minima acting as traps in kinase substrate fitness space—for any non-ideal substrate, there exist one or more single mutations that would improve the fitness of the substrate for the kinase, with no concerted double or higher-multiple mutations necessary.
Work by other research groups demonstrates that this property of position-wise independence is not uniformly common to all components of phosphosignalling pathways. Yaffe
et al. [
22], in studying the phosphopeptide-binding protein 14-3-3, found that substrates adhered to one of two mutually exclusive sequence modes. Liu
et al. [
23] have demonstrated that SH2 domains, which recognize and bind to phosphotyrosine-containing peptides, will vary in their specificity at some substrate positions as a function of what amino acids are present at other positions. For example, the SH2 domain of Crk binds to peptides with a phosphotyrosine along with a leucine or a proline in the +3 position. Proline at the +2 position, however, is allowed only when the identity of the +3 position is leucine and not proline. Other work by Gfeller
et al. [
21] has shown a similar property among the SH3, PDZ and WW peptide-binding domains. Although these domains do not bind phosphopeptides in general, a subset of the WW domains (not explicitly studied by Gfeller
et al.) does bind specifically to phosphoserine- and phosphothreonine-containing peptides with a proline in the +1 position.
This fundamental difference in the way that kinases and phosphopeptide-binding domains recognize their substrates—kinases in a position-independent manner, but phosphopeptide-binding domains exhibiting clear second- or higher-order preferences—may indicate that the evolution of kinase substrates is a fast or easy step in the evolution of phosphosignalling networks. Single mutations can always improve a non-optimal kinase substrate, whereas the substrates of phosphopeptide-binding domains, which operate in signalling networks to read the phosphorylation events left by kinases, may sometimes require concerted mutations in potential substrates in order to become more fit for binding (). Evidence for the existence of significant numbers of functionless phosphorylations [
24] is consistent with this possibility; a build-up of non-functional phosphorylations is consistent with kinase–substrate evolution being an easy step in signalling network evolution.
It is clear from the results presented here that the specificity of these kinases for the amino acid sequences proximal to the site of phosphorylation among their
in vivo substrates is largely well-described by a first-order model. Adding second-order information to first-order models of kinase specificity, at least for these kinases, seems to add only a minimal benefit in terms of predicting novel substrates. In some cases, adding second-order information even reduces the quality of a first-order model, indicating that the second-order models are overfit to irrelevant interpositional correlations in the training data. In order to predict novel kinase substrates, it may instead be beneficial to integrate simple sequence models with other contextual information such as known interactions [
11], subcellular localization, protein structure and distal-site recognition.
In the present work, we elected to restrict our models to using only local sequence information directly, in order to isolate the effect of second-order information on improving first-order models. Sequence information also informs substrate fitness in indirect ways not examined here. Intrinsic protein disorder seems to be enriched in proximity to sites of protein phosphorylation, and consideration of protein disorder improves
ab initio prediction of the location of sites of phosphorylation [
25]. Likewise, the evolution and conservation of protein sequence surrounding a site of phosphorylation might give clues to which neighbouring residues are responsible for recognition, and what the relevant constraints on flanking sequence are, though aligning sequences in disordered regions of proteins is quite difficult.
It is tempting to speculate that the approximately first-order nature of substrate recognition by these kinases reflects evolutionary freedom for the development of new kinase substrates: the energy landscape for substrate fitness is smooth, with multiple-step mutations generally unnecessary to improve the fitness of any potential kinase substrate. It remains to be seen, however, whether the results reported here extend to other kinases, or to the complete repertoire of substrates for the kinases analysed here.