We used the interaction correlation method 
, described in the Methods
section, to determine the pattern of energetically coupled pathways in Thermus thermophilus
Argonaute (TtAgo), with and without a bound nucleic acid substrate. (We refer to Argonaute bound to a single guide nucleic acid strand as the binary complex
and to Argonaute bound to a guide-target nucleic acid hybrid as the ternary complex
This method computes the pairwise residue interaction energies of many conformational states of the structures at thermal equilibrium sampled from a MD trajectory. The interaction energy calculations include only nonbonded terms from the force field equation: van der Waals (i.e.
Lennard-Jones terms) and electrostatic interaction energies, which are added together to produce the total interaction energy. Because electrostatic interactions decay over a much longer distance compared to van der Waals interactions, electrostatic energies dominate the calculated quantities. As an example, for a representative ternary complex snapshot, the magnitude of the sum of all pairwise residue electrostatic interaction energies was on the order of 30,000 kcal/mol. By comparison, the sum of the corresponding van der Waals energies was approximately 3400 kcal/mol. It is possible that at relatively short ranges the two terms may be of similar magnitude and opposite signs and as such cancel each other out, masking significant interaction changes. In the case of the fully-matched TtAgo ternary complex, in a given trajectory snapshot, there were in general less than 50 residue pairs in which this situation occurred, as defined by energy magnitudes greater than 0.5 kcal/mol (in most cases less than 2 kcal/mol) and sum less than 0.1 kcal/mol. See Supporting Table S1
for example energy values from a representative snapshot. For the purposes of this study, we felt that the individual contributions of van der Waals and electrostatic energies were not essential to the conclusions, and so we chose to examine the nonbonded energies as determined using the force field equation.
Correlations among pairs of residue interaction energies were calculated and subsequently projected back on the structure to create a residue correlation matrix where each axis represents the protein sequence (see ). Each matrix cell then describes the degree of energetic coupling between two residues. A row or a column in the (symmetric, by construction) residue correlation matrix corresponds to a vector of the correlations for a specific residue with all the other residues. Importantly, the residue correlation matrix, by describing the correlations among pairs
of interactions, goes beyond the calculation of direct interaction or positional correlations among single
residues. This analysis is distinct from position correlation analysis; the position correlation matrix for the same trajectory is shown in for comparison. The residue correlation matrix may change when there is a structural modification; this is illustrated in the binding of nucleic acid to the free TtAgo to form a ternary complex, which results in a relatively small but orderly change in the residue correlation matrix. The delta between the respective residue correlation matrices for this example is shown in Supporting Figure S1
Residue correlation versus positional correlation matrices.
Notably, the qualitative features of residue correlation matrices discussed below manifested similarly at both relatively short (6–12 ns) and long (80–100 ns) timescales, suggesting good convergence at short timescales. It is important to note that correlations at timescales considerably longer than we were able to obtain are likely to be important, possibly resulting in the determination of an internal allosteric network modified from the one we present here; for example, partially connected networks or a fundamentally different architecture could emerge. Of course, the possibility of inadequacy due to sampling limitations is inherent to any study depending on MD simulation. Regardless, it is a strength of this method that we can still glean much important information from short-timescale correlations easily accessible through MD.
Interaction correlation network is centered on active site
Examining the residue correlation matrix for the TtAgo ternary complex shows that there is a widely distributed set of correlations among residues in all regions of the protein. This is shown in note the widely distributed regions of high correlation values. This represents an organized set of interaction pathways, described by energetic couplings, that spans the structure. Residues that are part of the same pathway would have high correlation values with each other. However, specific pathways do not present themselves as an obvious property of the matrix.
Previous workers had identified pathways by regarding the residue correlation matrix as a similarity matrix and using clustering methods to identify groups of residues associated by large correlation values with each other. We repeated this approach, using the Markov Cluster algorithm (MCL) 
, as previously used with a smaller protein 
, and subsequently, average-linkage divisive hierarchical clustering 
. However, with both methods, as the clustering granularity was progressively decreased, large numbers of singleton clusters were resolved. These methods did not identify a pathway (or pathways) but rather suggested that in the setting of a residue correlation matrix densely populated with high similarity values, there are many interaction correlation pathways that are not clearly distinct from one another. This is consistent with the fact that Argonaute is a large protein with a large number of residue-residue interactions, and highlights the difficulty of isolating individual internal allosteric pathways in proteins of this large size.
We next took a different approach to identifying pathway structure, employing a different similarity metric while retaining the use of average-linkage divisive hierarchical clustering. In this approach, the residue correlation matrix is treated as a matrix of observations by residue, where each “observation” for a given residue represents the degree of correlation with another particular residue. Starting with one large cluster containing all residues, each cluster is then recursively divided by choosing the cluster with the greatest internal dissimilarity at each step. In this way, a given pair of residues would only be assigned to the same cluster if they both correlated strongly to similar subsets of residues. Importantly, this pair of residues need not actually have a strong correlation with each other. With this metric, the meaning of a given cluster is that it is a “slice” across a set of correlation pathways rather than a discrete pathway. This is illustrated schematically in , where each circle represents a residue and each connecting line represents a high correlation value between two residues. Note that the red residues at the bottom of the diagram are strongly connected to the blue residues, but are not connected to each other. Hence they are grouped together. Similarly, the blue residues connect strongly to the red residues and the green residues, and as such are grouped together. Finally, the green residues connect primarily to the blue residues and are grouped together. Note that clustering in this manner does not resolve individual correlation pathways, yet the overall directionality of the network is evident. Although the specific clusters chosen are to some degree dependent on the clustering method and as such the set of correlations evident in the partitioning is not necessarily exhaustive, this clustering provides substantial insight into the overall organization governing transfer of information within the interaction correlation network.
Illustrations of residue correlation matrix clustering.
Applying this concept to partitioning the residue correlation matrix for each structure, we produced a set of cross-sections of the interaction correlation network in the Argonaute ternary complex, which revealed the overall architecture of the network. In all matched and mismatched ternary complexes, we found that there were roughly six major clusters of residues that were robustly prominent at a variety of clustering levels. These clusters were organized in layers emanating from the active site region (i.e. corresponding to the red cluster in ), in an “onion-skin” type architecture (see ). The largest cluster included the cores of all protein domains (blue cluster), showing that they are the common link in all emanating pathways. This cluster was surrounded by a distributed cluster (green cluster) consisting of residues on the protein surface, with which it directly interacts. The central clusters (red) include the active site and the target scissile bond. This architecture allows dynamical signals to propagate from the active site region (red), through the protein domain cores (blue), onward to the surface of the complex (green), and vice versa. This widely-distributed but non-uniform network of interaction pathways follows from the architecture of Argonaute because all domains, although structurally well defined, are in relatively close proximity to each other and therefore their nonbonded interaction energies tend to be well correlated. Such architecture allows the dynamics of any given region of the complex to influence all other regions of the complex. This would allow Argonaute to support allosteric interactions that would have distant endpoint effects at many regions across its surface.
Influence of mismatch depends on its sequence position
We also examined certain individual elements of the interaction correlation network. Since the guide strand seed region has been implicated in RNAi sequence selectivity, we hypothesized that in a ternary complex, the degree to which a given guide strand nucleotide can influence Argonaute protein residues would depend on its position. The rationale for such a hypothesis is based on a structural argument related to the nature of nucleic acid protein interactions. In the ternary complex, as one progresses along the nucleic acid sequence (e.g.
, from position 3 towards position 11 near the active site) the base pairs are not only translated from the anchoring site towards the active site, but they also change their orientation with respect to the protein because of the helical nature of the nucleic acid. In approximately 5 base pair steps, the orientation turns ~180°, such that if the guide strand was proximal to the protein, after 5 base pairs, the target strand is now in an equivalent, albeit translated, position. To some extent this was the origin of sequence-position-dependent changes in the free energy of introducing a mismatch, as was observed in our previous work 
To probe this coupling, we calculated the correlation factor
of each nucleotide with the protein as detailed in the Methods
section. Essentially, this is the sum total of correlation values of the nucleotide with the rest of the protein, providing a description of the degree to which interaction pathways connect each individual nucleotide with the protein. In all TtAgo ternary complexes, guide strand positions 4–6 (part of the seed region) have increased correlation with the protein regardless of the presence of a mismatch. The increased seed region correlation suggests that while structural distortion is minimal when there is a guide-target mismatch in the seed region, it has the potential to propagate a signal to the protein to a greater degree than a mismatch in the surrounding positions. We also probed the dependence of the correlation factor on distance from the nucleotide (see ). Interestingly, active site nucleotides (the scissile bond is between nucleotides 10 and 11) had a large correlation factor with respect to nearby protein residues, but this is greatly diminished with distance, indicating that perturbations in these residues have primarily local effects. By contrast, seed region nucleotides had a prominent effect in distant protein residues (30–50 Å). This is congruent with existing theories stating that the seed region is the primary determinant of target selectivity, because a seed region mismatch is better able to influence the rest of the structure than a non-seed-region mismatch. This is also congruent with the hypothesis that this binding free energy penalty is the primary mechanism through which mismatches in the seed region reduce RNAi cleavage activity, consistent with the results of experiments showing a decrease in RNAi catalytic efficiency due to seed region mismatches 
Relative correlation factor by distance in fully-matched TtAgo ternary complex.
Whole-network effects of guide-target mismatches in the nucleic acid
Previously, we found in simulations that a guide-target mismatch tends to destabilize the ternary complex relative to the binary complex, which would disfavor target binding 
. Examining the simulation trajectories of all mismatched TtAgo ternary complexes showed that in each case, significant structural distortion was limited to the mismatched base pairs themselves and their close neighborhood. In addition, for all simulations of the mismatched ternary complexes, we observed that the RMSD was less than 3 Å with respect to the fully-matched ternary complex. This suggests that the dynamical perturbations emanating from the mismatch site do not cause a large structural change even though their energetic change is quite substantial and highly significant with respect to function. We wished to determine how this energetic change affected the interaction correlation network. To do so, we compared the residue correlation matrix of the fully-matched TtAgo ternary complex with those of mismatched TtAgo ternary complexes. To determine whether gross interaction pathway reorganization took place within the protein, we designed and used a similarity metric designed to describe the degree to which the general pattern of internal correlation pathways is preserved when a mismatch is introduced. The method is described in detail in the Methods
section, and is briefly described below.
We were primarily interested in the degree to which strong correlations were preserved or lost across the two structures in a given comparison, so we excluded from consideration all but the strong correlations. In order to identify these strong correlations for a given residue correlation matrix, we plotted all correlation values, extracted from the matrix, in ascending order. This resulted in a curve with two distinct regions — a low-valued region representing small correlations and a high-valued region representing strong correlations (see Supporting Figure S2
). In all cases, the high-valued region represented the top 7.4% of values, which for a given residue correlation matrix was approximately 17350 distinct correlation values, out of (N2
234270 possible (given that the residue correlation matrix is symmetric). To compare two residue correlation matrices A
, each cell at Aij
was examined along with its corresponding cell Bij
to produce a combined similarity score (see Methods
). Each corresponding pair where both had strong correlations in the top 7.4% subset contributed a 1 and those correlations that were unmatched contributed a zero. Those correlations that were not in the top 7.4% subset were excluded. The similarity score was therefore the fraction of the corresponding pairs out of the total number of correlations preserved in the top 7.4%. The intuition underlying this method is that in the case of full preservation of the interaction correlation network, if there is a strong correlation between residues i
in a fully-matched structure, then i
should also have a strong correlation in the mismatched structure, and vice versa. Each score was compared to an expected similarity score between randomized versions of the same two matrices in order to assess for significance (see Methods
). In all cases only the Argonaute protein was included in the comparison.
and show comparisons using the similarity score between various matched and mismatched ternary complexes. Every 0.001 increment in the similarity score represents approximately 35 modified strong interactions. The maximum possible similarity score is 1, indicating a fully-preserved allosteric network architecture that would preserve the dynamics of surface residues interacting with allosteric partners. The similarity scores between matched and mismatched ternary complexes are in the range 0.9904–0.9911, which implies that there were on the order of 300 modified strong correlations in each case. Comparing the mismatched ternary complex residue correlation matrices to each other resulted in similarity scores ranging from 0.9977 to 0.9987 (see ). This indicates that the different mismatched ternary complexes are more similar to each other than they are to the fully-matched ternary complex. For all comparisons the similarity scores were substantially higher than the randomized value of approximately 0.1514, indicating that they were significant.
Residue correlation matrix similarity scores — protein subsets only.
Residue correlation matrix similarity scores — ternary complexes, protein subsets only.
These high similarity scores suggest that the architecture of the network is quite resilient to perturbations, whether they originate from a binding event, a conformational change or the introduction of a mismatch in the nucleic acid sequence. We were most interested in the ~1% of variability in the network arising from a guide-target mismatch. Since the only structural difference between matched and mismatched ternary complexes was within the mismatched base pair itself, this dissimilarity can be considered to arise from the presence of the mismatched base pair. We therefore designed an approach to identify specific residues affected by these structural perturbations.
Argonaute residues preferentially affected by guide-target mismatches
We observed that a single mismatch can modify the entire interaction correlation network, causing a subset of residues to be either coupled or decoupled from the network. Given the pattern of similarity scores, we hypothesized that this subset would have significant commonality across mismatched ternary structures. To identify these common elements, we examined each difference matrix
resulting from element-wise subtraction of the fully-matched residue correlation matrix from each mismatched residue correlation matrix. To facilitate comparisons among different matrices, each residue correlation matrix was first converted to units of standard deviations from its element mean, and only the Argonaute protein was included in these calculations. For each residue correlation matrix Mmismatched
corresponding to a given mismatched ternary structure, we calculated ΔM
. Therefore, each cell in the difference matrix ΔMij
describes the change in correlation value between residues i
due to the presence of a mismatch. For a given residue pair, a gain in correlation would suggest that new interaction pathways connected to the perturbation pass through this coupling, and a loss would suggest that such pathways were removed. In both cases, this means that the dynamics of a residue pair are modified in response to the perturbation. However, we were interested in the average magnitudes of differences rather than their signs, so we calculated the root mean square of each cell ΔMij
across all difference matrices. This resulted in an averaged difference matrix ΔMrms
To quantitate the degree of involvement of each residue in the modification of the residue correlation matrix due to the introduction of any of the tested mismatches, we summed the values in each column of ΔMrms to produce a vector where the ith element describes the amount of change in correlation value for residue i combined across other residues in the structure. The largest values in this vector corresponded to residues which had their coupling to the interaction correlation network most affected by the mismatches. Importantly, a large value could result from both increases and decreases in correlation value. To determine the most important residues by this measure, we identified the 65 elements of this vector that were more than two standard deviations away from the mean of all 685 elements in the vector, and the corresponding residues were deemed the sensor residues. These sensor residues gain or lose coupling with the rest of the structure in the presence of a seed region guide-target mismatch. The sensor residues are listed in and diagrammed in .
Sensor residues in TtAgo ternary complex.
Sensor residues were found in N-terminal, PAZ, and PIWI domains as well as the linker regions, but none were found in the Mid domain. Since the Mid-PIWI interface binds the 5′ end of the guide strand, a short distance from the seed region, this shows that the distribution of sensor residues was nonuniform. Nearly all sensor residues were charged, with a predominance of arginine and glutamic acid residues, as well as several aspartic acid and lysine residues. Since the interaction energies are dominated by electrostatics, the fact that so many sensor residues were charged is therefore not surprising. Some sensor residues formed salt bridges, which could in principle change orientation, serving as an allosteric mechanism; these are listed in . The sensor residues could be divided into three categories: two residues forming part of the catalytic triad, binding groove residues, and surface residues.
The active site residues D546 and D660, which interact with both Mg2+
ions in the active site, were found to be sensor residues. This suggests an effect of seed region mismatches directly upon the target cleavage reaction. It is likely that the mechanism of target cleavage in Argonaute is similar to that of RNase H given the structural similarities between the respective active sites — indeed, we placed the Mg2+
ions in the system by analogy with RNase H (see Methods
). In the RNase H mechanism, an Mg2+
ion is likely to catalyze the nucleophilic attack of the scissile phosphate by a water, forming an intermediate which is bound to the other Mg2+
ion. Protonation of the leaving group by an active site Asp residue completes the reaction. During the course of the reaction, the inter-Mg2+
distance is modified, suggesting that the motion of these ions is significant 
. Disrupting any of these components, whether an Asp itself or an Mg2+
complexed to the Asp residues, could potentially modify the catalytic rate. In addition, the binding groove sensor residues, described below, may be part of internal allosteric pathways terminating at D546 and D660. The identification of these residues strongly suggests a focused effect on the active site region due to seed region guide-target mismatches.
Interestingly, much of the length of the Argonaute binding groove was composed of sensor residues, demonstrating a long-range effect of seed region mismatches upon the binding groove. This included a subset of residues that were located proximal to the active site; these were K191, R192, R194, E203, D246, K248, D249, and E268; this region is illustrated in . Several binding groove sensor residues were part of the PIWI domain: R482, R574, K575, K618, D546, R548, E597, D598, D660, R661, E665, and R685, also illustrated in . Even though we employed a TtAgo guide-target structure in which both strands of the bound nucleic acid were truncated after guide strand position 12, there were a number of sensor residues in the regions of the binding groove that would normally bind the missing nucleotides. This suggests that the entire binding groove — both guide- and target-binding regions — is dynamically coupled to the seed region, and that a seed region guide-target mismatch may have an effect upon the interface between protein and nucleic acid, even in regions relatively distant from the site of the mismatch.
In order to estimate more accurately how the binding groove sensor residues were positioned relative to the “missing” nucleotides, we identified the sensor residues on a TtAgo ternary structure in which the missing nucleotides were resolved, but for which no experimental data regarding target cleavage as a function of mismatch is available 
(PDB: 3HK2). Binding groove sensor residues were found to be in close relationship to the “missing” nucleotides (see ). Notably, near the 5′ end of the target RNA, there is a strip of sensor residues which follow the putative path of a longer target RNA as it would extend into the TtAgo protein. This lends credence to the idea that the Argonaute-target interface is modified along its entire length by an internal allosteric mechanism originating at seed region guide-target mismatches.
Although the location of binding of RISC accessory proteins to Argonaute is not known, sensor residues on the surface of Argonaute may mediate the passage of the internal allosteric signal arising from the mismatched base pair to the surface of the protein, and by virtue of their physical accessibility to external molecules would represent potential external allosteric interfaces. We wished to identify these sensor residues because they may mediate the recognition of mismatches by external proteins, such as Dicer or accessory proteins in RISC. We defined a surface sensor residue as having relatively large solvent-accessible surface area (>70 Å2
), as calculated using a standard probe radius of 1.4 Å, without being located in the binding groove. (The solvent-accessible surface area values are given in and diagrammed in Supporting Figure S3
.) These residues were E76, R289, R335, R340, E517, and D521. There were relatively few regions containing such surface sensor residues. Surface sensor residues R335 and R340 are located in the PIWI domain, on the opposite side of the protein from the binding groove. E517 and D521 are located adjacent to each other in the PIWI domain. R289 is located near the 3′ end of guide strand. E76 is in the N-terminal domain.
In general, the locations of the sensor residues are congruent with our description of the “onion skin” architecture of the internal allosteric network with the active site at the center: the sensor residues near the active site are part of the central clusters, and those on the surface are part of peripheral clusters. The overall picture is of three general categories of sensor residues with apparent functional importance: residues on the surface of the protein, which can transmit a signal to external binding partners; residues in the binding groove, which can modify the interface along the length of the bound nucleic acid distant from the seed region mismatch; and residues near and in the active site, which could potentially affect target cleavage.