|Home | About | Journals | Submit | Contact Us | Français|
The observed mutation pattern in Immunoglobulin V genes is influenced by several mechanisms, including AID targeting to DNA motifs (hot spots), negative selection of B cells that accumulate mutations that prevent the expression of the Ig receptor, and positive selection of B cells that carry affinity-increasing mutations. These influences, combined with biased codon usage, produce the well-known pattern of increased replacement mutation frequency in the CDR regions, and decreased replacement frequency in the framework regions. Through the analysis of over 12,000 mutated sequences, we show that the specific location in the V gene also significantly influences mutation accumulation. While this position-specific effect is partially explained by selection, it appears independent of the CDR/FWR structure. To further explore the specific targeting of SHM, we propose a statistical formalism describing the mutation probability of a sequence through the multiplication of independent probabilities. Using this model, we show that C->G (or G->C) mutations are almost as frequent as C->T and G->A mutations, in contrast with C->A (or G->T) mutations, which are not more probable than any other mutation. The proposed statistical framework allows us to precisely quantify the effect of V gene position, substitution type, and micro-sequence specificity on the observed mutation pattern.
B cells immunoglobulin (Ig) diversity is obtained in multiple stages. The initial diversity is generated in the bone marrow during the Heavy Chain (HC) and L Chain (LC) V(D)J rearrangement , including TdT induced nucleotide addition and exonuclease-mediated deletion at the segment junctions . A second stage of diversification is obtained in the periphery following exposure to antigen through somatic hypermutation (SHM) . In SHM B cells introduce point-mutations into the variable region of their immunoglobulin (Ig) genes . The mutation rates in these genes can reach 1.e-3 mutation per base pair per division .
SHM occurs in germinal centers (GC) of secondary lymphoid organs . To initiate SHM, activation induced cytidine deaminase (AID) is required. AID initiates hypermutations on both the transcribed and non transcribed strands of Ig variable region genes by deaminating a C to form a U during transcription [7, 8]. The resulting U/G mismatches [3, 9] have several possible fates . First, the mismatch can be replicated over, producing C->T transition mutations. Second, the U can be removed by UNG, leading to C->A/C/T mutations. Finally, the mismatch repair machinery may be engaged leading to excision of neighboring bases and DNA resynthesis primarily by Polη, but also by other polymerases (ι and ζ) . The repair mechanism is error prone and leads to point mutations at all base positions in the surrounding sequence. AID targets specific locations in the Ig gene [12–14]. SHMs occurs mainly in B cell Ig genes, but have been recently described to occur at a much lower mutation rate in other genes .
SHM preferentially targets C bases in the WRCY hot-spot motif (W = A or T, R = G or A, and Y = T or C), or its reverse complementary sequence RGYW . Other variants of this hot-spot motif, such as WRCH (or DGYW) have also been proposed (REF – ROGOZIN and DIAZ). Such sequences are highly over-represented in Ig Complementary Determining Regions (CDRs), which are thus considered to have an increased mutations probability[17, 18], while FWRs have a decreased frequency [19, 20].. This hot-spot motif can at least partially explained by the observed in vitro binding preference for AID to WRC (or GYW) motifs . When analyzing sets of experimentally-observed Ig sequences, the observed increased mutation frequency at these hot-spots is due to factors beyond AID targeting. An important element affecting the observed mutation spectrum in different regions is positive and negative selection of the B cells carrying the resulting mutations. Negative selection is the death or editing (in the case of the L chain) of B cells carrying low affinity, non-functional or self-reactive receptors [2, 21]. It has been estimated that ~50% of replacement mutations in the framework regions are subject to negative selection . Positive selection is the increased expansion rate of B cells with a higher affinity receptor for the introduced antigen , most likely due to a decreased death rate  . Thus mutations adversely affecting the structure of functionality of the receptors are expected to be purged from the peripheral B cell population, while mutation enhancing antigen binding are expected to be over-represented. This will increase the observed mutation frequency in CDR regions and decrease the observed mutation frequency in structurally important framework (FWR) regions . A final element affecting the observed mutation frequency is the long term evolution of the Ig germline sequence.
We here analyze a large collection of over 12,000 mutated sequences, carrying over 80,000 mutations and propose a new statistical model to analyze the mutation probability as a function of the surrounding sequence. We show using this model that the SHM targeting pattern is different than previously reported. Most significantly, we show that beyond microsequence-specific targeting, there exists a position-specific effect on the mutation probability.
Three sources of immunoglobulin sequences were used: the IMGT database , the KABAT database  and immunoglobulin sequences published in the NCBI database. An initial quality check was applied to all sequences by checking for length and homology in the V region to a published germline V sequence. Sequences of less than 280 nucleotides, or sequence with less than 80 % homology to any published  V germline gene were removed. We then removed all sequences with insertions or deletions when compared with all similar enough V genes (above 80 %). In the immunoglobulin sequence, we only studied the V region, and ignored the CDR3 region to avoid any bias induced by junctional diversity
Mutations were computed by blasting each sequence with all IMGT germline V gene sequences in the appropriate organism. The blast was performed using the NCBI BLAST software . The highest scoring V gene was selected and all the differences between the rearranged sequence and the germline in the first 280 nucleotide were defined as mutations.
We checked whether the appearance probability of a mutation could be explained solely on the base of the DNA sequence surrounding the mutation. We defined the signature of a mutation as the DNA quadruplet surrounding it, with either 2 positions before it and 1 after it or the opposite. We then built a regression model for the mutation probability. The approximated probability was computed as a product of the probability of the preceding di-nucleotides, the probability of the mutation by itself and the probability of a following single nucleotide. A similar inverted model was also tested. For example the expected value of AG(C->A)T was approximated as P1 (AG)*P2 (C->A)*P3 (T), and the expected value of A(G->T)GT was approximated by P1 (A)*P2 (G->T)*P3 (GT). We developed an independent regression model for the forward (mutation at second position) and backward (mutation at third position) models. The log of the values was used producing log10 (P(AG(C->A)T) ≈ log10 (P1 (AG))+log10 (P2 (C->A))+log10 (P3 (T)). Each regression model (forward and backward) had 768 observation (each quadruplet) and 32 variables (16 possible nucleotide couples, 12 possible mutations, and 4 possible single nucleotide probabilities). At a second stage, we simplified the model, where the di-nucleotide was replaced by a multiplication of two independent mutations (e.g. P1 (AG)=P1a (A)P1b (G)). The simplified model had 8 less variable producing a model with 24 independent variables.
An AID hot-spot “signature” is defined as an “RGYW” motif or its reverse complement motifs “WRCY” [28, 29]. We compared the mutation probabilities of all 48 different nucleotide 4-mers that fit these AID signatures to all others 1488 (1536–48) 4-mers. As described above, in our model a signature is composed of a di-nucleotide, a mutation and another nucleotide.
The nucleotides surrounding mutations in human and mouse H and L chains were extracted by blasting each reported immunoglobulin sequence V regions against all IMGT V germline genes . We then detected all mutations and tallied the frequency of each mutation accompanied by four flanking nucleotides (two 5′ and two 3′). The classical AID targeting motif (WRCY/RGYW) suggests that the most critical positions influencing targeting are the two 5′ bases along with the first 3′ base. Thus, for each mutation we focused on these surrounding bases and generated two quadruplets to account for the possibility of targeting on either strand. For example, an observed mutation at the middle position in ACAGA would be analyzed as the di-nucleotide AC, a mutation at A followed by the nucleotide G (or on the opposite strand as C, followed by a mutation at A, then GT). We write this as AC(A->T)G or C(A->T)GT). The frequency of such quadruplets was measured separately for human and mouse H and L chains (Figure 1). Each mutation quadruplet (e.g., AC(A->T)G) was normalized by the frequency of its underlying nucleotide quadruplet (e.g., ACAG), which we denote as its base quadruplet. Normalization was applied either at the entire sequence level, or locally per position along the sequence as shall be further discussed. Our model covers 1536 possible quadruplets, 768 (4*4*12*4) with a mutation at the third position, which we denote as forward quadruplets and 768 with mutations at the second position, which we denote as reverse quadruplets. Different quadruplets have drastically different observed mutation frequencies (Figure 1 diagonal for histogram in log scale), with a distribution spanning almost 5 orders. This is consistent with the finding of a hierarchy of mutation described by Smith et al . While a clear correlation was observed between the log of the quadruplet frequencies in the four chains tested (human H,L and mouse H,L), some combination had much higher log frequencies in one chain or the other (Figure 1). Furthermore, there is a significant scatter around the diagonal. Most of the difference between the different chains results from the variability in base quadruplets with a low appearance frequency in the germline sequences. The average Pearson correlation between the mutations frequencies in the different chains studied is 0.4. When limiting the analysis to quadruplet whose base quadruplets have at least 1,000 observations (1262 possibilities out of 1536), the average correlation coefficient rises to 0.59. When further limiting the comparison to base quadruplets with over 5,000 observations in each chain (635 possibilities out of 1536), the average Pearson coefficient rises to 0.71. Limiting the selection to even more frequent quadruplets does not increase the correlation. Thus the difference between chains is not fully explained by random fluctuations in rare quadruplets. These differences can reach two orders of magnitude and no bias was observed in the variance for any particular mutation quadruplet. These differences suggest that micro-sequence specificity is not the only element affecting the observed mutation pattern. Such differences were previously described and assigned to differential targeting of AID in different chains [31, 32]. We suggest here that they represent the combination of selection and a position-specific effect.
AID initiates SHM by deaminating C, with a preference for the hot-spot WRCY/RGYW. We tested whether the observed mutation distribution in our database was consistent with this classical AID signature. We summed the mutations from human and mouse H and L chains and computed the average mutation probability. We then divided all quadruplets into AID-positive quadruplets (i.e. quadruplets that fit the AID hot-spot signature) and AID-negative quadruplets (all others), and computed their frequency distribution (Figure 2). Interestingly, we found that AID-positive quadruplets are actually composed of two populations; one group of motifs has significantly higher frequencies compared with AID-negative motifs, while the other group is no different from this background. This high-frequency subset is enriched for motifs containing C->G or T and G->A or C mutations. In contrast, C->A and G->T mutations are no more frequent than AID-negative motifs. The preference of C->T and G->A mutations is expected by the deamination function of AID  and the previously observed preference for transition mutations. However, it is noteworthy that C->G and G->C are almost as frequent. This could be explained by the codon composition of Ig germline sequences  that induce selection of mutations beyond C->T and G->A. Note that some AID-positive quadruplets with C->G or T and G->A or C mutations have a lower mutation probability than some AID-negative quadruplets, as can be observed by the right hand tail of the AID-negative distribution in Figure 2.
In order to quantitatively explain the observed mutation frequencies; we developed a regression model of the quadruplet frequency as a combination of the independent probabilities of two nucleotides preceding the mutation, a single nucleotide following the mutation and the mutation itself. For example the frequency of AC(A->T)G was modeled by the following probabilities product: P(AC)P(A->T)P(G). These probabilities were extracted using a linear regression on the log of the frequencies. The regression results show that such a model can explain over half the variability in the log of the quadruplet frequency (Figures 3 and and44).
As previously reported [17, 18], the last position in the AID hot-spot motif WRCY (or first position in RGYW) does not contribute very much to the observed mutation frequency (Figure 4 vs. vs.5).5). To test whether this was true in our dataset, we performed a quantitative comparison of a model that includes the last position to a model without it. The decrease in the residual variability of the log frequency was less than 10% (F18,736=27,p <1.e-20, (Sup mat Figure 1)). Although this was statistically significant, the effect is much less than other positions. These results suggest that the downstream nucleotide is less important to predicting the mutation probability than others, but cannot be neglected. On the other hand a model with independent probabilities for the two positions before the mutation (Sup mat Figure 1) had a significantly higher residual variance (F18,736=14,p <1.e-20). This suggests that the interaction between the two 5′ nucleotides is important for targeting SHM/AID.
It is important to note that a large part of the variability in the proposed mutation model is explained by the specific nucleotide substitution, rather than the DNA motif surrounding the nucleotide position. In other words, the variance in the frequency of different substitutions that occur at a position selected for mutation is as high as the variance in the mutation frequencies across different positions. While mutations from a Cytosine to Thymine are highly over-represented, as expected from the biology of AID, mutations from Cytosine to Guanine or Adenine are actually underrepresented. We compared the frequencies of the 12 possible mutations with the frequencies obtained by Patrick Wilson on a set of 2,000 human H chains and obtained similar results in most of the mutations for the human H chain. The regression coefficient represents the contribution of a given nucleotide or mutation type to the total frequency of the quadruplet. Positive regression coefficients imply a high frequency of quadruplets and vice versa. A comparison of the mutation frequencies from the previous section as well as from Zheng at al.  to the our regression coefficients show that a good match exists for 10 out of the 12 possible mutations from one nucleotide to another: A->G, G->A, C->T,T->C, G->C and A->C have high mutation frequencies (in previous publications) and positive regression coefficients (in the current analysis). C->A, T->A, G->T and T->G have low mutation frequencies and negative regression coefficients. Only type two mutation C->G and A->T have a intermediate frequencies with negative coefficients.
The regression model predicts many of the known AID hot- and cold-spots, with the contribution of Adenines around the mutation being much more significant than any other nucleotide or dinucleotide. To summarize, while the AID signature may be in general overrepresented, a super-signature exists, which is significantly more over-represented than any other mutation frequency and it is composed of A(G->A)[CT]T or [AT]A(C->T)T. Moreover, signatures containing a C->A or G->T mutations are not over represented at all.
The differences between the H and L chain and between the human and mouse quadruplet mutation frequencies suggests a mutation mechanism that is not directly mediated by the sequence. While selection may play a role as discussed below, we hypothesized that this effect was due at least in part to the position along the V gene. However, any position-specific effect may be confounded by micro-sequence specificity as it is well-know that AID hotspots cluster in the CDRs. To resolve this problem, a two way ANOVA model was built. We split the V region into 10 base windows, and computed the probability of each quadruplet in each window. For example, the frequency of AG(C->A)T in window 5 (where the position is determined by the nucleotide mutating) was approximated as the frequency AGCT in the same window multiplied by P1 (AG)* P2 (C- > A)* P3 (T) multiplied by the total mutation probability in this window P(mutations in window 5).
Taking the log of the two sides yields:
where F are the appropriate frequencies. A similar approach was used for all quadruplets in all windows. The difference between the log of the mutation quadruplet frequency and the nucleotide quadruplet frequency was introduced into a two way ANOVA model. The results of the ANOVA were highly significant for both the quadruplet and the position effect (p < 1.e-10). When looking at the coefficient of each position in the V gene a clear and structured order appears with a higher mutation frequency in CDRs and a lower mutation frequency in FWR regions (Figure 6). Such a location effect might be due to the negative selection of mutations in framework regions (or positive selection of mutations in the CDRs). In order to test the possible effect of selection, we repeated the same analysis including only silent mutations, and again found a significant position-specific effect (p<1.e-3) in all chains except for the mouse H chain. In the case of silent mutations there is no clear bias of high mutation frequency in the CDR regions (Sup mat Figure 2). Thus, we conclude that the observed position-specific effect is not fully due to selection. Once the position along the germline V gene is incorporated within the ANOVA framework, the regression model explains over 75% of the observed mutation frequency (F test F28,736=27, p <1.e-20). A more limited effect was recently proposed based on the distance between mutations .
We have here proposed a new statistical formalism to analyze the effect of the surrounding nucleotide sequence on the mutation probability. We have shown that a model composed of multiplication of the probability of preceding di-nucleotides, the probability of each mutation at a given nucleotide and the probability of the following single nucleotide can explain most of the observed variance in the observed mutation frequencies among a large set of sequences. This quadruplet based model explains more than half the differences in mutations rates among quadruplets. The proposed model highlights that the main element affecting the mutation probability is not the flanking region, but the mutation itself, with C ↔T and A↔G probability 10 times higher than the average mutation probability. While C->T and G->A transitions are expected from the AID targeting model (as they are the direct result of replicating over AID-induced mismatches), A->G and T->C may be the result of selection. Indeed, when only silent mutations are incorporated in the analysis, the contribution of A->G and T->C to the mutation frequency is much lower (~3.5x) compared with C->T and G->A. Within the context of the AID targeting signature, C->G substitutions are almost as probable as C->T, and similarly G->C are almost as probable as G->A. C->A and G->T, on the other hand, are no more probable than any other mutation even when surrounded by an AID targeting motif. This is true both for a model that only includes mutations at C (and G), and when developing a general model incorporating mutations at all positions (Sup mat Figure 3).
The contribution of the di-nucleotide preceding the mutation is significantly more important than the one nucleotide downstream of the mutation. Within the first di-nucleotide, the model containing the correlation between the two nucleotides is slightly better than the one not including such correlations. However, when one focuses on C mutations in the coding strand (or G mutations in the opposite strand), the correlation between the two first (or last two) nucleotides becomes significant. For example, the TG di-nucleotide is much more frequent than the TA nucleotide, while AA and AG have similar frequencies. A similar phenomenon was recently observed by Zukerman et al .
Our sequence-based model explains more than half, but not all, the differences in the mutation frequencies among quadruplets. This is clearly seen by the large differences observed between the mutation probabilities for the same quadruplet in different chains. We show that half of the remaining variability can be explained by a position effect that is not affected by the sequence distribution at each position. This effect can explain the large difference between mutation frequencies in CDR and FWR regions that is not fully explained by the local sequence context. This position effect is a combination of selection (and its differential impact on CDR and FW regions) and a pure position effect as can be observed in the continued significance of position as a predictor when only considering silent mutations.
This position effect may be affected by structural elements, such as possible different methylation profiles in different regions . Another possibility is long range effect of AID. We have here only studied quadruplet based models of mutation probability. If AID targeting affects the mutation probability more than 2 or 3 positions beyond the initially targeted Cytosine then the entire CDR region may have a higher mutation probability that is somewhat different from what we estimate here.
The current analysis was performed on a large database of sequences currently deposited in the IMGT and NCBI databases as well as sequences from the older KABAT database. These sequences were obtained in many different ways. Thus we do not expect a systemic mutation pattern to emerge from the experimental procedure and sequencing method. However such a possibility must always be taken into account in the explanation of the observed mutation pattern. Another possible caveat of the current analysis is the definition of the germline V genes. If some V gene alleles are not included in the IMGT definition then the differences between these V genes and the most similar V gene will be defined as a consistent mutation patterns. Since no consistent mutations appeared in the sequences at the same position, we assume that even if such alleles exist, their effect on the results is limited.
A model was build to explain the mutation frequency of a quadruplet as a combination of each position in the quadruplet. For example the probability of the quadruplet P(AC(G->C)T) was modeled as P(A)P(C)P(G->C)P(T). The upper row represents the coefficients (e.g. P(A),P(C),P(G->C),P(T)) for the forward quadruplets and the lower row for the reverse ones. Coefficients consistent with AID hotspot motifs are highlighted in black.
The analysis performed here was similar to the one in Fig 6, but only silent mutations were incorporated to avoid any effect of selection. The clear bias toward mutations in CDRs disappears when mutations are ignored.
A model was build to explain the mutation frequency of a quadruplet as a combination of each position in the quadruplet, with only mutations originating from a G or a C. This model was developed to test whether the model reproduces the classical WRCY/RGYW motif. Coefficients consistent with AID hotspot motifs are highlighted in black. Note that within each subplot the average is 0. Thus, the bars represent the relative contribution of each nucleotide/mutation type and not the absolute contribution.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.