|Home | About | Journals | Submit | Contact Us | Français|
I.C., J.D.E. and S.G. designed the study; P.M.S., S.N., K.K. and S.G. contributed sources of M. tuberculosis DNA and demographic information; I.C., J.C. and J.G. performed DNA sequencing and bioinformatics; I.C., P.M.S., J.D.E. and S.G. wrote the manuscript with comments from all authors.
Mycobacterium tuberculosis is an obligate human pathogen capable of persisting in individual hosts for decades. To determine whether antigenic variation and immune escape contribute to the success of M. tuberculosis, we determined and analyzed 22 genome sequences representative of the global diversity of the M. tuberculosis complex (MTBC). As expected, we found that essential genes in MTBC were more evolutionarily conserved than non-essential genes. Surprisingly however, most of 491 experimentally confirmed human T cell epitopes showed little sequence variation and exhibited a lower ratio of non-synonymous to synonymous changes than essential and non-essential genes. These findings are consistent with strong purifying selection acting on these epitopes, and imply that MTBC might benefit from recognition by human T cells.
Infection with Mycobacterium tuberculosis causes enormous worldwide morbidity and mortality; there were more cases of tuberculosis in 2007 (the last year for which data are available) than at any prior point in world history1. Among the factors that contribute to the continued growth of tuberculosis as a global health problem are the efficiency of human-to-human transmission by the aerosol route, the ability of the causal agent M. tuberculosis to persist and to progress despite development of host immune responses, and the absence of a vaccine with reliable efficacy in preventing transmission of the infection. Moreover, while attempts to control tuberculosis through improved identification and treatment of infectious cases have been successful in some settings, similar approaches in other contexts have resulted in increasing rates of resistance to available anti-tuberculosis drugs2. Therefore, new approaches to controlling tuberculosis are essential and would greatly benefit from an improved understanding of the biology of the bacteria and their interactions with their human hosts. In particular, understanding the factors that drive the evolution of M. tuberculosis and allow it to evade host defences, may suggest unique opportunities to develop novel strategies against tuberculosis.
Human tuberculosis is caused by Mycobacterium tuberculosis and Mycobacterium africanum, which are members of the M. tuberculosis complex (MTBC). In addition to these human-adapted pathogens, MTBC includes various animal-adapted forms, such as Mycobacterium bovis, Mycobacterium microti, and Mycobacterium pinnipedii3. To characterize the extent and nature of the forces acting to diversify MTBC, we and others have applied several approaches to phylogenetic analysis of multiple clinical isolates from geographically diverse sources. Using single nucleotide polymorphisms (SNPs)3-6 or large sequence polymorphisms (LSPs)7-9 as genetic markers resulted in congruent groupings of human-adapted MTBC into six major lineages and consistent geographical associations for each of these lineages10. In addition, these studies found strong evidence for a clonal population structure of MTBC, without evidence of ongoing horizontal gene transfer. Analysis of SNPs in a total of 7 megabases of DNA sequence from 89 genes in 108 isolates of MTBC provided strong evidence that MTBC originated in Africa, and underwent population expansion and diversification following ancient human migrations out of Africa, followed by global spread and return to Africa of three particularly successful MTBC lineages through recent waves of travel, trade, and conquest3. Taken together, these studies have revealed that MTBC has undergone genetic diversification that corresponds to patterns of human migration, suggesting that distinct lineages have co-evolved with distinct human populations7. Moreover, they indicate that further understanding of the mechanisms and consequences of the interactions between MTBC and its human host can be obtained through comparative genomic analyses.
Host-pathogen co-evolution is characterised by reciprocal adaptive changes in interacting species11. Host immune pressure and associated parasite immune evasion are key features of this process often referred to as an ‘evolutionary arms-race’12-13. Studies in human pathogenic viruses, bacteria, and protozoa have revealed that genes encoding antigens tend to be highly variable as a consequence of diversifying selection to evade host immunity14-17. However, whether similar evolutionary mechanisms operate in MTBC, and whether the bacteria undergo antigenic variation in response to host immune pressure, is unknown.
Immunity to tuberculosis in humans, nonhuman primates, and mice depends on T lymphocytes18. Among human T lymphocyte subsets, CD4+ T cells are clearly essential for protective immunity to MTBC, as demonstrated by the observation that the incidence of active tuberculosis in people infected with HIV is inversely proportional to the number of circulating CD4+ T cells19. In addition to CD4+ T cell responses, humans infected with MTBC develop antigen-specific CD8+ T cell responses20, and MTBC antigen-specific human CD8+ T cells lyse infected cells and contribute to killing of intracellular MTBC21. Therefore, there is strong evidence that the adaptive immune system represented by CD4+ and CD8+ T cells, is an important mechanism for host recognition and control of MTBC. Recognition of foreign antigens by T lymphocytes depends on binding of short peptide fragments (termed epitopes) derived by proteolysis of foreign proteins, to MHC (major histocompatibility; in humans termed HLA (human leukocyte antigen)) proteins on the surface of macrophages and dendritic cells; CD4+ T cells recognize peptide epitopes bound to MHC/HLA class II; CD8+ T cells recognize peptide epitopes bound to MHC/HLA class I.
To obtain a better understanding of the effects of human T cell recognition on the diversity of MTBC, and to test the hypothesis that MTBC uses antigenic variation as one mechanism of evading elimination by human immune responses, we determined the genome sequences of 21 phylogeographically diverse strains of MTBC and used those genome sequences to analyze the diversity of 491 experimentally verified human T cell epitopes. This analysis produced the unexpected finding that the known human T cell epitopes are highly conserved relative to the rest of the MTBC genome. These results provide evidence that the relationship between MTBC and its human hosts may differ from that of a classical evolutionary arms-race, and suggest that development of new approaches to control of tuberculosis must take into account the possibility that certain human immune responses may actually benefit MTBC.
A total of 22 mycobacterial strains were included in this work. To study the sequence diversity of T cell antigens in MTBC, we used Illumina next-generation DNA sequencing to generate nearly complete genome sequences from 20 strains representative of the six main human MTBC lineages, and one strain of Mycobacterium canettii which is the closest known outgroup of MTBC3,22 (Table 1). In addition, we used the published genome sequence of the H37Rv laboratory strain of M. tuberculosis as a common reference23. For each of the 21 strains newly sequenced, a mean of 6.8 million sequence reads with a mean length of 51 base pairs were generated and mapped to the H37Rv reference genome. On average, the reads covered 98.9% of the 4.4 Mb reference genome (Table 1). The regions not covered primarily included members of the highly GC-rich and repetitive PE/PPE gene families24. A total of 32,745 SNPs were identified, corresponding to an average of 1 SNP call for every 3 kb of sequence generated. We used a total of 9,037 unique SNPs (i.e. SNPs that occurred in one or several strains) to derive a genome-wide phylogeny of 22 strains (Fig. 1, Supplementary Fig. 1). Six main lineages could be distinguished with high statistical support. These lineages were completely congruent to the strain groupings previously defined based on genomic deletion analysis and multilocus sequencing3,7,10. The perfect congruence between these different phylogenetic markers further corroborates the highly clonal population structure of MTBC and lack of ongoing horizontal gene transfer in this organism25. Because of the comprehensive nature of genome-scale data, a higher degree of phylogenetic resolution could be achieved compared to all previous studies. In this new phylogeny the brown and green lineages (also known as Mycobacterium africanum) are the most basal groups when compared to the M. canettii outgroup. M. africanum is highly restricted to West Africa for reasons that remain unclear8. However, the fact that the two M. africanum lineages represent the most ancestral forms of human MTBC reinforces the notion that human MTBC originated in Africa3,7.
We used these genome sequence data and the phylogeny derived from them to compare the genetic diversity in antigens and other experimentally determined gene classes. For comparisons across different gene categories, we divided our dataset into three gene sets, including ‘essential genes’, ‘non-essential genes’, and ‘antigens’ (Supplementary Fig. 2, Supplementary Tables 1 and 2). Antigens were defined based on the presence of 491 experimentally confirmed human T cell epitopes (Supplementary Table 3), which were compiled through the Immune Epitope Database (IEDB) initiative26. The ‘essential’ gene category was defined based on genome-wide analyses of transposon insertion mutants that were defective for the ability to grow on Middlebrook 7H11 agar, or in the spleens of intravenously-infected mice, published previously27-28. We excluded from this analysis genes belonging to the PE/PPE gene family24 and those related to mobile elements as they are difficult to study using current next-generation DNA sequencing technologies (total genes excluded: 273/3,990 (6.8%) genes annotated in the H37Rv reference genome; Supplementary Table 4).
Based on evolutionary theory and findings in other bacteria29, one would expect that in contrast to non-essential genes, the essential genes in MTBC will be under stronger purifying selection and thus more evolutionary conserved. In support of this notion, we observed that on average essential genes harboured less nucleotide diversity than non-essential genes (Fig. 2; Mann-Whitney U test p<0.002). We then compared the rates of synonymous and non-synonymous SNPs in the essential and non-essential gene categories. The synonymous and non-synonymous changes were derived by comparison to the most likely recent common ancestor of MTBC, which we inferred based on our new genome-wide phylogeny (Fig. 1, Supplementary Fig. 1). Because MTBC harbours little sequence diversity, it was necessary to analyze the distribution of synonymous and non-synonymous SNPs based on gene concatenates rather than individual genes. The two measures of distribution we used were based on the number of non-redundant SNPs across all 21 MTBC strains (dN/dS based on Measure A in Table 2 and Fig. 3), and on the individual pairwise comparisons between each strain and the inferred most likely recent common ancestor (dN/dS based on Measures B in Table 2). From these analyses, we found that the dN/dS measures for essential genes were significantly lower than for non-essential genes (Measure A in Fig. 3; Measure B in Table 2, Mann-Whitney U test p<0.0001). Taken together, these data show that in MTBC essential genes are more evolutionary conserved than non-essential genes.
Because MTBC interacts with humans through antigen-specific CD4+ or CD8+ T-cells, we would expect T cell antigens to be among the most diverse genes in the genome. Particularly when invoking a co-evolutionary arms-race and associated immune evasion, we would anticipate these antigens to be under diversifying selection and to be more variable than other genes in order to escape T cell recognition. However, when we analyzed the nucleotide diversity in 78 experimentally confirmed human T cell antigens (Supplementary Table 2), we found that they were on average not more diverse than essential genes (Fig. 2, Mann-Whitney U test p=0.12). Moreover, we found that the dN/dS measures in these antigens also resembled those of essential genes (Measure A in Fig. 3; Measure B in Table 2, Mann-Whitney U test p=0.77). Thus, human T cell antigens in MTBC do not appear to be under diversifying selection. Instead, purifying selection appears to be the driving selection pressure in these genes.
T cell antigens consist of epitope regions that interact with human T cells, and non-epitope regions which are not targets of T cell recognition. Hence, we decided to study these regions separately. To this end, we generated a separate concatenate of the epitope regions and another concatenate of all corresponding non-epitope regions. Because little data is currently available in the IEDB with respect to whether these 491 epitopes are recognized by CD4+ or CD8+ T cells, we analyzed them as one class. If immune escape was driving antigen evolution to evade T cell recognition in MTBC, we would expect non-synonymous changes to accumulate in epitope regions, leading to a high dN/dS. Contrary to this expectation however, the overall dN/dS of the epitope regions was 0.53, which was still similar to essential genes and lower than non-essential genes (Table 2, Fig. 3). Moreover, when we analyzed the distribution of amino acid replacements in individual epitopes we found that the large majority (95%) of the 491 epitopes showed no amino acid change (Fig. 4). Only five epitopes, contained in esxH, pstS1, and Rv1986, harboured more than one variable position (Supplementary Table 5). The higher number of amino acid substitutions in these five epitopes may reflect ongoing immune evasion, but further investigation is needed to determine whether the observed changes are due to immune pressure, other selection pressure(s), or mere random genetic drift3. Because these five epitopes were clear outliers compared to the large majority of T cell epitopes analyzed here, we repeated our dN/dS analysis after excluding the three antigens harbouring the five outlier epitopes. Our analysis revealed that the epitope regions had the lowest dN/dS of all gene categories (Table 2, Fig. 3). Furthermore, when we compared the proportion of non-redundant non-synonymous changes in epitope and non-epitope regions, we found that epitopes were less likely than non-epitopes to harbour changes at non-synonymous sites (Measure A in Table 2, χ2, p<0.05), whereas no difference was observed at synonymous sites (Table 2, χ2, p=0.89).
To further corroborate our finding of hyperconservation of human T cell epitopes in MTBC, we repeated our analysis using a data set from a previous study in which 89 individual genes were sequenced in 99 human-adapted strains representative of the six major global lineages of MTBC3. Sixteen of these 89 genes belonged to the T cell antigens analyzed here, including two of the three outlier antigens esxH and pstS13. Analysis of this additional dataset of 16 antigens in 99 MTBC strains revealed an overall dN/dS for the epitope regions of 0.74. However, after excluding the two outlier antigens, the dN/dS dropped to 0.46, which was again lower than the genome-based dN/dS values for essential and non-essential genes (Fig. 3).
Taken together, our findings strongly suggest that a large proportion of the MTBC genome known to interact with human T cells is highly conserved and under as strong, or perhaps even stronger, purifying selection than essential genes.
In this study of 22 MTBC genomes, we demonstrate that, as expected, essential genes are more conserved than non-essential genes. These results are in agreement with a previous study which analyzed a single genome30. Surprisingly, however, we found that the large majority of the currently known T cell antigens are as conserved as essential genes. Furthermore, the epitope regions of these antigen genes are the most highly conserved regions we studied. This observation, that the regions of the genome that interact with the human adaptive immune system appear to be under even stronger purifying selection than essential genes, is inconsistent with a classical model of an evolutionary arms-race.
It is possible that the known human T cell epitopes that we found to be hyperconserved represent a select subset of all of the human T cell epitopes encoded in the genome, and that certain approaches to epitope identification have favoured discovery of hyperconserved epitopes in MTBC. For example, since most, if not all of the epitope discovery efforts to date have utilized proteins and/or peptide sequences of strains from one lineage (lineage 4) and T cells from humans that are likely to have been infected by strains of other lineages, the assays used may have been especially suited to identification of hyperconserved and/or cross-reactive epitopes. While further investigation using alternative approaches to epitope discovery may reveal that variable epitopes that exhibit evidence of positive selection exist in the MTBC, it is likely that the large number of epitopes that we examined will remain a significant subset of the total, and that future vaccine development efforts will need to account for the possibility that immune recognition of certain epitopes may actually provide a net benefit to the bacteria.
Lack of antigenic variation and immune evasion has been reported for a number of other human pathogens, including RNA viruses such as measles, mumps, rubella, and influenza type C31. Theoretical studies have suggested that the absence of immune escape variants in these viruses might be due to structural constraints in viral proteins or negative mutational effects leading to reduced infectivity or transmission31. While we cannot exclude the possibility that structural and functional constraints that are independent of T cell recognition contribute to hyperconservation of the regions encoding MTBC peptides recognized by human T cells, one important characteristic of the aformentioned viral pathogens is that they spread among young and immunologically naive hosts, which might eliminate the need for immune evasion31. Moreover, infection by these viruses usually results in acute disease, followed by elimination of the infection through adaptive immunity, and acquisition of lifelong immunity against re-infection. This further indicates that these viruses are specialized pathogens of immunologically naive hosts. By contrast, MTBC causes chronic and often lifelong infections, and adaptive immunity is usually unable to completely clear the infection18. Furthermore, tuberculosis patients are prone to re-infection32, and mixed infections are also increasingly recognized33. These observations suggest that the biological basis for the lack of antigenic variation in MTBC reported here differs from what has been proposed for antigenically homogeneous RNA viruses31. In addition, we determined that the fraction of hyperconserved T cell epitopes of the MTBC that are derived from essential genes is indistinguishable from the frequency of essential genes in the MTBC genome as a whole (18% versus 21%, respectively; χ2 = 0.28, p = 0.59), indicating that our results were not skewed by over-representation of T cell epitopes in essential genes. Moreover, the T cell epitopes that we analyzed are present in genes from diverse gene ontologies, and the representation of five main gene categories (defined based on the NCBI Categories of Orthologous Groups (COG)) was no different in the T cell antigens when compared to the genome overall (χ2 with 4 degrees of freedom = 5.8, p = 0.21; Supplementary Table 6). Hence the only identifiable common property of these regions is their recognition by human T lymphocytes. These findings suggest that T lymphocyte recognition is an important factor in hyperconservation of these sequences, and that other structural or functional constraints are unlikely to fully account for the lack of sequence variation in these domains.
Our data suggest that T cell epitopes in MTBC are under strong selection pressure to be maintained, perhaps because the immune response they elicit in humans, which are essential for survival of an individual host, might partially work towards the pathogen’s benefit. One potential mechanism of benefit to MTBC from human T cell recognition is that human T cell responses are essential for MTBC to establish latent infection. This notion is supported by the fact that CD4+ T cell-deficient HIV-positive individuals progress rapidly to active disease after infection, rather than to sustain prolonged periods of latent tuberculosis34. Latent infection mediated by host T cell responses, with subsequent reactivation to active disease often occurring decades after initial infection, is a key characteristic of human tuberculosis, and might have evolved as a way for MTBC to transmit to later generations of susceptible hosts35. In addition, there is evidence that T cell responses may contribute directly to human-to-human transmission of MTBC. In particular, cavitary tuberculosis, which generates secondary cases more efficiently than other disease forms36, rarely occurs in CD4+ T cell-deficient HIV-positive individuals, and the frequency of cavitary lung lesions in HIV-infected patients with tuberculosis is directly correlated with the number of peripheral CD4+ T cells37. While the mechanisms of lung cavitation in tuberculosis are poorly understood, these observations suggest that CD4+ T cells directly or indirectly mediate tissue damage in tuberculosis, and together with our finding of epitope hyperconservation indicate that certain T cell responses may be detrimental to the host and beneficial to the pathogen. Hence our findings suggest that MTBC takes advantage of host adaptive immunity to increase its likelihood of spread, and that the benefits of enhanced transmission exceed the costs of within-host cellular immune responses to these epitopes. In this manner, MTBC may resemble HIV, for which there is evidence that virulence has evolved, not to maximize replication of the virus within individual hosts, but to maximize the likelihood of its transmission38. Whether T cell responses to other epitopes, or whether specific T cell subsets (e.g. Th17 versus Th1) that benefit the host and not the bacteria can be identified will require additional studies in humans.
One limitation of this study was the exclusion of PE/PPE genes because of technical reasons. Some of these genes are known to vary and to be cell-surface exposed, which has lead to the hypothesis they might be involved in antigenic variation24. However, no direct evidence for this has yet been presented. Future work will need to clarify the function and evolution of PE/PPE genes. By contrast, all the T cell antigens included in this study have been experimentally confirmed26. Furthermore, some of them are being targeted by new tuberculosis diagnostics and vaccines39. Our findings thus have important implications for the development of these new tools. On the one hand, the fact that MTBC harbours little sequence diversity in T cell antigens will facilitate the development of diagnostics that are universally applicable across geographical regions where MTBC strains differ8. On the other hand, the possibility that the immune responses induced by vaccine antigens might partially benefit the pathogen suggests current efforts in vaccine research should be broadened. Most disturbing is the suggestion that vaccine induced immunity against these conserved epitopes may perversely increase transmission. In this respect, it is interesting to note that the currently available tuberculosis vaccine Bacille-Calmette-Guerin (BCG), which is a live vaccine based on an attenuated from of M. bovis, offers no protection against pulmonary tuberculosis in adults40. More importantly, some clinical trials of BCG have even reported an increased risk of tuberculosis in vaccinees compared to unvaccinated individuals41. Thus, in contrast to standard reverse vaccinology, in which the least variable antigens of a genome are targeted42, research into new tuberculosis vaccines should explore more variable regions of the MTBC genome.
While most of the T cell epitopes anlyzed here were highly conserved, five epitopes in three antigens harboured a larger number of amino acid changes. The fact that the dN/dS measure dropped sharply after excluding these outlier antigens from the analysis further supports the notion that they are indeed outliers compared to the other antigens. One of these outlier antigens, esxH (Rv0288, also known as TB10.4) is a member of a gene family known to encode a Type VII secretion system43. Importantly, this antigen is being considered as new vaccine antigen against tuberculosis39. Thus even though most of the other vaccine antigens analyzed here are conserved, our finding that this particular vaccine antigen harbours a comparatively high number of amino acid substitutions across a panel of global MTBC isolates, suggests that strain diversity should be considered during further development of the new vaccine candidates containing esxH8.
We detected significant differences in dN/dS between essential, non-essential, and antigenic genes. However, the individual dN/dS values remain high when compared to most other bacteria44. Such a high dN/dS was reported previously for MTBC, and has been linked to reduced selective constraint against slightly deleterious mutations3. It was proposed that the serial transmission bottlenecks associated with patient-to-patient transmission in MTBC could lead to an increase in random genetic drift compared to the forces of natural selection. Our new data show that even though the strength of purifying selection in MTBC might be reduced overall compared to other bacteria, it clearly is still acting on and capable of differentiating between gene categories.
In summary, we show that T cell epitopes of MTBC are highly conserved, and do not reflect any ongoing evolutionary arms-race or immune-evasion. Instead, the patterns observed might be indicative of a distinct evolutionary strategy of immune-subversion developed by this highly successful pathogen. Other intracellular bacteria such as Salmonella enterica serovar Typhi exhibit a similar lack of antigenic variation45, suggesting comparable mechanisms might exist in other pathogens with a similar lifestyle.
We thank Fernando Gonzalez-Candelas, Sonia Borrell, and Douglas Young for comments on the manuscript. This project has been funded in whole or in part with Federal funds from the National Institute of Allergy and Infectious Disease, National Institutes of Health, Department of Health and Human Services, under Contract No. HHSN266200400001C. J.C. is a Howard Hughes Medical Institute Research Training Fellow. J.D.E. was supported by NIH grants AI046097 and AI051242, and S.G. by the Medical Research Council, UK, the Royal Society, the Swiss National Science Foundation, and NIH grants HHSN266200700022C and AI034238.
Methods and any associated references are available in the online version of the paper at http://www.nature.com/naturegenetics/.
Accession codes. The sequencing reads have been submitted to the NCBI Sequence Read Archive (SRA) with accession codes SRX002001-SRX002005, SRX002429, SRX003589, SRX003590, SRX005394, SRX007715, SRX007716, SRX007718-SRX007726, and SRX012272. Sequence and SNP data are also available at the Tuberculosis Database (TBDB).
COMPETING INTEREST STATEMENT
The authors declare no competing financial interests.