|Home | About | Journals | Submit | Contact Us | Français|
The advent of whole-genome sequencing has provided an unprecedented detail about the evolution and genetic significance of species-specific variations across the whole Mycobacterium tuberculosis Complex. However, little attention has been focused on understanding the functional roles of these variations in the protein coding sequences. In this work, we compare the coding sequences from 74 sequenced mycobacterial species including M. africanum, M. bovis, M. canettii, M. caprae, M. orygis, and M. tuberculosis. Results show that albeit protein variations affect all functional classes, those proteins involved in lipid and intermediary metabolism and respiration have accumulated mutations during evolution. To understand the impact of these mutations on protein functionality, we explored their implications on protein ductility/disorder, a yet unexplored feature of mycobacterial proteomes. In agreement with previous studies, we found that a Gly71Ile substitution in the PhoPR virulence system severely affects the ductility of its nearby region in M. africanum and animal-adapted species. In the same line of evidence, the SmtB transcriptional regulator shows amino acid variations specific to the Beijing lineage, which affects the flexibility of the N-terminal trans-activation domain. Furthermore, despite the fact that MTBC epitopes are evolutionary hyperconserved, we identify strain- and lineage-specific amino acid mutations affecting previously known T-cell epitopes such as EsxH and FbpA (Ag85A). Interestingly, in silico studies reveal that these variations result in differential interaction of epitopes with the main HLA haplogroups.
Tuberculosis (TB) is an ancient disease that has accompanied animals and humans for thousands of years. Globally, it is estimated that TB has killed 1 billion people in the past 200 years, which turns it as the biggest killer compared with other infectious diseases as plague, influenza, smallpox, malaria, cholera, or AIDS among others (Paulson 2013). TB is responsible for 1.8 million deaths each year (WHO 2016). The main pathogen causing TB in humans is Mycobacterium tuberculosis. Other species such as M. africanum and M. canettii are also able to infect humans with less pathogenic potential and they are phlylogeographically restricted to certain populations in West- and East-African countries, respectively (de Jong et al. 2010; Supply et al. 2013). On the other hand, animal-adapted ecotypes such as M. bovis and M. caprae infecting cattle and goats, respectively, represent an economic burden as well as a zoonotic risk for humans (Aranaz et al. 2003; Smith 2012; de la Fuente et al. 2015). Several studies have addressed the phylogenetic classification of TB-causing mycobacteria. Pioneering works based on comparative genomics demonstrated that Mycobacterium species evolved by irreversible loss of genomic regions, which cannot be compensated by horizontal gene transfer. This pattern of reductive evolution led to propose an evolutionary scenario for these species (Brosch et al. 2002). Subsequent studies based on whole-genome sequencing confirmed and refined the previous deletion-based phylogeny and proposed the existence of eight major lineages (L1–L7 and animal-adapted species), which include the human-adapted ecotypes M. tuberculosis (L1–L4 and L7), M. africanum (L5 and L6), and M. canettii and the animal-adapted ecotypes M. bovis, M. caprae, M. microti, M. pinnipedii, M. orygis and M. mungi (Comas et al. 2013). With the exception of M. canettii, considered an ancestral lineage from which the remaining species emerged, these lineages comprise the M. tuberculosis complex (MTBC).
Despite the MTBC is strictly clonal and it is characterized by less than 0.05% sequence divergence between lineages, for a 4 Mb genome, this value is readily translated into 2000 polymorphisms per genome (Brosch et al. 2002; Comas et al. 2013). Considering that MTBC species code for roughly 4000 genes, this means that half of the genes might be virtually affected by polymorphisms. Even if a number of these variations might represent silent mutations or affect non-coding sequences, it is tempting to speculate that some missense or nonsense mutations might determine lineage-specific phenotypes or even impact on the host range of human- and animal-adapted species.
It is well known that Beijing-L2 strains are associated with a massive spread of drug resistance in Europe and Asia and this phenotype appear to be related with positive selection of single nucleotide polymorphisms in selected genes (Merker et al. 2015). A comparative study of the mutation rate between some L2 and L4 strains suggest that acquisition of drug resistance in Beijing-L2 strains might be related to the higher basal mutation rate of this lineage (Ford et al. 2013). However, the specific polymorphism(s) underlying Beijing-L2 phenotypes are poorly understood. In this same line, it is still unknown why M. africanum L5 and L6 are geographically restricted to West African populations even if it has been recently sequenced the largest genome repertoire of this lineage (Winglee et al. 2016). This is also applied to M. canettii that is restricted to East African populations but the genetic determinants responsible for its phylogeographic preference start to be delineated (Supply et al. 2013; Blouin et al. 2014; Boritsch et al. 2016). Another attractive and yet unresolved question is the host-pathogen specificity of the MTBC. It is intriguing how so phylogenetically related bacteria infect a variety of hosts ranging from humans to ungulates, rodents and pinnipeds. Although the answer is hampered due to the relative scarcity of genome sequences from animal-adapted species, recent studies have started to address this issue (Bos et al. 2014; de la Fuente et al. 2015).
Previous genomic studies were predominantly focused on the human pathogen M. tuberculosis, mainly due to the scarcity of whole genome sequences from other members of the MTBC. Today, the advent of next generation sequencing has provided the first collection of genomic data from M. africanum and animal-adapted species (de la Fuente et al. 2015; Winglee et al. 2016). Thus, studies covering the whole MTBC are now possible. On the other hand, irrespective of this cumulative genomic data, studies regarding comparison of the coding sequences in the MTBC are more limited. In particular, investigations concerning the structural and functional implications of single amino acid mutations have been poorly reported. A recent study has proved the value of investigating amino acid polymorphisms to understand the evolution of virulence phenotypes in the MTBC. This work showed that a single amino acid substitution in the PhoPR virulence system from M. africanum-L6 and the animal-adapted ecotypes ablates PhoPR function. Consequently, L6 and animal-adapted strains lack acyltrehalose-derived lipids and other PhoPR-dependent phenotypes (Gonzalo-Asensio et al. 2014). Given the essential role of PhoPR in M. tuberculosis virulence (Broset et al. 2015), it is plausible to think that PhoPR polymorphisms have shaped the evolution of the MTBC. Thus, characterizing differences in protein coding sequences between mycobacterial species is expected to reveal yet unexplored virulence features. With this aim we have analyzed in this work 74 complete proteomes from six Mycobacterium species including M. africanum, M. bovis, M. canettii, M. caprae, M. orygis and M. tuberculosis (fig. 1). The effect of sequence variations on the structural and functional features of proteins will be discussed.
Complete 74 proteomes of 6 Mycobacterial species: M. tuberculosis (2 strains), M. africanum (28 strains), M. canettii NC_015848 (1 strain), M. bovis (41 strains), M. caprae (1 strain), and M. orygis (1 strain) were retrieved from NCBI GenBank (http://www.ncbi.nlm.nih.gov/). The strains analyzed include M. tuberculosis reference strain H37Rv (NC_000962), M. tuberculosis strain Beijing (NC_021054), M. orygis (112400015), M. canettii (NC_015848), M. caprae (CDHG01), and the remaining 69 strains listed in supplementary table S1, Supplementary Material online.
The determination of the core genome, which encodes 3009 proteins, from these 74 strains was carried out using the bioinformatic tool GET_HOMOLOGUES (Contreras-Moreira and Vinuesa 2013) and taking all core genes identified by both the OMCL and the COGS algorithms. The M. tuberculosis H37Rv strain was used as reference. Each set of homologous sequences were aligned and mutations were identified using custom Perl scripts, including amino acid substitutions, deletions, and insertions.
Epitopes were retrieved from (Comas et al. 2010) and following the methods described in Copin et al. (2014). Functional cluster classification was as in Tuberculist (http://tuberculist.epfl.ch). Mutations were found using custom Perl scripts.
Thirty-four non-redundant genomes from M. canettii (1), M. africanum (13), M. bovis (16), and M. tuberculosis (4) were analyzed. Translated CDS sequences of single-copy core clusters were aligned with Clustal-omega v1.2.1 (Sievers et al. 2011) and the resulting alignments translated back to codon alignments using the primers4clades suite (Contreras-Moreira et al. 2009). Each codon alignment was then passed to yn00_cds_prealigned, obtained from https://github.com/hyphaltip/subopt-kaks (Yang, 1997) to estimate the ratio of nonsynonymous substitutions per nonsynonymous site (dN) to the number of synonymous substitutions per synonymous site (dS) of all pairs of pre-aligned sequences. Pairs of identical sequences in each cluster were not considered for dN/dS calculations.
DISOPRED v3.1 (Jones and Cozzetto 2015) predictions were performed for all protein sequences of the Mycobacterium core proteome. All input sequences, plus the reference database uniref90, were low-complexity filtered with PFILT and scanned with three iterations of BLASTPGP with an E-value cut-off of 0.001. Predictions in PhoR protein were also done using IUPred and ANCHOR (http://iupred.enzim.hu), MoRFPred (http://biomine-ws.ece.ualberta.ca/MoRFpred/index.html), and RONN (https://www.strubi.ox.ac.uk/RONN).
CD8+ T and CD4+ T cell epitope prediction was performed with NetMHCpan 3.0 (Andreatta and Nielsen 2015) and NetMHCIIpan 3.1 (Andreatta et al. 2015), respectively, because these tools were considered the best overall predictors in independent benchmarks (Chaves et al. 2012; Zhang et al. 2012; Trolle et al. 2015). Overlapping peptides comprising the amino acid residues of interest were tested for their binding affinity to all available classical HLA class I alleles (HLA-A, HLA-B and HLA-C; n = 2,915) and DR locus alleles for class II (n = 660) (supplementary table S4, Supplementary Material online). These HLA haplogroups were chosen due to higher predictive reliability (Andreatta et al. 2015) and data availability in the Immune Epitope Database (IEDB) (Vita et al. 2015). In the case of deletions, we predicted all the possible overlapping peptides that were present in the larger and smaller versions of the epitopes. Eight to 14-mers peptides were used in HLA class I predictions and nine to 25-mers in HLA class II. The cut-off for high binding affinity was IC50 < 50 nM or rank score < 0.5%. The fraction of the human population with the ability to recognize the predicted epitopes (population coverage) was estimated using the implementation of a previously described algorithm (Bui et al. 2006) at IEDB (http://tools.immuneepitope.org/tools/population/iedb_input).
Analysis of protein sequence variations was done in the group of proteins encoded by the core-genome of Mycobacterium species (M. africanum, M. bovis, M. canettii, M. caprae, M. orygis, and M. tuberculosis), which represents 3009/3645 proteins, ca. 82.5%. The Mycobacterium proteins encoded by the core-genome are distributed in nine functional classes as follows: virulence, detoxification and adaptation (80/3009 proteins, 2.6%), lipid metabolism (180/3009 proteins, 6.0%); information pathways (217/3009 proteins, 7.2%), cell wall and cell processes (593/3009 proteins, 19.7%), insertion sequences and phages (19/3009 proteins, 0.6%), proline–glutamate (PE), and proline–proline–glutamate (PPE) proteins (55/3009 proteins, 1.8%), intermediary metabolism and respiration (767/3009 proteins, 25.5%), unknown proteins (1/3009 proteins, 0.03%), regulatory proteins (158/3009 proteins, 5.2%), conserved hypothetical proteins (780/3009 proteins, 25.9%), and conserved hypothetical proteins with an ortholog in M. bovis (159/3009 proteins, 5.3%) (fig. 2A, right). This distribution is comparable to that of the complete M. tuberculosis H37Rv proteome. Analysis of polymorphisms revealed that they affect similarly all functional classes (fig. 2A, left). When the analysis is restricted to the subset of proteins with experimental evidence annotated in BioCyc (n = 323), the data reveal a similar trend, with the largest classes, lipid (19–30%) and intermediary metabolism and respiration (15%), harboring most mutations. Statistical analyses did not indicate deviations from the expected distribution of mutations along functional classes (fig. 2B;supplementary table S2, Supplementary Material online). These percentages were normalized taking into account the total proteins associated to each functional group and the sequence identity averages.
Research of the past decade and a half has provided valuable information about intrinsically disordered/ductile proteins (IDPs) and regions (IDRs) as well as their role in diseases (van der Lee et al. 2014; Yan et al. 2016). IDPs and IDRs are characterized by a number of specific features that distinguish them from those of ordered proteins and domains and make them predictable (He et al. 2009). Such particular structural and biochemical properties of disordered/ductile regions are ideal for proteins that mediate specific molecular recognition and interaction with partners or co-ordinate regulatory events (Uversky 2015). Accordingly, ductility and flexibility feature in proteins confers advantages for their functional versatility (Babu et al. 2011). But also variations in protein flexibility can modify protein function and phenotype.
This is the case of the PhoPR two-component virulence system, where the single mutation Gly71Ile found in PhoR from M. bovis and M. africanum L6 cause strong phenotypic differences (Gonzalo-Asensio et al. 2014). The Gly71Ile mutation is located within the periplasmic loop of PhoR (Ser36 to Arg155) (fig. 3A). The intrinsically disorder predictions performed in this work show that such mutation reduce the flexibility of the Thr61–Gln91 amino acid segment. Thus, subtle variations in the flexibility of the periplasmic loop can cause a dramatic effect on the protein function and phenotype. The results obtained with IUPred (Dosztányi et al. 2005) and RONN (Yang et al. 2005) are shown in figure 3B. These are similar to those obtained with DISOPRED v3.1 (Jones and Cozzetto 2015) (supplementary fig. S1, Supplementary Material online). Moreover, ANCHOR (Dosztányi et al. 2009) and MorRFPred (Miri Disfani et al. 2012) point out that Thr61–Gln91 amino acid region in M. tuberculosis PhoR could be a molecular recognition motif. Besides, the glycine substitution by the branched-amino acid isoleucine probably impairs ordered-disordered transitions in such protein region of M. bovis/M. africanum modifying possible molecular interactions (fig. 3B).
Taking into account the above results, further analyses were focused to investigate the presence of mutations in disordered/ductile regions (IDRs) of Mycobacterium proteomes in order to know their effect on structural and functional protein features. The results derived from DISOPRED v3.1 prediction indicated that disordered residues are present in all protein categories of M. tuberculosis being its content higher in antigens compared with essential and non-essential proteins (fig. 4A). Moreover, our results also show that epitopes are preferentially distributed in ordered regions in the MTBC and the higher frequency of disordered regions in antigenic proteins is restricted to non-epitope regions. This finding is agreement with previous investigations (Mitic et al. 2014; Pavlović et al. 2014). Furthermore, ca. 14% M. tuberculosis H37Rv proteins contain a long disordered segment (L > 30 aa), which is in agreement with other bacterial proteomes (fig. 4B). It is estimated that archaea and bacteria have 7–30% of such proteins (Pavlović-Lažetić et al. 2011). The amino acid composition of M. tuberculosis IDRs is enriched in methionine and proline (fig. 4C), which are characteristic of IDRs in different organisms (Dunker et al. 2008; Yruela and Contreras-Moreira 2012).
To know in detail the implications of mutations in ductile regions the attention was focused in well-annotated proteins with longer disordered segments (L > 30) because these regions are normally associated with high confidence to particular functions (Radivojac et al. 2007; Orosz and Ovádi, 2011; Yan et al. 2016). Twenty-one of these well-annotated proteins have mutations along their sequences. Among them only four proteins, Rv2215 (DlaT), Rv2358 (SmtB), Rv2495c (BkdC), and Rv3003c (IlvB1) have mutations inside IDRs (table 1). Interestingly, SmtB of M. tuberculosis Beiijing shows five amino acid substitutions in the Phe30–Pro37 motif of the N-terminal in comparison with M. tuberculosis H37Rv. The substitutions Ala31Ser, Glu32Thr, Cys33Ala, Thr35Gly, and Phe36Gly are predicted to induce a gain of flexibility in the sequence of M. tuberculosis Beijing strain compared with the H37Rv one (supplementary fig. S2, Supplementary Material online). ANCHOR (Dosztányi et al. 2009) predicted that this mutated region is more probably a molecular recognition motif (MoRF) in the Beijing strain. These variations might affect the function of the N-terminal of this protein.
SmtB is a Zn-binding transcription regulator, a member of the ArsR family (Canneva et al. 2005), which allows the organism to respond to a wide range of changes in its immediate microenvironment. The SmtB presents homology with the human p53 transcription factor. Particularly, the alignment shows similarity with the TransActivation Domain 2 (TAD2) (residues 40–60), and a proline-rich region, PR (residues 64–92) of its N-terminal (supplementary fig. S3, Supplementary Material online). It is well known that the N-terminal of p53 is involved in a multitude of interactions with a wide spectrum of partners (Xue et al. 2013). Considering these observations, the amino acid substitutions in the N-terminal of SmtB of M. tuberculosis Beijing might perturb the interaction with their partners or to increase their number. The tridimensional structure of SmtB protein is still unknown and there is not information about the possible molecular interactions with their partners. However, structural homologues could provide information about these features and help us to understand the effect of such variations. In this sense, structural models of the SmtB from both M. tuberculosis H37Rv and Beijing (E-value = 2.2e−14) were produced using the NMR tridimensional structure (pdb 2lkp) of the NmtR transcription factor of M. tuberculosis (Rv3744) as template, which is also a member of the ArsR family. The comparison of both SmtB structure models shows relative conformational variations in the N-terminal (fig. 5).
Furthermore, Rv2495c, a component of branched-chain alpha-ketoacid dehydrogenase complex (BkdC, UniProt O06159) shows mutations in Tyr103Asp and Thr107Ala in M. tuberculosis Beijing compared with M. tuberculosis H37Rv which increase the flexibility in this region as indicate IUPred/ANCHOR predictions (supplementary fig. S3, Supplementary Material online). The functional implications for this higher flexibility remain to be elucidated.
It is generally well assumed that human antigens containing T-cell epitopes in M. tuberculosis are evolutionarily hyperconseved. Accordingly, dN/dS of antigens and epitope regions are lower than dN/dS of essential and non-essential proteins (Comas et al. 2010). However, to the best of our knowledge, the hyperconservation of T-cell epitopes has not been interrogated in other species different from M. tuberculosis and BCG vaccines (Copin et al. 2014; Coscolla et al. 2015; Stucki et al. 2016). The overall ratio of nonsynonymous/synonymous substitutions (dN/dS) was calculated for the aligned CDS sequences from MTBC species listed above based on the number of nonredundant synonymous and nonsynonymous changes. Note that for these calculations a subset of 34 nonredundant genomes was analyzed, leaving out those with average peptide identity over 95.3%. The calculated dN/dS values indicated that antigens are evolutionarily conserved to the same extent than essential and nonessential genes (fig. 6A). This result allows expanding the general assumption about hyperconservation of human T-cell epitopes to the whole MTBC.
However, it should be noted that interrogating the whole MTBC implies a higher degree of antigenic variation than comparing M. tuberculosis species alone (Comas et al. 2010). In this context, it has been also demonstrated that epitope variability is higher in widely distributed L4 sublineages relative to those geographically restricted (Stucki et al. 2016). Specifically, despite of antigen conservation, we identified single mutations in 51/852 epitopes of 27 antigenic proteins encoded by the core-genome of the MTBC species analyzed. In addition, double and even higher number of mutations were found in 9/852 epitopes of seven antigenic proteins. These polymorphisms are detailed in fig. 6B and supplementary table S3, Supplementary Material online.
Some polymorphisms are specific of one specie as the case of Ala71Ser in EsxH and Ala177Val in IlvB1, which are highly represented in M. africanum, and Thr46Pro in Mce4C of M. bovis. Others are conserved in different species such as Val352Ala in PstS1 and Gln68His in Rv1733c which are represented in M. africanum, M. bovis, M. orygis, M. caprae, and M. tuberculosis Beijing. The effect of some of these polymorphisms in antigenic proteins was partially investigated.
This is the case of EsxH, a short protein of 96 amino acids in length, which is a member of the Esx family (Uniprot P9WNK3; Ilghari et al. 2011). M. tuberculosis encodes 23 Esx proteins (EsxA–W), which are generally short in length (~100 residues) and are organized in pairs within the genome. Of these, the EsxA/EsxB (ESAT-6/CFP10) pair is the best known mycobacterial virulence factor and promotes escape of M. tuberculosis from the phagosome to the cytosol (van der Wel et al. 2007; Houben et al. 2012) The EsxG/EsxH pair is co-ordinately regulated forming a small operon. Both proteins interact with each other to form a tight 1:1 complex (Lightbody et al. 2008). EsxG·EsxH complex contains a specific Zn(2+) binding site in the N-terminal.
The tridimensional structure of EsxG·EsxH complex has been resolved by NMR spectroscopy (Ilghari et al. 2011). The contact surface between EsxG and EsxH is essentially hydrophobic and the Ala71 residue is located in the intermolecular interface of EsxG·EsxH complex close to Met72, which together Met18 form the base of the cleft. The conserved substitution Ala71Ser in M. africanum probably destabilizes this hydrophobic interaction. The comparison of the native complex structure and that of the mutant Ala71Ser shows that the polar Ser71 residue perturbs the interface close to EsxG–Arg57 and EsxH–Met72 (fig. 7). The adjacent conserved Ala71/Ser71 replacement could also perturb the metal binding. Thus, besides the potential role of these epitope polymorphisms in altering binding to MHC molecules (as will be studied below), mutations in these antigens might also impact on protein functionality.
Other antigen carrying amino acid mutations is Rv0934 (Uniprot P9WGU1), which corresponds with the phosphate-binding protein PstS1 and functions in inorganic phosphate uptake (Peirs et al. 2005). The tridimensional 2.16 Å structure of phosphate-bound PstS-1 has been resolved by X-ray crystallography (Vyas et al. 2003). The residues Asn66 and Val352 are localized in the surface of the N-terminus (residues 42–128) and the C-terminus (residues 314–374) of the protein, respectively, which are exposed to the solvent (supplementary fig. S5, Supplementary Material online). The substitution Asn66Ser can modify slightly the electrostatic surface of the protein as shown in supplementary figure S5, Supplementary Material online. This polymorphism might perturb the electrostatic balance and the interaction with the outer cell wall membrane and/or the capture of ligands as phosphate in the medium. The Val352Ala polymorphism has been reported earlier in M. bovis and BGC by Liu et al. (2013). It is worth noting that our analysis indicates that this substitution is also highly represented in other species such as M. africanum, M. orygis, and M. caprae. Because the mycobacterial antigen PstS1 is a highly immunogenic and immunostimulatory component of the mycobacterial cell membrane such mutation might be an evolutionary adaptation and have caused immune evasion (Liu et al. 2013).
Additionally, the 1.5 Å resolution structure of MPT63 (Rv1926c, Uniprot P9WIP1, pdb 1lmi) shows that the substitution Val93Ile found in M. canettii is localized in the only helical secondary structure, a 310 helix of three residues at the start of β-strand 6 (supplementary fig. S6, Supplementary Material online). This position is in the vicinity of a negatively charged cavity and a positively charged channel, features probably with implications in the MPT63 function (Goulding et al. 2002).
Binding of Mycobacterium epitopes to MHC class II molecules (named HLA in humans) plays a central role to mount an efficient adaptive immunity mediated by T-cells (O’Garra et al. 2013). Accordingly, polymorphisms in epitopes might alter the binding affinities of these molecules to the HLA cleft and the subsequent immune responses (Ivanyi 2014). Because HLA genes harbor a high degree of polymorphisms, testing the binding affinity of individual epitopes is challenging. Hence, in silico algorithms are useful prediction tools to compare the binding of epitope variants across representative HLA haplotypes. Using such immunoinformatics approach, we demonstrate that some epitope polymorphisms greatly impact the number of HLAs able to bind epitopes with high affinity. Consequently, because HLA molecules have distinct frequencies across the world, the fraction of the human population that will theoretically respond to the described epitopes (population coverage) can be affected (Bui et al. 2006). In our data, population coverage across several world regions was increased for Val34Met in Rv0288 (EsxH), Leu262Arg in Rv0290 (EccD3), Ile212Thr in Rv0853c (Pdc), Ala162Thr in Rv1733c, Val93Ile in Rv1926c (MPT63), Pro194Ser in Rv3714c, and Thr145Ala in Rv3804c (FbpA/Ag85A). On the other hand, a decrease was observed for Pro9Leu and Ala71Ser in Rv0288 (EsxH), diverse mutations in Rv2819c, Leu270Val in Rv3025c (IscS) and Leu76Arg in Rv3615c (EspC) (fig. 8;supplementary table S5, Supplementary Material online). Our results also suggest that T-cell vaccine design might benefit from taking into account MTBC strain variation and global HLA haplogroup distributions. Indeed, a polymorphism in Rv3804c (FbpA/Ag85A) found in a specific M. africanum strain results in stronger interaction of this epitope variant by most HLA haplogroups and specially in central American people (fig. 8). Conversely, some epitope mutations cause little or no alteration in population coverage (supplementary table S5, Supplementary Material online). Predictions for HLA class I binding was also performed (supplementary table S5 and fig. S7, Supplementary Material online).
According to the clonal origin of MTBC species, it is observed that mutations in coding sequences are uniformly distributed across the MTBC. The M. canettii outgroup has ca. 1.7 more mutations per protein than the average and this observation can be associated to its greater genetic diversity. Irrespective of this uniform distribution, proteins involved in lipid and respiration pathways have accumulated mutations during the MTBC evolution. During infection M. tuberculosis uses lipids (fatty acids and cholesterol) from the host macrophages as the main energy source (Schnappinger et al. 2003; Rohde et al. 2012). In this context, it is tempting to hypothesize that polymorphisms in the lipid metabolism might provide selective benefits to survive in the intracellular fatty acid environment. Those strains with efficient metabolic flux through lipid metabolism routes would be associated with higher virulence and vice versa. Concerning the higher frequency of polymorphisms of proteins involved in intermediary metabolism and respiration, TB-causing bacteria should be able to shift from an aerated environment during transmission to a hypoxic environment during infection. Accordingly, we can postulate that polymorphisms in respiration pathways might influence the infectious cycle. A recent study has linked the hypoxic response with lipid metabolism in M. tuberculosis (Galagan et al. 2013) and consequently it is possible that both metabolic pathways have co-evolved in different Mycobacterium species.
The comparative analyses of protein variations carried out in this work point out that changes in flexibility/ductility of MTBC protein regions can explain phenotype differences. It is known that IDRs in proteins are relevant in molecular recognition and protein-protein interactions (Vacic et al. 2007; Yan et al. 2016). This fact makes that these regions can have an important role in bacterial pathogenesis and virulence. A previous study demonstrated that a Gly71Ile substitution severely impacts on the functionality of PhoPR. The periplasmic loop of PhoR is involved in the recognition of a yet unknown activating stimulus. Accordingly, changes in ductility of the PhoR sensor domain might impair recognition of the activating stimulus which otherwise would result in lack of PhoP transcriptional activation of virulence genes (Broset et al. 2015). It is important to remember that the Gly71Ile substitution is exclusive of M. africanum and animal-adapted Mycobacterium species. These lineages are associated with slower progression to active disease and lower pathogenicity in humans than the human-adapted pathogen M. tuberculosis (Gonzalo-Asensio et al. 2014). In line with this observation, A.K. Dunker and co-workers have reported that evolution of disordered regions can be related to pathogenic microbes in comparison with non-pathogenic ones (Mohan et al. 2008). Altogether, it is worth noting that this structural analysis provides the first relationship between such protein structural variation and reported phenotype differences.
Here, we also demonstrate that hyperconservation of T-cell epitopes can be extrapolated to the whole MTBC. Accordingly, we can assume that human-adapted lineages (M. tuberculosis L1–L4 and L7 and M. africanum L5 and L6) equally benefit from being recognized by the human immune system. This finding could be also applied to the zoonotic transmission of M. bovis from cattle to humans, because our results demonstrate hyperconservation of human epitopes in the cow pathogen M. bovis. However, because we do not know the epitope repertoire in mammalian hosts infected by animal-adapted species, we cannot confirm whether animal T-cell epitopes are also evolutionarily hyperconserved.
Interestingly, despite epitopes are hyperconserved some of them show mutations. These variations could affect the structure and function of their corresponding antigens such is the case of EsxH. A recent study has demonstrated that EsxH is required for proper iron acquisition in M. tuberculosis and deletion of esxH results in decreased virulence in mice (Tufariello et al. 2016). Because a similar mutation has been reported in the non MTBC species M. marinum and M. ulcerans, it might represent an evolutionary adaptation of these species to modulate iron availability and/or virulence to specific hosts.
It is also worth of mentioning that mutations found in epitopes are predicted to modify the binding affinity to HLA proteins. A recent study expanded the search for antigenic variations by examining 1226 epitopes in 216 diverse strains of M. tuberculosis. Seven antigenic proteins showed amino acid changes between M. tuberculosis lineages. Epitopes of these proteins were immunogenic in whole blood assays and importantly, amino acid mutations in these epitopes produced differential immune stimulation (Coscolla et al. 2015). Another recent study demonstrated differences in dN/dS of epitopes from L4 sublineages. Those sublineages considered generalists associated with worldwide distribution present higher epitope variation than geographically restricted specialist sublineages (Stucki et al. 2016).These pioneering studies have paved the way to study rare epitope variants and our findings could help to translate this knowledge to other MTBC species that are also an important cause of mortality in animals and restricted human populations. In this regard, a Leu262Arg in EccD3 of M. bovis found in this work greatly increases the population coverage of HLAs class II that theoretically respond to this epitope. It remains to be explored whether this polymorphism is associated to more efficacious zoonotic transmission of certain M. bovis strains. Different polymorphisms in Rv0288 (EsxH) from M. africanum are associated with higher (Val34Met) and lower (Pro9Leu and Ala71Ser) human population coverages of HLAs class II responding to these variants. These might be sites involved in the adaption of the different species to humans or to specific subsets of the human population.
Finally, even if immunogenicity does not necessarily correlate with protection, our results could have extraordinary implications in vaccine development. Future vaccine candidates might greatly benefit from containing epitope variants with increased potential to stimulate immunity. This could be applied to either live attenuated vaccines constructed in genetically immunodominant backgrounds or to subunit vaccines containing antigens with immunodominant epitopes. One of the most advanced vaccine candidates named MVA85A consist on the Modified Vaccinia Ankara virus expressing FbpA/Ag85A. This vaccine was designed to boost the immunity provided by BCG. Although MVA85A was safe and immunogenic it failed to confer significant protection compared with BCG (Tameris et al. 2013). Our predictions indicate that FbpA/Ag85A variant found in M. africanum binds with more affinity to most HLA haplogroups. If this FbpA/Ag85A variant proves to be differentially immunogenic comparatively to the wild type, new vaccines based on modified MVA85A could be efficacious strategies to combat TB. Currently, 4 of 8 subunit vaccines (Ad5Ag85A, ChAdOx185A, MVA85A, and TB/FLU-04L) contain Ag85A expressed from a viral vector. Other subunit vaccines use FbpB/Ag85B, a related member to the Fbp family (Fletcher and Schrager 2016). Because there is a dominance of vaccines that express or contains Fbp proteins, these strategies could greatly benefit from comparative and immunological studies to select the best antigenic variants.
Supplementary data are available at Genome Biology and Evolution online.
This work was supported by Gobierno de Aragón (DGA-GC B18 and B25), the Spanish Ministry of Science and Competitiveness (BIO2014-52580P, CSIC13-4E-2490), Instituto de Salud Carlos III (PI12/01970) and the European Commission Horizon 2020 (H2020-PHC-643381). Some of these grants were partially financed by the EU FEDER Program. This work was also supported by Fundação para a Ciência e Tecnologia, Portugal (IF/00474/2014) and cofunded by Programa Operacional Regional do Norte (ON.2—O Novo Norte), Quadro de Referência Estratégico Nacional (QREN), through the Fundo Europeu de Desenvolvimento Regional (FEDER).