|Home | About | Journals | Submit | Contact Us | Français|
Molecular mimicry of host proteins is a common strategy adopted by bacterial pathogens to interfere with and exploit host processes. Despite the availability of pathogen genomes, few studies have attempted to predict virulence-associated mimicry relationships directly from genomic sequences. Here, we analyzed the proteomes of 62 pathogenic and 66 non-pathogenic bacterial species, and screened for the top pathogen-specific or pathogen-enriched sequence similarities to human proteins. The screen identified approximately 100 potential mimicry relationships including well-characterized examples among the top-scoring hits (e.g., RalF, internalin, yopH, and others), with about 1/3 of predicted relationships supported by existing literature. Examination of homology to virulence factors, statistically enriched functions, and comparison with literature indicated that the detected mimics target key host structures (e.g., extracellular matrix, ECM) and pathways (e.g., cell adhesion, lipid metabolism, and immune signaling). The top-scoring and most widespread mimicry pattern detected among pathogens consisted of elevated sequence similarities to ECM proteins including collagens and leucine-rich repeat proteins. Unexpectedly, analysis of the pathogen counterparts of these proteins revealed that they have evolved independently in different species of bacterial pathogens from separate repeat amplifications. Thus, our analysis provides evidence for two classes of mimics: complex proteins such as enzymes that have been acquired by eukaryote-to-pathogen horizontal transfer, and simpler repeat proteins that have independently evolved to mimic the host ECM. Ultimately, computational detection of pathogen-specific and pathogen-enriched similarities to host proteins provides insights into potentially novel mimicry-mediated virulence mechanisms of pathogenic bacteria.
Molecular mimicry can be broadly defined as sequence or structural resemblance between microbial and host molecules. This has been studied extensively within the context of autoimmunity, whereby similarities between foreign and self molecules can lead to cross-reactive epitopes and ultimately autoimmune disease.1-5 However, there is a growing body of evidence that molecular mimicry of host proteins is a broader strategy adopted by bacterial pathogens to exploit and subvert host processes during infection and plays a role in a wide range of virulence pathways, including pathogen recognition and binding to human cells, evasion of the host immune response, and intracellular survival in host immune cells.6-10
Mimics within pathogens are thought to originate through two evolutionary mechanisms. Pathogen genomes can obtain host genes directly through lateral transfer (reviewed in Koonin et al.11). Such cases frequently have detectable homology between pathogen and host proteins, a complex sequence or domain composition, and limited occurrence of the mimic in one or a small number of pathogenic species. For example, Coxiella burnetii, the causative agent of human Q fever, encodes two eukaryote-like sterol reductases. These mimics may play a role in formation of the cholesterol-rich Coxiella parasitophorous vacuole,12 which serves as a barrier to sequester nutrients and ions and also facilitates pathogen survival inside the host cell. These enzymes are extremely rare in prokaryotes and are thought to have arisen in Coxiella by lateral transfer from a eukaryotic source.13
A second possible mechanism is convergent or parallel evolution of a pathogenic protein toward resemblance of a host protein.7,10,14 Here, over time, co-evolutionary forces generate pathogen proteins that resemble host proteins structurally, or resemble smaller sequence fragments of host proteins, without homology between the pathogen and host proteins.7 For example, enterohemorrhagic Escherichia coli (EHEC) secretes a type III effector (EspFU) into human host cells, which stimulates actin polymerization by interacting with host WASP proteins.15 Exploitation of host functions is achieved through subtle structural mimicry of the host WASP autoinhibitory helix, but there is no detectable sequence similarity between the two proteins. By stimulating actin polymerization, EspFU mediates attachment of EHEC to host epithelial cells, which is critical to its virulence mechanism. Another example of convergent evolution is that of the Yersinia effector protein, invasin, which has evolved to mimic the integrin-binding surface of fibronectin.16 This surface mediates high affinity binding to β1 integrins on host M cells, which induces cytoskeletal reorganization and allows the pathogen to gain entry into the host cell.
Though there are exceptions (e.g., see Graham et al.17), pathogen mimics tend to function similarly to their human counterparts. For instance, through mimicry of human guanine-exchange factors (GEFs), the Legionella effector RalF functions as a GEF in the host and recruits ADP-ribosylation factor (Arf) to manipulate host vesicular trafficking.18 Through mimicry of human tyrosine phosphatases, Yersinia YopH dephosphorylates a number of human proteins including p130Cas that leads to inhibition of phagocytosis (reviewed in Stebbins and Galan7 and Knodler et al.8). Furthermore, internalin virulence factors are composed of leucine-rich repeats (LRRs) with binding surfaces like eukaryotic LRRs, and play a role in adherence and invasion of host cells.7,19,20
To date, discovery of pathogen mimics has been done largely on a case-by-case basis, and it is possible that there exist many additional mimics that may be detectable through computational methods. In previous work, for instance, we identified sequence and structural similarities between clostridial toxins and mammalian collagens, from which we hypothesized that collagen may be an additional mimicry target of pathogenic bacteria,21 which could play a role in adhesion of pathogens to the host extracellular matrix. However, detection of sequence similarity between host and pathogenic proteins is by itself not indicative of mimicry or pathogen-specific exploitation of host functions.
Here, motivated by our previous work and the broad goal of detecting host-pathogen mimicry at a genomic scale, we performed an analysis of bacterial pathogen vs. non-pathogen proteomes and compared their similarities to the host (i.e., human) proteome. We screened for cases where the detected similarities to host proteins are pathogen-specific or are enriched in a variety of pathogenic species compared with non-pathogens, thus producing a list of candidate pathogen mimics and their human targets. It is important to note that while the analysis is based on the human proteome, host specificity of the predicted mimics toward human is not certain, and the predictions may reflect mimicry of proteins from alternative eukaryotic hosts. A similar approach has been used to screen for molecular mimicry candidates in protozoan parasites,22 yet to our knowledge a large-scale computational analysis has not been performed for human pathogenic bacteria.
Ultimately, our results provide additional evidence that collagens and extracellular matrix proteins in general are targets of mimicry by a range of pathogenic bacteria. Moreover, we report the unexpected result that such mimics have evolved independently in a range of bacterial pathogens through separate amplifications of short peptide repeats. In addition to extracellular matrix proteins, the screen predicted numerous known and potentially novel mimicry relationships that are candidates for future experimental investigation.
We developed and applied a computational pipeline (outlined in Fig. 1) to identify potential pathogen mimics of human (host) proteins using comparative proteome analysis. First, proteome sequence data was retrieved for human and 128 bacterial species including 62 bacterial pathogens of humans and 66 non-pathogens as annotated by the Comprehensive Microbial Resource (CMR).23 An all-by-all BLAST24 analysis was performed, in which all human proteins were searched against all bacterial proteomes, and hits with E-values < 1e−06 were collected and compared between the two groups (Fig. 1).
From the set of 33340 human proteins, 9149 (27.4%) had BLAST matches with E < 1e−06 in one or more bacterial proteomes (Fig. S1A). For the average human protein, the frequency of pathogen vs. non-pathogen genomes containing a match, and the top pathogen vs. non-pathogen alignment score (bitscore) and E-value, was highly similar (Fig. S1B–E). Moreover, a larger fraction of the human proteome was similar to proteins in the non-pathogen set (Fig. S1A), which may reflect pathogen-associated genome reduction.25
While molecular mimics may be encoded by pathogenic and non-pathogenic species, and serve different biological purposes (e.g., virulence, mimicry of immune epitopes, and survival of commensal bacteria inside host), we aimed to identify mimics that may play specific roles in pathogen virulence. Here, we use the definition of pathogen mimics as bacterial pathogen-encoded proteins that share significant similarities with host proteins for the purposes of interacting or interfering with host machinery for the pathogen’s benefit.10,22 To identify such a subset, we further processed this list of bacteria–human protein similarities to identify those specific to or enriched in pathogens and diminished or completely absent in the non-pathogen group (controls), and thus indicative pathogen–host specificity and potential molecular mimicry. This involved applying the following criteria: that a hit is specific to or at least 2-fold enriched in pathogen species, and that a top pathogen hit has a greater alignment score than the corresponding top non-pathogen hit (bitscore difference >10, see Methods) (Fig. 1). These parameters capture a subset of potential mimics within the distributions shown in Figure S1C–E.
Applying these filters resulted in a final list of 355 human proteins predicted as potential targets of pathogen molecular mimicry (Table S1A). These predicted mimicry relationships occurred in a small number of bacterial species (average = 3.2 species, 2.5%) compared with that observed for non-mimics (average = 36.8 species, 28.8%), which as expected appears to contain all ubiquitous proteins in the data set (2333 found in human and a majority of bacterial species, 208 conserved across all species).
The 355 human proteins were predicted targets of 231 total pathogen proteins from 53 pathogen species (Table S1B). We also removed redundancy within this set (see Methods) in order to generate a smaller list of 95 highly unique relationships (Table S1C) for subsequent analysis (see Methods). The top 25 most unique mimicry predictions are listed in Table 1. As explored in later sections, the top predictions include extracellular matrix proteins (collagen and leucine-rich repeat proteins), as well as several virulence factors and known examples of molecular mimicry (Table 1).
Figure 2 displays a plot of the top BLAST matches to each human protein in pathogen vs. non-pathogen proteomes. Similarities to the well-known mimicry targets, guanine-exchange factors,26 are shown in blue and are clear outliers in this distribution, which serves as a positive control. Similarities to human collagens, which correspond to the top predicted mimicry relationships (Table 1), are also indicated in red, and similarly display elevated scores in the pathogen set. As an additional control, we performed the same computation using plant pathogen/non-pathogen definitions (6/16 respectively), and this signal was absent (Fig. 2, right).
The pathogen vs. non-pathogen bitscore distributions for nine of the top-scoring hits from Table 1 are shown in Figure 3. In each case, the overall bitscore distributions for both pathogen and non-pathogen proteomes are similar to the extreme value distributions that might be expected by random alignments, but also contain pathogen proteins with considerably elevated similarities to a human protein, indicative of mimicry. Some of the detected mimics occur in a few pathogens (e.g., delta sterol reductase homolog exclusive to Coxiella burnetii), while others occur more broadly throughout a range of pathogen species (e.g., putative mimics of collagens and leucine-rich repeat proteins, Fig. 3).
We then examined the list of predicted mimicry relationships in terms of known mimics, homology to virulence factors, statistically enriched functions, and comparison with literature. These analyses suggest that the predictions are enriched in mimicry-mediated virulence mechanisms of bacterial pathogens, and include both known and putative novel mimicry relationships.
Prediction #3 corresponds to detected similarity between a human leucine-rich repeat protein (gi 122937309) and a leucine-rich repeat protein from Streptococcus pyogenes (SpyM3_1561) (Table 1) as well as Listeria internalin virulence factors (.g., lmo0801 and LMOh7858_0295 [Table S1A]). This reflects the established example of human LRR mimicry by internalin.7,19,20 Predictions #17 and #18 are detected relationships between two human guanine exchange factor (GEF) containing proteins and the pathogen proteins RalF (Legionella pneumophila) and sec7 (Rickettsia prowazekii), and thus reflect GEF mimicry.18 Prediction #31 corresponds to detected similarity between a human tyrosine phosphatase (gi 108802617) and yopH from Yersinia pseudotuberculosis, another known case of pathogen-specific molecular mimicry.7,8
Sixty-one of 95 predicted mimics (non-redundant set) were detected as homologous (BLAST E-value < 1e−06) to known virulence factors from the MvirDB,27 suggesting possible roles in mimicry-mediated virulence. These include the Legionella LepB effector (prediction #7, similar to numerous human coiled coil proteins) indicating possible functions in regulation of secretory traffic, Listeria PlcA (prediction #19, similar to human PI-PLC), and Helicobacter and human fucosyltransferase (prediction #68, FucT) (Table 1; Table S1). Helicobacter FucT is another known case of mimicry as its produces the lipopolysaccharide component Lewis X trisaccharide, which is thought to mimic host sugars to escape immune detection.28
Gene enrichment analysis was then performed using DAVID29 to identify statistically enriched functions and protein families among the full set of predicted human mimicry targets (Table 2; Table S2A). The top four enriched function terms were: “Extracellular matrix” (Benjamini P = 3.39E−42), collagen (P = 3.11E−29), “Extracellular matrix structural protein” (P = 2.56E−24), and “ARF guanyl-nucleotide exchange factor activity” (P = 2.11E−19) (Table S2A). Other intriguing top enriched terms included “O-acyltransferase activity” (P = 2.08E−06), “cell adhesion (P = 2.95E−06), and “inflammation mediated by chemokine and cytokine signaling pathway” (P = 6.84E−03). Terms that were highly ranked and functionally relevant but of weaker statistical significance included “Iκ kinase/NFκB cascade” (P = 1.18E−02), “lysosome” (P = 1.38E−02), “Toll-like receptor pathway” (P = 5.15E−02), and interaction with host (P = 9.30E−02). Thus, the top enriched terms appear to be consistent with virulence related mechanisms of pathogenic bacteria. As a comparison, we also applied the same analysis to phytopathogens (578 mimics detected, data not shown), which identified different enrichment categories. The top two functional enrichments for predicted mimics from phytopathogens were “apoptosis” (P = 5.54E−27) and “programmed cell death” (P = 1.42E−25). Indeed, induction of apoptosis is a known virulence mechanism of plant pathogens,30 suggesting that the approach may be applicable to different host-pathogen relationships.
We then analyzed the top 95 unique detected mimicry relationships (non-redundant set) for additional literature supporting potential roles in virulence. Thirty-three of these relationships (~35%) were found to be highly suggestive of mimicry-mediated virulence mechanisms with support from existing literature (Table 3). Known mechanisms of pathogen exploitation of host cells that are also reflected throughout the detected relationships include pathogen-specific modulation of host lipid metabolism, modulation of the host nucleotide pool and induction of apoptosis, and cell adhesion. The remaining 65 cases may represent previously uncharacterized mimicry relationships, and provide numerous targets for investigation of pathogenicity mechanisms.
Numerous detected mimicry relationships and function enrichments (Table S2) are linked to lipid metabolism. Putative mimics affecting host lipid metabolism include detected pathogen counterparts of human cholinephosphotransferase, PCTP like protein, fukutin, acyltransferases, phospholipase A2, carnitine O-palmitoyltransferase, and sterol reductase (Table 3). For example, Mycobacterium pneumonia is known to incorporate host lipids (e.g., phosphatidylcholine),68 and has been shown to modulate host sphingolipid metabolism.45 A human-like carnitine O-palmitoyltransferase (CPT) was detected in Mycobacterium pneumonia, which may play a role in these functions.44 As another example, part of the virulence mechanism of Bacillus anthracis involves escape from host macrophages, but the mechanism is largely unknown.69 A human-like phospholipase A2 was detected in Bacillus anthracis (BA_3805), which, in other pathogens, has been shown to play a role in phagosome escape as well as entry and lysis of host cells.64,65
Cells damaged by invading pathogens release nucleotides into the external environment, which act as “danger signals” and stimulate pathogen-killing immune responses,37,70 or induce apoptosis.71 Detected mimicry relationships indicative of host nucleotide pool modulation include the previously studied Legionella lpl1869 with detected similarities to the human NTPDase CD39,60 and proteins similar to human P-loop NTPases and ATPases from Vibrio and other species (Table 3). Some of these putative mimics may play roles in inhibition of apoptotic signaling. Rickettsia RC0370 appears to mimic the P-loop NTPase domain of human NACHT proteins involved in immune signaling and apoptosis. Furthermore, adenylate kinase (AK) in Pseudomonas aeruginosa has been shown to act as a virulence factor regulating external ATP-dependent macrophage cell death,67 and the analysis identified several human-like AKs in a range of pathogenic species that represent novel candidate virulence factors (Table 3; Table S1A). Another example of potential disruption of apoptotic signaling involves detected between the Anaplasma protein, APH_0455, and the human protein NFBD1/MDC1, with the two proteins sharing a repetitive motif (QPSTSXDQPXT, see Data File S1 for BLAST alignments). Interestingly, Anaplasma phagocytophilum is known to prevent apoptosis in neutrophils,72 and MDC1 is known to have anti-apoptotic activity through inhibition of p53 phosphorylation.47
Before pathogens can appropriate host pathways, they must adhere to and invade host cells. This was the strongest detected mimicry function in our data set, as both the top-scoring putative mimics (Table 1) and enriched functions (Table 2) relate to the extracellular matrix and its components (specifically, collagens and leucine-rich repeat proteins). Both collagens and LRRs have been implicated in virulence and play a role in host cell adherence and invasion,20,73,74 and are the focus of the following section. Other detected pathogen proteins with possible roles in host cell adhesion include a Bacillus adhesin (BA_0871), which shares a unique repetitive TEKP motif (Data File S1) with human zonadhesin, and Treponema proteins similar to human ankyrins that may interact with the host cytoskeleton52 (Table 3).
Prediction #1 (Table 1) and function enrichment #1 and #2 (Table 2) relate to detected similarities between human collagens and collagen-like proteins in pathogenic bacteria. Figures 2 and and33 highlight collagens as clear outliers in the E-value and bitscore distributions, similar to that of the known mimicry relationship, RalF-GEF.18 Collagen mimicry was also the most abundant pathogen-specific pattern detected, occurring in 7 pathogens and 0 non-pathogens (Table 1).
The top scoring predicted collagen mimic was spr1403 (PclA) from Streptococcus pneumoniae, a protein that has previously been shown to contribute to host-cell adherence and invasion.73
Numerous detected proteins may play similar roles in pathogenesis, including: BA_3841, BCE_1581, BC_3345, CPE0955, CPR_1027, ECs1228, EF_2090, SPy1983, SpyM3_0738, Z1483, lpl2569, BC_2381, CPF_1202, BCE_3739, ECs2941, and Z2340. Another putative collagen mimic detected by the analysis (Spy1983 [SclA] from Streptococcus pyogenes) has recently been demonstrated to act as an adhesin during pathogenesis of streptococcal infection.74 These pathogen-specific collagen-like proteins (CLPs) are significantly more human-like than CLPs found in other bacteria, as shown by the bitscore distributions (Fig. 3). To investigate this further, we analyzed motif content for the predicted subset of collagen mimics found in pathogens, vs. all other CLPs including those found in non-pathogens. The human-like, pathogen-specific subset of CLPs (putative collagen mimics) were found to have significantly more tetrapeptides in common with human collagens than non-pathogen CLPs (Fig. S2). Many of these tetrapeptides contain “GP” motifs characteristic of human collagen sequences. For example, the average number of GP motifs per sequence length is 4.95% for the putative collagen mimics, 4.86% for human collagens, and only 0.26% for predicted non-mimics. Thus, the detected sequence similarities are due to similarities in peptide composition rather than sequence or statistical artifacts.
In mammals, leucine-rich repeat proteins are a second class of proteins abundant in the extracellular matrix, where they function in cell growth, adhesion, migration, and bind with other ECM components including collagens.75 As mentioned above, prediction #3 involves detected similarity between human LRRs and internalin-related factors (e.g., lmo0801, LMOh7858_0295), which are known to function in adherence and invasion of host cells (Table 1; Table S1A). LRR-containing proteins such as NOD-like receptors and Toll-like receptors also serve as important pathogen-detection molecules, recognizing key pathogen-associated molecular patterns (PAMPs).20,76 Both Toll-like receptors (prediction #27) and NOD-like receptors (prediction #50 and #60) were detected as potential targets of LRR mimicry (Table S1C).
As the top scoring pattern overall, we analyzed the detected similarities between human extracellular matrix proteins and putative bacterial pathogen mimics. The two detected relationships are largely distinct according to their taxonomic distribution. Detected collagen mimics were associated predominantly with Firmicutes pathogens, and LRR mimics were identified in pathogens from a range of phyla also including Spirochaetes and Bacteroidetes. One species (Streptococcus pyogenes) appears to encode both types of mimicry proteins.
To investigate how the set of putative collagen mimics have evolved in pathogenic bacteria, we analyzed and compared the sequence architecture of predicted ECM mimics from different species. Despite possessing a common repetitive collagen architecture, we found that collagen-like repeats from different bacterial pathogenic species have distinct repeat architectures, indicative of separate evolutionary origins. To demonstrate this, CLPs from different pathogens were divided into their peptide repeats, which were aligned and used to create sequence logos (Fig. 4A) and phylogenetic trees (Fig. 5, left). While the collagen-like repetitive pattern (GXXN) is common to all detected CLPs, the progenitor repeat sequences are different, and the repeat lengths are also variable (Fig. 4A). Phylogenetic trees were constructed based on an alignment of pathogen collagen-like repeats as well as a top-aligning human repeat to the set of repeats within each pathogen protein (see Methods). As revealed by the sequence logos and the tree, the detected collagen mimics from Streptococcus pneumoniae (spr1403), Streptococcus pyogenes (SpyM3_0738), Clostridium perfringens (CPR_1027), Bacillus anthracis (BA_3841), and Legionella pneumophila (lpl2569) appear to have independently evolved their similarity to human collagen via separate repeat amplifications (Fig. 4A and and5,5, left). Moreover, different human repeats can be found that cluster specifically with each pathogen repeat class (Fig. 5, left), indicating that different pathogen repeats may not only be mimicking different human proteins, but may be derived from different host peptides.
Consistent with the idea that different pathogen CLPs have evolved independently from each other, CLPs in bacteria appear to exhibit a scattered phylogenetic distribution. For example, while CLPs were predominantly detected in pathogens from the Firmicutes phylum, they also exist in the highly pathogenic O157:H7 strain of Escherichia coli (Table S1A) as well as Legionella pneumophila but not other Gammaproteobacteria. Interestingly, while present in E. coli O157:H7, the collagen-like proteins are absent in the uropathogenic E. coli CFT073 strain, and the non-pathogenic K1 strain.
As with the putative collagen mimics, we then analyzed the repeat architecture of the detected LRR mimics, and aligned the sequence logos in the form of the general LRR sequence pattern (based on Ward et al.77) (Fig. 4B), and generated phylogenetic trees as described above (Fig. 5, right). Like the detected collagen mimics, LRR mimics from different pathogens exhibit different repeat architectures (Fig. 4B) and form unique clusters in the phylogenetic tree (Fig. 5, right), implying separate origins via independent repeat amplifications. As described in Ward et al.,77 the first residues LxxLxLxxNx of leucine-rich repeats correspond to a conserved, interior portion of the LRR structure, while the remaining sequence encodes variable residues on the other face of the LRR structure. Interestingly, variable segments from pathogenic LRRs aligned strongly with variable segments from human LRRs, which suggests this may be a variable interaction surface exploited by pathogens. For example, the Legionella protein lpl1579 possesses the motif GAKALA in this variable region, which is similar to the sequence conservation pattern of leucine-rich repeats in human NOD-like receptors (e.g., NLRC3) (Fig. 4C). Interestingly, NLRC3 was the top BLAST match to lpl1579 in the human proteome, and it has been demonstrated to have inhibitory effects on T-cell function.78 It is important to note that a BLAST search of lpl1579 across all eukaryotes identified predicted proteins from Naegleria, followed by other NLRC3 proteins from a range of mammals, and so the human NLRC3 is not necessarily the highest scoring hit, which is a general statement that also applies to other detected mimicry relationships. Further phylogenetic analysis would be needed to characterize the evolutionary origins of each detected mimic to verify possible eukaryotic host-bacteria horizontal transfer79 from human or non-human host species.
The cases described above (putative collagen and LRR mimics) appear to have evolved independently in pathogens to mimic the repetitive architecture of host proteins.
To investigate the sequence repetition of predicted mimics quantitatively, each region of the putative mimics with detected similarity to human proteins was analyzed and divided into putative sequence repeats using the RADAR repeat prediction algorithm.80 Repeat proteins were found to make up a considerable number of detected mimicry relationships, as 34/95 and 207/306 detected mimics were identified as repetitive (containing three or more predicted repeats) in the non-redundant and full set of predictions, respectively (Table S1).
As revealed by gene-enrichment analyses performed on both sets separately (Table S2B and C), the repetitive class consisting of CLPs, LRRs, and other candidates, are responsible for the enrichments related to extracellular matrix mimicry and cell adhesion (Table S2B). These terms were not significantly enriched among the non-repetitive class (Table S2C). Conversely, non-repetitive mimics were significantly enriched in terms related to enzymatic function such as “catalytic activity” and “lipid-metabolism”, but these were not enriched among the repetitive-class. These results, combined with sequence (Fig. 4) and phylogenetic analyses (Fig. 5), are consistent with the idea that non-repetitive mimics with complex sequence composition are associated with enzymatic modulation of host functions and have likely been acquired in pathogens by horizontal transfer, while repetitive mimics have evolved independently in pathogens to mimic repetitive host structural proteins involving in adherence and invasion of host cells.
Another interesting example of the latter is a detected potential mimicry relationship between human protein periaxin and mycobacterial PPE family virulence factors (Rv1918c) (Table 3). Mycobacteria such as M. leprae specifically invade human Schwann cells through an interaction with the dystroglycan complex.56,58 Interestingly, human periaxin is a Schwann cell-specific protein that is critical to formation of the dystroglycan-complex. However, as with LRRs and collagens, the detected similarity is not due to homology but rather, a repetitive proline-rich composition that has independently evolved in both proteins (see Data File S1). It is possible that the repetitive proline-rich architecture of PPE proteins may facilitate an interaction with host dystroglycan-complex as is the case for periaxin, or that they may act as membrane-interacting lipoproteins.
The results of this work suggest that comparison of host–bacteria proteome similarities is sufficient to detect a subset of pathogen mimics that function in bacterial virulence. Our approach identified known examples of mimicry, virulence factors, and potential novel candidates with roles in modulation and exploitation of host functions.
According to a recent study,81 all human proteins possess motifs also present in bacterial proteins. The same study also found no observable difference in overall bacteria–human peptide similarities between pathogenic and non-pathogenic species. Our results show that while overall sequence similarity to human proteins is not significantly enriched in pathogenic vs. non-pathogenic bacteria, there are detectable pathogen-specific or pathogen-enriched similarities to host proteins in key functional pathways related to virulence. These identified pathways and components, including the extracellular matrix, lipid metabolism, and immune signaling, are known targets of exploitation by bacterial pathogens.
As discussed in previous literature,7,14 two evolutionary mechanisms are likely responsible for detected sequence mimicry between pathogen and host proteins: direct homology due to lateral transfer of the eukaryotic proteins to one or few bacteria, or similarities due to independent evolutionary processes (convergent or parallel evolution) (Fig. S3). In this work, we identified two sequence classes of detected mimics (non-repetitive and repetitive), which likely fall into these two evolutionary categories.
A new insight revealed by this work relates to the independent evolutionary processes by which pathogen mimics can originate. One documented mechanism underlying convergent evolution of host mimicry is independent origin of a binding surfaces or motif in a pathogen protein that displays no detectable homology with its host counterpart.7,14 Our work provides strong support for an additional mechanism, mimicry of host repetitive proteins via independently evolved peptide repeats. In this scenario, separate progenitor repeats in the pathogen genome are amplified to result in repeat proteins that share the same repetitive architecture but with different sequences for each repeat unit (Fig. 4; Fig. S3). This is similar to what has been observed for β-trefoil proteins, which also include virulence-associated subfamilies (i.e., ricin toxins) that have undergone separate repeat amplifications while maintaining the same overall structure.82
These detected similarities do not imply overall homology between the full proteins but rather are due to similarity of repetitive architecture. Repetitive host proteins such as collagens, leucine-rich repeat proteins, and adhesins represent ideal targets for this evolutionary mechanism of pathogen mimicry, while complex proteins such as enzymes are not.
Interestingly, not only do the human and pathogen counterparts of these proteins appear to have evolved independently, but repeat amplifications appear to have occurred independently in different pathogenic species. While this may be indicative of convergent evolution, it is also possible that the pathogen proteins evolved by tandem duplications of an original peptide fragment that itself was acquired from (and thus related to) a host fragment, but then evolved a novel composition through independent evolutionary processes. This is similar in some respects to the results of a recent study that identified a recurring phenomenon whereby host-derived proteins in viruses had subsequently converged toward simpler domain architectures.83
In either case, the enrichment of repetitive ECM mimics in pathogens is likely due to convergent or parallel evolutionary processes that are driven by pathogen-specific selective pressures.
These predictions provide starting points for future experimental work characterizing the biological role of predicted pathogen mimics. We have analyzed only a subset of this spectrum, and future work expanding this analysis, and also evaluating the host-species specificity of this approach will be useful. Thus, future use of alternative classification schemes, improved motif detection techniques and structural bioinformatics may provide added sensitivity.
For instance, the classification scheme that is the basis of our comparative approach separates human pathogenic vs. non-pathogenic bacteria. Although this classification is somewhat arbitrary, the putative mimics detected using this scheme likely play virulence associated roles, and it was an objective of this analysis to find such pathogen-associated mimics. However, it is also important to note that other classes of mimics exist that may not be detected by our pathogen/non-pathogen comparison. These include mimics that are not directly involved in virulence but might still play a role in persistence of commensal bacteria inside the host, or have other effects such as the extensively studied role of peptide mimicry in generation of autoimmune disease.1,2 In the case of immune epitope mimicry, small regions of sequence similarity and not overall homology may be sufficient to elicit molecular mimicry,3,5 which would also not be identified using a standard homology detection approach. Our work thus complements previous work,5 which has focused on such cases of immune epitope mimicry. Finally, pathogen mimics of host proteins may have diverged beyond the point of recognizable sequence homology, but be detectable at the level of overall structural similarity.
Ultimately, to extend computational analysis of host–pathogen molecular mimicry, it may be useful to analyze mimicry with respect to the specific pathology of each bacterial species and the biological consequences they have on their host (e.g., autoimmunity, direct damage, interference of metabolism, persistence in host), as well as conduct sequence comparisons at the level of motif fragments, perhaps taking into account protein structural information. With such improvements, it will be increasingly possible to predict novel virulence mechanisms and host-pathogen relationships from genomic data. A resource containing predicted mimicry candidates discussed in this paper is available at http://doxey.uwaterloo.ca/mimicry/.
Protein sequence data sets for human as well as 163 bacterial genomes (see Table S1 for a complete list) were retrieved from the NCBI (RefSeq human protein database build 36 [37742 proteins]) and the Comprehensive Microbial Resource23 at TIGR/JVCI (http://cmr.jcvi.org). To reduce species redundancy, only one proteome per species was kept and the rest were removed. This step removed 35 species, leaving 62 pathogens and 66 non-pathogens.
An all-by-all BLAST analysis was conducted using BLAST v. 2.2.16, in which each human protein was used as an individual query in a separate BLAST search of each individual organism’s protein database. Default BLAST parameters were used with “composition-based statistics” to correct for potential compositional bias. A BLAST E-value cutoff of 1e−06 was used to identify putative matches, from which a presence/absence matrix was constructed. BLAST E-values, bitscores, and top pathogen protein matches were recorded for each cell of the matrix. To remove genome/species redundancy, only one genome per species was kept (randomly assigned) and the remaining genomes were removed.
Each human protein (i) was then scored using the fraction of pathogen species with a hit detected by blast (Pi) divided by the fraction of non-pathogens (NPi) containing a hit. Potential protein mimics were selected based on rarity in non-pathogens, enrichment in pathogens, and greater similarity between pathogen and human proteins. The specific criteria were hits found in less than five non-pathogens, Pi/NPi ratio greater than 2, and top pathogen BLAST hit had a bitscore greater than 10 above that of the top non-pathogen hit. A bitscore difference of 10 was chosen to result in a final list equivalent to roughly 1% of the human proteome (99% percentile).
The full list of detected mimicry candidates is shown in Table S1. A smaller list of the most unique relationships was also generated by including only the top human matches to each unique pathogen protein. This generated a smaller set of 95 unique mimicry relationships (Table S1C), the top 25 of which are listed in Table 1.
As a control, and test for generality, we applied the approach to a different host-pathogen relationship (plant/phytopathogen). The same approach was used as described above with Arabidopsis thaliana used as the host proteome, and the following species were defined as phytopathogens: Phytoplasma asteris, Agrobacterium tumefaciens, Ralstonia solanacearum, Pseudomonas syringae, Xanthomonas axonopodis, and Xylella fastidiosa. Xanthomonas campestris was removed due to redundancy with X. axonopodis. In total, this data set contained six phytopathogens and 16 non-pathogens.
All candidate pathogen mimics were searched against the MvirDB27 database of known virulence factors. Hits were again defined as BLAST matches with E < 1e−06.
Gene enrichment analysis for overrepresented functions was performed using DAVID.29 The following eight ontologies were used: GOTERM_BP_ALL, GO_TERM_CC_ALL, GOTERM_MF_ALL, PANTHER_BP_ALL, PANTHER_MF_ALL, BIOCARTA, KEGG_PATHWAY, and PANTHER_PATHWAY. The default parameter value of EASE = 0.1 was used.
For each putative mimic, the BLAST result was parsed to extract the sequence region detected as similar to the host protein. Repeats were predicted for these regions using RADAR80 run with default parameters. Repetitive mimics were defined as those containing greater than three predicted repeats, all of which had total scores > 90.
For the detected collagen-like and LRR mimics, the detected repeats were aligned, adjusted where necessary to a common length, and sequence logos were generated using seqlogo with default parameters (http://weblogo.berkeley.edu).
For each set of repeats from a pathogen protein, the consensus sequence was used as a query to identify the best-aligning repeat from a corresponding human protein using SSEARCH.84 Collagen-like and leucine-rich repeats were then aligned separately along with their respective human sequences and a phylogenetic tree was generated using parsimony using Phylip.85 Branch lengths were estimated by maximum likelihood using Fasttree86 with the JTT model and CAT approximation with 20 rate categories.
The compseq program within the EMBOSS suite87 (version 6.3.1) was used to compute motif frequencies for human collagens, the putative collagen mimics and non-mimics.
No potential conflicts of interest were disclosed.
This work was supported by the National Science and Engineering Research Council of Canada (NSERC) through grants to BJM (NSERC Discovery Grant) and ACD (NSERC PDF). We thank Trevor Charles for insightful discussions.