|Home | About | Journals | Submit | Contact Us | Français|
The bacterium Helicobacter pylori is remarkable for its ability to persist in the human stomach for decades without provoking sterilizing immunity. Since repetitive DNA can facilitate adaptive genomic flexibility via increased recombination, insertion, and deletion, we searched the genomes of two H. pylori strains for nucleotide repeats. We discovered a family of genes with extensive repetitive DNA that we have termed the H. pylori RD gene family. Each gene of this family is composed of a conserved 3′ region, a variable mid-region encoding 7 and 11 amino acid repeats, and a 5′ region containing one of two possible alleles. Analysis of five complete genome sequences and PCR genotyping of 42 H. pylori strains revealed extensive variation between strains in the number, location, and arrangement of RD genes. Furthermore, examination of multiple strains isolated from a single subject's stomach revealed intrahost variation in repeat number and composition. Despite prior evidence that the protein products of this gene family are expressed at the bacterial cell surface, enzyme-linked immunosorbent assay and immunoblot studies revealed no consistent seroreactivity to a recombinant RD protein by H. pylori-positive hosts. The pattern of repeats uncovered in the RD gene family appears to reflect slipped-strand mispairing or domain duplication, allowing for redundancy and subsequent diversity in genotype and phenotype. This novel family of hypervariable genes with conserved, repetitive, and allelic domains may represent an important locus for understanding H. pylori persistence in its natural host.
Helicobacter pylori, a gram-negative bacterium, is remarkable for its ability to persist in the human stomach for decades. Colonization with H. pylori increases risk for peptic ulcer disease and gastric adenocarcinoma (53, 70) and elicits a vigorous immune response (15). The persistence of H. pylori occurs in a niche in the human body previously considered inhospitable to microbial colonization: the acidic stomach replete with proteolytic enzymes.
H. pylori strains exhibit substantial genetic diversity, including extensive variation in the presence, arrangement, order, and identity of genes (2, 4-7, 25, 51, 74). Furthermore, analyses of multiple single-colony H. pylori isolates from separate stomach biopsy specimens of individual patients have demonstrated diversity, both within hosts (27, 65), and over time (36). The mechanisms that generate H. pylori genetic diversity may be among the factors that enable persistence in this environment (3, 28).
While the natural ability of H. pylori for transformation and recombination may explain some of the intra- and interhost genetic variation observed in this bacterium (43), point mutations and interspecies recombination alone are not sufficient for explaining the extent of the variation in H. pylori (14, 32). The initial genomic sequencing of H. pylori strains 26695 and J99 (6, 72) revealed large amounts of repetitive DNA (1, 59). DNA repeats in bacteria are associated with mechanisms of plasticity, such as phase variation (49, 67); slipped-strand mispairing (41, 46); and increased rates of recombination, deletion, and insertion (17, 60, 62). Because many of the recombination repair and mismatch repair mechanisms common in bacteria are absent or modified in H. pylori (28-30, 56, 76), this organism may be particularly susceptible to the diversifying effects of repetitive DNA. In fact, loci in the H. pylori genome containing repetitive DNA have been shown to exhibit extensive inter- and intrahost variation (9, 10, 28, 37).
We hypothesized that identification of repetitive DNA hotspots in H. pylori would allow the recognition of genes whose variation could aid in persistence. To examine this hypothesis, we conducted in silico analyses to identify open reading frames (ORFs) enriched for DNA repeats and then used a combination of sequence analyses and immunoassays to examine the patterns associated with the specific repetitive DNA observed. Our approach led to the realization that a previously identified H. pylori-specific gene family (19, 52) exhibits extensive genetic variation at multiple levels.
Genomic sequences of Helicobacter pylori strains 26695 (72), J99 (6), and HPAG1 (51) were retrieved from GenBank (12), and specific sequences from the H. pylori G27 genome were provided ahead of publication (11). In addition, 43 H. pylori strains from Asia, Europe, North America, and South America from patients with differing clinical diagnoses were studied (see Table S1 in the supplemental material). These strains, stored in the New York University Helicobacter/Campylobacter strain reference collection at −70°C, were cultured at 37°C in a 5% CO2 atmosphere on Trypticase soy agar plates with 5% sheep blood. For one of these strains, HP1 (26), nucleotide sequencing of specific genes of interest was conducted. In addition, multiple H. pylori isolates were obtained from patients in Ladakh, India, during endoscopy (63).
Identification of forward, perfect nucleotide repeats of >24 base pairs in strains 26695 and J99 was done using the computer program REPuter (39). We chose a minimum length of 25 because the probability of finding repeats of this length by chance alone in the H. pylori genome is <0.001 (59). To assess homology, Clustal X version 2.0 (40) was used to align DNA sequences and Swaap 1.0.3 (57) was used to calculate pairwise nucleotide and amino acid identities as well as synonymous- and nonsynonymous-substitution rates (Ks and Ka, respectively). Kyte-Doolittle hydropathy plots were created with FASTA at the University of Virginia (http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=misc1). Amino acid sequences were examined for protein sequence motifs and predicted coiled-coil regions by using the Motifs and Coilscan programs, respectively, in SeqWeb version 3.1. For the Coilscan analysis, a 28-amino-acid sliding window was used to determine the coiled-coil probability for each residue. Phylogenetic analyses were conducted with PAUP* version 4.0b10 (71), using the distance criterion. We employed a neighbor-joining technique and estimated support for individual branches by conducting 1,000 bootstrap replicates (21).
To achieve a working classification, repeat units (all 21 or 33 bp) were identified and aligned using the Clustal algorithm within MEGA version 3.1 (38). Since the alignments were very short, distances between fragments were determined by absolute number of differences, and a tree based on the unweighted-pair group method using average linkages was constructed using MEGA version 3.1 (38) with 1,000 repetitions to determine bootstrap values.
After 72 h of bacterial growth on Trypticase soy agar plates with 5% sheep blood, genomic DNA was prepared from each studied strain. PCRs for determining RD genotypes were conducted with 1× buffer, 0.20 mM deoxynucleoside triphosphates, 0.40 μM of each primer, 0.5 units of Taq polymerase, and 100 ng DNA in a 50-μl total volume. The thermal cycler program used an initial denaturation step at 94°C for 3 min; 30 cycles at 94°C for 30 s, 60°C for 30 s, and 72°C for 4 min; and a final extension at 72°C for 10 min. Product length was determined by agarose gel electrophoresis. The number of RD genes at each locus was determined with amplifications using primers RDL1-5 and RDL1-3, located in the conserved genes flanking RD locus 1, and primers RDL2-5 and RDL2-3, located in the conserved genes flanking RD locus 2 (see Table S2 in the supplemental material). The allelic identities of RD genes of the 5′ allelic region (FAR) were determined by sequencing or by amplification with primers placed in FAR1 (FAR1f and FAR1r) or FAR2 (FAR2f and FAR2r) (see Fig. S1 in the supplemental material).
For the strains isolated from patients in Ladakh, India, DNA fingerprinting was conducted by random amplified polymorphic DNA-PCR with 10-nucleotide primers 1254, 1281, and 1290, as described previously (5, 78).
Primers were designed to amplify the full-length JHP_0110 gene from H. pylori strain J99, with addition of a 5′ XhoI restriction site and a 3′ BamHI restriction site (see Table S2 in the supplemental material). PCRs were prepared as described above, but the thermal cycler program had an initial denaturation step at 94°C for 3 min; 30 cycles at 94°C for 1 min, 58°C for 1 min, and 72°C for 1 min; and a final extension at 72°C for 10 min.
PCR products treated with a QIAquick PCR purification kit (Qiagen, Valencia, CA) and pET-15b vector were digested with both XhoI and BamHI, and molecular cloning was done according to the manufacturer's guidelines (EMD Biosciences, Darmstadt, Germany). QIAprep spin miniprep kits (Qiagen, Inc., Valencia, CA) were used to isolate plasmid from XL1-blue Escherichia coli cells. The pet15b-JHP_0110 construct was transformed into BL21(DE3) E. coli cells for expression, and the recombinant protein was isolated from inclusion bodies, in accordance with the Novagen Bugbuster protocol. The Novagen His bind kit (EMD Biosciences, Darmstadt, Germany) was used to purify the recombinant protein under denaturing conditions (6 M urea).
Sera from a previously described cohort of H. pylori-negative children (54) were used as negative controls. We studied the sera from 247 adults undergoing endoscopy at the VA New York Harbor Healthcare System. Patients were considered H. pylori positive if they showed positive results for at least two of the following assays: histology, culture, serology, and rapid urease testing (24). In brief, enzyme-linked immunosorbent assays (ELISAs) were conducted by fixing 10 ng of recombinant full-length JHP_0110 protein to each well, using carbonate buffer (pH 9.6). After being blocked, samples were added to wells in duplicate. The secondary antibody, goat anti-human immunoglobulin G (IgG)-horseradish peroxidase conjugate, was diluted 1:4,000 in phosphate-buffered saline containing 0.05% Tween 20, 0.02% thimerosal, 0.1% gamma globulin, and 1.0% albumin and incubated in wells for 1 h at 37°C. Developer containing 45% Na2HPO4, 55% citric acid, 0.16% H2O2, and 10 mg of ABTS [2,2′-azinobis(3-ethylbenzthiazolinesulfonic acid)] was added and optical density (OD) measured at 405 nm. For immunoblots, recombinant full-length JHP_0110 protein in 6 M urea was electrophoresed by sodium dodecyl sulfate-polyacrylamide gel electrophoresis and transferred to a nitrocellulose membrane. Human sera were diluted 1:3,000 and mouse anti-His diluted 1:2,000. Goat anti-human IgG-alkaline phosphatase conjugate, diluted 1:1,000, and BCIP (5-bromo-4-chloro-3-indolylphosphate)-nitroblue tetrazolium phosphate substrate were used to visualize the conjugated protein.
An analysis of direct identical repeats of >24 bp by use of the REPuter program revealed 554 repeats in the H. pylori 26695 genome and 482 repeats in the H. pylori J99 genome (Fig. (Fig.1).1). As expected (10), cagY was the ORF in both the 26695 and the J99 genomes (HP_0527 and JHP_0476, respectively) with the most repeats. Only six other ORFs in strain 26695 contained >20 direct identical repeats. Five of these ORFs in strain 26695 (HP_0118, HP_0119, HP_0120, HP_1187, and HP_1188) and two ORFs in strain J99 (JHP_0110 and JHP_1113) all belonged to a single gene family (DUF874) according to Pfam (22). Because of their repetitive DNA sequences, we propose calling this group of ORFs the H. pylori “RD gene family.”
We termed the location of the three adjacent ORFs in 26695 (HP_0118, HP_0119, and HP_0120) RD locus 1 and the location of the two other adjacent ORFs (HP_1187 and HP_1188) RD locus 2 (Fig. (Fig.2).2). In strain J99, RD locus 1 contains JHP_0110, and RD locus 2 contains JHP_1113. Strain HPAG1 has two genes at RD locus 1 and three at RD locus 2 (five in total), whereas G27 has one gene at RD locus 1 and four genes at RD locus 2 (also five in total) (Fig. (Fig.2).2). In the four studied H. pylori strains with completely sequenced genomes, RD locus 1 was always flanked by a hypothetical protein (HP_0117) and phosphoenolpyruvate synthase (HP_0121), and RD locus 2 was always flanked by carbonic anhydrase (HP_1186) and aspartate-semialdehyde dehydrogenase (HP_1189) (Fig. (Fig.2).2). For another H. pylori strain, HPI, studied using flanking and internal primers, two RD genes (one at each locus, resembling J99) were identified and sequences analyzed. In total, the 19 RD genes identified in these five strains were the substrate for further studies.
All 19 RD genes examined had a FAR, a conserved 3′ region, and a variable mid-region (Fig. (Fig.2).2). The FAR of each RD gene contained one of two alleles, designated FAR1 and FAR2, which share 32.6% ± 1.1% pairwise nucleotide identity (Table (Table1).1). In contrast, the nucleotide sequences of the FARs in HP_0118, HP_0120, HP_1187, HP_1188, JHP_0110, HPAG1_0119, HPAG1_1127, G27_0110, and HP1_L1 were 93.9% ± 1.6% identical (mean ± SD) (Table (Table1);1); we termed this allele FAR1 (amino acid positions 1 to 122 in JHP_0110) (Fig. (Fig.2).2). Similarly, the nucleotide sequences of the FARs in HP_0119, JHP_1113, HPAG1_0118, HPAG1_1128, HPAG1_1129, G27_1130, G27_1131, G27_1132, G27_1133, and HP1_L2 were 97.2% ± 1.0% identical (Table (Table1);1); we termed this allele FAR2 (amino acid positions 1 to 110 in JHP_1113) (Fig. (Fig.2).2). FAR2 sequences shared significantly more (P < 0.001) identity than FAR1 sequences at both the nucleotide and amino acid levels (Table (Table1),1), but FARs from the same strain were no more identical at the nucleotide or amino acid levels than FARs from different strains (Table (Table11).
There was substantial homology across all strains in the 3′ region (amino acid positions 228 to 412 in JHP_0110 and 199 to 383 in JHP_1113). The 19 RD genes in strains 26695, J99, HPAG1, G27, and HP1, had 3′ regions that were 94.6% ± 1.5% identical at the nucleotide level and 90.8% ± 2.1% identical at the amino acid level (Table (Table1).1). At the nucleotide level, the identities between the 3′ regions of the genes with FAR1 (94.7% ± 1.2%) did not differ significantly (P = 0.73) from the identities between the 3′ regions of the genes with FAR2 (94.6% ± 1.9%) (Table (Table1).1). Similarly, at the amino acid level, the identities between 3′ regions of the genes with FAR1 (90.9% ± 2.5%) did not differ significantly (P = 0.95) from the identities between the 3′ regions of the genes with FAR2 (90.9% ± 3.0%) (Table (Table1).1). The nucleotide and amino acid identities of the 3′ region for intrastrain comparisons were significantly greater than those for interstrain comparisons for genes containing FAR2 and trended in that direction for FAR1.
The 19 3′ region sequences studied showed a low Ks (0.11 ± 0.04), similar to that for H. pylori housekeeping genes (2), regardless of the FAR allele of its gene and regardless of whether the RD genes were in the same strain. Ka was also quite low, regardless of type or origin of the 3′ region. Overall, on the basis of 171 pairwise comparisons, the 3′ regions of the 19 RD genes in the five studied strains (26695, J99, HPAG1, G27, and HP1) had a Ka/Ks ratio of 0.47 ± 0.20 (mean ± SD) (Table (Table1).1). The Ka/Ks ratios were found to be 0.51 ± 0.24 when the 3′ regions of RD genes with FAR1 were compared, 0.44 ± 0.18 when the 3′ regions of RD genes with FAR2 were compared, and 0.47 ± 0.20 when the 3′ regions of RD genes with different FARs were compared (Table (Table1).1). While Ka was significantly higher (P = 0.001) for the 9 FAR1 sequences (0.04 ± 0.01) than for the 10 FAR2 sequences (0.02 ± 0.01), the Ka/Ks ratios for FAR1 sequences and FAR2 sequences were not significantly different (P = 0.86) (Table (Table1).1). That the Ka/Ks ratio for all 3′ region comparisons (0.47 ± 0.20) was significantly higher (P < 0.001) than the Ka/Ks ratios for FAR1 (0.28 ± 0.13) and FAR2 (0.27 ± 0.19) comparisons is consistent with the idea that there is greater selective pressure for variation on the 3′ region than for variation on individual FARs. The intra-allele substitution rates for the FAR sequences and those for the 3′ regions were similarly low, but the 90 FAR1/FAR2 comparisons showed high substitution rates and evidence for diversifying selection (Ka/Ks = 1.30 ± 0.30) (Table (Table11).
Separate phylogenetic analyses were conducted using nucleotide sequence data from the FARs and the 3′ regions of the 19 RD genes in the five studied strains (Fig. (Fig.3).3). The tree constructed using FAR sequences (Fig. (Fig.3A)3A) contains two strongly supported branches, with one containing all 9 FAR1 sequences and one containing all 10 FAR2 sequences. Within these two groups, RD genes were not monophyletic on the basis of strain or RD locus. In the tree constructed from 3′ region sequences (Fig. (Fig.3B),3B), RD genes were not monophyletic on the basis of FAR allele, RD locus, or strain. All of the strongly supported branches (bootstrap values of >70) were for sequences within the same strain, consistent with concerted evolution facilitated by gene conversion (57). All branches on both the FAR and the 3′ region nucleotide trees with bootstrap support values of >70 also appeared on neighbor-joining trees inferred from amino acid data (not shown). Thus, the evolution of the two strongly conserved domains of the RD genes, the FAR and 3′ region, show evidence of substantially different selection.
Using a PCR-based technique to determine the contents of each RD locus (see Fig. S1 in the supplemental material), we found extensive diversity in the number and arrangement of RD genes in 47 H. pylori strains from various parts of the world (see Table S1 in the supplemental material). Nevertheless, three conserved patterns were identified: (i) for each strain studied, each RD locus contained at least one RD gene; (ii) there were no empty sites; and (iii) no strain had more than five RD genes combined between the two loci. The numbers of RD genes per strain in the eight Cag-negative strains (2.38 ± 0.52) and in the 30 Cag-positive strains (2.57 ± 0.97) were nearly identical (P = 0.46). The most common genotype, present in 34.0% of strains, was a FAR1 gene at locus 1 and a FAR2 gene at locus 2 (see Fig. S2 in the supplemental material). Of the strains studied, 68.1% had two RD genes, 23.4% had three RD genes, 2.1% had four RD genes, and 6.4% had five RD genes. In total, of the 116 RD genes present in the 47 strains, 59 (50.9%) were at locus 1 and 57 (49.1%) were at locus 2. Of the genes at locus 1, 67.8% are FAR1, whereas at locus 2, 36.8% are FAR1, an allelic distribution unlikely to be due to chance alone (χ2 = 5.7; P = 0.017). Of the 17 strains with single FAR1 and FAR2 genes, 16 (94%) had FAR1 at locus 1 and FAR2 at locus 2 (χ2 = 13.2; P = 0.0003). Thus, although RD gene distribution varied greatly among strains, several strongly nonrandom organizational principles were observed.
When the Sawyer run algorithms were used to detect putative recombination events in the 19 FAR segments, no statistically significant indications of recombination were found within the FAR1 or the FAR2 or between them. However, the same analysis of the 3′ region revealed 6 putative recombinant fragments with global statistical support for recombination and 43 putative inner fragments with pairwise support (see Table S3 in the supplemental material). Of these 49 possible recombination events, 26 were identified between differing FAR types (FAR1 or FAR2) and 27 were identified between different strains of H. pylori (see Table S4 in the supplemental material). These data support the phylogenetic studies suggesting gene conversion.
The mid-region shows extensive variation. However, several conserved principles still emerge. Among the RD genes in the five studied strains, the mid-regions ranged in size from 28 to 166 encoded amino acids and were composed of imperfect 7- and 11-amino-acid repeats (“words”). For illustration and analysis, each word was assigned a color, with similar words (with three or fewer amino acid mismatches) designated shades of the same color (Fig. (Fig.4A),4A), and the mid-regions of the five strains were manually aligned (Fig. (Fig.4B4B).
Of the 229 words identified, there were 62 unique nucleotide sequences, with a nonrandom distribution of their prevalence (Fig. (Fig.4A).4A). In contrast to the FAR domain, no strong phylogenetic signal was seen in the mid-region (Fig. (Fig.4C).4C). To compensate for poor bootstrap values in the tree clustering, each fragment was assigned to a group and in some cases to a subgroup (Fig. (Fig.4C).4C). The subgroups are ≤4 bp different from one another, predominantly ≤3 bp, consistent with clusters with poor bootstrap values. Several relatively well-differentiated clusters (with bootstrap values of ≥50) were identified. Most members of a single group are ≤6 bp different from each other and are variably differentiated from other groups. A pattern of great complexity emerged (see Tables S4 and S5 in the supplemental material); the large number of near-repeats instead of exact repeats was unexpected. Group A sequences were the most prevalent, and FAR2 sequences were more likely to have repeats from group A than were the FAR1 sequences (P = 0.017 by Fisher's exact test). There were no repeats within a single gene for fragments in eight groups (B, C, E, F, H, K, L, and M). No RD genes containing FAR1 had group D repeats except sequences HP_0118 and HP_1187, and the distributions of alleles in group A clearly differed in FAR1 and FAR2 sequences. The mid-regions of HP_0118 and HP_1187 resembled FAR2 mid-regions and may represent a recombinant event. Thus, although the mid-regions have many conserved features in aggregate, the mid-region is a locus of great genetic diversity.
The heptad repeats of the mid-region display characteristics consistent with a coiled-coil section of protein (16, 47). Kyte-Doolittle hydropathy analyses demonstrate that the mid-regions of the RD genes are mostly hydrophilic (Fig. (Fig.5),5), which can be explained by the high frequency of amino acids with polar side chains including glutamic acid (E), glutamine (Q), asparagine (N), and lysine (K) in the mid-region words (Fig. (Fig.4A).4A). Coilscan analyses of the 19 RD genes in the five studied strains predicted that 91.0% of the 1,693 amino acids in mid-regions are part of a coiled-coil structure, compared to 7.5% of the 2,154 amino acids in FARs and 4.5% of the 3,465 amino acids in 3′ regions (P < 0.001 for both comparisons). Thus, despite the primary sequence diversity, there is strong conservation of secondary structure. Consistent with the FAR allelic differences, putative prokaryotic membrane lipoprotein lipid attachment sites were identified in all FAR1 sequences in strains 26696, J99, HPAG1, and G27 but not in any FAR2 sequences (Fig. (Fig.55).
To begin to examine intrahost variation, RD genotyping and mid-region sequencing was performed for four H. pylori stomach isolates (one from the antrum, two from the corpus, and one from the fundus) obtained from a patient from Ladakh, India. Many of the words in the mid-regions of the five studied strains (Fig. (Fig.4A)4A) also were present in the mid-regions of the Ladakh isolates (Fig. (Fig.6).6). At locus 1, single colony 3 from the corpus had only one RD gene, whereas the three other isolates had two (Fig. (Fig.6A).6A). The three strains (antrum-sc7, corpus-sc4, and fundus-sc1) that had two RD genes at locus 1 did not have identical mid-regions at locus 2 (Fig. (Fig.6B).6B). Thus, all four isolates had differing RD genotypes. Analysis of the sequences of the mid-regions at locus 2 revealed an indel (Fig. (Fig.6C).6C). The repetitive DNA flanking the indel is consistent with the pattern observed following strand slippage during DNA replication (10, 46).
Random amplified polymorphic DNA analysis was performed on the four H. pylori isolates from the patient represented in Fig. Fig.6.6. Three of the isolates (7A-sc7, 7C-sc4, and 7F-sc1) showed identical patterns, indicating that they are clonally related, whereas the pattern for 7c-sc3 was completely different, indicating that this strain is different from the other three (data not shown). These results explain why the RD mid-region profile for 7C-sc3 was so different from those for the other three strains. Although the three related strains had identical mid-regions at locus 1, all three had different mid-regions at locus 2.
Multiple isolates from two additional patients also were examined (see Fig. S3 in the supplemental material). The mid-regions of the two isolates from the fundus of patient 30 differed by a single amino acid from the mid-regions of the five isolates from the corpus and fundus in the same patient (see Fig. S3A in the supplemental material). In addition to the nonsynonymous-nucleotide difference between the fundus isolates and the other isolates, the mid-regions of the isolates from the fundus also differed from the other isolates by two synonymous nucleotide changes. Eight isolates from patient 32 (two from the antrum, three from the corpus, and two from the fundus) had mid-regions identical at both the nucleotide and the amino acid levels (see Fig. S3B in the supplemental material). Thus, in an individual host, the dominant RD genotypes may be wholly, partially, or not at all conserved.
To determine whether the protein product of the RD gene JHP_0110 was recognized by H. pylori-colonized persons, ELISAs were conducted with recombinant JHP_0110 protein, using sera from H. pylori-positive adults and, as controls, from H. pylori-negative children and adults (Fig. (Fig.7).7). As expected, the sera from H. pylori-negative children (n = 22) showed little reactivity with the JHP_0110 protein (OD = 0.038 ± 0.017) (Fig. (Fig.7)7) and were used as a reference group. On the basis of this group, seropositivity was defined as >0.089 OD units (mean + 3 standard deviations). Both groups of adult subjects showed low-level reactivity, and the H. pylori-negative adults (OD = 0.138 ± 0.100; 66.6% seropositive) and the H. pylori-positive adults (OD = 0.179 ± 0.176; 63.0% seropositive) were not significantly different (P = 0.27).
When Western blots of the recombinant JHP_0110 protein were probed with sera from 14 subjects, there was baseline reactivity for all, with no difference in band intensity between the H. pylori-positive sera and the H. pylori-negative sera. When JHP_0110 blots were exposed to the five sera most reactive under the ELISA conditions (Fig. (Fig.7),7), there also was little association between the IgG antibody level and the immunoblot findings (not shown). In total, these data do not support the hypothesis that the protein encoded by JHP_0110 is recognized as antigenic by H. pylori-positive subjects.
The previously recognized (19, 52) but not fully described group of genes that we now call the RD gene family was consistently observed at two well-defined genome loci in the 47 H. pylori strains that we studied. However, within these loci, we observed extensive variation at multiple levels: (i) the number of genes per strain; (ii) the identity of the FARs; (iii) the arrangement of genes at the two loci; and (iv) the composition, order, and length of the repetitive mid-regions. While we cannot be certain of the variation-generating mechanisms responsible for this diversity, the observed variation is likely related to the function of the proteins encoded by the RD gene family.
Variation in the number of RD genes across 47 strains and the phylogenetic relationships demonstrate that gene duplication in this family is common. While there were nearly equal numbers of RD genes at locus 1 and locus 2, the distribution by FAR type is nonrandom. This could reflect either functional considerations based on proximity of genes or founder effects. If the latter, it was most likely ancient, since the distribution is observed in isolates from widely dispersed parts of the world. Differing phylogenetic trees for the 5′ and 3′ regions of RD genes (Fig. (Fig.3)3) indicate that the opposite ends of each RD gene do not share a common evolutionary history. This finding could be explained by both inter- and intragenomic recombination, likely facilitated by the high level of identity shared by 3′ regions, FAR1s, FAR2s, and flanking genes.
The mid-regions of the RD genes exhibited the most intriguing variability, in the form of a complex “vocabulary” of heptad repeats. While the heptad amino acid “words” could be classified into 13 groups, the composition, order, and number of words varied greatly for the 19 RD genes in the five studied strains (Fig. (Fig.4).4). The variation in the number of repeats could provide encoded proteins with length-based functional differences, as has been observed for H. pylori FutA and FutB (50). However, it is also necessary to acknowledge the possibility that some of the repetitive DNA observed may be a result of repeated passing of cultures.
The specific mechanism behind the mid-region repeats is still unclear. The pattern observed in isolates obtained from individual hosts (Fig. (Fig.6;6; see also Fig. S3 in the supplemental material) is consistent with the hypothesis that repetitive DNA in the mid-region facilitates recombinatorial insertion or deletion events via slipped-strand mispairing (41, 46). However, imperfect repeats do not conform to expectations for slipped-strand repeat replication (13), nor is the evidence strong for gene conversion events in the mid-region. The presence of near-repeats suggests that many of the duplication events are very old and have not been subjected to gene conversion events occurring in concerted evolution, such as seen in rRNA genes (66) or in H. pylori babA and babB (58); their persistence suggests that each may represent a different functional allele. Thus, the observations described here may be an example of domain duplications (8, 75), which have been observed as genetic repeats from 11 to 609 nucleotides in length (42).
Domain duplications have been studied mostly for eukaryotic genomes (42, 75), and recombination (75) and selection at the protein level (35) are important in their evolution, but the mechanism for their origin is not understood. Repeats that become degenerate are more likely to be retained (23, 31, 75); our observation of words conserved between strains suggests selection for common function. The finding of the predominant words in the five studied strains among isolates from Ladakh, India, supports the notion of a common vocabulary in the H. pylori pan-genome. That the vocabulary has so many nuances suggests a repertoire capable of facilitating colonization in a wide variety of hosts in our outbred human population.
Tandem clusters of paralogous proteins often include membrane-associated proteins (33). Our finding of prokaryotic membrane lipoprotein lipid attachment sites in FAR1 sequences (Fig. (Fig.5)5) is consistent with the placement of these genes' protein products at the cell surface. A homologue of RD gene HP_1188 encodes a membrane-expressed protein that promotes attachment to gastric epithelial cells (64). When H. pylori cells were cocultured with AGS cells, HP_0118 expression was upregulated 2.9-fold and the encoded protein expressed at the cell surface (34). While an RD knockout showed in vitro growth characteristics indistinguishable from those of the wild type (19), its colonization efficiency was significantly impaired in a mouse model (52). In total, these experiments suggest that the RD genes encode adhesins that promote H. pylori attachment to host epithelial cells. The repetitive DNA in the RD gene family could be advantageous for colonization and persistence because domain duplication can lead to phenotypic variants and short-sequence DNA repeats facilitate adaptive characteristics, such as phase variation and antigenic variation (73).
How the extensive variability in RD genes is adaptive for H. pylori is not known. Phase variation would be unlikely since all observed repeats encoded 7 or 11 amino acids, but antigenic variation is still a possibility. Our observation that H. pylori-positive subjects are no more than minimally seroreactive to recombinant JHP_01110 suggests that this protein is not an H. pylori-specific antigen, that this protein is not reactive with human IgG under the conditions tested, that JHP_0110 is not expressed in vivo, or that mid-region hypervariability prevents the human immune system from mounting a full response. The last scenario would be similar to the observations for CagY, which also is a surface-exposed protein with extensive variation based on repetitive DNA (9, 18, 44). To fully substantiate this hypothesis, extensive testing with multiple RD antigens and various reaction conditions would have to be conducted.
That the mid-region is largely predicted to encode coiled-coil structures suggests functional possibilities; the ability of coiled-coil structures to dimerize (16, 47) increases the number of structural possibilities for the encoded RD proteins. With the most common genotype of single FAR1 and FAR2 genes, there are three possible dimers (FAR1/FAR1, FAR1/FAR2, and FAR2/FAR2). Since the RD genes within any strain are not identical and some strains have as many as five, the combinatorial dimer possibilities increase exponentially, providing a large reservoir of potential variants within each strain. In addition, coiled-coil structures may serve as pH-dependent molecular switches (16); pH-dependent conformational changes in adhesion could help adaptation to the acid gradient in the gastric lumen.
Expression of RD genes appears regulated by the two-component ArsRS system (HP_0166-HP_0165) (19), required for acid resistance in H. pylori (45). At pH ≤5, HP_0119 (48) and HP_1187 and HP_1188 (77) were all strongly upregulated, and expression levels of HP_0118, HP_0120, HP_1198, HP_1188, JHP_0110, and JHP_1113 were four- to eightfold greater in the wild type than in a ΔarsS strain (55). Alternative (but not exclusive) hypotheses are that low pH upregulates RD gene expression, promoting adhesion, enabling enhanced survival in the more nearly neutral paracellular niches, or that adhesion permits H. pylori to better regulate host gastric acid production (20).
Although our study did not identify the role of the RD gene family in H. pylori, our observations of conserved and diverse structures should provide a framework for future research. Since the RD proteins provide a competitive advantage to H. pylori during colonization (52) and we now document their genetic hypervariability, it is reasonable to conclude that the variability itself may be adaptive. Future studies of RD protein antigenicity will allow determination of whether their repetitive DNA facilitates antigenic or other functional variation. Molecular studies of protein-protein interactions in the RD gene family may uncover the functional significance of the two different FARs and the mid-region's coiled-coil structures.
This research was supported in part by R01 GM63270 and by the Diane Belfer Program in Human Microbial Ecology.
We thank David A. Baltrus and Karen Guillemin for sharing unpublished data, Ernst J. Kuipers for strains and information, and Edgardo L. Sanabria-Valentin for helpful discussions.
Published ahead of print on 11 September 2009.
†Supplemental material for this article may be found at http://jb.asm.org/.