|Home | About | Journals | Submit | Contact Us | Français|
Helicobacter pylori (H. pylori) have a unique ability to survive in extreme acidic environments and to colonize the gastric mucosa. It can cause diverse gastric diseases such as peptic ulcers, chronic gastritis, mucosa-associated lymphoid tissue (MALT) lymphoma, gastric cancer, etc. Based on genomic research of H. pylori, over 1600 genes have been functionally identified so far. However, H. pylori possess some genes that are uncharacterized since: (i) the gene sequences are quite new; (ii) the function of genes have not been characterized in any other bacterial systems; and (iii) sometimes, the protein that is classified into a known protein based on the sequence homology shows some functional ambiguity, which raises questions about the function of the protein produced in H. pylori. Thus, there are still a lot of genes to be biologically or biochemically characterized to understand the whole picture of gene functions in the bacteria. In this regard, knowledge on the 3D structure of a protein, especially unknown or hypothetical protein, is frequently useful to elucidate the structure-function relationship of the uncharacterized gene product. That is, a structural comparison with known proteins provides valuable information to help predict the cellular functions of hypothetical proteins. Here, we show the 3D structures of some hypothetical proteins determined by NMR spectroscopy and X-ray crystallography as a part of the structural genomics of H. pylori. In addition, we show some successful approaches of elucidating the function of unknown proteins based on their structural information.
Helicobacter pylorus is one of the pathogens involved in various gastric diseases such as peptic ulcers, chronic gastritis, mucosa-associated lymphoid tissue lymphoma, and gastric cancer [1–3]. Infection with H. pylori is associated with an increased risk of gastric adenocarcinoma and has attracted attention as a cofactor in the pathogenesis of this malignant condition . Moreover, the risk of developing cancer is related to the physiologic and histologic changes induced by a H. pylori infection in the stomach . Despite a general decline in the incidence of gastric cancer, it remains the fourth most common cancer and second leading cause of cancer-related deaths worldwide . However, most H. pylori infections do not cause cancer. The sporadic distribution of the disease caused by H. pylori looks to be dependent on host-related factors: the host (human individual) genetics controlling the inflammatory response, the age when the H. pylori infection was acquired, poor nutrition, storage of food, and the pattern of food consumption can be considered as host-related factors [7–9].
In addition, bacterial factors associated with the risk of gastric cancer are also emphasized, and molecular and cell biology approaches aimed at understanding the interaction between H. pylori and transforming epithelial cells have been carried out. Since H. pylori is a highly heterogeneous bacterial species, both genotypically and phenotypically, and is highly adapted for survival in the gastric niche, it is not easy to figure out the major bacterial factors that are directly associated with etiopathogenesis [10,11]. Based on the current knowledge, several virulence factors such as genes within the cag (cytotoxin-associated antigen) pathogenicity island, including the gene encoding the CagA protein, as well as polymorphic variation in the VacA vacuolating exotoxin and the blood group antigen binding adhesions, BabA and SabA, are regarded as possible bacterial factors [6,10,12]. A duodenal ulcer-promoting gene (dupA), located in the “plasticity region” of the H. pylori genome, was reported as a potential virulence marker [10,13]. Other bacterial factors such as peptidoglycan, lipopolysaccharide(LPS), γ-glutamyl trans-peptidase(GGT), and protease HtrA may be linked to pathogenicity .
Although a huge amount of biological data on H. pylori has been accumulated, enzymes or proteins of unknown function still make up more than a third of the open reading frames (ORF) of H. pylori. An unknown protein could be defined as a protein whose function has not yet been characterized, and a hypothetical protein could be defined as a protein that is supposed to exist in an organism although its existence has not been shown experimentally. Therefore, in a broad sense, hypothetical proteins could be included in unknown proteins. To completely understand the pathogenic mechanism of H. pylori, it is very important to elucidate the functions of these unknown proteins. To fill in the “missing parts list” is accordingly one of the greatest challenges for post-genomic biology, and a tremendous opportunity to discover new biological and pathogenic machinery in H. pylori.
The sequencing of the H. pylori genome started in 1997 with the H. pylori strain 26695 . It was isolated from an English patient with chronic gastritis. The chromosome of strain 26695 is circular and composed of 1.67 mega base pairs (Table 1). The average G-C content is approximately 38.9% and the genome has 1590 open reading frames (ORF) that are possibly protein-coding loci , together with the RNA coding genes (2 copies of 16S rRNA and 23S rRNA genes, 36 tRNA genes). From the following analysis of the same genome, it was suggested that a smaller number of ORFs is in the sequence of strain 26695 .
Ongoing studies have found genes that were missing in previous analyses, as in the case of SecE. A general secretion machinery is widely present in bacteria, which functions in the secretion of outer membrane proteins to extracellular environments . From the first annotation results, it was thought that strain 26695 had only a partial general secretion machinery because it lacked SecE . A new small open reading frame between nusG and rmpG (HP1203–HP1204) in the genome sequences was found using an ab initio server, GeneMark, Glimmer, and BlastX . It has a high homology and structural similarity to the SecE protein in related bacteria implying that strain 26695 has a complete general secretion machinery. In addition, small RNA genes are universally present in bacteria . The tmRNA gene (ssrA) has been found in H. pylori, encoding a functional RNA molecule and a small peptide involved in the quality control of translation . In addition, the H. pylori strain contains a sRNA gene encoding the RNA component of RnaseP and the 4.5S RNA gene which is involved in secretion [22,23].
In 2008, the adaptations of H. pylori to a rarely captured event in the evolution of its impact on a host biology were characterized by defining the impact of these adaptations on an intriguing but poorly characterized interaction between this bacterium and gastric epithelial stem cells . H. pylori HPKX_438_AG0C1 and HPXK_438_CA4C1 were isolated from a single patient who progressed from ChAG (chronic atrophic gastritis) to adenocarcinoma using a population-based endoscopy study. ChAG-associated Kx1 and Cancer-associated Kx2 genomes were analyzed to examine the adaptation of H. pylori, respectively. Micro-arrays gave a comprehensive view of the genome diversity of the H. pylori pathogen. This was performed with information on the origin of the hspA together with glmM alleles revealing that H. pylori infection may be acquired by more diverse routes than previously expected . According to cluster analysis, isolates from family D belonged to three different strains, those from family L consisted of two strains, and those from family A were grouped into at least 5 strains. Strains from family D and family L differed by the presence/absence of 24 to 42 CDSs (coding sequences). In family A, one strain was difficult to define due to the small differences in gene profiles between neighboring branches.
In 2009, the complete genome sequence of H. pylori G27 was reported . The G27 strain was originally isolated from an endoscopy patient from Italy . The genome consists of a single circular chromosome with about 1.65 mega base pairs (Table 1) that is AT rich (61.6%), contains 1515 ORFs, and is similar in size and composition to the other published H. pylori genomes of strains 26695, J99, and HPAG [15,16,28]. The G27 strain contains 58 genes that are not found in 26695, J99, or HPAG, as defined by a blastp hit. The majority of these G27-specific genes are predicted to encode hypothetical proteins .
In the same year, the genome sequences of two H. pylori strains were analyzed . H. pylori strain 98-10 was isolated from a patient with gastric cancer and strain B128 was isolated from a patient with gastric ulcer disease. Strain 98-10 was most closely related to H. pylori strains of East Asian origin and strain B128 was most closely related to strains of European origin. Strain 98-10 contained multiple features characteristic of East Asian strains, including a type s1c vacA allele and a cagA allele encoding an EPIYA-D tyrosine phosphorylation motif.
Very recently, several genome sequences of different strains were reported accelerating H. pylori genomic and proteomic research [30–38]. Strain 908 is a close relative strain of J99  and was isolated from an African patient living in France, who suffered from duodenal ulcer disease . The B8 strain consists of about 1.67 mega base pairs and a small plasmid of about 6000 base pairs carrying nine putative genes. Interestingly, the B8 strain contains coding sequences, 293 of which are strain-specific, coding mainly for hypothetical proteins with unknown functions . Similarly, the P12 strain contains plasticity zones, encoding for the type IV secretion system and having the typical properties of genomic islands . Another sequenced genome, the Shi470 strain known as the Shiimaa village strain was more Asian- than European-like genome-wide, indicating Amerind ancestry. This strain contains two unique cagA virulence genes and a novel allele of gene hp0519 encoding host tissue interaction protein . There are several H. pylori populations such as hpAfrical, hpEurope, hspEAsia, and hspAmerind because this bacterium has colonized the stomach since early in human evolution and diverged with ancient human migrations [41–43]. One of these populations, the hspAmerind strain V225d, was cultured from a Venezuelan Piaroa Amerindian subject and identified. The V225d strain is cag-positive encoding a multifunctional effector protein injected into host cells by the cag type IV secretion system . Two strains, 2017 and 2018, are the chronological subclones of strain 908 and cultured from the antrum and corpus, respectively. Using comparative genomic analysis [35,37], these two strains are almost identical and descended from the genome of strain 908 [30,36]. The B45 strain was sequenced from a gastric mucosa-associated lymphoid tissue (MALT) lymphoma patient and induced an integrated prophage in this strain by UV irradiation .
The Comprehensive Microbial Resource (CMR) is a free tool that allows researchers to access all of the publicly available bacterial genome sequences completed to date  (Figure 1). Currently, it provides genomic sequences of three strains of Helicobacter pylori (26695, HPAG1, J99).
As in the case of other genomic research, Structural Genomics Initiatives are mainly responsible for determination of H. pylori protein structures. These initiatives, together with the structure determination of known proteins, have made enormous strides in the elucidation of unknown protein structure of H. pylori [15,16,24–26,28–38,45–47]. The available structural data have already led to the identification of potentially new drug targets  and has been helpful in assigning functions to proteins of which the functions were previously unknown [49,50].
The increase in structure determination for H. pylori has been triggered by the sequencing of the H. pylori 52 and 26695 genomes [15,25,45,47]. The genome sequences and their protein structures yielded many clues to help understand the pathogenesis of H. pylori. Approximately 14% of Lyase structures have been determined and represent the largest proportion of any functional class of which the structures have already been solved (Table S1).
The sequencing of the genome led to a dramatic increase in the number of known structures for H. pylori proteins deposited in the Protein Data Bank (PDB) (Figure 2). The first H. pylori protein structure was determined in 2001 (PDB ID: 1G6O) . In the following four years, 32 more structures were reported (Figure 2). After several sub-species genome sequences of H. pylori became publicly available, the number of structures determined after 2005 increased sharply and at an increasing rate.
Usually, protein solubility is one of the main bottlenecks in structure determination . In the case of H. pylori, methods have already been developed that remedied this problem, such as the development of customized expression strategies for H. pylori proteins in Escherichia coli . The increase in determined structures is also due to the development of improved methods for high-throughput X-ray crystallography. However, the major driving force for this increase was the availability of genome-wide sequence data in the early 2000s.
There are currently 79,356 structures in the PDB as of 14 February 2012, of which 0.35%, a total of 279, are structures of H. pylori proteins. Of these proteins, 28 are unknown in function, which represents 10.03% of the determined H. pylori structures (Table 2).
A complete list of H. pylori protein structures deposited in the PDB is given in the Supporting Information Table S1. The predominant method used to determine these structures was X-ray crystallography, which accounts for 261 of the total number of H. pylori structures currently determined (Figure 2). A further 18 were elucidated by solution-state NMR spectroscopy. Most structures are of individual proteins, although many are bound by small molecule ligands such as substrate analogues and only 11 protein-DNA complexes have been determined (Figure 3, Table S1).
The most typical approach of predicting the function of an unknown protein is to use sequence similarity by finding a similar protein of known function . Based on sequence-similarity, a predictor assigns the known function to the inferred protein. Actually, the functions of enzymes tend to be conserved if they share more than a 40%–50% sequence identity. The sequence-based approach is reasonable, however, approximately 50% of the unknown proteins from a newly sequenced genome could not be assigned to their function using only sequence-similarity approaches  (Figure 1). The low efficiency of the sequence-similarity search may be partly caused by gene sequences that are quite new and genes that have not yet been characterized in other bacterial systems. To overcome the weakness of sequence-similarity searches, several trials were employed using so called “similarity free” methods . The methods use physicochemical properties and secondary structure of proteins. Bioinformatics developed the methods and there have been successful cases for characterizing function or structure [58–60]. However, the methods need to be improved since similarity-free methods still depend to a certain extent on similarity.
Another approach to identify function is to use 3D structures. This approach often succeeds in cases where sequence-based methods fail. This may be due to the idea that in many cases evolution retains the folding pattern long after the sequence similarity becomes undetectable. Structural similarity searches use the global fold of the protein [61–64] or detect the functionally important regions of the protein [65–69]. Since structures diverge more slowly than sequences, a sequence comparison may be less sensitive than a structure comparison . However, the structural comparison still has the limitation of false positives being reported and needs to be improved to overcome overestimation of statistical significance like sequence-similarity searches . This means that experimental confirmation is still required for exact assignment of function to an unknown protein.
Some examples of functional elucidation of unknown proteins from H. pylori are provided below. For estimation, we generally conducted four steps: (i) structure determination; (ii) sequence homology search using PSI-BLAST ; (iii) structural homology search using the web server DALI ; and (iv) experimental confirmation of the function.
The high-quality NMR structure of HP0894 was reported . The HP0894 structure (PDB ID: 1Z8M) has two α-helices, two 310-helices, and four β-strands (α-α-310-β-310-β-β-β). The β-Strands form a four-stranded anti-parallel β-sheet (Figure 4). BLAST conserved domain search  showed that HP0894 contains the conserved domain DUF332 (Domain of Unknown Function), which is equivalent to COG 3041 in the National Center for Biotechnology Information Database of Clusters of Orthologous Groups. However, in the Pfam database , HP0894 belongs to the plasmid stabilization system protein family (PF05016). From the sequence homology search, we were able to get a hint of the function. However, a search for structural homologs with a Z score higher than 3.0 using the programs DALI showed that HP0894 is structurally similar to Pyrococcus horikoshii Archaeal RelE (PDB code: 1WMI, Z score = 7.8, pairwise RMSD = 2.8 Å), E. coli YoeB (PDB code: 2A6Q, Z score = 8.8, RMSD = 2.9 Å), and Guanyloribonuclease (PDB code: 1RGE, Z score = 3.3, pairwise RMSD = 3.4 Å). These proteins are both ribonucleases, have a similar number of residues as HP0894 (around 90), share a similar β-sheet topology with HP0894, and have a comparable location for two of their helices (Figure 4). As expected, they have no detectable sequence homology with HP0894 in PSI-BLAST searches and Blast2 (pairwise comparison) analyses. The structural homology search revealed HP0894 may have potential ribonuclease activity and represents the toxin-antitoxin (TA) system like RelE . Generally, in a TA system, toxin expression induces arrest of cell growth, whereas the antitoxin neutralizes the toxin by a direct protein-protein interaction . Both proteins of the toxin-antitoxin system are encoded within a single operon, with the toxin gene usually located directly downstream of the antitoxin gene . Thus, we hypothesized: (i) HP0894 is a toxin molecule in H. pylori; (ii) there should be an antitoxin molecule that interacts with HP0894; and (iii) it should be near the gene location for hp0894 on the chromosome, if an antitoxin molecule exists. Actually, we found that HP0895 (hypothetical protein) is an antitoxin molecule  locating upstream of the hp0894 gene.
Our experimental data  showed that HP0894 and HP0895 forms a stable complex as a large multimer (hexamer, ((HP0895)6, (HP0894–HP0895)6), and the inhibitory effect of HP0894 on E. coli cell growth was neutralized by HP0895. In bacteria, toxins function, or are supposed to function, by inhibiting translation through mRNA cleavage . With a RNA retardation experiment, the in vitro RNase activity of HP0894 was confirmed and HP0895 inhibited this RNase activity . A primer extension experiment showed that HP0894-mediated mRNA cleavage occurred predominantly before adenine (A) or guanine (G) residues and we suggested -U:A- and -C:A- sequences are the most preferred cleavage sites . The binding mode between HP0894 and HP0895 was more deeply studied using NMR and CD spectroscopy and we showed the binding interface of HP0894 . Interestingly, HP0316 (hypothetical protein) that has an 85% sequence identity with HP0895 except for 30 residues at the C-terminal tail did not bind to HP0894, suggesting the C-terminal non-conserved tail of HP0895 may be responsible for binding of HP0894 . Actually, with the synthesized C-terminal peptide of HP0895, the residue-specific interaction sites of HP0894 were cleared (Figure 4). These results indicate that the HP0894–HP0895 TA system, especially through negative regulation of the HP0894 toxin by the HP0895 antitoxin, may be related to the status of infections of H. pylori in the human gastric mucosa and to its survival in that locus.
Notably, HP0892 (hypothetical protein) and HP0894 share high sequence similarity (identity 53%). It is expected that HP0892 may be a paralog of HP0894. As a result, the structure of HP0892 is very similar to that of HP0894  (Figure 5), and HP0892 is structurally similar to Archaeal RelE (aRelE) (Z score = 8.1, RMSD = 2.7 Å) and the YoeB toxin of E. coli (Z score = 9.6, RMSD = 2.9 Å) like HP0894. Based on the above study, HP0892 was speculated to be another toxin molecule. However, there is no comparable protein to the HP0895 antitoxin near the upstream or downstream of hp0892 gene. Thus, the function of HP0892 is still questionable, which implies that most structural homologues do not reveal the function of unknown proteins. According to gene comparison studies using DNA microarrays , the hp0892 gene is one of several H. pylori genes absent from a set of five cag pathogenicity island (PAI)-negative strains, while the hp0894 gene is not. This may represent a marker for the identification of virulent strains or may represent novel virulence factors. Therefore, it is probable that the biological role of HP0892 is different from that of HP0894, aRelE, and YoeB, despite the sequence and/or structural similarities among them.
Virulence-associated protein, a product of the vap gene in various organisms, may be insufficient in itself, but is a requisite for virulence. The vap genes are known as factors or enzyme-producing factors that regulate the expression of true virulence genes or activate virulence factors by translational modification, processing of secretions or that are required for the activity of true virulence factors. Several vap genes (vapA, B, C, D, H and I) are known to exist in various organisms [82–84] but how the products of the vap genes are related to virulence remains unclear. H. pylori strain 26695 has only one type of virulence-associated protein, VapD. Two genes in this strain (HP0315 and HP0967) belong to vapD . The exact biological role of the VapD protein has not yet been established, but several suggestions such as toxin, acid tolerance, plasmid stability, etc. have been made [86–88]. Here, we summarized the elucidation of the probable function of HP0315 with structural and biochemical studies.
The structure of HP0315 consists of 10 secondary structure elements: β1 (residues 1–8), α1 (residues 10–17), α1′ (residues 21–35), β2 (residues 38–41), β3 (residues 44–47), α2 (residues 53–66), α2′ (residues 68–73), β4 (residues 75–87) and α3 (residues 88–93). The monomer has a ferredoxin-like fold. It has the β1-(α1-α1′)-β2-β3-(α2-α2′)-β4-α3 instead of the β-α-β-β-α-β structure of the ferredoxin fold. The dimer of HP0315 is butterfly-shaped (PDB code: 3UI3, Figure 6). The β4 strand and the α3 helix associate with the adjacent monomer, forming a dimerization interface . This structure is the first structure of a VapD family to our knowledge. A sequence homology search revealed that HP0315 is related to the CRISPR-associated protein Cas2, a novel family of endoribonucleases, suggesting the potential ribonuclease activity of HP0315. The structure-based alignment also yielded a high score from DALI for one of the Cas2 proteins, SSO1404 (PDB code: 2IVY) although the top-scoring proteins were mainly hypothetical unknown proteins. In addition, the interrelationships between VapD and Cas2 proteins were supported by a genomic analysis .
The sequence analysis yielded another interesting result: the two genes HP0315 and HP0316 exist as an operon, which is a functional unit of genomic DNA containing partially overlapping genes under the control of a single regulatory signal or promoter (gene coordinates: HP0315 330872–330588, HP0316 331245–330853, Figure 6). As described above, HP0316 has a sequence similarity of 88.9% with HP0895 , which might suggest the HP0315–HP0316 system is identical with the HP0894–HP0895 system. In other words, HP0315 might act as a toxin molecule like HP0894 although no sequence and structural similarity exists between them. However, HP0315 did not bind HP0316 and did not affect the cell viability in in vivo toxicity experiments . From the sequence/structure analysis and biochemical experiments, HP0315 was speculated to be a ribonuclease but not a toxin even though the gene arrangement is similar to that of a TA system . The RNase activity of HP0315 was confirmed by primer extension and gel retardation experiments, revealing purine-specific endoribonuclease activity .
Conclusively, HP0315, a member of the VapD family, has a structural similarity with the Cas2 family and has a gene arrangement similar to the TA system; however, it does not belong to any of them, like an evolutionary intermediate. The exact function of HP0315 has not been determined yet. However, considering the relationship with Cas2 and a TA system, as well as the endoribonuclease activity, HP0315 may be related to either cell maintenance or a defense mechanism against invasion, or possibly both such as Cas2 and/or a TA system.
The 3D structure of hypothetical protein HP0062 (PDB code: 3FX7) at 1.65 Å resolution was solved . HP0062 is a small protein composed of 86 amino acids but it exists as dimer. The HP0062 monomer folds into a hairpin structure, in which two α-helices (the N- and C-helix) are connected by a short loop (Figure 7A) and the N-helix displays a modified leucine zipper. The protomers dimerize in an antiparallel arrangement, in which the N and C helices of one protomer pack against the N and C helices of the second protomer, forming a four-helix bundle. The two protomers in an asymmetric unit of the orthorhombic crystal are similar, and the topologically equivalent Ca carbons superimpose with a RMSD of 0.79 Å. Actually, the structure of HP0062 was also solved by another group but they reported the protein is monomeric (unpublished, PDB code: 2GTS). Since our gel filtration chromatography revealed the dimeric state of HP0062, it is believed that the biologically relevant form is a dimer . The structural comparison indicated HP0062 has similarity with the coiled-coil segments of over 100 functionally unrelated proteins that are involved in various protein-protein interactions. Thus, the function of HP0062 is hard to directly estimate from the structural information. Interestingly, HP0062 shows extensively similar characteristics to those of the ESAT-6 family of Gram-positive bacteria; small dimer, helix-hairpin-helix structure, no signal peptide but with WXG motif in the hairpin bend (WRD in HP0062), and gene clusters with a protein with FtsK/SpoIIIE domain . On the other hand, HP0062 also has similar characteristics to those of the TTS (Type Three Secretion) chaperones of Gram-negative bacteria; small dimer, an acidic pI, an overall α-helical character and a carboxy-terminal amphipathic α-helix . These results might give a hint for the function of HP0062 as a transport chaperone and/or adaptor protein to facilitate interactions with host receptor proteins.
HP0495 is an 86-residue hypothetical protein with a molecular weight of 10,192.7 Da. The atomic coordinates of the final structure have been deposited in PDB (2H9Z). HP0495 has two α-helices and four β-strand, forming a ferredoxin-like fold, β1-α1-β2-β3-α2-β4 (Figure 7B). HP0495 is a completely unknown protein since HP0495 has a restricted sequence homology with unknown proteins from several bacteria [94,95]. The ubiquitous ones like HP0495 merit the highest priority for functional characterization because they have the greatest potential payoff in new biological knowledge. In this case, the structure of HP0495 and structural homology data may be more important and provide a clue for the function. Unfortunately, a structural homology search using DALI indicated that HP0495 has structural homology with a variety of proteins . This should be because the ferredoxin-like fold of HP0495 is abundant in other structures. Twenty proteins had a higher Z-score of 5.0 from DALI analysis including the NikR protein from Pyrococcus horikoshii (nickel responsive repressor; PDB code: 2BJ9, RMSD = 2.9 Å), LrpA from Thermus thermophilus (transcriptional regulator; PDB code: 1RIS, RMSD = 2.9 Å), S6 protein from Archaeoglobus fulgidus (ribosomal protein; PDB code: 1Y7P, RMSD = 2.9 Å), and a hypothetical YbeD protein from E. coli (unknown; PDB code: 1RWU, RMSD = 3.6 Å). The structural comparison did not show a clear result. However, the function of HP0495 seems to be related to nucleic acid interaction since its homologues are mainly nucleic acid binding proteins and HP0495 possesses positive surface charges (Figure 7B).
HP0827 is classified as a putative single-stranded (ss)-DNA binding protein 12RNP2 precursor protein. The solution structure of HP0827 (PDB code: 2KI2) has a ferredoxin-like fold, β1-α1-β2-β3-α2-β4 . The four β-strands are arranged in a right-handed twist and form an antiparallel β-sheet that packs against the two α-helices (Figure 7C). This protein contains one RRM (RNA Recognition Motif) comprised of two ribonucleo-protein motifs (RNP1, Lys/Arg-Gly-Phe/Tyr-Gly/Ala-Phe/Tyr-Val/Ile/ Leu-X-Phe/Tyr and RNP2, Ile/Val/Leu-Phe/Tyr–Ile/Val/Leu-X-Asn-Leu). Since the RRM motif is an abundant component in protein structures, only the RRM motif could not tell the exact function of HP0062. Actually, a total of 6,056 RRM motifs can be found in 3541 different proteins in the Pfam database . We could not elucidate the biological function of HP0827 from a structural basis, though the structure may provide information on the putative RNA binding site. Further biological studies may be required for this case.
The HP1242 gene encodes a 76-residue conserved hypothetical protein with a molecular weight of 9111 Da. HP1242 adopts a full helical structure, which is composed of three α-helices . These correspond to residues 6–14 (αI), 18–38 (αII), and 43–75 (αIII). The overall structure of HP1242 represents a coiled-coil-like conformation (Figure 7D). Based on the sequence homology, HP1242 is classified as the DUF (Domain of Unknown Function) 465 family, which has an unknown function. These family members are found in several bacterial proteins, and also in the heavy chain of eukaryotic myosin and kinesin, which are predicted to form coiled coil structures. HP1242 has a structural homology with a variety of proteins including the rop protein (transcription regulation), arfaptin 2 fragment (signaling protein), sensory rhodopsin II fragment (membrane protein complex) and so on . This result indicates that the function of HP1242 could not be evaluated by only a structural comparison.
We also determined the solution structure of HP1423, which has 84 amino acid residues. HP1423 is a hypothetical protein as well. According to the Pfam database, HP1423 belongs to S4 (PF01479) superfamily. The S4 domain is a small domain consisting of 60–65 amino acid residues that probably mediates binding to RNA . The structure of HP1423 is composed of five β-strands and three α-helices . The topology can be described as α1-α2-β2-β1-β3-β4-α3-β5 (Figure 7E). Notably, the region, extending from α1 through β3, forms an obvious structural motif, the so called αL motif, because of the two α-helices and the loop between β2 and β3 which forms an L-shaped meander (Figure 7E). This structural motif shows a high degree of conservation between different families within the S4 (PF01479) superfamily and may be important for interaction with RNA . The surface region of the αL motif of HP1423 has a strong concentration of positive charge and the loop between β4 and α3 exposes another positively charged side chain of K67, which may raise the possibility that HP1423 is a RNA binding protein (Figure 7E). The DALI result also showed that HP1423 is structurally similar to proteins that belong to S4 superfamily. The S4 superfamily includes the Hsp15 protein (PDB code: 1DM9-B), ribosomal small subunit pseudouridine synthase A (PDB code: 1VIO-A), 30S ribosomal protein S4 (PDB code: 1FJG-D), and so on. All these homologues contain the αL motif. However, the distribution of positively-charged residues on the protein surfaces was somewhat different between homologous proteins , suggesting that HP1423 may bind to RNA through the αL motif in a similar but not exactly same manner as the S4 RNA binding proteins.
Bioinformatics tools have been remarkably developed, providing biologists valuable information for functional elucidation. Nevertheless, prediction of protein function from sequence and structure is a difficult problem, because homologous proteins often have different functions. In addition, the protein that is classified into a known protein, based on the sequence homology, often shows some functional ambiguity since the composition of the operon is quite different from that of the known system. In addition, some of the proteins, which are considered to be well characterized, may have additional functions beyond their listed function . In this regard, it is still worth investigating known proteins from a newly sequenced genome for their cell and biological functions. Here, we present two examples of well-defined proteins that have different characteristics compared to the homologues.
Copper metabolism by copper chaperones has been studied extensively in both eukaryotes and bacteria. In the gram-positive bacterium, Enterococcus hirae, the cop operon is composed of four proteins: two integral membrane P-type ATPases, CopA, and CopB which transport Cu(I) into cells under Cu(I) limiting conditions and eliminate Cu(I) under conditions of high Cu(I) levels, respectively [103,104]. The imported copper ions are transferred from CopA to the CopZ chaperone [105–107] and CopY, a gene repressor, is released from the cop operon promoter when Cu(I) is delivered to CopY by the copper chaperone, CopZ (Figure 8A). In the case of the gram-negative bacterium, H. pylori, copper homeostasis seems to be maintained by only two proteins CopA and CopP (HP1073). The H. pylori cop operon (Figure 8A) is included in a novel stress-responsive operon (sro), which encodes the flagellar motor switch protein CheY, the putative methyltransferase Hsm, the cell division protein FtsH, the putative phosphatidyltransferase Ptr, the heavy metal-binding proteins CopA and CopP, and an open reading frame of unknown function . CopA is a member of the bacterial copper ion ATPase family, and CopP, which is homologous to E. hirae CopZ, is a putative copper binding regulatory protein of 66 amino acids [104,108]. CopA of H. pylori was identified as a Cu(II) export ATPase , which shows that its biological role is more similar to that of E. hirae CopB, rather than CopA . Moreover, the CopP gene resides immediately downstream of the CopA gene, while the E. hirae CopZ gene resides upstream of the CopA gene. Therefore, the cop operon organization seems to be evolutionarily modified in each bacterium.
Generally, CopZ proteins share a conserved structure, βαββαβ with a similar metal binding region. Interestingly, HpCopP adopts the βαββα fold with a missing C-terminal β strand . The overall topologies of the secondary structural components are very similar between the CopZs and HpCopP, while some variations in the loop regions appear (Figure 8). The relationship between the unusual fold and the copper specificity was evaluated . We showed that HpCopP was not adequate for Cu(II) binding since the fold stability decreased in the presence of Cu(II) ion, suggesting that the structure of HpCopP is optimized for the transfer of toxic Cu(I). The absence of the C-terminal β-strand may lead to decreased conformational stability of loop I including the CXXC motif (Cu binding motif), which probably contributes to the disulfide bond formation between the two cysteine residues in the presence of Cu(II) ion. These findings should be helpful in evaluating the copper metabolism related with HpCopA and HpCopP in H. pylori.
Acyl carrier protein (ACP) found in bacteria is a monofunctional protein, that is, a type II enzyme in fatty acid biosynthesis. All the ACPs are decorated by acyl carrier protein synthase (ACPS) with fatty acids, which are covalently attached as thioesters to the 4′-phosphopantetheine prosthetic group at highly conserved Ser 36 . Fatty acid binding has little influence on ACP conformation under physiological conditions , but it stabilizes ACP against denaturation at alkaline pH .
H. pylori ACP (HP0559) is composed of 78 amino acids with a pI value of 3.9, and its primary structure is similar with those of homologous ACPs. Like other ACPs, HpACP forms a helical bundle structure through hydrophobic contacts between the helices (Figure 9). However, we found an unusual behavior of HpACP at neutral pH . HpACP exists as a partially unfolded state at neutral pH, which is a unique characteristic of HpACP (Figure 9). In contrast, the overall helical structure of E. coli ACP was maintained at pH 7  and Vibrio harveyi ACP exhibited a random coil-like conformation at pH 7 .
The pH dependent-conformational change of a protein from H. pylori is a very interesting feature, considering that the environment of the stomach has a low pH. A few studies showed the relationship between the mutation of various residues and the pH-dependent structural stability. The mutation of Val 43 to Ile in E. coli ACP increases the stability to pH-induced expansion in electrophoretic systems, concomitantly inducing more compact folding . The mutants F50 A and I54 A of V. harveyi are incapable of adopting a native conformation with increased hydrodynamic radius at neutral pH . In addition, a few basic residues scattered near the N- and C-termini, for example, His 75 of E. coli ACP, are necessary for ACP to maintain a native conformation at neutral pH . Through our structural analysis, we found that several hydrophilic residues (Glu 47, Asn 75, and Lys 76) play an important role in structural stability. Therefore, we could suggest that, unlike other ACPs, the helical bundle of H. pylori ACP is maintained by, not only hydrophobic interactions, but also by hydrophilic interactions and these interactions may be weakened by elevation of the pH because the exchange rate of protons attached to the side chain amide of Asn and Lys may increase .
Mass genomic sequencing has been yielding many protein sequences that cannot be annotated, and structural genomics projects are yielding many protein structures that have unknown functions. Unknown proteins represent up to about half of the proteins in prokaryotic genomes, and much more than this in higher plants and animals . In bacteria such as H. pylori, 30–40% of the proteins encoded by typical bacterial genomes have no clear known function . Thus, a major issue of genomic studies may be to narrow the gap between the richness of sequences (and/or structures) and functional characterization as subsequent experimental investigation is costly and time-consuming . Actually, only 54% of E. coli gene products have been experimentally investigated so far . Therefore, more robust bioinformatic methods or approaches may be necessary to overcome this situation. Here, we showed several examples of successful cases for elucidating the function of H. pylori unknown proteins based on their structural information, which supports the potential of structural comparison for functional identification. It is hoped that the structural comparison can at least act as a guide to the possible function, even though all structures cannot elucidate the actual function.
This study was supported by the National Research Foundation of Korea (NRF) grant funded by Korean government (MEST) (Grant number 20110001207 and 2012R1A2A1A01003569). This study was also supported by a grant of the Korea Healthcare technology R&D Project, Ministry for Health, Welfare & Family Affairs, Republic of Korea. (Grant number: A092006). This research was also supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2011-0011603).