|Home | About | Journals | Submit | Contact Us | Français|
This study reports structural modeling, molecular dynamics profiling of hypothetical proteins in Chlamydia abortus genome database.
The hypothetical protein sequences were extracted from C. abortus LLG Genome Database for functional elucidation using in silico methods.
Fifty-one proteins with their roles in defense, binding and transporting other biomolecules were unraveled. Forty-five proteins were found to be nonhomologous to proteins present in hosts infected by C. abortus. Of these, 31 proteins were related to virulence. The structural modeling of two proteins, first, WP_006344020.1 (phosphorylase) and second, WP_006344325.1 (chlamydial protease/proteasome-like activity factor) were accomplished. The conserved active sites necessary for the catalytic function were analyzed.
The finally concluded proteins are envisioned as possible targets for developing drugs to curtail chlamydial infections, however, and should be validated by molecular biological methods.
Chlamydia abortus is an important, amphixenosis, nonmotile, Gram-negative, obligate intracellular pathogen [1,2]. The pathogen causes enzootic abortion, vesiculitis, orchitis and epididimytis in cattle . When zoonotic, it causes conjunctivitis, health pathologies and abortion in sheep and goats [4–6], yaks , pig , cats , stray and companion animals . The pathogen can be zoonotic spreading infection in humans. The infected people may not have outward symptoms in early stages, but Chlamydia can create serious health problems such as pelvic inflammatory disease in females. Incidences of psittacosis, primarily caused by C. psittaci have been noted in women involved in chicken gutting . C. abortus has attracted increasing scientific attention due to its pathogenecity and severe systemic infection in humans as well some animals .
The bacterium is assigned to family Chlamydiaceae entailing two genera, namely Chlamydia and Chlamydophila comprising of nine species; three of Chlamydia (C. muridarum, C. suis and C. trachomatis), and six of Chlamydophila (C. caviae, C. abortus, C. felis, C. pecorum, C. pneumoniae and C. psittaci) . As the genus, Chlamydophila is not widely accepted by mainstream research groups, the researchers have recommended reunifying the genera Chlamydia and Chlamydophila to one single genus, the genus Chlamydia within the family Chlamydiaceae [14,15].
The bacterial genome database has proteins classified as hypothetical, as their functions are not confirmed by molecular biological methods. The hypothetical proteins constitute around 20–40% of the total proteome, which are important for structural biologists. Their functions can be predicted by domain homology searches with various confidence levels. As wet lab methods, generally used for unraveling desired genes and proteins are expensive and time-consuming, the in silico methods have emerged as important tools to predict or identify the hypothetical genes and proteins. The genome of C. abortus LLG has also been sequenced and entails several hypothetical proteins . This study reports functional analysis of unrecognized proteins from the genome data of C. abortus strain LLG. To the best of our knowledge, it is the first comprehensive description of unrecognized proteins in the species.
The C. abortus LLG with genome reference number NZ_CM001168.1 at NCBI Genome database  served as data source. The sequences of unrecognized proteins were extracted from Genome Database at NCBI for functional inferences using in silico methods. Functional signature sequences of proteins were identified by Web-based tools, namely NCBI-CDD , INTERPROSCAN  and support vector machine (SVM)-Prot . For predicting family and superfamily of the proteins, protein families (Pfam)  and structural classification of proteins (SCOP)-superfamily  were used. The conserved signature sequence and protein families were identified by these programs.
Computation of theoretical isoelectric point (pI) and molecular weight was determined by compute pI/molecular weight . GRAVY CALCULATOR  was used for calculating the grand average of hydropathicity. Additionally, the properties of sequences including aromatic and aliphatic, basic and acidic sequences with average number of polar and nonpolar amino acids were determined by using EMBOOS PepStat . To discover the position of these proteins with predicted signature sequences, TMHMM , HMMTOP  and SOSUI-GramN  were used, while subcellular localization of proteins was determined by CELLO [29,30]. The SignalP 4.1  was used to determine peptide signal cleavage sites.
Once signature sequence and proteins localization were identified, the functional roles of hypothetical proteins, and their gene ontology were predicted using CELLO2GO . This program is used to predict the bacterial proteins at cellular, biological and molecular levels. To find the role of hypothetical proteins in various pathways, they were analyzed using KEGG database .
Basic Local Alignment Search Tool for proteins  was used for C. abortus hypothetical proteins against various hosts including Ovis aries (sheep; taxid: 9940), Capra aegagrus hircus (goat; taxid: 9925), Bos taurus (bovine; taxid: 9913), Bos grunniens (yak; taxid: 30521), Sus scrofa domesticus (pig; taxid: 9825) and Homo sapiens (human; taxid: 9906) proteins at NCBI database . The sequences showing hits with less than 0.0001 expectation value were not considered, and concerned protein sequences were supposed to share homology with proteins in hosts . Nonhomologous proteins were checked for virulence factors that are involved in severity of the infection, and are envisaged as targets for developing drugs against pathogen . The virulence factor of nonhomologous proteins was identified by VICMpred and BTXpred [38,39]. Both methods for predicting virulence factor from protein sequence were based on SVM.
Homology-modeling method, Phyre2 , was used for predicting the protein structure. The quality of model was recognized with similarity of sequence of target and the template approximation . After prediction of structures, the models were validated for probable errors by using structure analysis and verification server (SAVES) program for molecular stereochemical quality, residues parameters, nonbonded interactions, model compatibility and macromolecular volume [41–44]. For residues in most favored region, the Ramachandran plot was analyzed in PROCHECK  at SAVES.
Homology-modeled proteins were conceded for molecular dynamics (MD) simulations with GROMACS 5.0 (GROningen MAchine for Chemical Simulation) package using the GROMOS96 53a6 force field. To generate the topology files for proteins, the command pdb2gmx was used. Protein salvation was carried out, and the solvated proteins were positioned in a cubic box keeping a distance of 1.0 nm between the box edges and the protein surface. The particle-mesh ewald (PME) electrostatic and periodic boundary conditions were applied in the all directions . Na+ counter ions were added to neutralize all the systems as per necessity in the proteins. 50,000 steps of steepest descent energy minimization were performed for all the systems to avoid high-energy interactions and steric clashes.
The equilibration and production phases are composition of MD simulation progression. The systems were administered to the simulations (constant number, volume and temperature [NVT] and constant number, pressure and temperature [NPT]) at 300 K for 100 ps to equilibrate the system. Ultimately, every system was subjected to MD production run at 1 bar pressure and 300 K temperature for 20 ns. The atom coordinates were recorded at every 10 ps throughout the MD simulation. Table 1 summarizes various bioinformatics programs used in this study.
The genome data of C. abortus LLG (NZ_CM001168.1 at NCBI Genome database; ) have a total of 936 proteins of which 198 proteins are termed as hypothetical. Of these, 51 proteins were predicted with conserved domains. The Pfam database revealed families of the identified conserved domains with their particular functions. A total of 37 sequences with particular family were identified. Additionally, superfamilies of 38 sequences were successfully determined. Identification of signature sequences and protein families enabled us to classify the hypothetical proteins in different categories including binding proteins, outer membrane proteins, enzymes, defense, secretory and signaling molecules. The supporting data have been shown in Supplementary Tables 1, 2 & 3. Additionally, re-analysis of whole proteome (including pre-characterized) was done, and no discrepancy between the official annotation and analysis by our methodology was observed (summarized in Supplementary Table 1).
Out of 51 hypothetical proteins, 13 proteins were predicted to have transmembranic helices on the basis of TMHMM, HMMTOP and SOSUI-GramN. Furthermore, concerning subcellular localization, it was predicted that 32 proteins were cytoplasmic, three as extracellular, five as innermembranic, eight as outermembranic and three as periplasmic. Using SignalP 4.1, a total of seven proteins with cleavage sites were detected, of which only single peptide is found as transmembrane. The proteins predicted and their locations are summarized in Supplementary Table 3.
The sequence analysis for prediction of the functions showed a total of 51 proteins with conserved domain, gene IDs and some specific functions (Table 2). Of these, a total of 16 proteins were found to have theoretical pI equal to or more than 7, and 35 proteins had theoretical pI less than 7.
For stability of globular proteins at wide range of temperature, the higher aliphatic index is regarded as the positive factor for stability at high temperature. For better interaction of proteins with water molecules, very low GRAVY index is considered as helpful and was calculated through GRAVY calculator by the total sum of values of hydropathy for all amino acids, divided by the total length of the protein. Aliphatic and aromatic properties with average number of polar and nonpolar amino acids along with basic and acidic nature for all protein sequences were determined as shown in Supplementary Table 2.
The proteins were classified based on biological processes, molecular functions and cellular components on the basis of Gene ontology annotations. At biological level, 20 processes were identified. Most of the proteins were found involving in small-molecule metabolic processes, cellular nitrogen compounds and biosynthetic processes. Twenty different molecular functions were identified out of which the largest cluster was found to be involved in ion-binding, followed by hydrolase and ligase activities. The cells had highest number, followed by plasma membrane and intracellular components (Figure 1). In addition, the pathways analysis showed that four proteins, namely WP_006344048.1, WP_006343878.1, WP_006343995.1 and WP_035395294.1, were involved in regulation of metabolism, biosynthesis and defense mechanisms. However, none of these four proteins shared homology with any of the host proteins. Three proteins (WP_006343878.1, WP_006343995.1 and WP_035395294.1) were involved in virulence.
The nonhomologous proteins were inferred by homology search between pathogen protein sequences. The selection of the proteins which share sequence between pathogen and the host is not desirable. Hence, Basic Local Alignment Search Tool for proteins’ search of the C. abortus hypothetical proteins against the hosts, namely Ovis aries (sheep; taxid: 9940), Capra aegagrus hircus (goat; taxid: 9925), Bos taurus (bovine; taxid: 9913), Bos grunniens (yak; taxid: 30521), Sus scrofa domesticus (pig; taxid: 9825) and Homo sapiens (human; taxid: 9906) proteomes was carried out. As a result, only six hypothetical proteins of C. abortus showed homology with proteins present in host species (Table 3). These proteins were rejected or abandoned as targeting them in C. abortus may lead to cross-reactivity, auto-immune reactions and cytotoxicity in host.
Out of 45 proteins, the structural modeling of two proteins, in other words, WP_006344020.1 and WP_006344325.1 were successfully accomplished using comparative structural methods after selecting their suitable templates, namely 4QAS and 3DJA, respectively. The amino acid residues involved in the enzymatic activity of, first, WP_006344020.1 (phosphorylase) and second, WP_006344325.1 (chlamydial protease/proteasome-like activity factor [CPAF]) are shown in sticks (red color; Figure 2). To check the stability, compactness and structural behavior of modeled proteins, energy minimization and MD simulation were carried out. The proteins modeled were found to be stable as revealed from MD parameters including radius of gyration, the solvent accessible surface area and the root-mean-square fluctuations (RMSF; Figure 3). Validation of both models done by checking stereochemical quality of protein structure through the Ramachandran Plot analysis using PROCHECK showed that 93% residues were in allowed region in WP_006344020.1, and 95% residues were in allowed region of WP_006344325.1 (Supplementary Figures 1 & 2).
The comparison of the modeled structures of phosphorylase and CPAF homolog to their ortholog in C. trachomatis was carried out. It was found that they had similar structures as validated by superimposition of modeled structures and predicted ortholog (Supplementary Figure 3).
Chlamydia is an important pathogen with potential risks for humans and livestock species including poultry and other birds [11,48]. Studies have shown that C. pecorum is intestinal endemic pathogen in cattle, whereas other species like C. pneuminiae, C. psittaci and C. gallinacea were involved in systemic (uterine, blood and milk) infections .
Whole-genome sequence analysis of C. abortus has revealed highly variable protein families, including transmembrane head/inc and polymorphic membrane proteins and secretion systems . It is important to investigate hypothetical proteins for deciphering their functions in pathogenic microorganisms to identify suitable drug targets. Predicting the functions of hypothetical proteins through bioinformatics methods is a quicker and preliminary approach, whereas the classical wet lab approaches such as enrichment and isolation, gene cloning and understanding functions at system level are expensive and protracted [50,51]. Through in silico approaches, the genes or proteins can be predicted initially and validated afterward by molecular biological methods.
Different Chlamydia species are already sequenced [9,16,48,52]. The structural analysis of proteins can boost up their predicted functions. By using homology-modeling methods, we could predict structures of two hypothetical proteins. Functional residues found in structures were analyzed by structural alignments for fully validation of predicted functions during sequence analysis. The protein sequences with expectation value less than 0.0001 were considered as homologous proteins . The virulence factors of the 45 hypothetical proteins of C. abortus were identified as effective targets for drug discovery, as also reported in earlier studies .
Identification of signature sequences and protein families has enabled the researchers to classify hypothetical proteins into different categories such as binding proteins, transmembrane, outer membrane proteins, enzymes, defense, secretory and signaling proteins. There were 14 hypothetical proteins in present study with their possible roles in nucleic acid-binding, and metal-binding processes. Some proteins were predicted for their possible role in DNA replication and recovery in case of DNA damage or genotoxicity. In addition, the proteins were also found to act as ribosomal proteins S3, transcription factors (e.g., NusA_K), and post-transcriptional modifiers of mRNA in stressed environments.
For defense against Gram-negative bacteria under stressed situations, the outer membrane proteins play a vital role. Five outer membrane proteins were predicted based on their pattern of alternating hydrophobic amino acids similar to porins present in outer membrane of bacteria.
A total of nine hypothetical proteins were predicted as enzymes, such as transferases, ATP synthases and GTPases. The secretory proteins (e.g., hormones, enzymes, antimicrobial peptides and toxins) are actively transported across the cell membrane. A total of five proteins were recognized as secretory proteins, one belonging to CesT family, and four to type-III secretory system. Protein secretory system type-III is used by bacteria to deliver their effector proteins into cells, which leads to modulation of host cellular functions [53,54]. Signal proteins have a precious role in several mechanisms including growth and immune systems. Almost all processes such as membrane fusion, transportation of ions, enzymatic activities, defense systems depend on signaling mechanisms. In addition, a total of ten proteins related to cellular signaling were identified. The earlier studies have speculated that cellular signaling proteins could be targeted for evolving novel therapeutic interventions against pathogenic C. abortus .
A total of four proteins (WP_006344048.1, WP_006343878.1, WP_006343995.1 and WP_035395294.1) with various functions were identified. The protein WP_006344048.1 was identified as DNA polymerase III, with delta subunit having KEGG-id K02340 (EC: 188.8.131.52). It is required for assembly of the processivity factor β(2) onto primed DNA in the DNA polymerase III holoenzyme-catalyzed reaction. The delta subunit is also known as HolA. It has role in purine and pyrimidine metabolisms. It also catalyzes DNA-template-direct extension of the 3′-end of DNA strand, by one nucleotide at a time, hence plays a role in DNA mismatch repair. The protein WP_006343878.1 was annotated as arabinose-5-phosphate isomerase with KEGG-id K06041 (EC: 184.108.40.206). This enzyme has role in catalysis and synthesis of 3-deoxy-D-manno-octulosonate, which is a component of bacterial lipopolysaccharides. It constitutes outer bacterial membrane known as endotoxin and is referred to as lipid A. It is identified as pathogen-associated molecule by the host immune cells. The protein WP_006343995.1 was recognized as lipoate-protein ligase A with KEGG-id K03800 (EC: 220.127.116.11). This protein has role in regulating lipoic acid metabolism and is important for functioning of key enzymes involved in oxidative metabolism, dehydrogenases and transferases. The protein WP_035395294.1 was identified as substrate transport protein and acts as ABC transporters of ferric siderophores and metal ions such as Mn2+, Fe3+, Cu2+ and/or Zn2+ with KEGG-id K11707. The ligand-binding site is formed in the interface between two globular domains linked by a single helix. These may act as efflux pumps and helpful for bacterial defense mechanisms against drugs.
It is important to develop alternative strategies to cope with pathogenic microorganisms and preventing humans and animals from infections [56–58]. 45 nonhomologous proteins exclusively present in C. abortus have been found in this study. Six hypothetical proteins showed homology with the proteins present in various hosts. The 45 nonhomologous hypothetical proteins when subjected to virulence factors analysis revealed that 31 hypothetical proteins possessed virulence (Table 3). Evidently, a variety of toxins are produced by the pathogens to withstand the host immune system . Furthermore, location of a protein in the cell is important in view of its interaction with other proteins or drug molecule targets. For instance, the cytoplasmic proteins act as promising drug targets while membrane proteins may be used as vaccine targets.
Structures of two proteins were finally predicted using comparative structural methods by selecting suitable templates. Protein WP_006344020.1 was recognized as Phosphorylase superfamily protein, having role in synthesis of quinones (menaquinone or ubiquinone), which are lipid-soluble electron carriers essential for cellular respiration in bacteria . Although certain bacteria, for example, Escherichia coli, utilize different quinones depending on oxygen availability, many Gram-negative and most Gram-positive bacteria rely on menaquinone as the sole electron-carrier system .
This protein includes side chain of catalytic triad residue Asp-176, and main chain atoms Phe-83, Gly-85 and Tyr-139. This is found within a highly conserved region of prototypical 5′-methylthioadenosine nucleosidases and has consonance with some earlier reports . Protein WP_006344325.1 was identified as CPAF that is responsible for degrading host molecules and plays a major role in chlamydial pathogenesis. The 3DJA was selected as template for prediction of structure. The highly conserved residues are Ser499 and His105 in CPAF, used for catalytic activity are well superimposed with the His97 and Ser488. On the other hand, auto-inhibition of CPAF is with internal inhibitory segment. Recognition of this peptide by the active site is dominated by two regions of contacts. One primarily involves hydrophobic contacts of the N-terminal rigid coil against the core domain of CPAF. The Met264 satisfactorily fits into the hydrophobic pocket formed by Val378, Cys500, Gly525 and F527 at template center which is similar in protein WP_006344325.1 as Met252 that fits into pocket formed by Val367, Cys489, Gly514 and Phe516. Additionally, a pair of main-chain hydrogen bonds is developed between Met252 and Gly367. Vander Waal contacts of three bulky residues, Phe256, Trp257 and Tyr264 in template, are similar to Phe268, Trp269 and Tyr276 in CPAF, with their respective neighboring residues which dominate interactions of the other binding region .
For a protein to be biologically active, it should be more stable to a unique globular conformation or native state. The factors such as amino acid sequence, folding, host cell strain, post-translational modifications, expression and purification conditions determine the protein stability. In this study, all the modeled proteins showed consistent structural behavior during their MD analysis. There were no abrupt fluctuations in the root-mean-square deviations and radius of gyration with the time evolution of trajectories of MD simulations. Moreover, the protein structure obtained had more compactness toward the end of MD. The RMSF analysis showed less variations and evenness in most of the amino acid residues. A steadiness in solvent accessible area was monitored which served as an evidence that predicted proteins would be stable in aqueous phase. The proteins did not show unusual folding/unfolding patterns during the simulations indicating that the proteins were stable. The overall analysis, in other words, radius of gyration, the solvent accessible area and RMSF analysis together validated the stability of the proteins.
Hypothetical protein WP_006344325.1 was identified as Chlamydia protease CPAF, unique to Chlamydia, degrades host molecules and assists the bacterium adapt host environment. Earlier, this protein was reported as the virulence factor in Chlamydia pathogenesis . Our study also corroborated the same, and no homology was observed with any of the proteins in different hosts. Some of the proteins were annotated hypothetical in Chlamydia, for example, CT398(cdsZ). Later, the experimental work provided the evidence that the protein CT398(cdsZ) interacted with σ(54) (RpoN)-holoenzyme and the type-III secretion export apparatus in C. trachomatis . Hypothetical protein CT263, with structural similarity to 5′-methylthioadenosine nucleosidase enzymes, supports the evidence that menaquinone synthesis in Chlamydiaceae is mediated through futalosine pathway . A chromosomally encoded hypothetical protein TC0668 was found to serve as an important chromosome-encoded genitourinary pathogenicity factor in C. muridarum .
Although hypothetical proteins are pseudogene, the hypothetical protein database possesses considerable amount of information. The analysis of unannotated data of Chlamydia has provided valuable inferences. The identified domains, virulence factor as well as tertiary structures of the proteins were validated by energy minimization and MD simulation. The in silico inferences could assist in computer-aided drug designing or vaccines to curtail the C. abortus.
In future, the clinical microbiology and infection epidemiology will depend much on rapid molecular testing of pathogens and reliable microbiological diagnostics. Clinicians and medical practitioners will have to process and interpret the data obtained from sequencing technologies that are wholly different from conventional microbiological methods.
Chlamydia is an important infectious intracellular pathogen of humans and animals. Asymptomatic or paucisymptomatic infection that remains undetected, and therefore untreated for a prolonged duration, can lead to health problems including pelvic inflammatory disease, miscarriage in pregnant women and infertility. Many aspects of chlamydiology such as epidemiology, taxonomy and evolution, biodiversity, diagnosis and treatments are of concern to clinicians and readers.
Abundance of some Chlamydia species in poultry and exotic avian species reveals epidemiological importance of wild birds as potential reservoirs of the pathogen. The persons like poultry breeders, veterinarians, women involved in poultry gutting should take strict preventive measures against Chlamydia infections. The chlamydial infections are manageable in humans, but endemic in animals due to nonavailability of timely and effective antichlamydial treatments. Although hypothetical proteins are not linked to documented genes, their annotation may reveal novel biomolecules and biochemical pathways. The proteins investigated in the study are conserved in various Chlamydia species. Notably, the bioinformatics, computational biology and chemical engineering could lead to invention of novel metabolites with therapeutic potential from the microorganisms. The hypothetical proteins reported herein are envisioned to be novel targets for developing drugs to curtail chlamydial infections. However, the inferences should be validated by standard molecular biological methods.
The authors are thankful the Central University of Himachal Pradesh, and Director, CSIR-IHBT, Palampur for providing necessary facilities.
V Singh, M Kumar and F Marotta were involved in concept, mining and annotation of hypothetical proteins and their implications. G Singh, D Sharma and J Rani predicted the protein structures and analyzed the data. B Singh and G Mal were involved in consortium planning, fund raising and manuscript writing.
Financial & competing interests disclosure
The financial support from SERB-DST to perform the work is duly acknowledged. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.
No writing assistance was utilized in the production of this manuscript.
This work is licensed under the Creative Commons Attribution 4.0 License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
Papers of special note have been highlighted as: • of interest; •• of considerable interest