In silico exoproteome prediction schema
As shown in our proposed prediction schema (Figure ), the software SurfG+ (Surface Gram positive), specially configured for GRAM+ bacteria, is responsible for most of the sub-cellular classifications, which vary between cytoplasmic (CYT), membrane (MEM), SEC and PSE (Figure ). SurfG+ was configured for GRAM+ bacteria. Figure represents the prediction schema using SurfG+ and three additional software, TatP 1.0 [20
], SecretomeP 2.0 [21
] and NclassG+ [22
], which are specialized in non-classical secretion prediction. SurfG+ incorporates SignalP 3.0 predictor, responsible for identification of classical putative secreted proteins or exported proteins by the SEC pathway [23
C. pseudotuberculosis pan genomic prediction schema. Software used, identified sub-cellular compartments and flow scheme to create the final pan genomic data sets.
Figure 2 Predicted gene quantities by sub-cellular compartment from full C. pseudotuberculosis genomes. Classification of more than 10,000 distinct genes from the five different C. pseudotuberculosis strains in the four sub-cellular categories: cytoplasmic (CYT), (more ...)
The results obtained after running SurfG+, TapP, SecretomeP and NClassG+ have gave rise to two gene data sets labeled as SEC and PSE, which correspond to the C. pseudotuberculosis
ISPPE. These ISPPE data sets are composed of putative proteins present fivefold (5x), fourfold (4x), threefold (3x), twofold (2x) or onefold (1x), where fivefold means that a gene was predicted in all five strains, four fold meaning that a gene was predicted in four strains, and so on. A gene fold was obtained by reciprocal blast results, as described in the methods section. Since not all predicted genes are named, it was necessary to create a pan genome identifier, here denominated pan locus
, to nominate each unique gene fold. The pan locus
is unique within a pan genome and is shared by all homologous genes. For example, when a putative exported protein was found within the five strains, each gene copy received the same pan locus
to facilitate further data processing and identification. Following, it was necessary to confirm these results by systematical manual curation of each gene using the ACT tool from the Artemis software package [24
]. Once completed this manual curation, it was possible to answer several questions regarding the correctness of each blast result and, as a consequence, it was possible to identify, for instance, that a gene formerly classified as 1x was indeed a 5x, as the other four gene copies were created starting beyond the signal peptide motif. After initial methionine correction, and also taking into account homologous genes, a new prediction step indicated all remaining putative proteins to be exported, composing the core ISPPE. However, gene's start positions incorporating a less probable signal peptide motif were also observed. In general, genes formerly predicted as Nx proved to be correct by manual curation as the remaining (5-N)x genes were predicted as cytoplasmic, PSE or pseudogenes. These results are particularly interesting because they compose the dispensable and unique ISPPE data sets. These genome annotation corrections, as a consequence of these analyses, were incorporated into the official annotation of the five C. pseudotuberculosis
strains deposited at GenBank in August, 2011. This genomes are also available in the additional file 1
, as EMBL files.
Classical and non-classical secreted putative proteins
Figure exhibits the in silico predicted pan secretome results for C. pseudotuberculosis, which comprise 150 genes, out of 377 from the whole ISPPE, representing 750 locus_tags in the five studied C. pseudotuberculosis strains. However, despite representing 750 locus_tags, not all were predicted as secreted. If at least one gene copy, within a specific pan locus, was not predicted as secreted, it still received the same pan locus but was not classified as part of the predicted core secretome. There are 122 genes composing the predicted core secretome (5x), followed by 25 genes constituting the predicted dispensable secretome (4x, 3x and 2x) and just 3 genes as the predicted unique secretome (1x). These results were obtained applying the prediction schema from Figure ; however, different contributions were obtained from different predictors, as shown in Figure .
Predicted C. pseudotuberculosis pan secretome. Predictions for 150 genes from strains 1002, C231, I19, FRC41 and PAT10 made by SurfG+ 1.0, TatP 1.0 Server and SecretomeP 2.0 Server.
Predicted C. pseudotuberculosis pan secretome by predictor software. Predicted secreted genes coverage in the predicted pan secretome of the five bacterial strains separated by predictor software SurfG+, TatP and SecretomeP.
SurfG+ predicted 104 genes, corresponding 85, 18 and 1 to the predicted core, dispensable and unique secretome respectively. On the other hand, TatP predicted 25 genes, of which 17, 7 and 1 corresponded to the predicted core, dispensable and unique secretome respectively. Finally, SecretomeP and NClassG+ predicted 21 genes, corresponding 20 and 1 to the predicted core and unique secretome respectively. It can be easily observed that the main predicted portion is originated by SurfG+, as it predicts putative proteins possibly secreted by the SEC pathway. A considerable portion of genes (~31%), only within the predicted core secretome, comes from non-classical secretion predictors that cannot be ignored when the subject is about vaccine candidates.
The dispensable and unique C. pseudotuberculosis
predicted secretomes contain ~8%, or 58 locus_tags
, not predicted as secreted. Putative proteins predicted as CYT, PSE and putative frame shifts (pseudogenes) account for 22, 24 and 10 locus_tags
respectively. In the dispensable and unique C. pseudotuberculosis in silico
predicted secretomes, the numbers of genes identified as membrane integral or absent in a genome are insignificant. Nevertheless, the manual curation step ensured no annotation errors in these predictions, making it possible to claim the hypothesis that these differences could be due to environment adaptations. A table containing the complete list of C. pseudotuberculosis
secreted proteins is available in the additional file 2
Potentially surface exposed (PSE) putative proteins
The SurfG+ software was calibrated by the cell wall thickness for each C. pseudotuberculosis strain. Figure shows 184 genes, out of 377 from the whole ISPPE, comprising the predicted core surfaceome (5x), 34 genes composing the predicted dispensable surfaceome (4x, 3x and 2x) and just 9 genes as predicted unique surfaceome (1x). These 227 genes account for 1135 locus_tags in all five strains. In this set, homologous genes within a pan locus do not ever share the same sub-cellular prediction. Genes predicted as MEM, CYT, SEC and putative pseudogenes account for 29, 23, 20 and 17 distinct locus_tags, respectively. Genes predicted as MEM (~3%) compose the second major group. This could be explained by the fact that membrane proteins already contain hydrophobic extension and could be more susceptible to expose or occult parts of a protein to the extracellular milieu. However, the same reasoning does not suit to explain the third major group of locus_tags with surfaceome pan locus that correspond to proteins predicted as secreted ones. These 20 locus_tags that were predicted as secreted, but also received surfaceome pan locus, raise a question; do these fit SEC or PSE labels? There exist no simple paths to estimate their sub-cellular compartment by software, since some locus_tags were predicted as PSE receiving surfaceome pan locus and other were predicted as SEC and also received secretome pan locus. Ten pan locus (plcppse193, plcppse194, plcppse205, plcppse218, plcppse226, plcpsec096, plcpsec097, plcpsec098, plcpsec100, plcpsec101) faces this question, as some genes appear in both the predicted secretome and surfaceome.
Predicted C. pseudotuberculosis pan surfaceome. Pan surfaceome predictions for 227 genes from strains 1002, C231, I19, FRC41 and PAT10, performed by SurfG+ 1.0.
The PSE subcategories show predominance of genes, as presented in Figure . Most of the 1045 genes predicted as PSE are cell wall anchored outward C-terminal (~40%) (≥ 50 AA long), followed by lipoproteins (~24%), outward loops (~11%) (≥ 100 AA long) and outward N-terminal (~17%) (≥ 50 AA long), whereas genes containing retention signals (PSE R) account only for ~8%.
Figure 6 Predicted C. pseudotuberculosis pan surfaceome by PSE subcategories. PSE categories are distributed in outward C-terminal or N-terminal portion greater than or equal 50 AA. Outward N or C terminal greater than 100 AA are classified as L. Lipogenes identified (more ...)
The PSE results of all strains were analyzed considering that a significant cell wall thickness difference between strain I19 and the other ones was observed (~34 nm versus ~24 nm). Despite the significant cell wall thickness difference, a small difference was predicted in the genome, which accounts for a decrease in the number of PSE and an increase in the number of MEM genes in C. pseudotuberculosis
strain I19. A table containing the complete list of C. pseudotuberculosis
PSE proteins is available in the additional file 3
Revised in vitro exoproteome results
The 104 observed genes in both TPP/LC-MSE
] and 2-DE-MALDI-TOF/TOF, (Silva WM, Seyffert N, Castro TLP, Santos AV, Pacheco LGC, Santos AR, Ciprandi A, Zurita-Turk M, Dorella FA, Andrade HM, Pimenta AMC, Silva A, Miyoshi A, Azevedo V, unpublished observations) experiments were compared with the ISPPE results here presented. This comparison, explained in the methods section, brought novel insights into the in vitro
exoproteome and showed the possibility of having additional genes in the main C. pseudotuberculosis in vitro
exoproteome. In Table are listed all 35 proteins of the variant in vitro
exoproteome (strains 1002 and C231), that correspond to ~23% of the total amount. These proteins were found to be highly conserved in the five compared C. pseudotuberculosis
strains and comprise the core ISPPE. Moreover, it was verified that three proteins (ADL20466, ADL20097 e ADL19973), previously classified as belonging to the variant in vitro
exoproteome of strains 1002 [25
], did actually belong to the main in vitro
exoproteome. These findings give raise to the possibility that more proteins of the variant in vitro
exoproteome indeed make part of the main in vitro
Core C. pseudotuberculosis in silico predicted pan-exoproteome found in the variant in vitro exoproteome
This comparison also served as a rebuttal argument against some specific genes. The Cp1002_0369 gene, classified under the plcpsec100 pan locus as a pseudogene, was identified by the in vitro exoproteome experiment. Interestingly, this gene copy also suits the plcppse226 pan locus. Both pan locus make part of previous related genes that already showed difficulties to be classified, by software, into any potential sub-cellular compartment, as some genes within the pan locus fit both SEC and PSE labels. The in silico predictions enforces that there are at least three secreted proteins, inspite of the other two gene copies being predicted as having PSE and CYT labels.
Furthermore, the genes plcppse180, plcppse192, plcpsec077, plcpsec095 and plcpsec099 also had both genes found in the main in vitro exoproteome of strains 1002 and C231, but were not classified in the ISPPE. The plcppse180 pan locus holds a putative pseudogene (CpPAT10_0459), and is therefore not present in the in silico predicted core surfaceome. Other genes were predicted as cytoplasmic. It is possible that these genes were wrongly assembled since there is evidence that at least two homologous genes, from strains 1002 and C231, are exported to the extracellular milieu.
Core C. pseudotuberculosis ISPPE candidates homologous to Mtb
Within the core C. pseudotuberculosis ISPPE, homologous genes to those of the previously studied Mycobacterium tuberculosis H37Rv (Mtb) were observed. In this work we present some of these homologous genes featuring at least 90% protein alignment and 50% identity within this alignment. These cut-offs were obtained during the search for C. pseudotuberculosis homologous genes in the Mtb genome.
The core C. pseudotuberculosis ISPPE, that accounts for ~81% of the total, is composed of 306 genes or 1,530 distinct locus_tags, being ~40% predicted as SEC and ~60% predicted as PSE proteins, of which 20 genes present high similarity to Mtb's genes (Table ); however, not all of these Mtb genes have known functions.
Core C. pseudotuberculosi s in silic o predicted pan-exoproteome homologous to Mtb's proteins
In this regard, here we only discuss some of these Mtb
's genes with experimental evidence. The plcppse174 pan locus
shows 51% protein identity with Rv3915 (YP_178027.1), a gene named cwlM
that was the first autolysin gene identified and cloned from Mtb
. This finding offers a new drug target class that could alter the permeability of the mycobacterium cell wall and enhance the effectiveness of treatments for tuberculosis [26
]. Applying principles of in vivo
expression technology (IVET), it was possible to identify upregulated genes from Mtb
in an in vitro
simulation of anaerobic persistence condition. The upregulated genes under hypoxic condition (dissolved oxygen <1%) include Rv0050 (ponA1
), a penicillin binding protein that has 52% protein identity to the plcppse165 pan locus
and 90% alignment extension [27
]. The plcpsec122 pan locus
shows ~58% protein identity with Rv2752c (NP_217268.1), a unique bi-functional Mtb
gene that owns both β-lactamase and RNase activities. Both activities are lost upon deletion of the 100 AA long C-terminal 100 tail, which contains an additional loop when compared to the RNase J of Bacillus subtilis
]. As it can be observed, the plcppse080 pan locus
appears twice in Table , as it is homologous to both NADH dehydrogenase gene copies of Mtb
(NP_216370.1) and ndhA
(NP_214906.1), with ~57% protein identity. In Mtb
, energy generation is mainly performed by type II dehydrogenases ndh
, being both, as such, essential genes [29
The plcpsec113 pan locus
is homologous to the glmU
gene (NP_215534.1), holding ~59% protein identity and more than 90% alignment extension. This gene is essential in Mtb
, being required for optimal bacterial growth, and has been selected as a possible drug target for structural and functional investigation [30
is a bifunctional acetyltransferase/uridyltransferase that catalyses the formation of UDP-GlcNAc from GlcN-1-P. UDP-GlcNAc is the substrate for two important biosynthetic pathways: lipopolysaccharide and peptidoglycan synthesis. Due to its important roles, glmU
had its conformational structure solved [30
]. The plcpsec113 pan locus
for C. pseudotuberculosis
is an interesting putative drug candidate since it is predicted to be secreted, part of the core ISPPE and is able to infer its conformational structure by homology modeling using Mtb glmU
Several genes involved in mannoglycoconjugate biosynthesis have shown to be involved in virulence, due to their central role in biosynthesis of major surface-associated glycoconjugates. Within these genes, the Mtb
(Rv3264c) is defined as a GDP-mannose pyrophosphorylase (GDPMP) and disruption of its activity leads to decrease of surface-associated mannosylated lipoglycans. For GDPMP, this decrease correspond directly to reduced virulence in both BALB/c mice and cultured human macrophages [31
]. The Mtb manB
gene holds 69% protein identity to the plcpsec110 pan locus
and more than 90% alignment extension, making plcpsec110 a considerable putative drug target.
Mycolic acids and multimethyl-branched fatty acids are found uniquely in the cell envelope and are essential for survival, virulence and antibiotic resistance of Mtb
. Acyl-CoA carboxylases (ACCases) commit acyl-CoAs to the biosynthesis of these unique fatty acids. Previous studies indicate that AccD5 is important for cell envelope lipid biosynthesis and its disruption leads to pathogen death [32
]. The Mtb
(NP_217797.1) had its structure determined and also shows ~74% protein identity to the plcppse045 pan locus
in more than 90% alignment extension, making it also a promising candidate for further vaccine candidate evaluations.
Moreover, it was demonstrated that Mtb
can use heme as an iron source, suggesting that Mtb
contains a yet-unknown heme acquisition system [33
]. We found that the C. pseudotuberculosis
plcpsec076 pan locus
holds ~52% protein identity to the Mtb
(NP_217194.1) and more than 90% alignment size, therefore also representing an interesting drug target for C. pseudotuberculosis
The here presented results provide a plethora of putative vaccine candidates never seen before for C. pseudotuberculosis
. However, genes predicted as MEM and CYT account respectively for 18% and 65% of the in silico
predicted pan genome. Despite the 227 surfaceome and 150 secretome genes here presented, these only represents ~16% of the C. pseudotuberculosis in silico
predicted pan genome. Most of the genes remain inaccessible for the current in silico
prediction techniques and it is possible that these neglected genes could also be good candidates against C. pseudotuberculosis
. These findings raise the need for more elaborated and driven software or prediction schemas capable of uncovering these major genome neglected portions. Using the prediction schema here presented, it was possible to include more than ~2% of non-classic secreted putative proteins that compose putative vaccine candidates. However, this low income amount of vaccine candidates is due to the optional parameter selected in our prediction schema, the non-classic secreted score greater than or equal 0.90. If using the default parameter from the software secretomeP and NClassG+, this income would be increased up to ~6% and the final income of putative vaccine candidates would be ~20%, using a couple of motifs predictors as depicted in Figure . The current reverse vaccinology software allows obtaining a number of candidates closer to 20% of the C. pseudotuberculosis
genome. These considerations raise a question: supposing that novel software for unexplored secretion pathways come into scenario, what is the genome's percentage that could be selected as putative vaccine candidates? Supposing that this percentage reaches 40%, how could the problem of choosing between almost one thousand putative vaccine candidates to be used for the next vaccine production stage for C. pseudotuberculosis
be solved? This dilemma could be solved by using further software prediction just like those addressing epitopes MHC class I and II allele affinity [34
]; however, this could be just a part of the solution. There are chances of solving this dilemma by means of broader vaccine projects, which would take into account particular variables for each target organism in order to minimise research efforts and the number of possible vaccine candidates [35
In silico versus non-in silico
It is broadly known that in silico
genome investigations could give evidence about the genome's function and structure. It is also known that such in silico
investigations could only be proved or denied by non-in silico
experiments. Therefore, such reasonable thinking is not a single-hand avenue. Non-in silico
experiments could be improved by means of more comprehensive or specific approaches with the objective of getting a closer answer to the reality for biological questions. The fact is that in silico
analyses cannot vary when executed over and over again and no matter how many folds are run. We know that exactly 122 genes will be always predicted as having classical exportation motifs; on the other hand, we cannot expect the same behavior from non-in silico
analysis. Some real proteins could be or not be found in an in vitro
or in vivo
exoproteome result, due to an uncountable number of factors [21
]. Therefore, we suggest that the core C. pseudotuberculosis
ISPPE could be composed of a larger number of predicted genes, but such confirmation could only be affirmed with additional non-in silico