Optimization of sample processing for mass spectrometry
protein identifications were derived from the lungs of infected guinea pigs. Since homogenates were made from the whole lung, all proteomic samples contained both host and bacterial proteins. Based on growth curve data from infected guinea pig lungs, 10–20 CFU seeded the lungs of each animal and time-points earlier than 30 days were not addressed due to the challenge of confident protein identification in lung tissue containing less than 5 log10
. The ratio of guinea pig to mycobacterial cells were previously determined using uninfected lung tissue spikes with decreasing numbers of bacteria in order to determine a lower limit of detection with our mass spectrometry methods (data not shown). CFU data was determined for each sample: day 30 samples averaged 5.77 log10
(±0.19) and day 90 samples averaged 5.89 log10
(±0.32) consistent with previous observations 
. Similarly, the pathological state of the lungs demonstrated typical progression of chronic tuberculosis, with day 30 infected lungs demonstrating contained lesions consisting of inflammation and areas of central necrosis (). Day 90 infected lungs demonstrated progression of disease with multiple areas of inflammation and coalescing necrosis throughout the lung along with secondary granulomas () 
. In either case, a vast majority of each sample was composed of host material. Thus, methodology was developed to significantly reduce host proteins from overwhelming the analyses of Mtb
proteins. Similar in vivo
samples previously analyzed by microarray experiments utilized amplification of bacterial RNA or selective analysis of transcripts to eliminate the burden of host RNA. For proteomics, we applied a similar work-flow, whereby chromatographic separation was used to amplify bacterial products via reduction of the sample complexity prior to MS analysis, and construction and interrogation of a smaller custom database (rather than complex databases, ie NCBI or SwissProt) was used for selective analysis of bacterial peptides.
Representative photomicrographs of A) day 30, B) day 90 post-infection guinea pig lungs, and C) uninfected guinea pig lungs.
Initial LC-MS/MS optimization was assessed with tryptic digests of Mtb whole cell lysate (WCL) utilizing nanospray mass spectrometry. The number of mycobacterial proteins identified was found to directly correlate to the length of the elution segment during chromatography. A short and shallow gradient (42 min) did not allow for enough of a separation of host and bacterial proteins. Since the host proteins were much more abundant, ion suppression hindered the identification of many bacterial proteins. In this study, multiple gradient conditions were evaluated, and it was determined that with a peptide load of 50 ng, a 90 minute linear gradient provided optimal separation (). Additionally, the number of unique protein identifications per sample increased drastically with sequential injections of the same sample (). Therefore, all samples were run in triplicate. In addition, the database used for interrogation contained the predicted proteins of the Mtb H37Rv and mouse proteomes, so that the host ion peaks would not be matched to the predicted m/z for bacterial ions. Data was then subjected to a second interrogation using a reverse database to further eliminate non-mycobacterial specific spectra.
Summary LTQ method optimization.
The percent of novel protein identifications taper after sequential injections of MS.
Composite analysis of in vivo Mtb proteomes
The sample set consisted of homogenates made from the lungs of six animals, that were harvested 30 and 90 days after low dose aerosol (LDA) infection (total of 12 samples, 6 biological replicates for each time point). The concatenated Mtb
-mouse database was chosen due to the poor annotation of the guinea pig genome. Since the goal of this study was to identify mycobacterial proteins, rather than host, we felt that the confidence of our protein identifications would be improved by the concatenation to this large mammalian database, which is well-defined and includes over two hundred thousand entries. We are confident in this database due to the homology between the Mus musculus
and Cavia porcellus
proteins. Indeed, the most commonly found host proteins, including: albumin, calmodulin, actin, superoxide dismutase, were found whether we used the mouse database or the poorly annotated guinea pig database (data not shown). To reduce the false discovery rate, two separate data filters were designed and applied prior to pooling. The first filter removed proteins identified by peptides that had low ratios of observed to theoretical MS/MS ions – guaranteeing a certain amount of protein coverage and removing bias from larger proteins. The second set of filters was applied at the protein level, retaining only those proteins that were present in 2 or more biological replicates. All filtered data were pooled using the Scaffold program, which added another level of stringency utilizing the Peptide and Protein Prophet statistical analysis algorithms 
. Proteins and peptides were disqualified below a 90% threshold. Proteins identified by a single peptide were removed from our analysis, while those identified by only two peptides (in separate biological samples) were subject to manual validation. To summarize, from the six 30-day time-point samples, 355,411 spectra were acquired. From these spectra, 1,598 were matched to mycobacterial peptides within 310 proteins. Due to the presence of mammalian tissue, the match ratio was low, accounting for 0.4496% of the spectra. Likewise, from the six 90-day time-point samples, 287,843 spectra were acquired. From these spectra, 2,336 were matched to mycobacterial peptides within 323 proteins, accounting for 0.8116% of the total spectra. This multi-filtered analysis provided a final list of 545 protein identifications, ranging from 10 to 432 kDa with a pI range of 3.54 to 12.12. Between the 30 and 90-day samples, 222 and 235 proteins were uniquely identified and 88 proteins were common between the two sample sets. (; Supplementary Tables S1
). As a negative control, six uninfected lung samples were subjected to an identical analysis. Several falsely identified proteins were removed from the final analysis based on the identification of Mtb
peptides in uninfected that were also found in our infected samples. From this negative control, false discovery rates (FDRs) of 9.1% and 6.9% were calculated in the 30 and 90-day analysis respectively. FDRs were also calculated via the traditional reverse database analysis method and yielded similar FDRs, 11% and 6.8% for 30 and 90-day samples respectively. Since the FDRs calculated from the potential false positives identified in the uninfected samples are similar to the calculated FDRs from the reverse database analysis, a high confidence (90%) for proteins indentified in this analysis was retained.
Venn diagram depicting the breakdown of proteins identified at each time-point in this study.
Based on the TubercuList Web Server (http://genolist.pasteur.fr/TubercuList
) designations, all proteins were sorted by their functional category (). Two functional groups, categories 3 (cell wall and cell processes) and 7 (intermediary metabolism and respiration) comprised about half of the total data, representing 22.7% and 21% of the total identifications respectively. Interestingly, there is little overlap between the day 30 and day 90 proteins identified for these two categories, with only 13 identical proteins (10.3%) and 12 identical proteins (10.3%) found in both day 30 and day 90 samples ().
A closer look at category 3 shows an abundance of membrane transport proteins (). This includes many members of the metal-cation transporting ATPase family, including CtpV and CtpB (copper), CtpD (possibly cadmium), CtpE (unknown), CtpF (unknown), CtpG (unknown), CtbH (unknown) and CtpI (magnesium) – illustrating that the adjustment of cation levels is critical within the host. Just upstream of CtpB (Rv0103), Rv0102 also shows homology (by BLAST analysis) to copper resistance transporters. In addition, 10 efflux pumps were identified and were found to be particularly prevalent in the early stages of infection. Conversely, proteins involved in the binding and transport of phosphate were found entirely in the 90-day data set. The presence and wide variety of pumps and transport proteins during infection lends to the conjecture that the bacteria may adapt to the host environment by scavenging resources and altering the micronutrient levels.
Changes over the course of infection of representative pathways from categories 3 & 7.
Category 7 includes an assortment of proteins involved in catabolism. While examination of the data demonstrated no significant difference in hexose metabolism between day 30 and 90, significant differences were found in the later stages metabolism, beginning with dehydrogenation of pyruvate through the TCA cycle. Of the 10 proteins identified for this pathway, only 1 was specific to the 90-day samples and the spectra of the remaining 9 proteins are overwhelmingly found in the 30-day samples (). This data supports that of others' and indicates a decrease in the preferred carbon nutrients, as well as a decrease in phosphate during the chronic infection 
. This also supports the hypothesis that mycobacteria may breakdown lipids, rather than carbohydrates, as a source of carbon and energy 
. The phospholipases C are noted virulence factors and are hypothesized to breakdown host phospholipids for bacterial use 
. Similarly, Rv0183 – lysophospholipase, LipR (Rv3084) - a lipolytic esterase and LipY (Rv3097) – a triacylglycerol lipase, were all expressed at the 90-day time point. To obtain phosphorous for energy from the host environment, Mtb
may utilize two types of transport proteins, a low-affinity, PitA and two high-affinity, PstS1 and PstS3 phosphate transporters 
, as well as a phosphate binding lipoprotein. These were present during the chronic infection.
The third most abundant category consists of the acidic PE/PPE proteins, which represent 16.2% of the total protein identifications. It is important to note that the PE/PPE proteins have numerous peptides that overlap or are highly similar. Thus any ambiguities in the assignment of a PE/PPE peptide were analyzed and validated. In this study, all peptides assigned to more than one protein were only retained if the protein in question had an additional two or more unique peptides. Unlike the categories discussed above, this category contains the largest overlap - 37 proteins (45.7%), common between the 30 and 90-day samples. However, from day 30 to day 90, there is also an increase of 7.4% in respect to total proteins represented. This is the largest increase in any of the categories. This increase is most evident in and , which show that of the ten most abundant proteins in each sample set by spectral counting, two and five (30 and 90-day, respectively) of these proteins are from the PE/PPE category. The exact role of these proteins is unknown, but in general, these proteins are thought to reside in the cell envelope and have been implicated in increasing antigenic variation 
. Many PE/PPE proteins, such as PE-PGRS54, have been shown to be upregulated in response to conditions such as hypoxia, exposure to H2
and during NPR 
. Of interest, it has been hypothesized that some PE and PPE proteins may interact with each other after co-expression from the same operon 
. Following this notion, two sets of proteins, PE-PGRS53/54 and PE-PGRS56/57, all highly represented in the 90-day infection set, may be products of the same operons.
The ten most dominant Mtb proteins within the 30-day Mtb-infected lung samples based on normalized spectral counts.
The ten most dominant Mtb proteins within the 90-day Mtb-infected lung samples based on normalized spectral counts.
Possible relationships between proteomic and microarray datasets
The proteomic analysis of Mtb
infected guinea pig lungs 30 and 90-days post-aerosol challenge provides a description of proteins present during the establishment and maintenance of infection. Prior to this study, the majority of the information known about the bacterial state during hypoxia or infection was gleaned from correlative microarray studies. It has been noted that due to post-transcriptional events, the relationship between the amount of mRNA and protein is not 1
. This was very evident when comparing the 30 and 90-day proteomic data to that of the gene expression data sets from the well-characterized in vitro
models of NRP and starvation. For example, the Muttucumaru et al. data set, which summarizes the changes cells undergo in the transition from aerobic growth to that of NRP1 (microaerophilic) and NRP2 (anaerobic) stage, showed a 6.8% overlap between the genes upregulated in NRP1/2 and our 30/90-day in vivo
analysis () 
. The commonalities include: narK2
(nitrate reductase), ppsB
(PDIM synthesis), ctpV
(copper transport) mbtB
(mycobactin synthesis) and several ppe
genes. Similar trends in upregulation were apparent in the Voskuil et al microarray analysis of the stationary phase and NRP models, the Rachman et al microarray analysis of Mtb
in infected lungs and even the Cho et al ICAT study on in vitro
NRP () 
. While Cho's study identified representatives from important up-regulated pathways, the proteomics methodology employed in our study recognized several more members of each of these pathways – reinforcing their importance in vivo
. In addition to those similarities, major fundamental differences exist between microarray data sets from in vitro
studies and the proteomic data described in this study. Specifically, a large amount of chaperones and detoxification proteins were identified in the in vitro
models. In fact, the single most common finding in the in vitro
scenarios, including both Rachman and Cho studies, is the upregulation of hspX
. It is very likely that HspX is present in vivo
– however, based on our findings, it is either rapidly exported from the lung or its mass spectra are obscured in our study.
Overlap of our study with a sampling of in vitro studies.
Interestingly, the starvation model 
appears the least similar to the in vivo
results described in this study. In fact, the results appear to be reversed – the profile of genes found down regulated in response to starvation are more similar to our infection model than those found to be up-regulated. Perhaps in the lung of the host, the bacteria are not nutrient restricted at all. Since the sequencing of the Mtb
genome in 1998, it has been known that Mtb
contains an unusually large number of proteins involved in lipid metabolism 
. Many FadD and FadE proteins are present in the in vivo
data set, thus, it is likely that the bacteria are able to breakdown host lipids in order to utilize them as nutrients 
. Contrary to our results, the Betts' model shows PDIM synthesis to be decreased, as well as down-regulation of several genes, including: glcB
– many of which have been shown to be present in other infection models 
. Rv3403, a hypothetical protein found to be in our preliminary 10 most abundant proteins list at day 30 (), is shown to be down-regulated in the starvation model 
. However, by day 90 this protein is completely absent and may be potentially related to the decrease in a nutrient in the lung.
Much of the literature defining the bacterial state during NRP has been based on microarray studies, in which changes in mRNA levels in response to the implementation of stresses on cultured Mtb
were monitored. While it is important to tease out which pathways may be upregulated due to each stress the bacteria faces in the host, it is equally as important to establish the combined and actual effects of intra-host pressure. In this study, none of the simulated in vitro
(model) environments accurately reflect the protein profile within the lung – even the “multiple stress dormancy model” 
. This is not entirely surprising; in vitro
cultures include a different set of variables resulting in bacterial stress, such as nutrients are static and can be exhausted, toxic bacterial byproducts can build up, and physical space is limiting. Likewise, even tissue culture experiments yield an incomplete picture, in that they are missing more complex immunological influences. Perhaps the most important difference is that in vitro
studies focus on a clonal population, while in the in vivo
experience different populations exist – a major contributing factor that has hindered the progress of treatment. In order to tease out which proteins vary across the various bacterial populations and host environments, in future studies, primary and secondary granulomas will be compared to uninvolved tissue.
Lastly, we acknowledge that our described in vivo
proteome lacks the identification of the major secreted proteins. This was unexpected, since in all previous in vitro
analyses by ourselves and others 
, secreted/stress response proteins such as GroEL, HspX and DnaK, were highly abundant. Similarly, during the development of our mass spectrometry methodology, we utilized uninfected lung tissue spiked with 106
gamma-irradiated (dead) Mtb
and in this analysis the secreted proteins and several chaperones remained dominant (data not shown). Likewise, microarray analysis of Mtb
from macrophage culture, show transcripts of the stress-associated proteins in high abundance 
. Clearly, there is a difference between mock-infected lung tissue, cell culture-based analyses, and lung tissue obtained from an actual infection. Since our proteomics was performed on whole lung homogenates, it is likely that the exported proteins were not identified because these proteins are not simply secreted by the bacillus and retained at the site of infection; rather we hypothesize that these proteins are trafficked to the draining lymph node, serving as important T-cell antigens. These secreted proteins, therefore, may not be valid drug targets because they are not directly associated with the site of infection. However, their role as potential biomarkers, diagnostic reagents, or vaccine candidates remains. In addition to phagocytosis-mediated export of secreted proteins from the lung, some proteins may directly traffic to the blood, sputum, or other bodily fluids. Indeed, members of the antigen-85 complex (Rv3804c, Rv1886c and Rv0129c) have been detected in serum and cerebral spinal fluid 
. Further, some secreted mycobacterial products may be shuttled away in exosomes; as is the case described for the 19 kDa liproprotein (Rv3763) and lipoarabinomannan (LAM) 
In this study, over 500 Mtb
proteins present at 30 and 90 days post infection are described. This description provides a picture of the Mtb
proteome in mammalian lungs. The exclusion of a protein from either sample list does not rule out its presence. These samples are highly complex, containing both host and bacterial peptides. Thus it is likely that only the most dominant Mtb
proteins (where dominance is reflective by quantity and capacity for mass spectrometry detection based on the physiochemical ionization properties of the protein) are described here. One of the flaws of large-scale shotgun proteomic studies is that dominant peptides can skew the results in a manner that causes some proteins to remain undetected. Thus, and contain the proteins observed based on the abundance of unique and repetitively sequenced peptides from one protein. While this is an accepted method to glean which proteins are more dominant relative to other proteins identified in a given sample, their absolute abundance has not been validated. Most of the spectral counts (the spectral counts for each protein identified in this study can be found in Supplementary Table S3
.) mentioned in our study are significantly lower (ten-fold) than the dominant host proteins, which are in the hundreds (data not shown). This is entirely expected and is reflective of the nature of the sample – the infected lung contains few bacteria in relation to host cells. Validation of any of the proteins or pathways in question can be performed with a quantitative mass spectrometry technique, like multiple reaction monitoring (MRM); these studies are currently being developed in our laboratory to validate some of our findings.
This proteomic study on infected lung homogenates has supplied a long list of mycobacterial proteins that are present during infection. The most challenging facet to treating tuberculosis is that there are multiple populations of Mtb
within the lung. Thus, it is not surprising that our data does not correlate with any one in vitro
model. The Mtb
-infected lung comprises a heterogeneous and dynamic population of bacteria. One additional source of heterogeneity is the contribution of organisms that have infiltrated from other infection sites, such as the spleen. It is believed that after seeding the spleen, organisms can re-enter the lung as secondary infection sites 
. Therefore, it is highly likely that these organisms are also sampled in our 90-day post-infection analysis, in addition to those retained in the lung throughout the course of infection. This aspect is beneficial when defining drug targets. We believe that several of the proteins identified in our analysis will give us clues to which pathways and biosynthetic processes of Mtb
might be worth targeting during an infection. It is difficult to determine which, of the hundreds of proteins identified are important from this study. The comparisons of our dataset to others' afford some conjecture (). Few studies have looked at specific factors contributing to Mtb
survival in the guinea pig model of infection. One study did explore survival rates of attenuated Mtb
and identified 18 mutants with reduced fitness in the guinea pig 
. Two of the eighteen gene products, Rv1798 and ctpG were identified in this study. As part of our future undertakings, we hope to further mine which of these aspects of Mtb
physiology persist in specific Mtb
lesions through continued comprehensive and targeted proteomic profiling during infection in an effort to define novel, specific targets that lead to advances in vaccine and drug discovery efforts.