|Home | About | Journals | Submit | Contact Us | Français|
The composition of the large, single, mitochondrion of T. brucei was characterized by mass spectrometry (2D-LC-MS/MS and gel-LC-MS/MS) analyses. A total of 2,897 proteins representing a substantial proportion of procyclic form cellular proteome were identified, which confirmed the validity of the vast majority of gene predictions. The data also showed that the genes annotated as hypothetical (species specific) were over-predicted and that virtually all genes annotated as hypothetical, unlikely are not expressed. By comparing the mass spectrometry data with genome sequence, 40 genes were identified that were not previously predicted. The data are placed in a publicly available web-based database (www.TrypsProteome.org). The total mitochondrial proteome is estimated at 1,008 proteins, with 401, 196, and 283 assigned to the mitochondrion with high, moderate, and lower confidence, respectively. The remaining mitochondrial proteins were estimated by statistical methods although individual assignments could not be made. The identified proteins have predicted roles in macromolecular, metabolic, energy generating, and transport processes providing a comprehensive profile of the protein content and function of the T. brucei mitochondrion.
Trypanosomes are protozoan parasites that cause enormous disease burden, African trypanosomiasis caused by Trypanosoma brucei contributes to 1.5 million DALYs (Disability Adjusted Life Years), Chagas disease caused by T. cruzi contributes to 667,000 DALYs, and Leishmaniasis contribute to 2 million DALYs (World Health Report, 2004). Sequencing of the T. brucei, T. cruzi and L. major (the TriTryps) genomes is essentially complete [1–3] and accumulation of extensive genome sequence information poses new challenges as well as opportunities for post-genomic research. The TriTryp genomes have substantial sequence, gene content, and gene order conservation , and most of the basic cellular processes are shared among these trypanosomatids. Bioinformatics and comparative genomics play powerful roles in identifying putative genes, and defining the potential functions and relationships of many genes in the repertoire. However about 2/3rd of the predicted genes in these organisms have no known function and are currently annotated as encoding hypothetical proteins (www.genedb.org). T. brucei genome is predicted to encode 9,211 proteins, of which only 35.7% have been assigned functional roles based on experimental data (5.1%) or sequence similarities to proteins of known function in other organisms (30.6%) (www.genedb.org/genedb/tryp/index.jsp). The gene predictions in the TriTryps have not been systematically tested to determine if the predicted protein is present, let alone whether the predicted functions are accurate. The Trypanosome genomes have an unusual organization. They consist of clusters of numerous genes on the same DNA strand (directional clusters), each of these appear to be transcribed from single promoter-like elements, and RNA abundance is primarily regulated by transcript processing and turnover [4,5]. Most biological processes are controlled at the protein rather than RNA level and this may be especially true in T. brucei where regulation of transcription is rare and regulation of translation has been demonstrated [6,7]. Thus experimental evidence for gene expression at the protein level is important in defining the potential function of Trypanosomatid genomes. Progress in the development of mass spectrometric proteomics technologies enables proteins to be analyzed in a high throughput, automated manner. Fortunately, essentially all trypanosomatid genes lack introns, which simplifies gene identification and aids proteomic characterization. Such an approach can identify the molecular components of organelles, sub-cellular structures, and biological macromolecular complexes, as well as determining levels of protein expression between two different cell states, and various post-translational modifications that control regulatory pathways. Thus while the availability of the TriTryp genome sequences has accelerated research progress in many laboratories, only limited information has been generated at the proteome level for these organisms [8–14].
In this study we used a shotgun proteomics approach to identify proteins present in the mitochondrion of T. brucei procyclic form (PF) cells. The resultant profile was compared to the genome database and used to assess the validity of gene annotation. The results substantially and efficiently advance the annotation of trypanosomatid genomes. The proteomic data also enabled us to identify a set of new genes in T. brucei. Identified proteins were assigned to mitochondrion (mt) by criteria including enrichment in the organelle fraction, demonstrated or putative role in relevant biological processes, and association with known mitochondrial complexes, especially for those with unknown functions. We also identified a large set of proteins with unknown function that are likely associated with multi-protein mt complexes. We have created a web-based database “www.TrypsProteome.org” for dissemination of the proteomic data from these analyses.
Trypanosoma brucei procyclic form (PF) cells IsTaR 1.7a were grown at 27°C in SDM-79 media containing hemin (7.5 mg/ml) and 10% FBS. The cells were harvested at mid-log phase of growth by centrifugation at 6,000 × g for 10 min at 4°C. The mitochondrial vesicles were isolated from PF cells by hypotonic lysis followed by Percoll gradient floatation as described . Briefly, ~ 2×1010 PF cells were harvested at mid-log phase of growth and washed with 30 ml of SBG buffer (20 mM phosphate buffer, pH 7.9, 150 mM NaCl, 6 mM glucose). The cells were resuspended in 20 ml of DTE buffer (1 mM Tris-HCl, pH 8.0, 1 mM EDTA), disrupted by 5 strokes in Dounce homogenizer and immediately sucrose was added to a final concentration of 0.25 M (3.34 ml of 60% sucrose solution). After mixing the lysate was centrifuged at 15,000 × g for 10 min at 4°C. The organelle enriched pellet was resupended in 3.9 ml of STM buffer (20 mM Tris-HCl pH 8.0, 250 mM sucrose, 2 mM MgCl2) and treated with DNase (9 μg/ml final concentration). The sample was incubated in ice for 60 min following which equal volume of STE buffer (20 mM Tris-HCl pH 8.0, 250 mM sucrose, 2 mM EDTA) was added, mixed, and centrifuge as above. The pellet was resuspended in 4 ml of 70% Percoll using a small Dounce homogenizer with tight fitting pestle B for 5 strokes, layered at bottom of a 32 ml of 20–35% linear Percoll gradient and centrifuged at 103,900 × g for 60 min at 4°C. The mitochondria enriched fraction that appears in the density range of 1.052 to 1.069 g/ml was collected using a syringe and 18-gauge needle and washed 4 times with STE buffer and the mitochondrial vesicles were pelleted by centrifugation at 32,530 × g for 15 min.
The PF cells were washed with 1X PBS and lysed with 1% Triton X-100 with bi-directional mixing for 15 min at 4 °C. The lysed samples were separated to soluble supernatant and insoluble pellet fractions by centrifugation at 17,500 × g for 30 min at 4 °C. The pellet was washed thrice with 1X PBS, 1% Triton X-100 solution, and the cleared supernatant and pellet fractions were collected and analyzed by mass spectrometry (see 2.4). Similarly mitochondrial vesicles were lysed with 1% Triton X-100 and separated to cleared supernatant and pellet fractions as above.
The proteins in detergent soluble fractions were digested with sequencing grade modified trypsin (Promega) and the resulting peptides were analyzed by two-dimensional liquid chromatography tandem mass spectrometry (2D-LC-MS/MS). In first dimension the peptides were fractionated by off-line strong cation exchange (SCX) chromatography, multiple fractions were collected and in second dimension the peptides were further fractionated by on-line reverse phase (RP) chromatography. Briefly, 200 μg of proteins from detergent soluble fractions of PF cells and PF mitochondria were precipitated separately with 6 volumes of Acetone. The precipitates were dissolved in 8M Urea, 1 mM DTT and incubated at 50 °C for 1 h. After 4 fold dilution with 50 mM ammonium bicarbonate the proteins were digested with 2 μg trypsin O/N. The peptide samples were diluted 1:8 with 5% acetonitrile in 0.4% acetic acid buffer and loaded onto a 10 cm long x 2.1 mm ID polysulfoethyl column (PolyLC Inc) at a flow rate of 200 μl/min. The unbound peptides were washed away with 5% acetonitrile in 0.4% acetic acid at 200 μl/min flow rate for 10–20 min until A280 of the flow through reached the base line. The peptides were eluted with a 20 min linear gradient of 0–200 mM of ammonium acetate in 5% acetonitrile and 0.4% acetic acid, followed by a 5 min linear gradient of 200–500 mM of ammonium acetate in 5% acetonitrile and 0.4% acetic acid at 200 μl/min flow rate. Fractions of 200 μl were collected and dried in Speed Vac. The peptides in each fraction were dissolved in 10 μl of 5% acetonitrile, 0.4% acetic acid buffer and loaded onto a 10 cm long x 75 μm ID C18 capillary column at a flow rate of 200 nl/min. The peptide elution from C18 column was achieved by 5 min isocratic flow of 5% acetonitrile and 0.4% acetic acid followed by a 45 min linear gradient of 5–40% acetonitrile in 0.4% acetic acid, and a 5 min linear gradient of 40–80% acetonitrile in 0.4% acetic acid. The eluted peptides were analyzed on-line by electrospray ionization tandem mass spectrometry using a LTQ mass spectrometer (Thermo Electron) that was tuned for optimal performance at 2.2 kV spray voltage and 200°C capillary temperature using MRFA ion 524.3 at monthly interval. Xcalibur 1.4 SR1 version software was used to collect mass spectrometry data and the mass range for the MS survey scan was m/z 400–1400. Each MS scan was followed by 5 MS/MS scans and the data was collected using a dynamic exclusion method where a specific ion was sequenced twice at a maximum and is excluded from the list for 45 seconds.
The proteins in insoluble fractions were dissolved in 1X SDS-PAGE buffer, separated on 10% SDS-PAGE gels and stained with SYPRO Ruby stain (Invitrogen). Each gel lane was divided into 12 approximately equivalent pieces, the proteins were digested in-gel with trypsin O/N, and the resulting peptides were extracted (with 50% acetonitrile, 5% formic acid) and dried in Speed Vac . The peptides were fractionated by C18 RP chromatography and analyzed online by mass spectrometry as above.
The mass spectrometry data was analyzed against T. brucei sequence databases using TurboSEQUEST program in BioworksBrowser 3.1 software package (Thermo Electron) in a multi-processor cluster platform. The peak lists were generated using the Sequest module of Bioworks 3.1, cluster version SR1 using the default parameters (MW range: 400–3500, precursor mass tolerance: 1.4, group scan: 25 and minimum ion count: 15). The MS/MS data was compared with v4.0 predicted protein sequence database  (www.genedb.org). The database contained 9,211 T. brucei nuclear encoded protein sequences of which 612 are annotated as hypothetical unlikely plus 18 mitochondrial encoded protein sequences (we also included mouse immunoglobulin heavy and light chains, bovine serum albumin and human keratin sequences in the database). Parallel data analysis was carried out with a polypeptide database that contained all polypeptides of ≥50 amino acids (STOP codon to STOP codon) from six-frame translated T. brucei genome sequence (total 271,892 entries) (ftp://ftp.sanger.ac.uk/pub/databases/T.brucei_sequences/T.brucei_genome_v4/). No enzyme was specified during the SEQUEST search, peptide mass tolerance was set at 1.4 and fragment ion-tolerance at 0.0 (per the default parameters recommended by the manufacturer for good quality data). No fixed modification was set for any of the amino acids but differential modification for ‘M’ was set at 15.994. The output from SEQUEST search was filtered and compiled using PeptideProphet and ProteinProphet programs [16,17] using a local semiautomated platform built upon Trans-Proteomic Pipeline (TPP) (http://tools.proteomecenter.org/software.php).
The dataset presented here include only the doubly tryptic peptides that have minimum peptide identification probability of 0.9 and have a minimum SEQUEST X-correlation value of 1.5 for +1 ions, 1.8 for +2 ions, and 2.5 for +3 ions. We excluded any peptide containing more than one missed trypsin cleavage site in the sequence and that containing cysteine amino acid since alkylation step was not carried out during sample preparation. Proteins containing these peptides and with minimum identification probability of 0.9 were considered positive.
In selected cases the homology searches with TriTryp databases was carried out using OmniBLAST to identify any homologous or related proteins/genes. The probable functions of the proteins were assigned based on GeneDB annotation and for proteins with unknown function possible motifs and domains were searched in the PROSITE, InterPro and CDD databases.
The data from different experiments were stored and accessed via the Proteomics module of the SBEAMS (Systems Biology Experiment Analysis Management System) database (http://www.sbeams.org/Proteomics/) built on MS SQL Server 2000. The results from SEQUEST, PeptideProphet and ProteinProphet analyses were imported into the database using built-in Perl scripts. The results from single or sets of experiments were filtered with specific parameters (as above and in Results section) and compiled using SQL queries, and the output data saved in new Tables. A web based database (www.TrypsProteome.org) was developed using Microsoft .net framework 1.1. It is using Web Service to pull out data from compiled Tables and locally stored T. brucei GeneDB and Gene Ontology (GO) databases in .xml format.
We used a combination of cellular fractionation (non-ionic detergent soluble and insoluble) and sub-cellular fractionation (enrichment of mitochondrial vesicles) followed by protein fractionation (1D-gel) and peptide fractionation (SCX/RP chromatography) techniques for enhancing the coverage of peptides in mass spectrometry analyses. The peptides from Triton X-100 soluble supernatant fractions of whole cell and isolated mitochondria were fractionated by two-dimensional liquid chromatography and analyzed by tandem mass spectrometry (2D-LC-MS/MS) (see Supplementary Figure 1 for representative results). Reciprocally the proteins in insoluble pellet fractions were fractionated based on size in 1D SDS-PAGE gel, peptides were generated from multiple fractions and analyzed by RP-LC-MS/MS. The mass spectrometry data was analyzed and compiled as described in Methods section, and uploaded to SBEAMS database. In whole cell detergent soluble supernatant 1,689 proteins were identified by 2D-LC-MS/MS analysis, and 810 proteins were identified in pellet fraction by 1D-gel-LC-MS/MS analysis. There were 477 proteins in common between these two datasets, thus 2,022 proteins were identified in whole cell sample by MS/MS analysis. Similar analyses identified 1,548 proteins in mitochondrial enriched fraction of which 673 proteins were also identified in whole cell analysis (Figure 1). Thus 875 additional proteins were identified by analysis of mitochondrial enriched fraction compared to the whole cell fraction, and in total 2,897 proteins were identified in these analyses of which 1,333 have been assigned to known or putative function(s) (Supplementary Table 1). These results represent a substantial proportion of the T. brucei PF cellular proteome at mid-log phase of growth, and it also showed in a complex proteome like T. brucei analyses of sub-cellular/organelle fractions is required for maximal proteome coverage. The compiled results from multiple mass spectrometry runs and from different sample preparation methods enhanced the protein coverage . It resulted in detection of 65–75% of the same proteins between runs depending on sample complexity and second runs yielding 13–32% increase in proteins identified in highly complex samples, as seen by others .
In these analyses 12,131 unique peptides were identified (additionally 727 of these peptides were also identified in modified form), and 916 proteins were matched to these by a single peptide hit and 1,981 proteins by two or more peptide hits. The entire latter group of proteins had a very high protein identification probability (≥0.99 for 1,972 proteins and 0.98 for the other 9 proteins). Of the proteins identified with one peptide match 664 (72%) had protein identification probability of ≥0.99, 111 (12%) had 0.98 and the rest between 0.9 and 0.97. The identified peptide sequences and associated probability values are presented in Supplementary Table 1.
The T. brucei genome, excluding the pseudogenes, has had 9,211 protein coding sequences predicted (v4.0 database), 36% of which are annotated as encoding proteins with known or putative function, 51% as hypothetical conserved, 6% as hypothetical and 7% as hypothetical unlikely. However, of the 2,897 proteins identified in our mass spectrometry analysis representing the partial proteome 46% have assigned functions, 53% annotated as hypothetical conserved, ~1% as hypothetical and we did not identify any protein from hypothetical unlikely group (Figure 2). We observed a similar proportion in analyses of the BF cellular proteome (Results not shown). These results indicate that only a small proportion of the genes annotated as hypothetical and virtually none of the genes annotated as hypothetical unlikely are actually expressed in T. brucei cells. It also shows that a majority (but probably not all) of the genes annotated as hypothetical conserved are expressed in the cell. If we assume that almost all of the proteins annotated with assigned functions are expressed in cell during some stage then based on the observed ratio to proteins of unknown function it would extrapolate to ~7,210 proteins being expressed in T. brucei cells.
The acquired MS/MS data was also compared with predicted polypeptide sequences from nucleic acid (NA) database. While there was a very good concordance compared to the results obtained from v4.0 protein database some discrepancies were also apparent, especially in probability values of the identified peptides, and using the cut-off described above it missed some of the peptides identified by comparison to protein database (Results not shown). We did not identify 243 of the peptides that were identified in comparison to v4.0 protein database upon comparison of the MS/MS data to NA database (based on the best hit criteria, line 1 of .out file). It resulted in non-detection of 23 proteins, all originally identified with only one peptide hit against protein database. This predicts an error rate of 2% at peptide assignment level and 0.8% at protein assignment level.
We identified 146 unique peptides that correspond to predicted polypeptide sequences that are either not annotated as predicted genes (n=53) or to annotated genes (n=22). Four of the latter 22 genes more recently have been annotated as predicted genes; and one or more peptides in the other 18 annotated genes matched to predicted amino acid sequences upstream of currently annotated AUG start codon (Supplementary Figure 2). In 14 of these genes a start codon could be predicted upstream of the identified peptide(s). However, the other 4 lack a start codon upstream of the identified peptide sequence indicating possible sequencing error or an alternative start codon in these proteins. It is also possible that the strain used for proteomic studies may have slight differences in genomic sequences that could be reflected in different start codon upstream of the identified peptide sequences. Sequencing error appears to be most likely, especially since the homologs of 3 of these proteins (Tb927.3.2740, Tb927.3.4920 and Tb10.70.3350) are larger in both L. major and T. cruzi and span the polypeptide region identified upstream of the annotated start codon. Thus, this proteomic study identifies the start codon for a group of genes.
Of the 53 polypeptides identified in this analysis that do not map to currently annotated genes, 13 had no predicted start codon upstream of the matched peptide sequences and all were identified by single peptide hit, indicating that they may be false hits. As above these results are well within the estimated error range. Thus at an increased confidence level, 19 new ORFs were identified by two or more peptide matches and are likely bona-fide ORFs that were missed in the GeneDB annotation (Supplementary Figure 3A). Five of these proteins belong to the retrotransposon hot spot (RHS) protein group and 11 others have varying degrees of homology to predicted T. cruzi and/or L. major proteins, and the 3 other have homology to polypeptides predicted from T. cruzi (and L. major in 2 of the cases) genome sequences (Supplementary Table 2A). Eleven other polypeptides (including 3 belonging to RHS protein group), were each identified by only one peptide match (Supplementary Figure 3B) but have similarities to predicted T. cruzi and L. major proteins (Supplementary Table 2B) and thus are also likely bona-fide ORFs. The other ten polypeptides were identified by single peptide hits (Supplementary Figure 3C) but have no significant homology to annotated T. cruzi or L. major proteins. However, eight of those have some similarities to predicted polypeptides from T. cruzi and/or L. major contig sequences (Supplementary Table 2c). Thus this study identified 40 additional ORFs (30 with high confidence and 10 possible) that were missed in the GeneDB annotation.
We assessed the ability of available software [Mitoprot (http://ihg.gsf.de/ihg/mitoprot.html), SignalP (http://www.cbs.dtu.dk/services/SignalP/), Predotar (http://urgi.versailles.inra.fr/predotar/predotar.html), TargetP (http://www.cbs.dtu.dk/services/TargetP/) and PSORT (http://psort.nibb.ac.jp/form2.html)] in predicting the localization of T. brucei proteins to the mitochondrion using a set of known mitochondrial and non-mitochondrial proteins. The results showed different programs have different level of sensitivity vs. specificity and the correlations between different programs were poor. Overall Mitoprot and SignalP performed better than the other programs (Results not shown). Representative results obtained from Mitoprot and SignalP programs are shown in Figure 3, where the relative scores obtained for each proteins are plotted from a set of known mitochondrial proteins (ten proteins each from editosome  and MRB complex 1  in panel A and non-mitochondrial (ten proteins each from glycosome and cytoplasm) in panel B. The results showed that while the Mitoprot program was able to identify most of the known mitochondrial proteins, it also had the highest false positive rate by predicting known non-mitochondrial proteins as being mitochondrial. Even combining both the programs failed to identify approximately 25% of the mitochondrial proteins (see Supplementary Table 4 for relative scores of proteins assigned to mitochondria in this study). Thus the available programs have limited use in predicting localization of proteins to mitochondrion in T. brucei and additional qualifying criteria are required for sub-cellular assignment of proteins in this organism.
In this study we identified 1,548 proteins in mitochondria enriched fraction of which 607 (39%) have been assigned with a function. A randomized statistical approach was used to assess the coverage of mitochondrial proteome achieved in this analysis. We selected the proteins having ‘mitochondrion/mitochondrial’ ‘text’ in GO cellular component assignment, and calculated the proportion of those proteins that were identified in our shotgun proteomic analysis. The results indicate that we have identified ~86% of mitochondrial proteome (results not shown). This is supported by our observation that we only missed detecting one of the twenty annotated editosome proteins which are in low abundance in the mitochondrion.
We anticipate that not all of the proteins identified in this fraction will be mitochondrial. This is due in part to the fact that the single structurally complex mitochondrion is disrupted and reseals as vesicles during isolation, and also due to sample cross contamination. Based on available GO annotation, key-word search in protein description and literature references [11,13] 139 of 607 proteins that are assigned with function(s) (23%) appear to be non-mitochondrial. These proteins are assigned to other compartments of cell such as cytoplasm (8.7%), glycosome (6.6%), cytoskeleton and flagellum (4.4%), nucleus (2%) and others (Supplementary Table 3C). Additionally 52 other proteins (8.6%) may localize to membrane (Supplementary Table 3A), and we anticipate a large proportion of those would be associated with mitochondrial membrane. While 194 (32%) proteins could be assigned to mitochondrion (Table 1), the other 222 proteins (36.6%) have not been assigned to any cellular compartment (Supplementary Table 3B). It is likely that the majority of the 941 proteins that were identified in mitochondrial enriched fraction that have no known function will also be mitochondrial and the above ratio will be reflected in this group. Overall of the 385 proteins that are assignable (demonstrated or putative) to a specific sub-cellular compartment ~44% appear to be non-mitochondrial. We anticipated part of the proteins that are assigned to membrane will be non-mitochondrial. Thus, the other 56% are likely mitochondrial proteins and hence by extrapolation from our current coverage the T. brucei mitochondrial proteome is predicted to consist of ~1,008 proteins.
In general the mitochondrial and glycosomal proteins were identified with higher peptide coverage (peptide count) in organelle enriched fraction compared to the whole cell fraction (Table 1, Supplementary Table 3C). Reciprocally cytoplasmic and nuclear proteins were identified with higher peptide coverage in the whole cell fraction (Supplementary Table 3C). Similarly in recent proteomic studies from T. brucei large sets of mitochondrial proteins were identified in glycosome enriched fraction in addition to proteins from other cellular compartments [11,13]. Thus solely based on the identification of proteins in a specific organelle enriched fraction, they may not be assignable to the specific sub-cellular compartment and additional studies are required to determine their true localization. However, qualitatively proteins identified only in mitochondrial enriched fraction or at significantly higher peptide coverage compared to the whole cell fraction can be assigned to mitochondrion with higher confidence, especially for those lacking a predicted glycosomal localization signal.
To increase the confidence of protein assignments to the mitochondrion we carried out glycerol gradient experiment in which the Triton X-100 soluble fraction of enriched mitochondria was fractionated, and fractions from high ‘S’ value region (~20S, ~40S and ~80S) were analyzed by mass spectrometry. We anticipate that a majority of the proteins identified in these fractions (see Supplementary figure 4 for SDS-PAGE protein profile of the fractions) are likely to be associated with multi-protein complexes which are probably mitochondrial. We identified 633 proteins by analysis of these 3 different fractions that were also identified in mitochondrial enriched fraction. In this group 336 (53%) proteins are currently annotated as hypothetical (no known function) (Table 2 and Supplementary Table 5), and 297 (47%) have been assigned with function(s). In the later group 134 proteins (45%) are assignable to the mitochondrion (Table 1), 31 (10.4%) to the glycosome, 16 (5.4%) to membrane, and 101 proteins (34%) have not been assigned to any cellular compartment (Supplementary Table 3). Only 5% of the proteins are assignable to another cellular compartment such as the cytoplasm or nucleus compared to 16.3% in the mitochondrial enriched fraction. Indeed some proteins that are assigned to cytoplasm, such as dihydrolipoamide dehydrogenase proteins (Tb927.3.4390 and Tb927.8.7380) and heat shock protein HslVU (Tb927.5.1520 and Tb11.01.4050), were identified only in mitochondrial fractions or with significantly higher peptide coverage in this fraction compared to the whole cell fraction. The data indicate that these proteins are mitochondrial although they had been assigned to the cytoplasm (Supplementary Table 3). Similarly, FBPase fructose-1,6-bisphosphate which is annotated as cytosolic (Tb09.211.0540) may localize to the glycosome. TOP2 DNA topoisomerase II (Tb09.160.4090) which localizes to the mitochondrion  and was only identified in the mitochondrial enriched fraction is currently assigned to the nucleus in the GO database. In addition, preliminary results from our lab indicate that heat shock proteins HslVU (Tb927.5.1520 and Tb11.01.4050) are mitochondrial (Acestor N, unpublished results). Thus, results from this study are substantially refining the sub-cellular assignment of a large set of proteins. We estimate most (>95%) of the 633 proteins identified in the glycerol gradient fractions localize to the mitochondrion or glycosome.
We assigned 194 of the proteins identified by mass spectrometry analyses to the mitochondrion based on known/putative function, keyword search, GO annotation and publications [11,13,21,23] and grouped those in Table 1 by association with various biological processes. The results showed a large sub-set of the proteins that we identified in glycerol gradient samples are associated with mitochondrial multi-protein complexes, e.g. we identified protein components of respiratory complexes I-V, RNA editing complex, and ribosome etc. In recent studies from our laboratory we have determined the composition of several mitochondrial complexes using affinity tag, monoclonal antibody affinity purification and mass spectrometry analyses [20,21,23]. Of the 336 proteins identified in glycerol gradient sample that have no known function 20 proteins are associated with respiratory complex I and the MRB complex 1 , 72 proteins with mt ribosomes , and 40 others are associated with other mitochondrial complexes (Alena Zikova, manuscripts in preparation). In total we have identified 207 proteins with unknown function that were in the mitochondrial enriched fraction and are associated with mitochondrial complexes (Table 2). These proteins are currently annotated as hypothetical but conserved motifs/domains that are indicative of possible function(s) were identified in 59 of these proteins (Table 2).
We anticipate that the majority of the 204 proteins with unknown function that were identified in glycerol gradient fraction but have not been assigned to any complexes based on our current knowledge are potential mitochondrial proteins (Supplementary Table 5). Six of these proteins appear to be glycosomal based on motifs and targeting signal and 2 other are non-mitochondrial based on peptide coverage compared to whole cell fraction. Furthermore, more than 500 proteins with unknown function were identified in the mitochondrial enriched fraction. These have not been included in the high confidence list, but a substantial proportion of them are likely to be mitochondrial. As indicated in section 3.3, peptide count information would be applicable for preliminary assignment to a sub-set of those proteins, however further qualifying criteria are required to assign them to mitochondria. In Supplementary Table 6, we provide a list of 283 proteins that are likely mitochondrial. These include proteins identified with at least 2 peptides in mitochondrial enriched fraction and detected only or with higher peptide count compared to whole cell fraction. We also excluded any proteins that have putative a glycosomal targeting signal.
Overall, in this study 401 proteins were assigned to T. brucei mitochondrion (Table 1 and and2)2) based on their assigned function, GO annotation and/or specific association to mitochondrial complexes. It also provides a list of 196 high confidence candidate proteins (Supplementary Table 5) majority of which are expected to be associated with mt complexes, and 283 likely mitochondrial proteins (Supplementary Table 6) that need further follow up for definitive assignment.
The dataset presented here and that generated in our T. brucei mitochondrial proteome project is being available via a website (www.TrypsProteome.org). The database is searchable by several fields and is also linked to GeneDB database. Currently the information on protein identification, the identified peptides and their respective probability values are available on the website. The mass spectrometry identification of peptides and assignment to specific proteins were carried out by well established and widely accepted criteria, however, it is important to note that they are based on statistical confidence levels and possible error rate as explained above should be taken into consideration when qualifying the data. The database is planned to include detailed information on complex composition, sedimentation and immunolocalization as generated.
A combination of sub-cellular, protein, and peptide fractionation was used along with high throughput tandem mass spectrometry for a comprehensive characterization of the mitochondrial proteome of PF T. brucei. Analysis of the acquired data confirmed the validity of the very large majority of predicted genes in T. brucei and identified several genes that were previously not predicted. This analysis also showed that the set of genes annotated as hypothetical (species specific) is over-predicted and that virtually all genes annotated as hypothetical, unlikely are not expressed. Overall, the proteomic analysis extrapolated to a total of 1,008 mitochondrial proteins. Of these, specific assignments were made with progressively diminishing stringent criteria for 401, 196, and 283 proteins. The balance was estimated by statistical methods but mitochondrial assignments could not be made to individual proteins. The data have been placed in a publicly available web-based database that we constructed. Analyses of the data reveal a complex and divergent gene expression system, divergent energy production machinery, and a less divergent metabolic system.
The study generated high quality data on a substantial proportion (more than third) of the cellular proteome and was useful in assessing the initial genome annotation. Further studies are required for complete coverage of T. brucei cellular proteome that would analyze other sub-cellular fractions such as cytoplasm and nucleus etc., and also BF stage of the parasite. The use of stringent cut-off criteria and statistical approaches for peptide and protein assignments [16,17] and identifying the large proportion proteins with two or more peptides allowed protein assignment at high confidence. The error rate was <1% and may have missed some proteins for which mass spectrometry data was obtained. For example, two proteins with similarities to VSGs, Tb11.24.0011 and Tb11.43.0001, were identified, each with a single different peptide match, and thus either these proteins are expressed in PF, which is not expected for true VSGs, or these assignments are incorrect. Protein assignments were based on assessment of the gene prediction models by comparison to the predicted protein database and six-frame STOP-to-STOP translated genome sequence database of ≥50 amino acids length. It resulted in identification of protein coding ORFs that were not originally predicted which is similar to the experience with Plasmodium falciparum . The occurrences of homologs of the newly identified ORFs in L. major and/or T. cruzi genomes enhance the confidence in gene identification. While a large proportion of the newly identified genes were in the typical size range some were small, for example three encode 49 (Tb11.11283), 55 (Tb02.7762) and 57 (Tb11.29019) amino acids (Supplementary Figure 3). This predicts that the whole genome encodes additional smaller ORFS that have not yet been annotated. The proteome data also allowed reassignment of start codons for some genes.
No proteins that are annotated as hypothetical, unlikely were detected suggesting that these are not functional genes or minimally not expressed in the stages examined. Less that 1% of the proteins identified were predicted from genes that were annotated as hypothetical (brucei specific) group while protein coverage for genes annotated as hypothetical, conserved was very high. Thus most genes annotated as hypothetical, conserved are expressed in PF T. brucei while only a small fraction of genes annotated as species specific are expressed. None of the 18 kinetoplast encoded proteins were identified, possibly due to the characteristics of these proteins which include high hydrophobicity, membrane association, non migration into SDS-PAGE gels and few potential trypsin cleavage sites making their peptide identification by routine mass spectrometry analysis difficult.
The large complex tubular mitochondrion ruptures and reseals upon cell breakage and thus isolation by the established protocol that was used [15,25] may result in mitochondrial vesicle preparations that contain soluble abundant non-mitochondrial proteins, and organelles which have similar physical characteristics to the mitochondrial vesicles. Indeed, glycosome preparations have been reported to contain a substantial proportion of mitochondrial proteins [11,13] and in this study glycosomes were commonly present in purified mitochondrial preparations. Proteins from other compartments of cell such as cytoplasm and nucleus which were identified via their predicted functions were diminished in mitochondria compared to whole cells as determined by peptide frequency (Supplementary Table 3C). The enrichment of mitochondrial and glycosomal proteins is more evident in the glycerol gradient fractions compared to the total mitochondrial lysates. In general, peptide coverage of mitochondrial proteins was greater in mitochondrial fraction and lower for cytoplasmic and nuclear proteins when compared to whole cell fraction analyzed under similar conditions.
The shotgun proteomic experiments described here do not provide direct experimental evidence for localization of proteins to specific organelle; however, it provided candidate proteins for downstream analyses. The results from glycerol gradient fraction analysis provided a large set of candidate proteins for further characterization such as identification of associating complexes by TAP-tag analyses that is being carried out in our laboratory in a high throughput manner. In recent studies we have isolated and determined the composition of several multi-protein complexes from mitochondria [20,21,23]. We considered any proteins specifically associated with a mitochondrial complex or mitochondrial process would be mitochondrial, e.g. core sub-units of respiratory complex I and its associated protein components would be mitochondrial. Using this criterion in our analyses we extended the assignment of proteins to mitochondrion for a large set of proteins with known function beyond currently available GO annotation (Table 1) and also to a large set of proteins with unknown function (Table 2).
The mitochondrial proteome of T. brucei is estimated to have just over one thousand proteins, and 880 proteins were assigned to the mitochondrion at varying levels of confidence (Table 1 and and2;2; and Supplementary Table 5 and 6). This coverage correlates well with the proteomic analyses of yeast mitochondria in which 851 proteins were identified in the purified organelles and detected 84% of the confirmed mitochondrial proteins [18,26]. Thus the Trypanosomes mitochondrial proteome is larger than that of yeast. This might reflect additional biological processes such as RNA editing and metabolic complexities associated with the life cycles. The protein import process may also be more complex as seen by our analysis where mitochondrial signal prediction tools designed for yeast and human have limited capabilities in T. brucei (Figure 3). It is possible multiple mechanisms exist for import of proteins to Trypanosome mitochondria. The high confidence protein dataset provided in this study (Tables 1 and and2)2) can be used as a reference for creating new or modifying existing tools for prediction of mitochondrial import in this group of parasites.
Trypanosome mitochondrion contains complex machinery for energy metabolism through oxidative phosphorylation, iron-sulfur clusters and network of metabolic processes. The proteomic analyses identified many of the core components of respiratory complexes I-V. The presence of functional complex I in T. brucei is controversial; however, recently we reported the ability to purify a complex by mAb affinity and TAP-tag purification that contains proteins with homology to those in complex I including some of the core sub-units . This study in addition identified other potential core components of complex I including NDUFA5 (Tb10.70.3150), NDUFA9 (Tb09.244.2620 and Tb10.05.0070), NDUFA13 (Tb11.01.0640), and NDUFS7 (Tb11.47.0017) in mitochondrial enriched fraction. Further NDUFB9 (Tb11.01.7460) have been identified by mass spectrometry in a mitochondrial membrane enriched fraction (Nathalie Acestor, manuscript in preparation). Thus we were able to identify peptides corresponding to all of the core sub-units of complex I that could be identified by homology search against human/bovine protein database . Interestingly, we identified another proteins Tb927.4.1130 in this study which has domain related to the LYR family complex 1 protein indicating that it may be part of this complex. Complex I also contains numerous proteins of unknown function , illustrating its divergence in Trypanosomes. We have not detected peptides from the core components of complex I in the BF cell proteome (unpublished results), suggesting that either they are absent in BF or present at much lower level compared to PF cells. Alternatively, the TAO had higher peptide coverage in BF cells compared to PF cells (unpublished results) as did the Tb10.6k15.0550 protein which is currently annotated as hypothetical but has an alternative oxidase (AOX) domain (termed as TbAOX2 in ), suggesting a role in energy metabolism. Numerous enzymes associated with various metabolic processes as well as components associated with amino acid metabolism, protein biosynthesis and turnover, and lipid metabolism were identified in trypanosome mitochondria (Table 1). The functional assignment for most of these proteins has been possible based on sequence conservation with other organisms (www.genedb.org).
The mitochondrial DNA network termed kDNA is composed of few dozen maxicirlces and few thousand minicirlces. The protein components associated with kDNA replication such as polymerases and topoisomerase II were identified, as is RNA polymerase associated with transcription [28,29]. The mt RNAs are post-transcriptionally edited by RNA editing, a process that has been well characterized . Three different endonucleases have been identified that are required for editing process . However, how the poly-cistronic transcripts are processed prior to editing is not known, and it is anticipated at least one endonuclease will be required for cleavage of such RNAs. In this study numerous proteins of unknown function that have motifs indicating association with nucleic acids and candidate nucleases were identified, some of which may play role in above processes.
In summary, 1,548 proteins were identified in T. brucei mitochondrial enriched fraction, of which 1,008 are estimated to be mitochondrial with 880 specific proteins assigned to the mitochondrion at varying level of confidence. Of the 880 proteins 401 were assigned to the mitochondrion at high confidence, and additional studies are required for definitive sub-cellular assignment for the remaining proteins. Importantly, the results showed that a large proportion of proteins assigned to the mitochondrion have no known function. This is consistent with our recent data on the compositions of multi-protein mitochondrial complexes which show a large fraction of the components in mitochondrial complexes are unique to trypanosomes [21,23]. This reflects the complexity, and structural and functional divergence of multi-protein complexes and biological processes in trypanosome mitochondria. A web-based database www.TrypsProteome.org has been created for dissemination of proteomic data from this and ongoing proteomic studies in our laboratory.
We thank Eric Deutsch for help with SBEAMS Database and Christiane Hertz-Fowler for providing six-frame translated genome database. T. brucei protein sequence was downloaded from GeneDB (www.genedb.org) site. Research conducted using equipment made possible by Economic Development Administration - US Department of Commerce and the M.J. Murdock Charitable Trust. This work was supported by NIH grant AI065935.