Chloroplasts are essential organelles of prokaryotic origin and carry out a wide range of metabolic functions. The chloroplast genome only encodes for about 100 proteins, whereas the vast majority of the chloroplast proteome is encoded by the nuclear genome. These proteins are generally synthesized as precursor proteins with cleavable N-terminal chloroplast transit peptides (cTPs)
[1]. Several subcellular localization programs, such as TargetP
[2] are available that predict these cTPs, with the number of predicted chloroplast (plastid) proteins ranging from about 1500 to 4500 proteins
[3],
[4]. However, several known plastid proteins appear to have no obvious cTP, and chloroplast outer envelope proteins never have a cleavable cTP (for discussion see
[5]–
[7]. It was recently suggested that an
Arabidopsis thaliana (from here on referred to as
Arabidopsis) chloroplast protein (a carbonic anhydrase) takes an alternative route through the secretory pathway, and becomes N-glycosylated before entering the chloroplast
[8]. It is possible that more chloroplast proteins follow this route. Large scale experimental plastid proteomics studies are needed to evaluate unusual targeting pathways and to provide new training sets to improve subcellular localization prediction.
Driven by developments in mass spectrometry (MS), the
Arabidopsis chloroplast proteome has been analyzed by MS in combination with various protein fractionation techniques to assign proteins to chloroplast compartments (reviewed in
[9]–
[11]). Collectively, these studies identified 1090 proteins (counting 1 gene model per protein), with an overall cTP prediction rate of 60% by TargetP (data not shown). However, from manual evaluation we estimate that 300–350 proteins likely represent false positive identifications and/or non-chloroplast contaminations. This shows that uncurated experimental proteomics data from isolated subcellular compartments and localization predictors do not provide sufficient quality for localization. However, the combination of multiple independent proteomics experiments, ideally from all compartments of a cell, as well as cross-correlation to detailed functional and localization (eg. with GFP fusion proteins) studies may allow high quality subcellular localization and functional annotation
[12]. Currently, this curation process cannot be fully automated and requires manual supervision. Thus more experimental work and curation is needed to obtain a more in-depth and accurate overview of the chloroplast proteome in
Arabidopsis.Protein accumulation levels within a cell, or subcellular compartment such as the chloroplast, span five to ten orders of magnitude. To understand chloroplast function and homeostasis and to accommodate systems biology approaches to model genetic and metabolic networks
[13], it is important to determine protein accumulation levels. A recent analysis of the
Arabidopsis stromal proteome used gel based quantification to rank the abundance of 240 stromal proteins spanning several orders of magnitude
[14]. The challenge is now to obtain accurate quantification for a larger percentage of the chloroplast proteome. Recently, large scale MS-based studies for yeast, humans,
E. coli and other sequenced organisms have shown that the number of MS/MS spectra matched to a protein (spectral counts - SPC) positively correlates with the protein abundance
[15]–
[18]. Upon control of several experimental conditions, careful and stringent spectral assignments, and sophisticated normalization procedures, it appears that MS-based quantification can provide an attractive and sensitive tool to obtain large scale measurements of relative protein concentrations. For further review and discussions were refer to
[19]–
[21]. These new developments provide an excellent opportunity for quantification of the chloroplast proteome as will be demonstrated in the current study.
The half-life and function of proteins is often influenced by post-translational modifications (PTMs). N-terminal modifications of chloroplast proteins have shown to be important for chloroplast viability. For instance, N-terminal acetylation in the cytosol of nuclear-encoded chloroplast proteins is required for chloroplast function
[22]. Furthermore, both chloroplast localized deformylase
[23]–
[26] and methionine endopeptidase are essential for
Arabidopsis seedling viability
[27],
[28]. It is quite likely that these N-terminal modifications improve protein stability
[29], for example to avoid degradation by the abundant chloroplast Clp protease system
[30]. However, no systematic experimental analysis of N-termini of
Arabidopsis chloroplast proteins has been carried out so far. PTMs, such as N-terminal acetylation, typically lead to a well-defined change in molecular mass that can often be detected by high quality MS. The rapid improvements in MS instrumentation, exemplified by the linear ion trap triple quadropole (LTQ) Fourier transform ion cyclotron resonance and LTQ-Orbitrap instruments, now facilitate a high throughput PTM analysis
[31]–
[35].
The current study determines chloroplast stromal protein abundance and N-terminal modifications, re-evaluates chloroplast transit peptides and cleavage sites, and provides a comprehensive catalogue and annotation of the chloroplast proteome, encompassing existing literature. The plastid proteomics database, PPDB (
http://ppdb.tc.cornell.edu/), first described in
[36], is focused on the (cell-type specific) chloroplast proteomes from maize and
Arabidopsis and their functional annotation. We recently renamed the Plastid PDB into Plant PDB to better reflect the content. The dataset obtained in the current study is integrated in the PPDB, is expected to serve the plant community in small and large scale analyses where protein subcellular location, protein modification, function and abundance are important. Moreover, based on our experimental and theoretical analysis of the N-terminal portions and cTP cleavage sites, it is expected that the chloroplast data set presented here will facilitate improvement of subcellular protein localization predictors. Finally, the protein coverage and abundance of key chloroplast pathways and processes is discussed. This study demonstrates that ‘spectral counting’ can provide large scale protein quantification for
Arabidopsis, which is important in the context of plant systems biology
[13],
[37].