|Home | About | Journals | Submit | Contact Us | Français|
Proteomics is the large-scale study of the structure and function of proteins in complex biological sample. Such an approach has the potential value to understand the complex nature of the organism. Current proteomic tools allow large-scale, high-throughput analyses for the detection, identification, and functional investigation of proteome. Advances in protein fractionation and labeling techniques have improved protein identification to include the least abundant proteins. In addition, proteomics has been complemented by the analysis of posttranslational modifications and techniques for the quantitative comparison of different proteomes. However, the major limitation of proteomic investigations remains the complexity of biological structures and physiological processes, rendering the path of exploration paved with various difficulties and pitfalls. The quantity of data that is acquired with new techniques places new challenges on data processing and analysis. This article provides a brief overview of currently available proteomic techniques and their applications, followed by detailed description of advantages and technical challenges. Some solutions to circumvent technical difficulties are proposed.
The term proteomics describes the study and characterization of complete set of proteins present in a cell, organ, or organism at a given time . In general, proteomic approaches can be used (a) for proteome profiling, (b) for comparative expression analysis of two or more protein samples, (c) for the localization and identification of posttranslational modifications, and (d) for the study of protein–protein interactions. The human genome harbors 26000–31000 protein encoding genes ; whereas the total number of human protein products, including splice variants and essential posttranslational modifications (PTMs), has been estimated to be close to one million [3, 4]. It is evident that most of the functional information on the genes resides in the proteome, which is the sum of multiple dynamic processes that include protein phosphorylation, protein trafficking, localization, and protein-protein interactions . Moreover, the proteomes of mammalian cells, tissues, and body fluids are complex and display a wide dynamic range of proteins concentration one cell can contain between one and more than 100000 copies of a single protein . In spite of new technologies, analysis of complex biological mixtures, ability to quantify separated protein species, sufficient sensitivity for proteins of low abundance, quantification over a wide dynamic range, ability to analyze protein complexes, and high throughput applications is not yet fulfilled . Biomarker discovery remains a very challenging task due to the complexity of the samples (e.g., serum, other bodily fluids, or tissues) and the wide dynamic range of protein concentrations . Most of the serum biomarker studies performed to date seem to have converged on a set of proteins that are repeatedly identified in many studies and that represent only a small fraction of the entire blood proteome . Processing and analysis of proteomics data is indeed a very complex multistep process [10, 11]. The consistent and transparent analysis of LC/MS and LC-MS/MS data requires multiple stages , and this process remains the main bottleneck for many larger proteomics studies. To overcome these issues, effective sample preparation (to reduce complexity and to enrich for lower abundance components while depleting the most abundant ones), state-of-the-art mass spectrometry instrumentation, and extensive data processing and data analysis are required. A wide range of proteomic approaches are available such as gel-based applications include one-dimensional and two-dimensional polyacrylamide gel electrophoresis [13, 14], and gel-free high throughput screening technologies are equally available, including multidimensional protein identification technology , isotope-coded affinity tag ICAT ; SILAC ; isobaric tagging for relative and absolute quantitation (iTRAQ) . Shotgun proteomics  and 2DE DIGE  as well as protein microarrays [21, 22] are applied to obtain overviews of protein expression in tissues, cells, and organelles. Large-scale western blot assays , multiple reaction monitoring assay (MRM) , and label-free quantification of high mass resolution LC-MS data  are being explored for high throughput analysis. Many different bioinformatics tools have been developed to aid research in this field such as optimizing the storage and accessibility of proteomic data or statistically ascertaining the significance of protein identifications made from a single peptide match . In this review we attempt to provide a overview of the major developments in the field of proteomics, some success stories as well as challenges that are currently being faced.
About 20–30% of all genes in an organism encode integral membrane proteins, which are involved in numerous cellular processes . Membrane proteins constitute 30% of the typical proteome, yet their propensity to aggregate and precipitate in solution confounds their analysis . The target residues for tryptic cleavage (i.e., lysine and arginine) are mainly absent in transmembrane helices and preferentially found in the hydrophilic part of these lipid bilayer-incorporated proteins. Because of the protein aggregation step of IEF, 2DE is unsuitable for the separation of integral membrane proteins and is limited to detection of membrane-associated proteins and membrane proteins with a low hydrophobicity . Membrane solubilization methods have been deployed to analyze enriched membrane fractions and address the solubility issue by using detergents , organic solvents , and organic acids  compatible with subsequent proteolytic digestion/chemical cleavage, separation and analysis by LC/MS. In this approach, (1) an enriched yeast membrane fraction is solubilized with 90% formic acid in the presence of cyanogens bromide. The concentrated organic acid provides the solubilization agent, and cyanogen bromide, functional under acidic conditions, allows many embedded membrane proteins to be cleaved, (2) a membrane-enriched microsomal fraction is solubilized by boiling in 0.5% SDS and, following isotope-coded affinity tag (ICAT) labeling, is diluted to reduce the concentration of SDS, and (3) by using an enriched membrane sample, the proteins are thermally denatured and sonicated in 60% organic solvent (methanol) in the presence of trypsin. The resultant peptide mixture is then analyzed by LC/MS. All three of these methods are effective and optimize the identifications of membrane proteins. Another method using high pH and protenase K is optimized specifically for the global analysis of both membrane and soluble proteins . High pH favors the formation of membrane sheets, while proteinase K cleaves exposed hydrophilic domains of membrane proteins. Commercially available nonionic detergents, dodecyl maltoside, and decaethylene glycol mono hexadecyl are proved most efficient membrane protein solubilizers . Another more successful approach to isolate membrane proteins relies on cell surface labeling in combination with high resolution two-dimensional (2D) LC-MS/MS . In addition, improved analytical tools should be developed, that is, multidimensional liquid chromatography of peptide mixtures generated from membrane proteins, nanoflow chromatographic techniques for hydrophobic transmembrane peptides, and native electrophoresis of membrane protein complexes, which, in combination with mass spectrometry, should lead to the identification of the majority of proteins in the membrane proteome of simple microorganisms. It is important to quantify not only the identified membrane proteins but also to determine the levels of interacting partners. Subcellular fractionation techniques that employ a combination of centrifugation steps are a common choice for preparing plasma membrane-(PM-) enriched fractions including detergent-resistant membrane fractions, commonly known as lipid rafts. These methods can offer a significant improvement in specificity for PM proteins over approaches that do not perform any subcellular fractionation, but rather use whole-cell or tissue preparations . Chemical-tagging methods  have been a more applied technique used to enrich for PM proteins and are often used in conjunction with physical separation strategies. This method allows for a specific class of protein or modification of interest to be physically separated from other nontagged proteins. Importantly, when chemical tags are attached to the extracellular domain of PM proteins on intact cells, they offer an unrivaled specificity for PM proteins, because they offer a manner to distinguish true PM proteins from intracellular contaminants. Cell-surface biotinylation, the covalent attachment of a biotin tag to the extracellular domain of PM proteins, is also a popular choice [38–40].
Serum is a complex body fluid, containing a large diversity of proteins. More than 10000 different proteins are present in the human serum and many of them are secreted or shed by cells during different physiology or pathology processes . Serum is expected to be an excellent source of protein biomarkers because it circulates through, or comes in contact all tissues. Consequently, serum proteomics has raised great expectations for the discovery of biomarkers to improve diagnosis or classification of a wide range of diseases, including cancers . However, serum has been termed as the most complex human proteome  with considerable differences in the concentrations of individual proteins, ranging from several milligrams to less than one pictogram per milliliter . The analytical challenge for biomarker discovery arises from the high variability in the concentration and state of modification of some human plasma proteins between different individuals . Albumin is a protein of very high abundance in serum (35–50mg/mL) that would be a prime candidate for complete selective removal prior to performing a proteomic analysis of lower abundance proteins. Thus, removal of albumin from serum may also result in the specific removal of low abundance cytokines, peptide hormones, and lipoproteins of interest. Immunoglobulins, and antibodies are also abundant proteins in serum that function by recognizing “foreign” antigens in blood and initiating their destruction . The presence of higher abundance proteins interferes with the identification and quantification of lower abundance proteins (lower than ng/mL in serum). Complexity and dynamic range of protein concentrations can be addressed with a combination of prefractionation techniques that deplete highly abundant proteins and fractionate. Heparin chromatography coupled with protein G appears to be an efficient and economical strategy to pretreat serum for serum proteomics . Protein prefractionation by immunodepletion and reversed-phase separation of the depleted plasma on mRP-C18 column provide methods compatible with LC-MS-based analysis. A polyclonal antibody-based system to rapidly deplete multiple high abundant proteins in serum, plasma, CSF, and other biological fluids. Individual antibody materials are mixed in selected percentages and packed into a column format. Albumin can be removed by immunoaffinity columns , isoelectric trapping , dye-ligand chromatography , and peptide affinity chromatography . Another approach involves the removal of IgG by affinity chromatography using immobilized protein A or protein G . A recently developed depletion method that mixes 6 high-specificity polyclonal antibodies (MARS) to remove the top 6 proteins in a single purification step is commercially available . Human-14 multiple affinity removal column depletes the top 14 abundant proteins from human serum, plasma, CSF, and other biological fluids. To address 2D limitations several types of mass spectrometry, in conjunction with various separation and analysis methods, are increasingly being adopted for proteomic measurements . In contrast, 2D-PAGE analysis, SELDI-TOF MS is a rather new method which is especially valuable for the identification of serum-derived biomarkers . This method is based on ProteinChip Arrays which carry various chromatographic properties, such as anion exchange, cation exchange, and hydrophilic or hydrophobic surfaces . For the analysis of serum, only 5–10μL of serum sample is applied to these surfaces; after washing off unbound material, the protein fingerprint can be determined and visualized by time-of-flight mass spectrometry. The advantages of this method are the low amount of sample necessary for analysis, its speed, and high throughput capability. Many different groups have used this method and related methods based on prefractionation of serum proteins by beads and subsequent MALDI analysis for the identification of biomarkers in serum, urine, pancreatic juice, and other biological fluids . The necessity of this removal or separation is also illustrated that many proteins found useful as biomarkers . Different fractionation steps (such as electrophoresis, SELDI, and liquid chromatography) have been developed to reduce the complexity of serum proteome and to allow the detection and the identification of single proteins . 2DE and MALDI MS had applied to identify candidate biomarkers at early and late stages of lung cancer disease. This method identified 46 proteins in tumor bearing mice this included disease regulated expression of orosomucoid-8, a-2-macroglobulin, apolipoprotein-A1, apolipoprotein-C3, glutathione peroxidase-3, plasma retinol-binding protein, and transthyretin . Recently 1065 proteins were identified by stable isotope labeled proteome (SILAP) standard coupled with extensive multidimensional separation with tandem mass spectrometry of which 121 proteins were present at 1.5-fold or greater concentrations in the sera of patients with pancreatic cancer . Specimen collection (Blood, serum, plasma samples) is an integral component of clinical research. Access to high-quality specimens, collected and handled in standardized ways that minimize potential bias or confounding factors, is key to the “bench to bedside” aim of translational research . Variables that may impact analytic outcomes include (1) the type of additive in the blood collection tubes; (2) sample processing times or temperatures; (3) hemolysis of the sample; (4) sample storage parameters; (5) the number of freeze-thaw cycles [63, 64]. The key variable in any analysis is that the case and control samples are handled in the exact same manner throughout the entire analytical process from study design and collection of samples to data analysis [63, 65]. These types of differences between samples could have a significant impact on the stability of proteins or other molecules of interest in the specimens. Small differences in the processing or handling of a specimen can have dramatic effects in analytical reliability and reproducibility, especially when multiplex methods are used. A representative working group, standard operating procedures internal working group, comprised of members from across early detection research network should be formed to develop standard operating procedures (SOPs) for various types of specimens collected and managed for biomarker discovery and validation work.
Figure 1 gives the general work flow in proteomics and Table 1 addresses their strengths and limitations. Two-dimensional electrophoresis (2DE) was developed two decades before the term proteomics was coined [66, 67]. The 2DE entails the separation of complex protein mixtures by molecular charge in the first dimension and by mass in the second dimension. 2DE analysis provides several types of information about the hundreds of proteins investigated simultaneously, including molecular weight, pI and quantity, as well as possible posttranslational modifications. 2DE is extensively used but mostly for qualitative experiments and this method falls short in its reproducibility, inability to detect low abundant and hydrophobic proteins, low sensitivity in identifying proteins with pH values too low (pH < 3) or too high (pH > 10) and molecular masses too small (Mr < 10kD) or too large (Mr > 150kD) [2–5]. Poor separations of basic proteins due to “streaking” of spots and membrane proteins resolution  are limiting factors in 2DE. However, 2DE is the only technique that can be routinely applied for parallel quantitative expression profiling of complex protein mixtures such as whole cell and tissue lysates  and most widely used method for efficiently separating proteins, their variants and modifications (up to 15000 proteins). There are two ways to study posttranslational modifications by means of 2DE. First, posttranslational modifications that alter the molecular weight and or pI of a protein are reflected in a shift in location of the corresponding protein spot on the proteomic pattern. Second, in combination with Western blotting, antibodies specific for posttranslational modifications can reveal spots on 2DE patterns containing proteins with these modifications . Protein extraction and solubilization are key steps for proteomic analysis using 2DE, highly hydrophobic proteins tend to precipitate during isoelectro focusing (IEF), low copy number and the insolubility of transmembrane proteins renders quantitative analysis of these peptides and polypeptides are very challenging . In order to enhance protein extraction and solubilization, different treatments and conditions are necessary to efficiently solubilise different types of protein extracts [72, 73]. The major challenge for protein visualization in 2DE is the compatibility of sensitive protein staining methods with mass spectrometric analysis. Therefore, several fluorescent staining methods have been developed for the visualization of 2DE patterns, including sypro stainings and Cy-dyes . Although sypro ruby  and silver staining [76, 77] have a comparable sensitivity, sypro ruby staining allows much higher reproducibility, a significantly wider dynamic range and less false-positive staining. In addition, sypro ruby allows for the detection of lipoproteins, glycoproteins, metalloproteins, calcium-binding proteins, fibrillar proteins, and low molecular weight proteins that are less ‘‘stainable” using other methods. Finally, a large number of protein spots on 2DE patterns contain several proteins with a similar pI. A pH gradient with a narrow range allows zooming into different proteins with the same molecular weight. Increased separation distance 40 × 40cm gels using CA-IEF  could increase the proteome coverage up to 5000 proteins. Use of overlapping narrow range IPGs “Zoom” gels and increase in separation area could yield better membrane protein separation . This technology, however, is biased against certain classes of proteins including low abundance and hydrophobic proteins.
Proteins can also be fluorescently labelled with Cy2, Cy3, or Cy5 prior to 2DE . CyDyes are cyanine dyes carrying an N-hydroxysuccinimidyl ester reactive group that covalently binds the e-amino group of lysine residues in proteins. During DIGE , proteins in each of up to 3 samples can be labelled with one of these fluorescent dyes, and the differentially labelled samples can be mixed and loaded together on one single gel, allowing the quantitative comparative analysis of three samples using a single gel (Figure 2). The DIGE technique has exhibited higher sensitivity as well as linearity, eliminated postelectrophoretic processing (fixing and destaining) steps and enhanced reproducibility by directly comparing samples under similar electrophoretic conditions [81, 82]. The resulting images are then analyzed by software such as De-Cyder which are specifically designed for 2D-DIGE analysis . The major advantages of 2D-DIGE are the high sensitivity and linearity of its dyes, its straightforward protocol, as well as its significant reduction of intergel variability, increasing the possibility to unambiguously identify biological variability, and reducing bias from experimental variation. Moreover, the use of a pooled internal standard, loaded together with the control and experimental samples, increases quantification accuracy and statistical confidence . The DIGE technique has dramatically improved the reproducibility, sensitivity, and accuracy of quantitation; however, its labeling chemistry has some limitations; proteins without lysine cannot be labeled, and they require special equipment for visualization, and fluorophores are very expensive [83, 85].
Gel-free, or MS based, proteomics techniques are emerging as the methods of choice for quantitatively comparing proteins levels among biological proteomes, since they are more sensitive and reproducible than two-dimensional gel-based methods. ICAT is one of the most employed chemical isotope labeling methods and the first quantitative proteomic method to be based solely on using MS [86, 87]. Each ICAT reagent consists of three essential groups: a thiol-reactive group, an isotope-coded light or heavy linker, and a biotin segment to facilitate peptide enrichment. In an ICAT experiment, protein samples are first labeled with either light or heavy ICAT reagents on cysteine thiols. The mixtures of labeled proteins are then digested by trypsin and separated through a multistep chromatographic separation procedure. Peptides are identified with tandem MS, and the relative quantifications of peptides are inferred from the integrated LC peak areas of the heavy and light versions of the ICAT-labeled peptides . The ICAT concept has been widely used after its introduction [89–91]. Different software programs were developed to analyze ICAT labeled MS data (e.g., proICAT from Applied Biosystems, spectrum Mill from Agilent Technologies, and Sashimi from the Institute of System Biology ). ICAT is extremely helpful to detect peptides with low expression levels, which is one of the bottleneck issues in analytic protein techniques [93, 94]. However, major limitations of this technique include selective detection of proteins with high cysteine content and difficulties in the detection of acidic proteins [95, 96]. The methods for direct comparison of DIGE and ICAT for the identification and quantification of proteins in complex biological mixtures are also being considered .
While the ICAT reagent only interacts with the free sulfhydryl of homocysteine and 8% protein is noncysteine, the SILAC has emerged as a valuable proteomic technique  which becomes more common for cell types and have been applied in many fields [99–101]. The SILAC technique can be effectively expanded to compare the differential expression levels of tissue proteome at different pathological states, which allows to identify new candidate biomarkers . Compared with the ICAT, a popular in vitro labeling, SILAC as an example of in vivo coding requires no chemical manipulation, and there is very little chemical difference between the isotopically labeled amino acid and its naturally occurring counterpart . In addition, the amount of labeled proteins requires for analysis using SILAC technique is far less than that with ICAT. Therefore, the SILAC-based method has broadly applied in many areas of cell biology and proteomics. Except that the SILAC-based quantitative method is powerful in comparative/differential proteomics, it has been widely used in analyzing protein posttranslational modification, such as protein phosphorylation, detection of protein-protein or peptide-protein interactions and investigating signal transduction pathways [104, 105].Though there are numerous advantages for using SILAC-based methods compared to chemical labeling, a major drawback of SILAC is that it cannot be applied to tissue protein analysis directly. To overcome this shortcoming, SILAC has been successfully applied to tissue proteome based on 15N isotope labeling . Microorganisms such as malaria parasite can be labeled with isoleucine . Latterly the culture-derived isotope tags (CDITs) method was developed as an alternative quantitative approach for studying the proteome of mammalian tissues based on the application of SILAC .
Differential 16O/18O coding relies on the 18O exchange that takes place at the C-terminal carboxyl group of proteolytic fragments, where two 16O atoms are typically replaced by two 18O atoms by enzyme-catalyzed oxygen exchange in the presence of H218O . The resulting mass shift between differentially labeled peptide ions permits identification, characterization, and quantitation of proteins from which the peptides are proteolytically generated. In contrast to ICAT, 18O labeling does not favor peptides containing certain amino acids (e.g., cysteine), nor does it require an additional affinity step to enrich for these peptides . Unlike iTRAQ, 16O/18O labeling does not require a specific MS platform nor does it depend on fragmentation spectra (MS2) for quantitative peptide measurements. It is amenable to the labeling of human specimens (e.g., plasma, serum, tissues), which represents a limitation of metabolic labeling approaches (e.g., SILAC). Taken together, recent advancements in the homogeneity of 18O incorporation, improvements made on algorithms employed for calculating 16O/18O ratios and the inherent simplicity of this technique should result in increased use of 18O labeling . In general, 18O labeling suffers from two potential drawbacks, inhomogeneous 18O incorporation and inability to compare multiple samples within a single experiment. A dual 18O labeling using a non-gel-based platform has been developed to overcome the major problems of existing proteolytic 18O labeling methods .
The iTRAQ reagent is well known for relative and absolute quantitation of proteins. The iTRAQ technology offers several advantages, which include the ability to multiplex several samples, quantification, simplified analysis and increased analytical precision and accuracy [113–115]. The interest of this multiplexing reagent is that 4 or 8 analysis samples  can be quantified simultaneously. In this technique, the introduction of stable isotopes using iTRAQ reagents occurs on the level of proteolytic peptides (Figure 3). This technology uses an NHS ester derivative to modify primary amino groups by linking a mass balance group (carbonyl group) and a reporter group (based on N-methylpiperazine) to proteolytic peptides via the formation of an amide bond . Due to the isobaric mass design of the iTRAQ reagents, differentially labelled peptides appear as a single peak in MS scans, reducing the probability of peak overlapping. When iTRAQ-tagged peptides are subjected to MS/MS analysis, the mass balancing carbonyl moiety is released as a neutral fragment, liberating the isotope-encoded reporter ions which provides relative quantitative information on proteins. An inherent drawback of the reported iTRAQ technology is due to the enzymatic digestion of proteins prior to labelling, which artificially increases sample complexity and this approach needs a powerful multidimensional fractionation method of peptides before MS identification.
Prefractionation of proteins based on electrokinetic methodologies in free solution essentially relaying on the isoeletric focusing (IEF) has gained wide acceptance. Many commercial devices are now constructed to take the advantage of this principle (Table 2). Reproducible fractionation steps will break down the sample complexicity while concentrating low abundant species, resulting in more confident protein identifications and quantification by 2D gels, mass spectrometry, and protein arrays. A good example of a innovation is liquid-phase isoelectric focusing (IEF) as a prefractionation tool before the first dimension of 2D gel electrophoresis [118, 119]. For more consistent pI separation, the Zoom IEF fractionator  and multicompartment electrolyser (MCE)  are being used to prefractionate the proteins. The fractionated samples can be directly applied on standard narrow range IPG strips for 2D electrophoresis. This allows at least 10000 to 15000 separate proteins to be analyzed, including proteins of very low abundance. IEF, a high-resolution electrophoresis technique, has been widely used in shotgun proteomic experiments . IEF runs in a buffer-free solution containing carrier ampholytes or in immobilized pH gradient (IPG) gels. The use of IPG-IEF for the separation of complex peptide mixtures has been applied to the analysis of plasma and amniotic fluid [123, 124] as well as to bacterial material . The IPG gel strip is divided into small sections for extraction and cleaning up of the peptides. This technique recovers the sample from the liquid phase and was demonstrated to be of great interest in shotgun proteomics . IEF is not only a high resolution and high capacity separation method for peptides, it also provides additional physicochemical information like their isoelectric point [127, 128]. The pI value provided is used as an independent validating and filtering tool during database search for MS/MS peptide sequence identification . The recent introduction of commercially available OFFGEL fractionator system by Agilent Technologies provides an efficient and reproducible separation technique . This separation is based on immobilized pH gradient (IPG) strips and permits to separate peptides and proteins according to their isoelectric point (pI) but is realized in solution . Its micropreparative scale provides fraction volumes large enough to perform subsequent analyses as reverse phase (RP)–liquid chromatography (LC)–MALDI MS/MS. The combined use of iTRAQ labeling and OFFGEL fractionation methods for the proteomic study of complex sample is also being considered [132, 133].
In this procedure, a large well is used to separate the sample by PAGE and lanes are created on the membrane containing immobilized protein with the use of a manifold . Compatible combinations of primary antibodies are predetermined, with the criterion of being able to identify proteins that do not comigrate. Different combinations of primary antibodies are added to each well, with appropriate dilutions of each primary antibody so that expressed proteins are detected in a single condition. The scalability of the system depends on defining suitable combinations of primary antibodies, with up to 1000 antibodies in 200 lanes being used in the largest screens. Detection software is used to identify proteins based on their expected and observed gel mobility. Unlike 2D PAGE and HPLC-MS/MS, large-scale western blotting only identifies proteins for which antibodies are already available. While this is not an appropriate screen for identifying uncharacterized proteins, it greatly simplifies the verification and functional analyses of proteins that are detected. In addition, this approach is highly flexible, and can be focused to particular sets of proteins or protein function, such as cell signaling molecules. Importantly, the foundation of this approach is the large amount of data on individual antibodies, which are already available and characterized in the literature .
Another approach to analyse proteomes without gels is “shotgun” analysis using MudPIT . In the MudPIT approach, protein samples are subject to sequence-specific enzymatic digestion, usually with trypsin and endoproteinase lysC, and the resultant peptide mixtures are separated by strong cation exchange (SCX) and reversed phase (RP) high performance liquid chromatography (HPLC) [137, 138]. Peptides from the RP column enter the mass spectrometer and MS data is used to search the protein databases . The MudPIT technique generates an exhaustive list of proteins present in a particular protein sample, it is fast and sensitive with good reproducibility however, it lacks the ability to provide quantitative information [139–141]. A combination of HPLC, liquid phase isoelectric focusing, and capillary electrophoresis provides other multimodular options for the separation of complex protein mixtures .
High throughput production of human proteins using different methods is being developed to make protein array approach more practical. Recently simple and efficient production of human proteins using the versatile gateway vector system has been developed . In this approach, protein expression system is applied to the in vitro expression of 13364 human proteins and assessed their biological activity in two functional categories and developed “human protein factory” infrastructure which includes the resources and expression technology for in vitro proteome research. In another approach, DNA array to protein array (DAPA) is utilized, which allows the “printing” of replicate protein arrays directly from a DNA array template using cell-free protein synthesis . Based on the nucleic acid programmable protein array (NAPPA) concept, high-density self-assembling protein microarray is developed to display thousands of proteins that are produced and captured in situ from immobilized cDNA templates . This method will enable various experimental approaches to study protein function in high throughput.
The adventage of protein-based microarrays allows the global observation of biochemical activities on an unprecedented scale, where hundreds or thousands of proteins can be simultaneously screened for protein-protein, protein-nucleic acid, and protein-small molecule interactions, as well as posttranslational modifications [146, 147]. The microarray format provides a robust and convenient platform for the simultaneous analysis of thousands of individual protein samples, facilitating the design of sophisticated and reproducible biochemical experiments under highly specific conditions . The principal challenges in protein array development are 3-fold: (1) creation of a comprehensive expression clone library; (2) high-throughput protein production, including expression, isolation, and purification; (3) adaptation of DNA microarray technology to accommodate protein substrates . Functional protein microarrays differ from analytical arrays in that functional protein arrays are composed of arrays containing full-length functional proteins or protein domains (Figure 4). These protein chips are used to study the biochemical activities of an entire proteome in a single experiment. They are used to study numerous protein interactions, such as protein-protein, protein-DNA, protein-RNA, protein-phospholipid, and protein-small molecule interactions [150, 151]. Companies have introduced protein arrays aimed not only at proteomic analysis but also functional analyses of proteins (e.g., Biacore AB, Ciphergen Biosystems Inc., Phylos Inc.). Affinity proteomics aim to produce antibodies to every protein expressed by the human genome and these will be characterized against purified antigens and tested on tissue arrays to collect information about their specificity for tissue antigens . Companies are focused to produce various binding partners, for example, affibodies, monoclonal antibodies, and their fragments . Protein chips will likely be the next major manifestation of the revolution in proteomics and offer another solution to analyze low abundant proteins and have the potential for high throughput applications to identify biomarkers . Protein chips differ from previously described methods; whereas screening by 2DE or LC MS/MS can potentially detect any protein, and protein chips can only provide data on set of proteins selected by the investigator .
The development and application of high throughput, multiplex immunoassays that measure hundreds of known proteins in complex biological matrices, is becoming a significant tool for quantitative proteomics studies, diagnostic discovery, and biomarker-assisted drug development. Two broad categories of antibody microarray experimental formats have been developed , direct labelling, single antibody experiments , dual antibody, sandwich immunoassays are described [158, 159]. In the direct labelling method, all proteins in a complex mixture are tagged, providing a means for detecting bound proteins following incubation on an antibody microarray. In the sandwich immunoassay format, proteins captured on an antibody microarray are detected by a cocktail of detection antibodies, each antibody matched to one of the spotted antibodies. In addition, a variety of microarray substrates have been described, including nylon membranes, plastic microwells, planar glass slides, gel-based arrays and beads in suspension arrays.
Much effort has been expended in optimizing antibody attachment to the microarray substrate. Finally, various signal generation and signal enhancement strategies have been employed in antibody arrays, including colorimetry, radioactivity, fluorescence, chemiluminescence, quantum dots and other nanoparticles, enzyme-linked assays, resonance light scattering, tyramide signal amplification, and rolling circle amplification. Each of these formats and procedures has distinct advantages and disadvantages, relating broadly to sensitivity, specificity, dynamic range, multiplexing capability, precision, throughput, and ease of use. In general, multiplexed microarray immunoassays are ambient analyte assays . Given the heterogeneity of antibody array formats and procedures currently in use in proteomics studies, and the absence of a “gold standard,” there exists an urgent need for development and adoption of standards that permit platform comparisons and benchmarking.
Regardless of the choice of a given proteomic separation technique, gel-based or gel-free, a mass spectrometer is always the primary tool for protein identification. During the last decade, significant improvements have been made in the application of MS for the determination of protein sequences . Mass spectrometers consist of an ion source, the mass analyzer, and an ion detection system. Analysis of proteins by MS occurs in three major steps (a) protein ionization and generation of gas-phase ions, (b) separation of ions according to their mass to charge ratio, and (c) detection of ions . In gel-free approaches such as ICAT and MudPIT, samples are directly analyzed by MS whereas, in gel-based proteomics (2DE and 2D-DIGE), the protein spots are first excised from the gel and then digested with trypsin. The resulting peptides are then separated by LC or directly analyzed by MS. The experimentally derived peptide masses are correlated with the peptide fingerprints of known proteins in the databases using search engines (e.g., Mascot, Sequest). There are two main ionization sources which include matrix assisted laser desorption/ionization (MALDI) and electrospray ionization (ESI) and four major mass analyzers, which are time-of-flight (TOF), ion trap, quadrupole, and fourier transform ion cyclotron (FTIC) which are currently in use for protein identification and characterization . A combination of different mass analyzers in tandem such as quadrupole-TOF and quadrupole-ion trap has combined the individual strengths of different types of mass analyzers and greatly improved their capabilities for proteome analysis . Simple mass spectrometers such as MALDI-TOF are used for only measurement of mass, whereas tandem mass spectrometers are used for amino acid sequence determination . In MALDI the sample of interest is crystallized with the matrix on a metal surface and a laser ion source causes excitation of matrix along with the analyte ions, which are then released into the gas phase. MALDI measures the mass of peptides derived from a trypsinized parent protein and generates a list of experimental peptide masses, often referred to as “mass fingerprints” [165, 166]. In ESI, the analyte is ionized from a solution and transferred into the gas phase by generating a fine spray from a high voltage needle which results in multiple charging of the analyte and generation of multiple consecutive ions. Tandem mass spectrometry or MS/MS is performed by combining two different MS separation principles. In tandem MS, individual trypsin-digested peptides are fragmented after a liquid phase separation. Tandem MS instruments such as triple quadrupole, quadrupole ion trap, fourier transform ion-cyclotron resonance, or quadrupole time-of-flight are used in LC-MS/MS or nanospray experiments with electrospray ionization (ESI) to generate peptide fragment ion spectra . Ion mobility spectrometry (IMS) has been utilized as a rapid gas-phase separations strategy for biomolecular ions [168, 169]. The strategy provides high sensitivity because the gas-phase dispersion of peptide ions separates features corresponding to low abundance species from interfering chemical noise . Reduced spectral congestion also allows for the use of shorter experimental run times (LC separations) without sacrificing throughput; short analysis time scales are key to measuring the large numbers of samples required to determine normal protein variability prior to realizing individual plasma profiling. Additionally, mobility-dispersed ions can be fragmented and mobility linked to fragment ions without ion loss from precursor mass selection . These advantages have been demonstrated in head-to-head comparisons with conventional LC-MS/MS technology using rapid (21minutes) LC gradients . Accurate mass and time (AMT) tag approach  addresses an analogous situation in LC-MS-based proteomics studies. In this approach, initial LC-MS/MS analyses are performed on prefractionated peptide samples in order to provide peptide sequence identifications. These experiments are relatively low throughput because the peptide prefractionation can be quite extensive and require separate LC–MS/MS analyses for each fraction. The high-throughput accurate mass and time (AMT) tag proteomic approach was utilized to characterize the proteomes for cytoplasm, cytoplasmic membrane, periplasm, and outer membrane fractions from aerobic and photosynthetic cultures of the gram-negative bacterium Rhodobacter sphaeroides 2.4.1. There has been a recent trend in proteomics toward the development and application of technologies for the targeted analysis of proteins within complex mixtures . Selected reaction monitoring (SRM) is a powerful tandem mass spectrometry method that can be used to monitor target peptides within a complex protein digest [174, 175]. The specificity and sensitivity of the approach, as well as its capability to multiplex the measurement of many analytes in parallel, has made it a technology of particular promise for hypothesis driven proteomics. The use of tandem mass spectrometry data acquired on an LTQ ion trap mass spectrometer can accurately predict which fragment ions will produce the greatest signal in an SRM assay using a triple quadrupole mass spectrometer . One of the biggest benefits of a targeted assay on a triple quadrupole mass spectrometer is high throughput. Using the selectivity of multiple stages of mass selection of a tandem mass spectrometer, these targeted SRM assays are the mass spectrometry equivalent of a Western blot . An advantage of using targeted mass spectrometry-based assay over a traditional Western blot is that it does not rely on the creation of any immunoaffinity reagent. While its application is novel in the proteomics community, SRM has been utilized for several decades in the toxicology and pharmacokinetics disciplines . Peptide-based immunofractionation methods show potential for proteome wide screening approaches but are limited by the availability of antibodies [178, 179]. The stable isotope standards with capture by antipeptide antibodies (SISCAPA) approach is based on the addition of stable isotope labeled standard peptides to the digested clinical sample followed by immunoaffinity enrichment of standard and analyte peptide by highly specific antipeptide antibodies [180, 181]. This approach enables the absolute quantification of selected diagnostic peptides from digested clinical samples down to physiologically relevant analyte concentrations (ng/mL) at high precision (10% CV) and accuracy [178, 179]. Further improvement of MRM-based biomarker quantification should be possible if whole sets of analyte peptides can be enriched by immunofractionation. Since this method relies on one specific antibody per target protein/peptide the generation of more than 10000 antibodies is necessary for proteome wide screening approaches. Novel peptide affinity enrichment strategies enabling proteome wide analyses of signature peptides may provide an important addition to future proteome workflows. Undoubtedly, the accuracy, high throughput, and robustness of MS technologies have made the characterization of entire proteomes a realistic goal [180, 181].
The major bottlenecks in proteomics research today are related to data analysis to create an environment where computer scientists and biologists and the people who collect data can work closely together, so they can develop the necessary analytical tools that will help interpret the data [182–184]. Processing and analysis of proteomics data is indeed a very complex multistep process (Figure 5). The meaningful comparison, sharing, and exchange of data or analysis results obtained on different platforms or by different laboratories remain cumbersome mainly due to the lack of standards for data formats, data processing parameters, and data quality assessment. Accurate, consistent, and transparent data processing and analysis are integral and critical parts of proteomics workflows . We can now generate huge amounts of data, and currently there is an enormous challenge to figure out how to actually analyze this data and generate real biological insights. The necessity of an integrated pipeline for processing and analysis of complex proteomics data sets has therefore become critical.
This step consists of the assignment of MS/MS spectra to a database search using one of several engines available (e.g., Sequest, Mascot, Comet, X!tandem, etc.). One of the difficulties related to the use of sequest for peptide identifications is the lack of methods to globally evaluate the quality of data and the lack of methods to access global changes created by filtering schemes and/or database changes . Most approaches are matching and scoring large sets of experimental spectra with predicted masses of fragment ions of peptide sequences derived from a protein database. Results are scored according to a scheme specific to each search engine that also depends on the database used for the search. Usually tools are linked to one specific platform or were optimized for one instrument type. The various search engines do not yield identical results as they are based on different algorithms and scoring functions, making comparison and integration of results from different studies or experiments tedious [187, 188]. Peptide identification via database searches is very computationally intensive and time-demanding. High quality data allow more effective searches due to tighter constrains, that is, tolerance on precursor ion mass and charge state assignment, which will drastically reduce the search time in case of an indexed database. In addition, accurate mass measurements of fragment ions further simplify the database searches and add confidence to the results. The association of identified peptides with their precursor proteins is a very critical and difficult step in shotgun proteomics strategies as many peptides are common to several proteins, thus leading to ambiguous protein assignments. Therefore it becomes critical to have an appropriate tool that is able to assess the validity of the protein inference and associate a probability to it. Protein Prophet database tool combines probabilities assigned to peptides identified by MS/MS to compute accurate probabilities for the proteins present .
Importance of data repositories is to store, retrieve, and exchange data and results. Typically proteomics experiments are carried out in isolation by one single laboratory often in an uncoordinated way, thus making sharing and comparison of results tedious if not impossible. The lack of common standards and protocols has led to this situation and often resulted in duplication of efforts. Results were usually reported as a set of identified proteins (i.e., list of peptides identified and associated proteins) with minimal supporting data. Obviously the large volume of such data sets has made publication of detailed results using classical mechanisms very challenging. Sharing and exchange of data and results requires the definition of standard formats for the data at all levels (including raw mass spectrometric data, processed data, and search results) as well as a better definition (and/or standardization) of the parameters used for the data processing or the database searches.
Organellar proteomics aims to describe the full complement of proteins of subcellular structures and organelles. Identification of the proteins contained in subcellular organelles has become a popular proteomics endeavor . When compared with whole-cell or whole-tissue proteomes, the more focused results from subcellular proteomic studies have yielded relatively simpler datasets from which biologically relevant information can be more easily extracted . Subcellular fractionation consists of two major steps, disruption of the cellular organization (homogenization) and fractionation of the homogenate to separate the different populations of organelles. Such a homogenate can then be resolved by differential centrifugation into several fractions containing mainly (1) nuclei, heavy mitochondria, cytoskeletal networks, and plasma membrane; (2) light mitochondria, lysosomes, and peroxisomes; (3) golgi apparatus, endosomes and microsomes, and endoplasmic reticulum; (4) cytosol. Each population of organelles is characterized by size, density, charge, and other properties on which the separation relies . Analyzing subcellular fractions and organelles allows tracking proteins that shuttle between different compartments, for example, between the cytoplasm and nucleus. A high dynamic range of proteins can be partially achieved by fractionation of the proteome into subproteomes by applying affinity purification may allow proteomic analysis of low copy number proteins . The nuclear, chloroplast, amyloplast, plasma membrane, peroxisome, endoplasmic reticulum, cell wall, and mitochondrial proteomes were successfully characterized in Arabidopsis . Several groups have taken advantage of this approach to recover a higher percentage of membrane proteins from subcellular extracts using various nonionic and zwitterionic detergents or phase-partitioning methods. These efforts resulted in the successful determination of the protein complement of the thylakoid and envelope membrane systems of the chloroplast . By enriching for the protein class of interest based on a particular chemical/physical characteristic(s), offer the advantage of reducing sample complexity and access to lower abundance proteins in a discovery-driven experimental approach . Free flow electrophoresis (FFE) utilizes differences in electrophoretic mobility rather than density to separate cells or subcellular organelles . FFE has previously been used in separating endosomes from hamster ovary cells , plasma membrane from human platelets , and insulin transporting vesicles in liver cells. The separation is based on the electrophoretic motility of cells or cell organelles suspended in a vertical free flowing buffer film on which an electric field is applied at a right angle to the flow direction. FFE has been a most valuable tool in the investigation of the composition of secretory vesicles and in addition, it has clarified how the membrane of plasma membrane vesicles is oriented after nitrogen disruption of human neutrophils . Importantly, subcellular fractionation is a flexible and adjustable approach that may be efficiently combined not only with 2D gel electrophoresis but also with gel-independent techniques. However, they do have limitations of considerable cross-contamination with other subcellular organelles.
PTMs of proteins are considered to be one of the major determinants regarding organisms complexity . To date, at least more than 200 different types of PTMs have been identified of which only a few are reversible and important for the regulation of biological processes. Specific functions are usually mediated through PTMs, such as phosphorylations, acetylations, or glycosylations, which places additional demands on the sensitivity and precision of the method . One of the most studied PTMs is protein phosphorylation, because it is vital for a large number of protein functions that are important to cellular processes spanning from signal transduction, cell differentiation, and development to cell cycle control and metabolism. Enzymes and receptors can be switched “on” and “off” by phosphorylation and dephosphorylation. It was estimated that 10–50% of proteins are phosphorylated. Phosphorylation often occurs on serine, threonine, and tyrosine residues in eukaryotic proteins . Analysis of the entire cellular phosphoproteome has been an attractive study subject since the discovery of phosphorylation as a key regulatory mechanism of cell life. Unfortunately, phosphoproteins analysis is not straightforward for five main reasons. First, the stoichiometry of phosphorylation is generally relatively low, because only a small fraction of the available intracellular pool of a protein is phosphorylated at any given time as a result of a stimulus. Second, the phosphorylatation sites on proteins might vary, implying that any given phosphoprotein is heterogeneous (i.e., it exists in several different phosphorylated forms). Third, many of the signaling molecules, which are major targets of phosphorylation events , are present at low abundance within cells and, in these cases; enrichment is a prerequisite before analysis. Fourth, most analytical techniques used for studying protein phosphorylation have a limited dynamic range, which means that although major phosphorylation sites might be located easily, and minor sites might be difficult to identify. Finally, phosphatases could dephosphorylate residues unless precautions are taken to inhibit their activity during preparation and purification steps of cell lysates. In addition, various methods for protein phosphorylation site determination have been developed, yet this task remains a technical challenge . Western blot has been widely used to determine the presence of PTMs. However, this technique relies on the prior knowledge of the type and position of specific modifications and the availability of antibodies. It has low throughput and not ideal for studying highly complicated samples. Specific chemical or affinity enrichment steps are usually incorporated into the sample preparation or fractionation stages of the general scheme of proteomic studies [206, 207]. Well established methods involving the analysis of 32P-labeled phosphoproteins by Edman degradation and two-dimensional phosphopeptide mapping have proven to be powerful but not without limitations. Consequently, mass spectrometry (MS) has emerged as a reliable and sensitive method for the characterization of protein phosphorylation sites  and may therefore represent a method of choice for the analysis of protein phosphorylation . Immobilized metal affinity chromatography (IMAC), Metal oxide affinity chromatography (MOAC), and covalent methods are all capable of selectively enriching phosphopeptides . MOAC based on adsorption to TiO2 is especially attractive, but as with all techniques, loading, rinsing, and elution solutions must be carefully selected to minimize nonspecific adsorption and to maximize the detection of both monophosphorylated and multiphosphorylated species. IMAC might not provide the selectivity available with TiO2 enrichment, but with appropriate reagents, IMAC can be selective and sensitive for monophosphorylated and tetraphosphorylated peptides. However, some buffers and reagents such as EDTA are not compatible with IMAC, so HPLC purification may be needed prior to this technique . When trying to isolate and identify as many phosphoproteins as possible in a cell lysate, chromatographic column-based methods are required. Multiple elutions from IMAC or MOAC columns or even gradient elutions can help to simplify fractions of proteins and reveal more peptides [212, 213]. A combination of techniques can reveal large numbers of phosphopeptides in complex samples, but comprehensive phosphoproteomics is still not possible. For the highest protein coverage, future phosphoproteomic techniques will likely employ multiple enrichment techniques along with two-dimensional separations, but such studies are time consuming. Combinations of affinity-based enrichment and extraction methods, multidimensional separation technologies, and mass spectrometry are particularly attractive for systematic investigation of posttranslationally modified proteins in proteomics .
The application of proteomics and related technologies for the analysis of proteome is severely hampered by the lack of publicly available sequence information for most of the unsequenced organisms . Despite the precision of the mass information yielded by the SELDI technique, a significant number of proteins were found to have no similarity to known peptides, an aforementioned weakness of proteomics studies in nonmodel organisms . In order to circumvent this limitation, different strategies and tools were developed to make unsequenced organisms amenable to high-throughput proteomics  (Figure 6). However, an evaluation of their performance in an integrated proteomics strategy using high-throughput shotgun MS data is currently missing. In principle, two different approaches can lead to an increase in protein identifications from unsequenced organisms. In the first approach, MS/MS data are searched against a protein database of an evolutionarily closely related organism. However, as a matter of principle of database-dependent searches, only proteins can be identified that contain at least one peptide with exactly the same sequence as the peptide from a protein in the database. With increasing evolutionary distance this will be an increasingly severe restriction . In the second approach, the amino acid sequence of a peptide is extracted from the MS/MS spectrum for de novo sequencing, that is, in a fully database-independent manner using exclusively the information contained in the MS/MS spectrum. Several software tools for peptide de novo sequencing are now available and some of them provide sufficiently good results when applied to high-quality spectra . A basic limitation of MS de novo sequencing methods is the necessity for backbone cleavage between each pair of adjacent amino acids; a mass value representing a terminal fragment containing only one of the two residues is a first requirement for ordering of a specific pair [220, 221] and this limitation urged the need for bioinformatics approaches that can help interpret the proteomics data .
In the past several years there have been very important extremely useful advances in proteomics methods based on bottom-up display and bottom-up identification using peptides . These methods offer more sensitivity, greater rapidity and greater proteome coverage are often made with the explicit or implicit assertion that these methods are bound to replace more traditional methods based on top-down analysis, especially using 2D gels [223, 224]. The combination of bottom-up display and bottom-up identification has achieved very important successes in detecting the presence of large numbers of different proteins in cells or subcellular organelles [225, 226]. The use of specific fractionation schemes and prudent adoption of methods to increase the number of proteins able to be identified and quantified is enabling significant biological advances to be made. Further technological developments that enable a larger proportion of the proteome to be visualized will further enhance our ability to characterize biological systems. As such, these advances in proteomics will impact not only academic pursuits but also pharmaceutical, biotechnology and diagnostic research and development .
In the future gel-free techniques MudPIT, iTRAQ and 18O stable isotope labeling could be expected to gain more importance as they become more established. Sample prefractionation system provides a highly valuable tool to fractionate proteins and peptides from complex eukaryotic samples like plasma. This approach has a positive influence on the number of proteins identified compared to SCX method . iTRAQ is a very powerful tool, recognised form its ability to relatively quantify proteins. iTRAQ reagent improves MALDI ionisation, especially for peptides containing lysine. Although SILAC labelling is easy for any laboratory that uses cell culture, the MS technology that is required is still beyond the capabilities of most groups. One of the factors that contributed to the rapid acceptance of the SILAC technology was the availability of an open-source program, MSQuant, for interpreting results. Protein microarrays offer the ability to simultaneously survey multiple protein markers in an effort to develop expression profile changes across multiple protein analytes for potential use in diagnosis, prognosis, and measurement of therapeutic efficacy . This technology is an excellent high-throughput method used to probe an entire collection of proteins for a specific function or biochemistry. It is an exceptional new way to discover previously unknown multifunctional proteins, and to discover new functionalities for well-studied proteins . A systematic and efficient analysis of vast genomic and proteomic data sets is a major challenge for researchers today. To overcome limitations of current proteomics strategies in regard to the dynamic range of peptides detected and alternative mass spectrometry-based approaches are being explored. Targeted strategies exemplified by multiple reaction monitoring detect, quantify, and possibly collect a product ion spectrum to confirm the identity of a peptide with much greater sensitivity because the precursor ion is not detected in the full mass spectrum . A systematic and efficient evaluation of large-scale experimental results requires (1) automatic retrieval of user defined information to construct a customized, queryable database; (2) an intuitive graphical and query platform to display and analyze experimental data in the context of the customized database; (3) efficient utilization of web-based bioinformatics software tools for data interpretation, prediction of function, and modeling; (4) scalability and reconstruction of the database in response to changing user needs and an ever-expanding base of knowledge and bioinformatics tools . Creating a software tool to encompass the four crucial features outlined above is a challenging and ongoing task, particularly with respect to the ever-expanding publicly available base of knowledge and bioinformatics tools. The data processing and analysis bottleneck can be overcome through integration of the entire suite of tools into one linear pipeline. The good news is that all of the various proteomics strategies are in phases of very rapid technological development and that important advances in sensitivity, throughput, and proteome coverage can be expected in the near future for all of them.
This work was supported by Award No. SA-C0040/UK-C0016 of the King Abdullah University of Science and Technology (KAUST), the RGC grants of HKSAR (662408 and N_HKUST602/09), and a grant from China Ocean Mineral Resources Research and Development Association (COMRRDA06/07.SC02) to PY Qian.