Related Articles

1.  Absolute quantification of microbial proteomes at different states by directed mass spectrometry 
The developed, directed mass spectrometry workflow allows to generate consistent and system-wide quantitative maps of microbial proteomes in a single analysis. Application to the human pathogen L. interrogans revealed mechanistic proteome changes over time involved in pathogenic progression and antibiotic defense, and new insights about the regulation of absolute protein abundances within operons.
The developed, directed proteomic approach allowed consistent detection and absolute quantification of 1680 proteins of the human pathogen L. interrogans in a single LC–MS/MS experiment.The comparison of 25 extensive, consistent and quantitative proteome maps revealed new insights about the proteome changes involved in pathogenic progression and antibiotic defense of L. interrogans, and about the regulation of protein abundances within operons.The generated time-resolved data sets are compatible with pattern analysis algorithms developed for transcriptomics, including hierarchical clustering and functional enrichment analysis of the detected profile clusters.This is the first study that describes the absolute quantitative behavior of any proteome over multiple states and represents the most comprehensive proteome abundance pattern comparison for any organism to date.
Over the last decade, mass spectrometry (MS)-based proteomics has evolved as the method of choice for system-wide proteome studies and now allows for the characterization of several thousands of proteins in a single sample. Despite these great advances, redundant monitoring of protein levels over large sample numbers in a high-throughput manner remains a challenging task. New directed MS strategies have shown to overcome some of the current limitations, thereby enabling the acquisition of consistent and system-wide data sets of proteomes with low-to-moderate complexity at high throughput.
In this study, we applied this integrated, two-stage MS strategy to investigate global proteome changes in the human pathogen L. interrogans. In the initial discovery phase, 1680 proteins (out of around 3600 gene products) could be identified (Schmidt et al, 2008) and, by focusing precious MS-sequencing time on the most dominant, specific peptides per protein, all proteins could be accurately and consistently monitored over 25 different samples within a few days of instrument time in the following scoring phase (Figure 1). Additionally, the co-analysis of heavy reference peptides enabled us to obtain absolute protein concentration estimates for all identified proteins in each perturbation (Malmström et al, 2009). The detected proteins did not show any biases against functional groups or protein classes, including membrane proteins, and span an abundance range of more than three orders of magnitude, a range that is expected to cover most of the L. interrogans proteome (Malmström et al, 2009).
To elucidate mechanistic proteome changes over time involved in pathogenic progression and antibiotic defense of L. interrogans, we generated time-resolved proteome maps of cells perturbed with serum and three different antibiotics at sublethal concentrations that are currently used to treat Leptospirosis. This yielded an information-rich proteomic data set that describes, for the first time, the absolute quantitative behavior of any proteome over multiple states, and represents the most comprehensive proteome abundance pattern comparison for any organism to date. Using this unique property of the data set, we could quantify protein components of entire pathways across several time points and subject the data sets to cluster analysis, a tool that was previously limited to the transcript level due to incomplete sampling on protein level (Figure 4). Based on these analyses, we could demonstrate that Leptospira cells adjust the cellular abundance of a certain subset of proteins and pathways as a general response to stress while other parts of the proteome respond highly specific. The cells furthermore react to individual treatments by ‘fine tuning' the abundance of certain proteins and pathways in order to cope with the specific cause of stress. Intriguingly, the most specific and significant expression changes were observed for proteins involved in motility, tissue penetration and virulence after serum treatment where we tried to simulate the host environment. While many of the detected protein changes demonstrate good agreement with available transcriptomics data, most proteins showed a poor correlation. This includes potential virulence factors, like Loa22 or OmpL1, with confirmed expression in vivo that were significantly up-regulated on the protein level, but not on the mRNA level, strengthening the importance of proteomic studies. The high resolution and coverage of the proteome data set enabled us to further investigate protein abundance changes of co-regulated genes within operons. This suggests that although most proteins within an operon respond to regulation synchronously, bacterial cells seem to have subtle means to adjust the levels of individual proteins or protein groups outside of the general trend, a phenomena that was recently also observed on the transcript level of other bacteria (Güell et al, 2009).
The method can be implemented with standard high-resolution mass spectrometers and software tools that are readily available in the majority of proteomics laboratories. It is scalable to any proteome of low-to-medium complexity and can be extended to post-translational modifications or peptide-labeling strategies for quantification. We therefore expect the approach outlined here to become a cornerstone for microbial systems biology.
Over the past decade, liquid chromatography coupled with tandem mass spectrometry (LC–MS/MS) has evolved into the main proteome discovery technology. Up to several thousand proteins can now be reliably identified from a sample and the relative abundance of the identified proteins can be determined across samples. However, the remeasurement of substantially similar proteomes, for example those generated by perturbation experiments in systems biology, at high reproducibility and throughput remains challenging. Here, we apply a directed MS strategy to detect and quantify sets of pre-determined peptides in tryptic digests of cells of the human pathogen Leptospira interrogans at 25 different states. We show that in a single LC–MS/MS experiment around 5000 peptides, covering 1680 L. interrogans proteins, can be consistently detected and their absolute expression levels estimated, revealing new insights about the proteome changes involved in pathogenic progression and antibiotic defense of L. interrogans. This is the first study that describes the absolute quantitative behavior of any proteome over multiple states, and represents the most comprehensive proteome abundance pattern comparison for any organism to date.
PMCID: PMC3159967  PMID: 21772258
absolute quantification; directed mass spectrometry; Leptospira interrogans; microbiology; proteomics
2.  EP3 Fundamentals of Protein Sequence Characterization by Mass Spectrometry 
The first section of the tutorial will describe the instrumentation typically used in biological mass spectrometry applications related to protein identification. We focus on the relevant ionization techniques, common mass analyzers, and sample introduction systems. Attention will be given to properties, such as mass accuracy and mass resolution, which are important to protein characterization and database search strategies for protein identification. Practical considerations regarding the selection and use of instruments as well as troubleshooting information will be offered throughout the presentation.
The fundamentals of basic protein sequence characterization, including post-translational modifications, by mass spectrometry will be presented in the second section of the tutorial. Emphasis is placed on the use of tandem mass spectrometry at the peptide level to confirm and in some cases derive partial peptide sequence, identify post-translationally modified sequences, and localize the specific site of attachment. We will describe the basic principles of peptide fragmentation by collision-induced dissociation and how to use these principles to interpret MS/MS spectra. Basic sample preparation protocols compatible with mass spectrometry analysis will be described.
The third section of the tutorial will focus on mass spectrometric analyses of protein mixtures (proteomes). Besides shear numbers of proteins, the range of concentrations in certain samples is frequently an impediment to a complete analysis. Various fractionation, capture, and depletion methods will be described for dealing with very complex protein mixtures. Some of these capture methods also provide additional information regarding post-translational modifications. A brief description of database search methods for protein identification will be followed by a more extensive discussion of validating the search results. Finally, brief descriptions of protein quantitation methods will be presented, and their various advantages and disadvantages will be discussed.
PMCID: PMC2291869
3.  Methods for visual mining of genomic and proteomic data atlases 
BMC Bioinformatics  2012;13:58.
As the volume, complexity and diversity of the information that scientists work with on a daily basis continues to rise, so too does the requirement for new analytic software. The analytic software must solve the dichotomy that exists between the need to allow for a high level of scientific reasoning, and the requirement to have an intuitive and easy to use tool which does not require specialist, and often arduous, training to use. Information visualization provides a solution to this problem, as it allows for direct manipulation and interaction with diverse and complex data. The challenge addressing bioinformatics researches is how to apply this knowledge to data sets that are continually growing in a field that is rapidly changing.
This paper discusses an approach to the development of visual mining tools capable of supporting the mining of massive data collections used in systems biology research, and also discusses lessons that have been learned providing tools for both local researchers and the wider community. Example tools were developed which are designed to enable the exploration and analyses of both proteomics and genomics based atlases. These atlases represent large repositories of raw and processed experiment data generated to support the identification of biomarkers through mass spectrometry (the PeptideAtlas) and the genomic characterization of cancer (The Cancer Genome Atlas). Specifically the tools are designed to allow for: the visual mining of thousands of mass spectrometry experiments, to assist in designing informed targeted protein assays; and the interactive analysis of hundreds of genomes, to explore the variations across different cancer genomes and cancer types.
The mining of massive repositories of biological data requires the development of new tools and techniques. Visual exploration of the large-scale atlas data sets allows researchers to mine data to find new meaning and make sense at scales from single samples to entire populations. Providing linked task specific views that allow a user to start from points of interest (from diseases to single genes) enables targeted exploration of thousands of spectra and genomes. As the composition of the atlases changes, and our understanding of the biology increase, new tasks will continually arise. It is therefore important to provide the means to make the data available in a suitable manner in as short a time as possible. We have done this through the use of common visualization workflows, into which we rapidly deploy visual tools. These visualizations follow common metaphors where possible to assist users in understanding the displayed data. Rapid development of tools and task specific views allows researchers to mine large-scale data almost as quickly as it is produced. Ultimately these visual tools enable new inferences, new analyses and further refinement of the large scale data being provided in atlases such as PeptideAtlas and The Cancer Genome Atlas.
PMCID: PMC3352268  PMID: 22524279
4.  PEPPI: a peptidomic database of human protein isoforms for proteomics experiments 
BMC Bioinformatics  2010;11(Suppl 6):S7.
Protein isoform generation, which may derive from alternative splicing, genetic polymorphism, and posttranslational modification, is an essential source of achieving molecular diversity by eukaryotic cells. Previous studies have shown that protein isoforms play critical roles in disease diagnosis, risk assessment, sub-typing, prognosis, and treatment outcome predictions. Understanding the types, presence, and abundance of different protein isoforms in different cellular and physiological conditions is a major task in functional proteomics, and may pave ways to molecular biomarker discovery of human diseases. In tandem mass spectrometry (MS/MS) based proteomics analysis, peptide peaks with exact matches to protein sequence records in the proteomics database may be identified with mass spectrometry (MS) search software. However, due to limited annotation and poor coverage of protein isoforms in proteomics databases, high throughput protein isoform identifications, particularly those arising from alternative splicing and genetic polymorphism, have not been possible.
Therefore, we present the PEPtidomics Protein Isoform Database (PEPPI,, a comprehensive database of computationally-synthesized human peptides that can identify protein isoforms derived from either alternatively spliced mRNA transcripts or SNP variations. We collected genome, pre-mRNA alternative splicing and SNP information from Ensembl. We synthesized in silico isoform transcripts that cover all exons and theoretically possible junctions of exons and introns, as well as all their variations derived from known SNPs. With three case studies, we further demonstrated that the database can help researchers discover and characterize new protein isoform biomarkers from experimental proteomics data.
We developed a new tool for the proteomics community to characterize protein isoforms from MS-based proteomics experiments. By cataloguing each peptide configurations in the PEPPI database, users can study genetic variations and alternative splicing events at the proteome level. They can also batch-download peptide sequences in FASTA format to search for MS/MS spectra derived from human samples. The database can help generate novel hypotheses on molecular risk factors and molecular mechanisms of complex diseases, leading to identification of potentially highly specific protein isoform biomarkers.
PMCID: PMC3026381  PMID: 20946618
5.  Status of complete proteome analysis by mass spectrometry: SILAC labeled yeast as a model system 
Genome Biology  2006;7(6):R50.
A mass spectrometry analysis of the yeast proteome shows that complex mixture analysis is not limited by sensitivity but by a combination of dynamic range and by effective sequencing speed.
Mass spectrometry has become a powerful tool for the analysis of large numbers of proteins in complex samples, enabling much of proteomics. Due to various analytical challenges, so far no proteome has been sequenced completely. O'Shea, Weissman and co-workers have recently determined the copy number of yeast proteins, making this proteome an excellent model system to study factors affecting coverage.
To probe the yeast proteome in depth and determine factors currently preventing complete analysis, we grew yeast cells, extracted proteins and separated them by one-dimensional gel electrophoresis. Peptides resulting from trypsin digestion were analyzed by liquid chromatography mass spectrometry on a linear ion trap-Fourier transform mass spectrometer with very high mass accuracy and sequencing speed. We achieved unambiguous identification of more than 2,000 proteins, including very low abundant ones. Effective dynamic range was limited to about 1,000 and effective sensitivity to about 500 femtomoles, far from the subfemtomole sensitivity possible with single proteins. We used SILAC (stable isotope labeling by amino acids in cell culture) to generate one-to-one pairs of true peptide signals and investigated if sensitivity, sequencing speed or dynamic range were limiting the analysis.
Advanced mass spectrometry methods can unambiguously identify more than 2,000 proteins in a single proteome. Complex mixture analysis is not limited by sensitivity but by a combination of dynamic range (high abundance peptides preventing sequencing of low abundance ones) and by effective sequencing speed. Substantially increased coverage of the yeast proteome appears feasible with further development in software and instrumentation.
PMCID: PMC1779535  PMID: 16784548
6.  The Genome Organization of Thermotoga maritima Reflects Its Lifestyle 
PLoS Genetics  2013;9(4):e1003485.
The generation of genome-scale data is becoming more routine, yet the subsequent analysis of omics data remains a significant challenge. Here, an approach that integrates multiple omics datasets with bioinformatics tools was developed that produces a detailed annotation of several microbial genomic features. This methodology was used to characterize the genome of Thermotoga maritima—a phylogenetically deep-branching, hyperthermophilic bacterium. Experimental data were generated for whole-genome resequencing, transcription start site (TSS) determination, transcriptome profiling, and proteome profiling. These datasets, analyzed in combination with bioinformatics tools, served as a basis for the improvement of gene annotation, the elucidation of transcription units (TUs), the identification of putative non-coding RNAs (ncRNAs), and the determination of promoters and ribosome binding sites. This revealed many distinctive properties of the T. maritima genome organization relative to other bacteria. This genome has a high number of genes per TU (3.3), a paucity of putative ncRNAs (12), and few TUs with multiple TSSs (3.7%). Quantitative analysis of promoters and ribosome binding sites showed increased sequence conservation relative to other bacteria. The 5′UTRs follow an atypical bimodal length distribution comprised of “Short” 5′UTRs (11–17 nt) and “Common” 5′UTRs (26–32 nt). Transcriptional regulation is limited by a lack of intergenic space for the majority of TUs. Lastly, a high fraction of annotated genes are expressed independent of growth state and a linear correlation of mRNA/protein is observed (Pearson r = 0.63, p<2.2×10−16 t-test). These distinctive properties are hypothesized to be a reflection of this organism's hyperthermophilic lifestyle and could yield novel insights into the evolutionary trajectory of microbial life on earth.
Author Summary
Genomic studies have greatly benefited from the advent of high-throughput technologies and bioinformatics tools. Here, a methodology integrating genome-scale data and bioinformatics tools is developed to characterize the genome organization of the hyperthermophilic, phylogenetically deep-branching bacterium Thermotoga maritima. This approach elucidates several features of the genome organization and enables comparative analysis of these features across diverse taxa. Our results suggest that the genome of T. maritima is reflective of its hyperthermophilic lifestyle. Ultimately, constraints imposed on the genome have negative impacts on regulatory complexity and phenotypic diversity. Investigating the genome organization of Thermotogae species will help resolve various causal factors contributing to the genome organization such as phylogeny and environment. Applying a similar analysis of the genome organization to numerous taxa will likely provide insights into microbial evolution.
PMCID: PMC3636130  PMID: 23637642
7.  Clusters of Internally Primed Transcripts Reveal Novel Long Noncoding RNAs 
PLoS Genetics  2006;2(4):e37.
Non-protein-coding RNAs (ncRNAs) are increasingly being recognized as having important regulatory roles. Although much recent attention has focused on tiny 22- to 25-nucleotide microRNAs, several functional ncRNAs are orders of magnitude larger in size. Examples of such macro ncRNAs include Xist and Air, which in mouse are 18 and 108 kilobases (Kb), respectively. We surveyed the 102,801 FANTOM3 mouse cDNA clones and found that Air and Xist were present not as single, full-length transcripts but as a cluster of multiple, shorter cDNAs, which were unspliced, had little coding potential, and were most likely primed from internal adenine-rich regions within longer parental transcripts. We therefore conducted a genome-wide search for regional clusters of such cDNAs to find novel macro ncRNA candidates. Sixty-six regions were identified, each of which mapped outside known protein-coding loci and which had a mean length of 92 Kb. We detected several known long ncRNAs within these regions, supporting the basic rationale of our approach. In silico analysis showed that many regions had evidence of imprinting and/or antisense transcription. These regions were significantly associated with microRNAs and transcripts from the central nervous system. We selected eight novel regions for experimental validation by northern blot and RT-PCR and found that the majority represent previously unrecognized noncoding transcripts that are at least 10 Kb in size and predominantly localized in the nucleus. Taken together, the data not only identify multiple new ncRNAs but also suggest the existence of many more macro ncRNAs like Xist and Air.
The human genome has been sequenced, and, intriguingly, less than 2% specifies the information for the basic protein building blocks of our bodies. So, what does the other 98% do? It now appears that the mammalian genome also specifies the instructions for many previously undiscovered “non protein-coding RNA” (ncRNA) genes. However, what these ncRNAs do is largely unknown. In recent years, strategies have been designed that have successfully identified hundreds of short ncRNAs—termed microRNAs—many of which have since been shown to act as genetic regulators. Also known to be functionally important are a handful of ncRNAs orders of magnitude larger in size than microRNAs. The availability of complete genome and comprehensive transcript sequences allows for the systematic discovery of more large ncRNAs. The authors developed a computational strategy to screen the mouse genome and identify large ncRNAs. They detected existing large ncRNAs, thus validating their approach, but, more importantly, discovered more than 60 other candidates, some of which were subsequently confirmed experimentally. This work opens the door to a virtually unexplored world of large ncRNAs and beckons future experimental work to define the cellular functions of these molecules.
PMCID: PMC1449886  PMID: 16683026
8.  Cell-type specific analysis of translating RNAs in developing flowers reveals new levels of control 
Combining translating ribosome affinity purification with RNA-seq for cell-specific profiling of translating RNAs in developing flowers.Cell type comparisons of cell type-specific hormone responses, promoter motifs, coexpressed cognate binding factor candidates, and splicing isoforms.Widespread post-transcriptional regulation at both the intron splicing and translational stages.A new class of noncoding RNAs associated with polysomes.
What constitutes a differentiated cell type? How much do cell types differ in their transcription of genes? The development and functions of tissues rely on constant interactions among distinct and nonequivalent cell types. Answering these questions will require quantitative information on transcriptomes, proteomes, protein–protein interactions, protein–nucleic acid interactions, and metabolomes at cellular resolution. The systems approaches emerging in biology promise to explain properties of biological systems based on genome-wide measurements of expression, interaction, regulation, and metabolism. To facilitate a systems approach, it is essential first to capture such components in a global manner, ideally at cellular resolution.
Recently, microarray analysis of transcriptomes has been extended to a cellular level of resolution by using laser microdissection or fluorescence-activated sorting (for review, see Nelson et al, 2008). These methods have been limited by stresses associated with cellular separation and isolation procedures, and biases associated with mandatory RNA amplification steps. A newly developed method, translating ribosome affinity purification (TRAP; Zanetti et al, 2005; Heiman et al, 2008; Mustroph et al, 2009), circumvents these problems by epitopetagging a ribosomal protein in specific cellular domains to selectively purify polysomes. We combined TRAP with deep sequencing, which we term TRAP-seq, to provide cell-level spatiotemporal maps for Arabidopsis early floral development at single-base resolution.
Flower development in Arabidopsis has been studied extensively and is one of the best understood aspects of plant development (for review, see Krizek and Fletcher, 2005). Genetic analysis of homeotic mutants established the ABC model, in which three classes of regulatory genes, A, B and C, work in a combinatorial manner to confer organ identities of four whorls (Coen and Meyerowitz, 1991). Each class of regulatory gene is expressed in a specific and evolutionarily conserved domain, and the action of the class A, B and C genes is necessary for specification of organ identity (Figure 1A).
Using TRAP-seq, we purified cell-specific translating mRNA populations, which we and others call the translatome, from the A, B and C domains of early developing flowers, in which floral patterning and the specification of floral organs is established. To achieve temporal specificity, we used a floral induction system to facilitate collection of early stage flowers (Wellmer et al, 2006). The combination of TRAP-seq with domain-specific promoters and this floral induction system enabled fine spatiotemporal isolation of translating mRNA in specific cellular domains, and at specific developmental stages.
Multiple lines of evidence confirmed the specificity of this approach, including detecting the expression in expected domains but not in other domains for well-studied flower marker genes and known physiological functions (Figures 1B–D and 2A–C). Furthermore, we provide numerous examples from flower development in which a spatiotemporal map of rigorously comparable cell-specific translatomes makes possible new views of the properties of cell domains not evident in data obtained from whole organs or tissues, including patterns of transcription and cis-regulation, new physiological differences among cell domains and between flower stages, putative hormone-active centers, and splicing events specific for flower domains (Figure 2A–D). Such findings may provide new targets for reverse genetics studies and may aid in the formulation and validation of interaction and pathway networks.
Beside cellular heterogeneity, the transcriptome is regulated at several steps through the life of mRNA molecules, which are not directly available through traditional transcriptome profiling of total mRNA abundance. By comparing the translatome and transcriptome, we integratively profiled two key posttranscriptional control points, intron splicing and translation state. From our translatome-wide profiling, we (i) confirmed that both posttranscriptional regulation control points were used by a large portion of the transcriptome; (ii) identified a number of cis-acting features within the coding or noncoding sequences that correlate with splicing or translation state; and (iii) revealed correlation between each regulation mechanism and gene function. Our transcriptome-wide surveys have highlighted target genes transcripts of which are probably under extensive posttranscriptional regulation during flower development.
Finally, we reported the finding of a large number of polysome-associated ncRNAs. About one-third of all annotated ncRNA in the Arabidopsis genome were observed co-purified with polysomes. Coding capacity analysis confirmed that most of them are real ncRNA without conserved ORFs. The group of polysome-associated ncRNA reported in this study is a potential new addition to the expanding riboregulator catalog; they could have roles in translational regulation during early flower development.
Determining both the expression levels of mRNA and the regulation of its translation is important in understanding specialized cell functions. In this study, we describe both the expression profiles of cells within spatiotemporal domains of the Arabidopsis thaliana flower and the post-transcriptional regulation of these mRNAs, at nucleotide resolution. We express a tagged ribosomal protein under the promoters of three master regulators of flower development. By precipitating tagged polysomes, we isolated cell type-specific mRNAs that are probably translating, and quantified those mRNAs through deep sequencing. Cell type comparisons identified known cell-specific transcripts and uncovered many new ones, from which we inferred cell type-specific hormone responses, promoter motifs and coexpressed cognate binding factor candidates, and splicing isoforms. By comparing translating mRNAs with steady-state overall transcripts, we found evidence for widespread post-transcriptional regulation at both the intron splicing and translational stages. Sequence analyses identified structural features associated with each step. Finally, we identified a new class of noncoding RNAs associated with polysomes. Findings from our profiling lead to new hypotheses in the understanding of flower development.
PMCID: PMC2990639  PMID: 20924354
Arabidopsis; flower; intron; transcriptome; translation
9.  Unique Signatures of Long Noncoding RNA Expression in Response to Virus Infection and Altered Innate Immune Signaling 
mBio  2010;1(5):e00206-10.
Studies of the host response to virus infection typically focus on protein-coding genes. However, non-protein-coding RNAs (ncRNAs) are transcribed in mammalian cells, and the roles of many of these ncRNAs remain enigmas. Using next-generation sequencing, we performed a whole-transcriptome analysis of the host response to severe acute respiratory syndrome coronavirus (SARS-CoV) infection across four founder mouse strains of the Collaborative Cross. We observed differential expression of approximately 500 annotated, long ncRNAs and 1,000 nonannotated genomic regions during infection. Moreover, studies of a subset of these ncRNAs and genomic regions showed the following. (i) Most were similarly regulated in response to influenza virus infection. (ii) They had distinctive kinetic expression profiles in type I interferon receptor and STAT1 knockout mice during SARS-CoV infection, including unique signatures of ncRNA expression associated with lethal infection. (iii) Over 40% were similarly regulated in vitro in response to both influenza virus infection and interferon treatment. These findings represent the first discovery of the widespread differential expression of long ncRNAs in response to virus infection and suggest that ncRNAs are involved in regulating the host response, including innate immunity. At the same time, virus infection models provide a unique platform for studying the biology and regulation of ncRNAs.
Most studies examining the host transcriptional response to infection focus only on protein-coding genes. However, there is growing evidence that thousands of non-protein-coding RNAs (ncRNAs) are transcribed from mammalian genomes. While most attention to the involvement of ncRNAs in virus-host interactions has been on small ncRNAs such as microRNAs, it is becoming apparent that many long ncRNAs (>200 nucleotides [nt]) are also biologically important. These long ncRNAs have been found to have widespread functionality, including chromatin modification and transcriptional regulation and serving as the precursors of small RNAs. With the advent of next-generation sequencing technologies, whole-transcriptome analysis of the host response, including long ncRNAs, is now possible. Using this approach, we demonstrated that virus infection alters the expression of numerous long ncRNAs, suggesting that these RNAs may be a new class of regulatory molecules that play a role in determining the outcome of infection.
PMCID: PMC2962437  PMID: 20978541
10.  Identification of CRISPR and riboswitch related RNAs among novel noncoding RNAs of the euryarchaeon Pyrococcus abyssi 
BMC Genomics  2011;12:312.
Noncoding RNA (ncRNA) has been recognized as an important regulator of gene expression networks in Bacteria and Eucaryota. Little is known about ncRNA in thermococcal archaea except for the eukaryotic-like C/D and H/ACA modification guide RNAs.
Using a combination of in silico and experimental approaches, we identified and characterized novel P. abyssi ncRNAs transcribed from 12 intergenic regions, ten of which are conserved throughout the Thermococcales. Several of them accumulate in the late-exponential phase of growth. Analysis of the genomic context and sequence conservation amongst related thermococcal species revealed two novel P. abyssi ncRNA families. The CRISPR family is comprised of crRNAs expressed from two of the four P. abyssi CRISPR cassettes. The 5'UTR derived family includes four conserved ncRNAs, two of which have features similar to known bacterial riboswitches. Several of the novel ncRNAs have sequence similarities to orphan OrfB transposase elements. Based on RNA secondary structure predictions and experimental results, we show that three of the twelve ncRNAs include Kink-turn RNA motifs, arguing for a biological role of these ncRNAs in the cell. Furthermore, our results show that several of the ncRNAs are subjected to processing events by enzymes that remain to be identified and characterized.
This work proposes a revised annotation of CRISPR loci in P. abyssi and expands our knowledge of ncRNAs in the Thermococcales, thus providing a starting point for studies needed to elucidate their biological function.
PMCID: PMC3124441  PMID: 21668986
11.  The majority of total nuclear-encoded non-ribosomal RNA in a human cell is 'dark matter' un-annotated RNA 
BMC Biology  2010;8:149.
Discovery that the transcriptional output of the human genome is far more complex than predicted by the current set of protein-coding annotations and that most RNAs produced do not appear to encode proteins has transformed our understanding of genome complexity and suggests new paradigms of genome regulation. However, the fraction of all cellular RNA whose function we do not understand and the fraction of the genome that is utilized to produce that RNA remain controversial. This is not simply a bookkeeping issue because the degree to which this un-annotated transcription is present has important implications with respect to its biologic function and to the general architecture of genome regulation. For example, efforts to elucidate how non-coding RNAs (ncRNAs) regulate genome function will be compromised if that class of RNAs is dismissed as simply 'transcriptional noise'.
We show that the relative mass of RNA whose function and/or structure we do not understand (the so called 'dark matter' RNAs), as a proportion of all non-ribosomal, non-mitochondrial human RNA (mt-RNA), can be greater than that of protein-encoding transcripts. This observation is obscured in studies that focus only on polyA-selected RNA, a method that enriches for protein coding RNAs and at the same time discards the vast majority of RNA prior to analysis. We further show the presence of a large number of very long, abundantly-transcribed regions (100's of kb) in intergenic space and further show that expression of these regions is associated with neoplastic transformation. These overlap some regions found previously in normal human embryonic tissues and raises an interesting hypothesis as to the function of these ncRNAs in both early development and neoplastic transformation.
We conclude that 'dark matter' RNA can constitute the majority of non-ribosomal, non-mitochondrial-RNA and a significant fraction arises from numerous very long, intergenic transcribed regions that could be involved in neoplastic transformation.
PMCID: PMC3022773  PMID: 21176148
12.  Analytical Utility of Mass Spectral Binning in Proteomic Experiments by SPectral Immonium Ion Detection (SPIID)*  
Unambiguous identification of tandem mass spectra is a cornerstone in mass-spectrometry-based proteomics. As the study of post-translational modifications (PTMs) by means of shotgun proteomics progresses in depth and coverage, the ability to correctly identify PTM-bearing peptides is essential, increasing the demand for advanced data interpretation. Several PTMs are known to generate unique fragment ions during tandem mass spectrometry, the so-called diagnostic ions, which unequivocally identify a given mass spectrum as related to a specific PTM. Although such ions offer tremendous analytical advantages, algorithms to decipher MS/MS spectra for the presence of diagnostic ions in an unbiased manner are currently lacking. Here, we present a systematic spectral-pattern-based approach for the discovery of diagnostic ions and new fragmentation mechanisms in shotgun proteomics datasets. The developed software tool is designed to analyze large sets of high-resolution peptide fragmentation spectra independent of the fragmentation method, instrument type, or protease employed. To benchmark the software tool, we analyzed large higher-energy collisional activation dissociation datasets of samples containing phosphorylation, ubiquitylation, SUMOylation, formylation, and lysine acetylation. Using the developed software tool, we were able to identify known diagnostic ions by comparing histograms of modified and unmodified peptide spectra. Because the investigated tandem mass spectra data were acquired with high mass accuracy, unambiguous interpretation and determination of the chemical composition for the majority of detected fragment ions was feasible. Collectively we present a freely available software tool that allows for comprehensive and automatic analysis of analogous product ions in tandem mass spectra and systematic mapping of fragmentation mechanisms related to common amino acids.
PMCID: PMC4125726  PMID: 24895383
13.  Quantification of mRNA and protein and integration with protein turnover in a bacterium 
Determination of the average cellular copy number of 400 proteins under different growth conditions and integration with protein turnover and absolute mRNA levels reveals the dynamics of protein expression in the genome-reduced bacterium Mycoplasma pneumoniae.
Our study provides a fine-grained, quantitative picture to unprecedented detail in an established model organism for systems-wide studies.Our integrative approach reveals a novel, dynamic view on the processes, interactions and regulations underlying the central dogma pathway and the composition of protein complexes.Simulations using our quantitative data on mRNA, protein and turnover show how an organism copes with stochastic noise in gene expression in vivo.Our data serve as an important resource for colleagues both within our field of research and in related disciplines.
A hallmark of Systems Biology is the integration of diverse, large quantitative data sets with the aim to gain novel insights into how biological processes work. We measured individual mRNA and protein abundances as well as protein turnover in the bacterium Mycoplasma pneumoniae. This human pathogen is an ideal model organism for organism-wide studies. It can be readily cultured under laboratory conditions and it has a very small genome with only 690 protein-coding genes. This comparably low complexity allows for the exhaustive analysis of major cellular biomolecules avoiding constrains introduced by limitations of available analysis techniques.
Using a recently developed mass spectrometry-based approach, we determined the average cellular copy number for over 400 individual proteins under different growth and stress conditions. The 20 most abundant proteins, including Elongation factor Tu, cellular chaperones, and proteins involved in metabolizing glucose, the major energy source of M. pneumoniae account for nearly 44% of the total cellular protein mass. We observed abundance changes of many expected and several unexpected proteins in response to cellular stress, such as heat shock, DNA damage and osmotic stress, as well as along batch culture growth over 4 days.
Integration of the protein abundance data with quantitative mRNA measurements revealed a modest correlation between these two classes of biomolecules. However, for several classical stress-induced proteins, we observed a correlated induction of mRNA and protein in response to heat shock. A focused analysis of mRNA–protein abundance dynamics during batch culture growth suggested that the regulation of gene expression is largely decoupled from protein dynamics in M. pneumoniae, indicating extensive post-transcriptional and post-translational regulation influencing the cellular mRNA–protein ratios.
To investigate the factors influencing the cellular protein abundance, we measured individual protein turnover rates by mass spectrometry using a label-chase approach involving stable isotope-labelled amino acids. The average half-life of a protein in M. pneumoniae is 23 h. Based on the measured quantitative mRNA data, the protein abundances and their half-lives, we established an ordinary differential equations model for the estimation of individual in vivo protein degradation and translation efficiency rates. We found out that translation efficiency rather than protein turnover is the dominating factor influencing protein abundance. Using our abundance and turnover data, we additionally performed stochastic simulations of gene expression. We observed that long protein half-life and low translational efficiency buffers gene expression noise propagating from low cellular mRNA levels in vivo.
We compared the abundance ratios of proteins associating into complexes in vivo with their expected functional stoichiometries. We observed that for stable protein complexes, such as the GroEL/ES chaperonin or DNA gyrase, our measured abundance ratios reflected the expected subunit stoichiometries. More dynamic protein complexes, such as the DnaK/J/GrpE chaperone system or RNA polymerase, showed several unusual subunit ratios, pointing towards transient interaction of sub-stoichiometric subunits for function. A detailed, quantitative analysis of the ribosome, the largest cellular protein complex, revealed large abundance differences of the 51 subunits. This observation indicates a multi-functionality for several, abundant ribosomal proteins.
Finally, a comparison of the determined average cellular protein abundances with a different pathogenic bacterium, Leptospira interrogans, revealed that cellular protein abundances closely reflect their respective lifestyles.
Our study represents an organism-wide, quantitative analysis of cellular protein abundances. Integrating our proteomics data with determined mRNA levels and protein turnover rates reveals insights into the dynamic interplay and regulation of mRNA and proteins, the central biomolecules of a cell.
Biological function and cellular responses to environmental perturbations are regulated by a complex interplay of DNA, RNA, proteins and metabolites inside cells. To understand these central processes in living systems at the molecular level, we integrated experimentally determined abundance data for mRNA, proteins, as well as individual protein half-lives from the genome-reduced bacterium Mycoplasma pneumoniae. We provide a fine-grained, quantitative analysis of basic intracellular processes under various external conditions. Proteome composition changes in response to cellular perturbations reveal specific stress response strategies. The regulation of gene expression is largely decoupled from protein dynamics and translation efficiency has a higher regulatory impact on protein abundance than protein turnover. Stochastic simulations using in vivo data show how low translation efficiency and long protein half-lives effectively reduce biological noise in gene expression. Protein abundances are regulated in functional units, such as complexes or pathways, and reflect cellular lifestyles. Our study provides a detailed integrative analysis of average cellular protein abundances and the dynamic interplay of mRNA and proteins, the central biomolecules of a cell.
PMCID: PMC3159969  PMID: 21772259
mRNA–protein; Mycoplasma pneumoniae; protein homeostasis; protein turnover; quantitative proteomics
14.  18O Stable Isotope Labeling in MS-based Proteomics 
A variety of stable isotope labeling techniques have been developed and used in mass spectrometry (MS)-based proteomics, primarily for relative quantitation of changes in protein abundances between two compared samples, but also for qualitative characterization of differentially labeled proteomes. Differential 16O/18O coding relies on the 18O exchange that takes place at the C-terminal carboxyl group of proteolytic fragments, where two 16O atoms are typically replaced by two 18O atoms by enzyme-catalyzed oxygen-exchange in the presence of H218O. The resulting mass shift between differentially labeled peptide ions permits identification, characterization and quantitation of proteins from which the peptides are proteolytically generated. This review focuses on the utility of 16O/18O labeling within the context of mass spectrometry-based proteome research. Different strategies employing 16O/18O are examined in the context of global comparative proteome profiling, targeted subcellular proteomics, analysis of post-translational modifications and biomarker discovery. Also discussed are analytical issues related to this technique, including variable 18O exchange along with advantages and disadvantages of 16O/18O labeling in comparison with other isotope-coding techniques.
PMCID: PMC2722262  PMID: 19151093
18O labeling; enzyme-mediated isotope incorporation; stable isotope labeling; MS-based proteomics; relative protein quantitation; LC/MS/MS
15.  The Coding and Noncoding Architecture of the Caulobacter crescentus Genome 
PLoS Genetics  2014;10(7):e1004463.
Caulobacter crescentus undergoes an asymmetric cell division controlled by a genetic circuit that cycles in space and time. We provide a universal strategy for defining the coding potential of bacterial genomes by applying ribosome profiling, RNA-seq, global 5′-RACE, and liquid chromatography coupled with tandem mass spectrometry (LC-MS) data to the 4-megabase C. crescentus genome. We mapped transcript units at single base-pair resolution using RNA-seq together with global 5′-RACE. Additionally, using ribosome profiling and LC-MS, we mapped translation start sites and coding regions with near complete coverage. We found most start codons lacked corresponding Shine-Dalgarno sites although ribosomes were observed to pause at internal Shine-Dalgarno sites within the coding DNA sequence (CDS). These data suggest a more prevalent use of the Shine-Dalgarno sequence for ribosome pausing rather than translation initiation in C. crescentus. Overall 19% of the transcribed and translated genomic elements were newly identified or significantly improved by this approach, providing a valuable genomic resource to elucidate the complete C. crescentus genetic circuitry that controls asymmetric cell division.
Author Summary
Caulobacter crescentus is a model system for studying asymmetric cell division, a fundamental process that, through differential gene expression in the two daughter cells, enables the generation of cells with different fates. To explore how the genome directs and maintains asymmetry upon cell division, we performed a coordinated analysis of multiple genomic and proteomic datasets to identify the RNA and protein coding features in the C. crescentus genome. Our integrated analysis identifies many new genetic regulatory elements, adding significant regulatory complexity to the C. crescentus genome. Surprisingly, 75.4% of protein coding genes lack a canonical translation initiation sequence motif (the Shine-Dalgarno site) which hybridizes to the 3′ end of the ribosomal RNA allowing translation initiation. We find Shine-Dalgarno sites primarily inside of genes where they cause translating ribosomes to pause, possibly allowing nascent proteins to correctly fold. With our detailed map of genomic transcription and translation elements, a systems view of the genetic network that controls asymmetric cell division is within reach.
PMCID: PMC4117421  PMID: 25078267
16.  Computational prediction of novel non-coding RNAs in Arabidopsis thaliana 
BMC Bioinformatics  2009;10(Suppl 1):S36.
Non-coding RNA (ncRNA) genes do not encode proteins but produce functional RNA molecules that play crucial roles in many key biological processes. Recent genome-wide transcriptional profiling studies using tiling arrays in organisms such as human and Arabidopsis have revealed a great number of transcripts, a large portion of which have little or no capability to encode proteins. This unexpected finding suggests that the currently known repertoire of ncRNAs may only represent a small fraction of ncRNAs of the organisms. Thus, efficient and effective prediction of ncRNAs has become an important task in bioinformatics in recent years. Among the available computational methods, the comparative genomic approach seems to be the most powerful to detect ncRNAs. The recent completion of the sequencing of several major plant genomes has made the approach possible for plants.
We have developed a pipeline to predict novel ncRNAs in the Arabidopsis (Arabidopsis thaliana) genome. It starts by comparing the expressed intergenic regions of Arabidopsis as provided in two whole-genome high-density oligo-probe arrays from the literature with the intergenic nucleotide sequences of all completely sequenced plant genomes including rice (Oryza sativa), poplar (Populus trichocarpa), grape (Vitis vinifera), and papaya (Carica papaya). By using multiple sequence alignment, a popular ncRNA prediction program (RNAz), wet-bench experimental validation, protein-coding potential analysis, and stringent screening against various ncRNA databases, the pipeline resulted in 16 families of novel ncRNAs (with a total of 21 ncRNAs).
In this paper, we undertake a genome-wide search for novel ncRNAs in the genome of Arabidopsis by a comparative genomics approach. The identified novel ncRNAs are evolutionarily conserved between Arabidopsis and other recently sequenced plants, and may conduct interesting novel biological functions.
PMCID: PMC2648795  PMID: 19208137
17.  Empirical Bayes Analysis of Quantitative Proteomics Experiments 
PLoS ONE  2009;4(10):e7454.
Advances in mass spectrometry-based proteomics have enabled the incorporation of proteomic data into systems approaches to biology. However, development of analytical methods has lagged behind. Here we describe an empirical Bayes framework for quantitative proteomics data analysis. The method provides a statistical description of each experiment, including the number of proteins that differ in abundance between 2 samples, the experiment's statistical power to detect them, and the false-positive probability of each protein.
Methodology/Principal Findings
We analyzed 2 types of mass spectrometric experiments. First, we showed that the method identified the protein targets of small-molecules in affinity purification experiments with high precision. Second, we re-analyzed a mass spectrometric data set designed to identify proteins regulated by microRNAs. Our results were supported by sequence analysis of the 3′ UTR regions of predicted target genes, and we found that the previously reported conclusion that a large fraction of the proteome is regulated by microRNAs was not supported by our statistical analysis of the data.
Our results highlight the importance of rigorous statistical analysis of proteomic data, and the method described here provides a statistical framework to robustly and reliably interpret such data.
PMCID: PMC2759080  PMID: 19829701
18.  Stanford University Mass Spectrometry 
Journal of Biomolecular Techniques : JBT  2010;21(3 Suppl):S70-S71.
Stanford University Mass Spectrometry (SUMS) is Stanford University's central core facility for mass spectrometry-based analysis. SUMS wears several hats, as the Vincent Coates Foundation Mass Spectrometry Laboratory named in honor of a generous gift from Vincent and Stella Coates; a Stanford Bio-X core facility, embodying the Bio-X spirit of interdisciplinary communication and collaboration; and the Proteomics Shared Resource of the Stanford Comprehensive Cancer Center. The laboratory's expertise and support are available to researchers throughout Stanford University, Stanford Medical Center, and beyond. SUMS users have broad analytical needs and interests, ranging from general qualitative analysis to targeted quantitative assays, and proteomics to metabolomics. A total of 11 mass spectrometers interfaced with analytical- and capillary-scale HPLC and UPLC, as well as GC front ends support these research projects:Single quad GC-MS and LC-MS instruments are operated as open access systems, available 24/7 to trained users. Projects are run by staff scientists on one or more of single quad, ion trap, triple quad, Q-Tof, hybrid Orbitrap, and benchtop Orbitrap instruments, matching the requirements of the projects to the strengths of the instrumentation. The expertise and enthusiasm of the SUMS staff are the bedrock of the laboratory. In addition to making available state-of-the-art, user-friendly facilities and services, SUMS enables education, method development, and new applications development, designed to meet the rapidly evolving needs of researchers.
PMCID: PMC2918057
19.  OmicsHub Proteomics Software Tool 
OmicsHub Proteomics integrates in one single platform all the steps of a Mass Spectrometry Experiment reducing time and data management complexity. The proteomics data automation and data management/analysis provided by OmicsHub Proteomics solves the typical problems your lab members find on a daily basis and makes life easier when performing tasks such as multiple search engine support, pathways integration or custom report generation for external customers. OmicsHub has been designed as a central data management system to collect, analyze and annotate proteomics experimental data enabling users to automate tasks. OmicsHub Proteomics helps laboratories to easily meet proteomics standards such as PRIDE or FuGE and works with controlled vocabulary experiment annotation. The software enables your lab members to take a greater advantage of the Mascot and Phenyx search engines unique capabilities for protein identification. Multiple searches can be launch at once, allowing peak list data from several spots or chromatograms to be sent concurrently to Mascot/Phenyx. OmicsHub Proteomics works for both LC and Gel workflows. The system allows to store and compare proteomics data generated from different Mass Spectrometry instruments in a single platform instead of having a specific software for each of them. It is a web application which installs in a single server needing just Web Browser to have access to it. All experimental actions are userstamp and datestamp allowing the audit tracking of every action performed in OmicsHub. Some of the OmicsHub Proteomics main features are Protein identification, Biological annotation, Report customization, PRIDE standard, Pathways integration, Group proteins results removing redundancy, Peak filtering and FDR cutoff for decoy databases. OmicsHub Proteomics its flexible enough to parsers for new file formats to be easily imported and fits your budget having a very competitive price for its perpetual license.
PMCID: PMC2918172
20.  The quantitative proteomes of human-induced pluripotent stem cells and embryonic stem cells 
An in-depth proteomic comparison of human-induced pluripotent stem cells, and their parent fibroblast cells, with embryonic stem cells shows that the reprogramming process comprehensively remodels protein expression levels, creating cells that closely resemble natural stem cells.
We present here a large proteomic characterization of human embryonic stem cells, human-induced pluripotent stem cells and their parental fibroblasts cell lines.Overall, 97.8% of the 2683 quantified proteins in four experiments showed no significant differences in abundance between hESC and hiPSC highlighting the high similarity of these pluripotent cell lines.In total, 58 proteins were found significantly differentially expressed between hiPSCs and hESCs. The observed low overlap of these proteins with previous transcriptomic studies suggests that those differences do no reflect a recurrent molecular signature.
Human embryonic stem cells (hESCs) are capable of self-renewal and multi-lineage differentiation. However, the use of hESCs for clinical treatment entails ethical issues as they are derived from human embryos. Recently, reprogramming of somatic cells to an embryonic stem cell-like state, named induced pluripotent stem cells (iPSCs), was achieved through ectopic expression of defined factors. In addition to their clinical potential, hiPSCs represent a unique tool to develop cellular models for human diseases as well. Although current functional assays (e.g., tetraploid complementation) have confirmed the pluripotency of hiPSCs, there might still be significant differences (e.g., differentiation potential) when compared with their natural hESCs counterparts. Consequently, an extensive molecular characterization to address differences and similarities between these two pluripotent cell lines seems to be a prerequisite before any clinical application is conducted. Despite that great efforts, mainly at the genomic levels, have been made to address how similar hESCs and hiPSCs are, the definite answer to this fundamental question is currently still debated. Direct assessment of protein levels has yet to be incorporated into these integrative systems-level analyses. Protein levels are tuned by intricate mechanisms of gene expression regulation and it has recently been documented that mRNA and protein levels poorly correlate in mouse ESCs. Here, we use in-depth quantitative proteomics to gain insights into the differences and similarities in the protein content of two hiPS cell lines, their precursor IMR90 and 4Skin fibroblast cell lines and one hES cell line, providing novel molecular signatures that may assist in filling a gap in the understanding of pluripotency.
To study the degree of similarity, at the protein level, between hiPSCs and hESCs, four MS-based proteomic experiments were designed that use our in-house developed triplex dimethyl labeling chemistry followed by extensive fractionation by strong cation exchange (SCX) chromatography to reduce the sample complexity. High-resolution LC-MS/MS with dedicated fragmentation schemes (i.e., electron transfer dissociation, collision-induced dissociation and higher-energy collision dissociation) was subsequently used to maximize peptide identification rates. A total of 348 LC-MS/MS analyses (including technical and biological replicates) were performed. We confidently identified 1 593 446 peptide spectrum matches (peptide FDR<1%) corresponding to 10 628 unique protein groups (protein FDR∼4%). Using the extracted ion chromatograms, we also estimated the absolute abundance of the proteins within the samples spanning six orders of magnitude. To the best of our knowledge, the coverage obtained in this study represents the largest achieved by any proteomics screen on pluripotent cells.
Most importantly, our results indicate that the reprogramming process remodeled the proteome of both fibroblast cell lines to a profile that closely resembles the pluripotent hESCs proteome: 97.8% of the quantified proteins (2638 proteins in all four experiments) showed nonsignificant changes. Nevertheless, a small fraction of 58 proteins, mainly related to metabolism, antigen processing and cell adhesion, was found significantly regulated between hiPSCs and hESCs. A comparison of the regulated proteins to previously published transcriptomic studies showed a low overlap, highlighting the emerging notion that differences between both pluripotent cell lines rather reflect experimental conditions than a recurrent molecular signature. On the other side, the inclusion of the two parental fibroblast cell lines in our analysis allowed us to study changes in the proteome at both the starting and end points of the reprogramming process. As expected, the vast majority of the proteins (73.4%) showed differential expression between the parental fibroblasts and the reprogrammed pluripotent cells.
To find out if the differences observed in our study were a consequence of transcriptional or translational regulation, we performed paired genome-wide gene expression analyses on the same six samples that were used for the proteomic profiling. Overall, we observed a good correlation between mRNA and protein levels (r∼0.7). These results further authenticated the proteomic measurements and implied a high degree of control at the transcriptional level. Nevertheless, numerous genes were found uncorrelated highlighting the necessity of complementing transcriptomic-based approaches with proteomics.
Assessing relevant molecular differences between human-induced pluripotent stem cells (hiPSCs) and human embryonic stem cells (hESCs) is important, given that such differences may impact their potential therapeutic use. Controversy surrounds recent gene expression studies comparing hiPSCs and hESCs. Here, we present an in-depth quantitative mass spectrometry-based analysis of hESCs, two different hiPSCs and their precursor fibroblast cell lines. Our comparisons confirmed the high similarity of hESCs and hiPSCS at the proteome level as 97.8% of the proteins were found unchanged. Nevertheless, a small group of 58 proteins, mainly related to metabolism, antigen processing and cell adhesion, was found significantly differentially expressed between hiPSCs and hESCs. A comparison of the regulated proteins with previously published transcriptomic studies showed a low overlap, highlighting the emerging notion that differences between both pluripotent cell lines rather reflect experimental conditions than a recurrent molecular signature.
PMCID: PMC3261715  PMID: 22108792
human embryonic stem cells; human-induced pluripotent stem cells; proteomics; quantitation
21.  A proteomic chronology of gene expression through the cell cycle in human myeloid leukemia cells 
eLife  2014;3:e01630.
Technological advances have enabled the analysis of cellular protein and RNA levels with unprecedented depth and sensitivity, allowing for an unbiased re-evaluation of gene regulation during fundamental biological processes. Here, we have chronicled the dynamics of protein and mRNA expression levels across a minimally perturbed cell cycle in human myeloid leukemia cells using centrifugal elutriation combined with mass spectrometry-based proteomics and RNA-Seq, avoiding artificial synchronization procedures. We identify myeloid-specific gene expression and variations in protein abundance, isoform expression and phosphorylation at different cell cycle stages. We dissect the relationship between protein and mRNA levels for both bulk gene expression and for over ∼6000 genes individually across the cell cycle, revealing complex, gene-specific patterns. This data set, one of the deepest surveys to date of gene expression in human cells, is presented in an online, searchable database, the Encyclopedia of Proteome Dynamics (
eLife digest
Cells are complex environments: at any one time, thousands of different genes act as molecular templates to produce messenger RNA (mRNA) molecules, which themselves are templates used to produce proteins. However, not all genes are active at all times inside all cells: as cells grow and divide as part of the cell division cycle, genes are switched on and off on a regular basis. Similarly, the patterns of mRNA and protein production are different in, say, immune and skin cells.
In recent years, the tools available for detecting mRNA molecules and proteins have become more powerful, allowing researchers to move beyond just measuring the total amounts of mRNA and protein in the cell to now measuring individual amounts of specific mRNA and protein molecules encoded by specific genes. However, it has been a challenge to make these measurements at different stages of the cell cycle. Most of the methods used to do this have involved artificially ‘arresting’ the cell cycle, which can lead to side effects that are difficult to account for.
Ly et al. have now overcome these problems using a combination of three methods to measure the levels of mRNA and protein molecules associated with over 6000 genes in human cancer cells derived from myeloid leukemia. Exploiting the fact that cells change size during the cell cycle, Ly et al. used a centrifugation technique to separate cells based on their size and, therefore, the stage of the cell cycle they were at, thus avoiding the need to arrest the cell cycle. An approach called RNA-Seq was then employed to measure the levels of the different mRNA molecules in the cells, and a device called a mass spectrometer was used to identify and measure the levels of many different proteins.
In addition to being able to follow the level of mRNA and protein production for a large number of genes throughout the cell division cycle, while also obtaining detailed information about how many of the proteins are modified, Ly et al. discovered that—contrary to expectations—low numbers of mRNA molecules were sometimes associated with high numbers of the corresponding protein, and vice versa. This work provides a better understanding of the complex relationship between the levels of an mRNA and its corresponding protein product, and also demonstrates how it may be possible to detect subtle but important differences between cell types and disease states, including different types of cancer.
PMCID: PMC3936288  PMID: 24596151
proteomics; mass spectrometry; RNA-Seq; cell cycle; transcriptomics; human
22.  A multidimensional platform for the purification of non-coding RNA species 
Nucleic Acids Research  2013;41(17):e168.
A renewed interest in non-coding RNA (ncRNA) has led to the discovery of novel RNA species and post-transcriptional ribonucleoside modifications, and an emerging appreciation for the role of ncRNA in RNA epigenetics. Although much can be learned by amplification-based analysis of ncRNA sequence and quantity, there is a significant need for direct analysis of RNA, which has led to numerous methods for purification of specific ncRNA molecules. However, no single method allows purification of the full range of cellular ncRNA species. To this end, we developed a multidimensional chromatographic platform to resolve, isolate and quantify all canonical ncRNAs in a single sample of cells or tissue, as well as novel ncRNA species. The applicability of the platform is demonstrated in analyses of ncRNA from bacteria, human cells and plasmodium-infected reticulocytes, as well as a viral RNA genome. Among the many potential applications of this platform are a system-level analysis of the dozens of modified ribonucleosides in ncRNA, characterization of novel long ncRNA species, enhanced detection of rare transcript variants and analysis of viral genomes.
PMCID: PMC3783195  PMID: 23907385
23.  Experimental RNomics and genomic comparative analysis reveal a large group of species-specific small non-message RNAs in the silkworm Bombyx mori 
Nucleic Acids Research  2011;39(9):3792-3805.
Accumulating evidences show that small non-protein coding RNAs (ncRNAs) play important roles in development, stress response and other cellular processes. The silkworm is an important model for studies on insect genetics and control of lepidopterous pests. Here, we have performed the first systematic identification and analysis of intermediate size ncRNAs (50–500 nt) in the silkworm. We identified 189 novel ncRNAs, including 141 snoRNAs, six snRNAs, three tRNAs, one SRP and 38 unclassified ncRNAs. Forty ncRNAs showed significantly altered expression during silkworm development or across specific stage transitions. Genomic comparisons revealed that 123 of these ncRNAs are potentially silkworm-specific. Analysis of the genomic organization of the ncRNA loci showed that 32.62% of the novel snoRNA loci are intergenic, and that all the intronic snoRNAs follow the pattern of one-snoRNA-per-intron. Target site analysis predicted a total of 95 2′-O-methylation and pseudouridylation modification sites of rRNAs, snRNAs and tRNAs. Together, these findings provide new clues for future functional study of ncRNA during insect development and evolution.
PMCID: PMC3089462  PMID: 21227919
24.  The phosphoproteome of toll-like receptor-activated macrophages 
First global and quantitative analysis of phosphorylation cascades induced by toll-like receptor (TLR) stimulation in macrophages identifies nearly 7000 phosphorylation sites and shows extensive and dynamic up-regulation and down-regulation after lipopolysaccharide (LPS).In addition to the canonical TLR-associated pathways, mining of the phosphorylation data suggests an involvement of ATM/ATR kinases in signalling and shows that the cytoskeleton is a hotspot of TLR-induced phosphorylation.Intersecting transcription factor phosphorylation with bioinformatic promoter analysis of genes induced by LPS identified several candidate transcriptional regulators that were previously not implicated in TLR-induced transcriptional control.
Toll-like receptors (TLR) are a family of pattern recognition receptors that enable innate immune cells to sense infectious danger. Recognition of microbial structures, like lipopolysaccharide (LPS) of Gram-negative bacteria by TLR4, causes within hours substantial re-programming of macrophage gene expression, including up-regulation of chemokines driving inflammation, anti-microbial effector molecules and cytokines directing adaptive immune responses. TLR signalling is initiated by the adapter protein Myd88 and leads to the activation of kinase cascades that result in activation of the MAPK and NFkB pathways. Phosphorylation has an essential role in these early steps of TLR signalling, and in addition regulates critical transcription factors (TFs). Although TLR signalling has been extensively studied, a comprehensive analysis of phosphorylation events in TLR-activated macrophages is lacking. It is therefore unknown whether the canonical MAPK and NFkB pathways comprise the main phosphorylation events and which other molecular functions and processes are regulated by phosphorylation after stimulation with LPS.
Recent progress in mass spectrometry-based proteomics has opened the possibility to quantitatively investigate global changes in protein abundance and post-translational modifications. Stable isotope labelling with amino acids in cell culture (SILAC) allows highly accurate quantification, and has proved especially useful for direct comparison of phosphopeptide abundance in time-course or treatment analyses.
Here, we adapted SILAC to primary mouse macrophages, and performed a global, quantitative and kinetic analysis of the macrophage phosphoproteome after LPS stimulation. Bioinformatic analyses were used to identify kinases, pathways and biological processes enriched in the LPS-regulated phosphoproteome. To connect TF phosphorylation with transcription, we generated a parallel dataset of nascent RNA and used in silico promoter analysis to identify transcriptional regulators with binding site enrichment among the LPS-regulated gene set.
After establishing SILAC conditions for efficient labelling of primary bone marrow-derived macrophages in two independent experiments 1850 phosphoproteins with a total of 6956 phosphorylation sites were reproducibly identified. Phosphoproteins were detected from all cellular compartments, with a clear enrichment for nuclear and cytoskeleton-associated proteins. LPS caused major regulation of a large fraction of phosphopeptides, with 24% of all sites up-regulated and 9% down-regulated after stimulation (Figure 3A and B). These changes were highly dynamic, as the majority of the regulated phosphopeptides were up-regulated or down-regulated transiently or in a delayed manner (Figure 3C). Overall, the extent of changes in the phosphoproteome was comparable to the transcriptional re-programming, underscoring the importance of phosphorylation cascades in TLR signalling. Our parallel transcriptome data also showed that widespread phosphorylation precedes massive transcriptional changes.
To obtain footprints of kinase activation in response to TLR ligation, we searched phosphopeptide sequences for known linear sequence motifs of 33 kinases and identified kinase motifs enriched among LPS-regulated phosphorylation sites (compared to non-regulated phosphorylation sites) (Table I). Motif ERK/MAPK was highly enriched, in accordance with the essential role of the MAPK module in TLR signalling. Other kinases with motif enrichment have also recently been linked to TLR signalling (e.g. PKD; AKT and its targets GSK3 and mTOR). However, the DNA damage-actviated kinases ATM/ATR and the cell cycle-associated kinases AURORA and CHK1/2 have not been associated with the macrophage response to TLR activation yet. These finding shed new light on older data on the effect of TLR on macrophage proliferation in response to macrophage colony stimulating factor. Of interest, in follow-up experiments using pharmacological inhibitors of the kinases with motif enrichment, we observed that inhibition of ATM kinase activity caused increased LPS-induced expression of several cytokines and chemokines, suggesting that this pathway regulates inflammatory responses.
In further bioinformatic analyses, the Gene Ontology and signalling pathway annotations of phosphoproteins were used to identify signalling pathways and cellular processes targeted by TLR4-controlled phosphorylation (Table II). Among the expected hits, based on the known TLR pathways, were TLR signalling, MAPK and AKT as well as mTOR signalling. Of interest, the annotation terms ‘Rho GTPase cycle' and ‘cytoskeleton' were significantly enriched among LPS-regulated phosphoproteins, indicating a more prominent role for cytoskeletal proteins in the transduction of TLR signals or in the biological response to it.
We were especially interested in the phosphorylation of TFs and its regulation by LPS (Figure 6A). We hypothesised that functionally important TFs should have an increased frequency of binding sites in the promoters of LPS-regulated genes (Figure 6B). To identify transcriptionally regulated genes with high sensitivity, we isolated nascent RNA after metabolic labelling (Figure 6C–E). In silico promoter scanning using Genomatix software for binding sites for all 50 TF families with phosphorylated members was used to test for enrichment in transciptionally induced genes (Figure 6F). At the early time point, binding site enrichment for the canonical TLR-associated TF NFkB was detected, and in addition we found that several other TF families with an established role in the transcription of individual LPS-target genes showed binding site enrichment (CEBP, MEF2, NFAT and HEAT). In addition, enrichment for OCT and HOXC binding sites at the early time point and SORY matrices later after stimulation indicated an involvement of the phosphorylated members of the respective TF families in the execution of TLR-induced transcriptional responses. An initial test of the function for a few of these candidate transcriptional regulators was performed using siRNA knockdown in primary macrophages. These experiments suggested that knock down of the SORY binding phosphoprotein Capicua homolog (Cic) and to a lesser extent of the CREB family member Atf7 selectively attenuates LPS-induced expression of Il1a and Il1b.
In summary, this study provides a novel and global perspective on innate immune activation by TLR signalling (Figure 5). We quantitatively detected a large number of previously unknown site-specific phosphorylation events, which are now publicly available through the Phosida database. By combining different data mining approaches, we consistently identified canonical and newly implicated TLR-activated signalling modules. In particular, the PI3K/AKT and the related mTOR pathway were highlighted; furthermore, DNA damage–response associated ATM/ATR kinases and the cytoskeleton emerged as unexpected hotspots for phosphorylation. Finally, weaving together corresponding phophoproteome and nascent transcriptome datasets through the loom of in silico promoter analysis we identified TFs with a likely role in mediating TLR-induced gene expression programmes.
Recognition of microbial danger signals by toll-like receptors (TLR) causes re-programming of macrophages. To investigate kinase cascades triggered by the TLR4 ligand lipopolysaccharide (LPS) on systems level, we performed a global, quantitative and kinetic analysis of the phosphoproteome of primary macrophages using stable isotope labelling with amino acids in cell culture, phosphopeptide enrichment and high-resolution mass spectrometry. In parallel, nascent RNA was profiled to link transcription factor (TF) phosphorylation to TLR4-induced transcriptional activation. We reproducibly identified 1850 phosphoproteins with 6956 phosphorylation sites, two thirds of which were not reported earlier. LPS caused major dynamic changes in the phosphoproteome (24% up-regulation and 9% down-regulation). Functional bioinformatic analyses confirmed canonical players of the TLR pathway and highlighted other signalling modules (e.g. mTOR, ATM/ATR kinases) and the cytoskeleton as hotspots of LPS-regulated phosphorylation. Finally, weaving together phosphoproteome and nascent transcriptome data by in silico promoter analysis, we implicated several phosphorylated TFs in primary LPS-controlled gene expression.
PMCID: PMC2913394  PMID: 20531401
macrophage; nascent RNA; phosphoproteome; SILAC; toll-like receptors
25.  Proteomic approaches to characterize protein modifications: new tools to study the effects of environmental exposures. 
Environmental Health Perspectives  2002;110(Suppl 1):3-9.
Proteomics is the study of proteomes, which are the collections of proteins expressed in cells. Whereas genomes are essentially invariant in different cells in an organism, proteomes vary from cell to cell, with time and as a function of environmental stimuli and stress. The integration of new mass spectrometry (MS) methods, data analysis algorithms, and information from databases of protein and gene sequences has enabled the characterization of proteomes. Many environmental agents directly or indirectly generate reactive electrophiles that covalently modify proteins. Although considerable evidence supports a key role for protein adducts in adverse effects of chemicals, limitations in analytical technology have slowed progress in this area. New applications of liquid chromatography-tandem mass spectrometry (LC-MS-MS) now offer the potential to identify protein targets of reactive electrophiles and to map adducts at the level of amino acid sequence. Use of the data-analysis tools Sequest and SALSA (Scoring Algorithm for Spectral Analysis) together with LC-MS-MS analyses of protein digests enables the identification of modified forms of proteins in a sample. These approaches can map adducts to specific amino acids in protein targets and are being adapted to searches for protein adducts in complex proteomes. These tools will facilitate the identification of new biomarkers of chemical exposure and studies of mechanisms by which protein modifications contribute to the adverse effects of environmental exposures.
PMCID: PMC1241143  PMID: 11834459

