PMCC PMCC

Search tips
Search criteria

Advanced
Results 26-50 (61)
 

Clipboard (0)
None

Select a Filter Below

Year of Publication
more »
26.  It’s the machine that matters: Predicting gene function and phenotype from protein networks 
Journal of proteomics  2010;73(11):2277-2289.
Increasing knowledge about the organization of proteins into complexes, systems, and pathways has led to a flowering of theoretical approaches for exploiting this knowledge in order to better learn the functions of proteins and their roles underlying phenotypic traits and diseases. Much of this body of theory has been developed and tested in model organisms, relying on their relative simplicity and genetic and biochemical tractability to accelerate the research. In this review, we discuss several of the major approaches for computationally integrating proteomics and genomics observations into integrated protein networks, then applying guilt-by-association in these networks in order to identify genes underlying traits. Recent trends in this field include a rising appreciation of the modular network organization of proteins underlying traits or mutational phenotypes, and how to exploit such protein modularity using computational approaches related to the internet search algorithm PageRank. Many protein network-based predictions have recently been experimentally confirmed in yeast, worms, plants, and mice, and several successful approaches in model organisms have been directly translated to analyze human disease, with notable recent applications to glioma and breast cancer prognosis.
doi:10.1016/j.jprot.2010.07.005
PMCID: PMC2953423  PMID: 20637909
Data integration; Function prediction; Humans; Model organisms; Phenotype prediction; Protein interaction networks
27.  Defining the pathway of cytoplasmic maturation of the 60S ribosomal subunit 
Molecular cell  2010;39(2):196-208.
In eukaryotic cells the final maturation of ribosomes occurs in the cytoplasm, where trans-acting factors are removed and critical ribosomal proteins are added for functionality. Here, we have carried out a comprehensive analysis of cytoplasmic maturation, ordering the known steps into a coherent pathway. Maturation is initiated by the ATPase Drg1. Downstream, assembly of the ribosome stalk is essential for the release of Tif6. The stalk recruits GTPases during translation. Because the GTPase Efl1, which is required for the release of Tif6, resembles the translation elongation factor eEF2, we suggest that assembly of the stalk recruits Efl1, triggering a step in 60S biogenesis that mimics aspects of translocation. Efl1 could thereby provide a mechanism to functionally check the nascent subunit. Finally, the release of Tif6 is a prerequisite for the release of the nuclear export adapter Nmd3. Establishing this pathway provides an important conceptual framework for understanding ribosome maturation.
doi:10.1016/j.molcel.2010.06.018
PMCID: PMC2925414  PMID: 20670889
ribosome; ribosome biogenesis; EFL1; NMD3; TIF6
28.  MSblender: a probabilistic approach for integrating peptide identifications from multiple database search engines 
Journal of proteome research  2011;10(7):2949-2958.
Shotgun proteomics using mass spectrometry is a powerful method for protein identification but suffers limited sensitivity in complex samples. Integrating peptide identifications from multiple database search engines is a promising strategy to increase the number of peptide identifications and reduce the volume of unassigned tandem mass spectra. Existing methods pool statistical significance scores such as p-values or posterior probabilities of peptide-spectrum matches (PSMs) from multiple search engines after high scoring peptides have been assigned to spectra, but these methods lack reliable control of identification error rates as data are integrated from different search engines. We developed a statistically coherent method for integrative analysis, termed MSblender. MSblender converts raw search scores from search engines into a probability score for all possible PSMs and properly accounts for the correlation between search scores. The method reliably estimates false discovery rates and identifies more PSMs than any single search engine at the same false discovery rate. Increased identifications increment spectral counts for all detected proteins and allow quantification of proteins that would not have been quantified by individual search engines. We also demonstrate that enhanced quantification contributes to improve sensitivity in differential expression analyses.
doi:10.1021/pr2002116
PMCID: PMC3128686  PMID: 21488652
integrative analysis; database search; peptide identification
29.  Characterising and Predicting Haploinsufficiency in the Human Genome 
PLoS Genetics  2010;6(10):e1001154.
Haploinsufficiency, wherein a single functional copy of a gene is insufficient to maintain normal function, is a major cause of dominant disease. Human disease studies have identified several hundred haploinsufficient (HI) genes. We have compiled a map of 1,079 haplosufficient (HS) genes by systematic identification of genes unambiguously and repeatedly compromised by copy number variation among 8,458 apparently healthy individuals and contrasted the genomic, evolutionary, functional, and network properties between these HS genes and known HI genes. We found that HI genes are typically longer and have more conserved coding sequences and promoters than HS genes. HI genes exhibit higher levels of expression during early development and greater tissue specificity. Moreover, within a probabilistic human functional interaction network HI genes have more interaction partners and greater network proximity to other known HI genes. We built a predictive model on the basis of these differences and annotated 12,443 genes with their predicted probability of being haploinsufficient. We validated these predictions of haploinsufficiency by demonstrating that genes with a high predicted probability of exhibiting haploinsufficiency are enriched among genes implicated in human dominant diseases and among genes causing abnormal phenotypes in heterozygous knockout mice. We have transformed these gene-based haploinsufficiency predictions into haploinsufficiency scores for genic deletions, which we demonstrate to better discriminate between pathogenic and benign deletions than consideration of the deletion size or numbers of genes deleted. These robust predictions of haploinsufficiency support clinical interpretation of novel loss-of-function variants and prioritization of variants and genes for follow-up studies.
Author Summary
Humans, like most complex organisms, have two copies of most genes in their genome, one from the mother and one from the father. This redundancy provides a back-up copy for most genes, should one copy be lost through mutation. For a minority of genes, one functional copy is not enough to sustain normal human function, and mutations causing the loss of function of one of the copies of such genes are a major cause of childhood developmental diseases. Over the past 20 years medical geneticists have identified over 300 such genes, but it is not known how many of the 22,000 genes in our genome may also be sensitive to gene loss. By comparing these ∼300 genes known to be sensitive to gene loss with over 1,000 genes where loss of a single copy does not result in disease, we have identified some key evolutionary and functional similarities between genes sensitive to loss of a single copy. We have used these similarities to predict for most genes in the genome, whether loss of a single copy is likely to result in disease. These predictions will help in the interpretation of mutations seen in patients.
doi:10.1371/journal.pgen.1001154
PMCID: PMC2954820  PMID: 20976243
30.  Parallel Evolution in Pseudomonas aeruginosa over 39,000 Generations In Vivo 
mBio  2010;1(4):e00199-10.
The Gram-negative bacterium Pseudomonas aeruginosa is a common cause of chronic airway infections in individuals with the heritable disease cystic fibrosis (CF). After prolonged colonization of the CF lung, P. aeruginosa becomes highly resistant to host clearance and antibiotic treatment; therefore, understanding how this bacterium evolves during chronic infection is important for identifying beneficial adaptations that could be targeted therapeutically. To identify potential adaptive traits of P. aeruginosa during chronic infection, we carried out global transcriptomic profiling of chronological clonal isolates obtained from 3 individuals with CF. Isolates were collected sequentially over periods ranging from 3 months to 8 years, representing up to 39,000 in vivo generations. We identified 24 genes that were commonly regulated by all 3 P. aeruginosa lineages, including several genes encoding traits previously shown to be important for in vivo growth. Our results reveal that parallel evolution occurs in the CF lung and that at least a proportion of the traits identified are beneficial for P. aeruginosa chronic colonization of the CF lung.
IMPORTANCE
Deadly diseases like AIDS, malaria, and tuberculosis are the result of long-term chronic infections. Pathogens that cause chronic infections adapt to the host environment, avoiding the immune response and resisting antimicrobial agents. Studies of pathogen adaptation are therefore important for understanding how the efficacy of current therapeutics may change upon prolonged infection. One notorious chronic pathogen is Pseudomonas aeruginosa, a bacterium that causes long-term infections in individuals with the heritable disease cystic fibrosis (CF). We used gene expression profiles to identify 24 genes that commonly changed expression over time in 3 P. aeruginosa lineages, indicating that these changes occur in parallel in the lungs of individuals with CF. Several of these genes have previously been shown to encode traits critical for in vivo-relevant processes, suggesting that they are likely beneficial adaptations important for chronic colonization of the CF lung.
doi:10.1128/mBio.00199-10
PMCID: PMC2939680  PMID: 20856824
31.  Sequence signatures and mRNA concentration can explain two-thirds of protein abundance variation in a human cell line 
We provide a large-scale dataset on absolute protein and matching mRNA concentrations from the human medulloblastoma cell line Daoy. The correlation between mRNA and protein concentrations is significant and positive (Rs=0.46, R2=0.29, P-value<2e16), although non-linear.Out of ∼200 tested sequence features, sequence length, frequency and properties of amino acids, as well as translation initiation-related features are the strongest individual correlates of protein abundance when accounting for variation in mRNA concentration.When integrating mRNA expression data and all sequence features into a non-parametric regression model (Multivariate Adaptive Regression Splines), we were able to explain up to 67% of the variation in protein concentrations. Half of the contributions were attributed to mRNA concentrations, the other half to sequence features relating to regulation of translation and protein degradation. The sequence features are primarily linked to the coding and 3′ untranslated region. To our knowledge, this is the most comprehensive predictive model of human protein concentrations achieved so far.
mRNA decay, translation regulation and protein degradation are essential parts of eukaryotic gene expression regulation (Hieronymus and Silver, 2004; Mata et al, 2005), which enable the dynamics of cellular systems and their responses to external and internal stimuli without having to rely exclusively on transcription regulation. The importance of these processes is emphasized by the generally low correlation between mRNA and protein concentrations. For many prokaryotic and eukaryotic organisms, <50% of variation in protein abundance variation is explained by variation in mRNA concentrations (de Sousa Abreu et al, 2009).
Given the plethora of regulatory mechanisms involved, most studies have focused so far on individual regulators and specific targets. Particularly in human, we currently lack system-wide, quantitative analyses that evaluate the relative contribution of regulatory elements encoded in the mRNA and protein sequence. Existing studies have been carried out only in bacteria and yeast (Nie et al, 2006; Brockmann et al, 2007; Tuller et al, 2007; Wu et al, 2008). Here, we present the first comprehensive analysis on the impact of translation and protein degradation on protein abundance variation in a human cell line. For this purpose, we experimentally measured absolute protein and mRNA concentrations in the Daoy medulloblastoma cell line, using shotgun proteomics and microarrays, respectively (Figure 1). These data comprise one of the largest such sets available today for human. We focused on sequence features that likely impact protein translation and protein degradation, including length, nucleotide composition, structure of the untranslated regions (UTRs), coding sequence, composition of the translation initiation site, presence of upstream open reading frames putative target sites of miRNAs, codon usage, amino-acid composition and protein degradation signals.
Three types of tests have been conducted: (a) we examined partial Spearman's rank correlation of numerical features (e.g. length) with protein concentration, accounting for variation in mRNA concentrations; (b) for numerical and categorical features (e.g. function), we compared two extreme populations with Welch's t-test and (c) using a Multivariate Adaptive Regression Splines model, we analyzed the combined contributions of mRNA expression and sequence features to protein abundance variation (Figure 1). To account for the non-linearity of many relationships, we use non-parametric approaches throughout the analysis.
We observed a significant positive correlation between mRNA and protein concentrations, larger than many previous measurements (de Sousa Abreu et al, 2009). We also show that the contribution of translation and protein degradation is at least as important as the contribution of mRNA transcription and stability to the abundance variation of the final protein products. Although variation in mRNA expression explains ∼25–30% of the variation in protein abundance, another 30–40% can be accounted for by characteristics of the sequences, which we identified in a comparative assessment of global correlates. Among these characteristics, sequence length, amino-acid frequencies and also nucleotide frequencies in the coding region are of strong influence (Figure 3A). Characteristics of the 3′UTR and of the 5′UTR, that is length, nucleotide composition and secondary structures, describe another part of the variation, leaving 33% expression variation unexplained. The unexplained fraction may be accounted for by mechanisms not considered in this analysis (e.g. regulation by RNA-binding proteins or gene-specific structural motifs), as well as expression and measurement noise.
Our combined model including mRNA concentration and sequence features can explain 67% of the variation of protein abundance in this system—and thus has the highest predictive power for human protein abundance achieved so far (Figure 3B).
Transcription, mRNA decay, translation and protein degradation are essential processes during eukaryotic gene expression, but their relative global contributions to steady-state protein concentrations in multi-cellular eukaryotes are largely unknown. Using measurements of absolute protein and mRNA abundances in cellular lysate from the human Daoy medulloblastoma cell line, we quantitatively evaluate the impact of mRNA concentration and sequence features implicated in translation and protein degradation on protein expression. Sequence features related to translation and protein degradation have an impact similar to that of mRNA abundance, and their combined contribution explains two-thirds of protein abundance variation. mRNA sequence lengths, amino-acid properties, upstream open reading frames and secondary structures in the 5′ untranslated region (UTR) were the strongest individual correlates of protein concentrations. In a combined model, characteristics of the coding region and the 3′UTR explained a larger proportion of protein abundance variation than characteristics of the 5′UTR. The absolute protein and mRNA concentration measurements for >1000 human genes described here represent one of the largest datasets currently available, and reveal both general trends and specific examples of post-transcriptional regulation.
doi:10.1038/msb.2010.59
PMCID: PMC2947365  PMID: 20739923
gene expression regulation; protein degradation; protein stability; translation
32.  A Synthetic Genetic Edge Detection Program 
Cell  2009;137(7):1272-1281.
Summary
Edge detection is a signal processing algorithm common in artificial intelligence and image recognition programs. We have constructed a genetically encoded edge detection algorithm that programs an isogenic community of E.coli to sense an image of light, communicate to identify the light-dark edges, and visually present the result of the computation. The algorithm is implemented using multiple genetic circuits. An engineered light sensor enables cells to distinguish between light and dark regions. In the dark, cells produce a diffusible chemical signal that diffuses into light regions. Genetic logic gates are used so that only cells that sense light and the diffusible signal produce a positive output. A mathematical model constructed from first principles and parameterized with experimental measurements of the component circuits predicts the performance of the complete program. Quantitatively accurate models will facilitate the engineering of more complex biological behaviors and inform bottom-up studies of natural genetic regulatory networks.
doi:10.1016/j.cell.2009.04.048
PMCID: PMC2775486  PMID: 19563759
33.  Rational association of genes with traits using a genome-scale gene network for Arabidopsis thaliana 
Nature biotechnology  2010;28(2):149-156.
Plants are essential sources of food, fiber and renewable energy. Effective methods for manipulating plant traits have important agricultural and economic consequences. We introduce a rational approach for associating genes with plant traits by combined use of a genome-scale functional network and targeted reverse genetic screening. We present a probabilistic network (AraNet) of functional associations among 19,647 (73%) genes of the reference flowering plant Arabidopsis thaliana. AraNet associations have measured precision greater than literature-based protein interactions (21%) for 55% of genes, and are highly predictive for diverse biological pathways. Using AraNet, we found a 10-fold enrichment in identifying early seedling development genes. By interrogating network neighborhoods, we identify At1g80710 (now Drought sensitive 1; Drs1) and At3g05090 (now Lateral root stimulator 1; Lrs1) as novel regulators of drought sensitivity and lateral root development, respectively. AraNet (http://www.functionalnet.org/aranet/) provides a global resource for plant gene function identification and genetic dissection of plant traits.
doi:10.1038/nbt.1603
PMCID: PMC2857375  PMID: 20118918
34.  The planar cell polarity effector Fuz is essential for targeted membrane trafficking, ciliogenesis, and mouse embryonic development 
Nature cell biology  2009;11(10):1225-1232.
The planar cell polarity (PCP) signaling pathway is essential for embryonic development because it governs diverse cellular behaviors, and the “core PCP” proteins, such as Dishevelled and Frizzled, have been extensively characterized1–4. By contrast, the “PCP effector” proteins, such as Intu and Fuz, remain largely unstudied5, 6. These proteins are essential for PCP signaling, but they have never been investigated in a mammal and their cell biological activities remain entirely unknown. We report here that Fuz mutant mice display neural tube defects, skeletal dysmorphologies, and Hedgehog signaling defects stemming from disrupted ciliogenesis. Using bioinformatics and imaging of an in vivo mucociliary epithelium, we establish a central role for Fuz in membrane trafficking, showing that Fuz is essential for trafficking of cargo to basal bodies and to the apical tips of cilia. Fuz is also essential for exocytosis in secretory cells. Finally, we identify a novel, Rab-related small GTPase as a Fuz interaction partner that is also essential for ciliogenesis and secretion. These results are significant because they provide novel insights into the mechanisms by which developmental regulatory systems like PCP signaling interface with fundamental cellular systems such as the vesicle trafficking machinery.
doi:10.1038/ncb1966
PMCID: PMC2755648  PMID: 19767740
35.  Disorder, promiscuity, and toxic partnerships 
Cell  2009;138(1):16-18.
Many genes are toxic when overexpressed, but general mechanisms for this toxicity have proven elusive. Vavouri et al. (2009) find that intrinsic protein disorder and promiscuous molecular interactions are strong determinants of dosage sensitivity, explaining in part the toxicity of dosage-sensitive oncogenes in mice and humans.
doi:10.1016/j.cell.2009.06.024
PMCID: PMC2848715  PMID: 19596229
36.  Ribosome stalk assembly requires the dual-specificity phosphatase Yvh1 for the exchange of Mrt4 with P0 
The Journal of Cell Biology  2009;186(6):849-862.
The step by step assembly process from preribosome in the nucleus to translation-competent 60S ribosome subunit in the cytoplasm is revealed (also see Kemmler et al. in this issue).
The ribosome stalk is essential for recruitment of translation factors. In yeast, P0 and Rpl12 correspond to bacterial L10 and L11 and form the stalk base of mature ribosomes, whereas Mrt4 is a nuclear paralogue of P0. In this study, we show that the dual-specificity phosphatase Yvh1 is required for the release of Mrt4 from the pre-60S subunits. Deletion of YVH1 leads to the persistence of Mrt4 on pre-60S subunits in the cytoplasm. A mutation in Mrt4 at the protein–RNA interface bypasses the requirement for Yvh1. Pre-60S subunits associated with Yvh1 contain Rpl12 but lack both Mrt4 and P0. These results suggest a linear series of events in which Yvh1 binds to the pre-60S subunit to displace Mrt4. Subsequently, P0 loads onto the subunit to assemble the mature stalk, and Yvh1 is released. The initial assembly of the ribosome with Mrt4 may provide functional compartmentalization of ribosome assembly in addition to the spatial separation afforded by the nuclear envelope.
doi:10.1083/jcb.200904110
PMCID: PMC2753163  PMID: 19797078
37.  Human Cell Chips: Adapting DNA Microarray Spotting Technology to Cell-Based Imaging Assays 
PLoS ONE  2009;4(10):e7088.
Here we describe human spotted cell chips, a technology for determining cellular state across arrays of cells subjected to chemical or genetic perturbation. Cells are grown and treated under standard tissue culture conditions before being fixed and printed onto replicate glass slides, effectively decoupling the experimental conditions from the assay technique. Each slide is then probed using immunofluorescence or other optical reporter and assayed by automated microscopy. We show potential applications of the cell chip by assaying HeLa and A549 samples for changes in target protein abundance (of the dsRNA-activated protein kinase PKR), subcellular localization (nuclear translocation of NFκB) and activation state (phosphorylation of STAT1 and of the p38 and JNK stress kinases) in response to treatment by several chemical effectors (anisomycin, TNFα, and interferon), and we demonstrate scalability by printing a chip with ∼4,700 discrete samples of HeLa cells. Coupling this technology to high-throughput methods for culturing and treating cell lines could enable researchers to examine the impact of exogenous effectors on the same population of experimentally treated cells across multiple reporter targets potentially representing a variety of molecular systems, thus producing a highly multiplexed dataset with minimized experimental variance and at reduced reagent cost compared to alternative techniques. The ability to prepare and store chips also allows researchers to follow up on observations gleaned from initial screens with maximal repeatability.
doi:10.1371/journal.pone.0007088
PMCID: PMC2760726  PMID: 19862318
38.  Rational Extension of the Ribosome Biogenesis Pathway Using Network-Guided Genetics 
PLoS Biology  2009;7(10):e1000213.
Gene networks are an efficient route for associating candidate genes with biological processes. Here, networks are used to discover more than 15 new genes for ribosomal subunit maturation, rRNA processing, and ribosomal export from the nucleus.
Biogenesis of ribosomes is an essential cellular process conserved across all eukaryotes and is known to require >170 genes for the assembly, modification, and trafficking of ribosome components through multiple cellular compartments. Despite intensive study, this pathway likely involves many additional genes. Here, we employ network-guided genetics—an approach for associating candidate genes with biological processes that capitalizes on recent advances in functional genomic and proteomic studies—to computationally identify additional ribosomal biogenesis genes. We experimentally evaluated >100 candidate yeast genes in a battery of assays, confirming involvement of at least 15 new genes, including previously uncharacterized genes (YDL063C, YIL091C, YOR287C, YOR006C/TSR3, YOL022C/TSR4). We associate the new genes with specific aspects of ribosomal subunit maturation, ribosomal particle association, and ribosomal subunit nuclear export, and we identify genes specifically required for the processing of 5S, 7S, 20S, 27S, and 35S rRNAs. These results reveal new connections between ribosome biogenesis and mRNA splicing and add >10% new genes—most with human orthologs—to the biogenesis pathway, significantly extending our understanding of a universally conserved eukaryotic process.
Author Summary
Ribosomes are the extremely complex cellular machines responsible for constructing new proteins. In eukaryotic cells, such as yeast, each ribosome contains more than 80 protein or RNA components. These complex machines must themselves be assembled by an even more complex machinery spanning multiple cellular compartments and involving perhaps 200 components in an ordered series of processing events, resulting in delivery of the two halves of the mature ribosome, the 40S and 60S components, to the cytoplasm. The ribosome biogenesis machinery has been only partially characterized, and many lines of evidence suggest that there are additional components that are still unknown. We employed an emerging computational technique called network-guided genetics to identify new candidate genes for this pathway. We then tested the candidates in a battery of experimental assays to determine what roles the genes might play in the biogenesis of ribosomes. This approach proved an efficient route to the discovery of new genes involved in ribosome biogenesis, significantly extending our understanding of a universally conserved eukaryotic process.
doi:10.1371/journal.pbio.1000213
PMCID: PMC2749941  PMID: 19806183
39.  Mining gene functional networks to improve mass-spectrometry-based protein identification 
Bioinformatics  2009;25(22):2955-2961.
Motivation: High-throughput protein identification experiments based on tandem mass spectrometry (MS/MS) often suffer from low sensitivity and low-confidence protein identifications. In a typical shotgun proteomics experiment, it is assumed that all proteins are equally likely to be present. However, there is often other evidence to suggest that a protein is present and confidence in individual protein identification can be updated accordingly.
Results: We develop a method that analyzes MS/MS experiments in the larger context of the biological processes active in a cell. Our method, MSNet, improves protein identification in shotgun proteomics experiments by considering information on functional associations from a gene functional network. MSNet substantially increases the number of proteins identified in the sample at a given error rate. We identify 8–29% more proteins than the original MS experiment when applied to yeast grown in different experimental conditions analyzed on different MS/MS instruments, and 37% more proteins in a human sample. We validate up to 94% of our identifications in yeast by presence in ground-truth reference sets.
Availability and Implementation: Software and datasets are available at http://aug.csres.utexas.edu/msnet
Contact: miranker@cs.utexas.edu, marcotte@icmb.utexas.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp461
PMCID: PMC2773251  PMID: 19633097
40.  Integrating shotgun proteomics and mRNA expression data to improve protein identification 
Bioinformatics  2009;25(11):1397-1403.
Motivation: Tandem mass spectrometry (MS/MS) offers fast and reliable characterization of complex protein mixtures, but suffers from low sensitivity in protein identification. In a typical shotgun proteomics experiment, it is assumed that all proteins are equally likely to be present. However, there is often other information available, e.g. the probability of a protein's presence is likely to correlate with its mRNA concentration.
Results: We develop a Bayesian score that estimates the posterior probability of a protein's presence in the sample given its identification in an MS/MS experiment and its mRNA concentration measured under similar experimental conditions. Our method, MSpresso, substantially increases the number of proteins identified in an MS/MS experiment at the same error rate, e.g. in yeast, MSpresso increases the number of proteins identified by ∼40%. We apply MSpresso to data from different MS/MS instruments, experimental conditions and organisms (Escherichia coli, human), and predict 19–63% more proteins across the different datasets. MSpresso demonstrates that incorporating prior knowledge of protein presence into shotgun proteomics experiments can substantially improve protein identification scores.
Availability and Implementation: Software is available upon request from the authors. Mass spectrometry datasets and supplementary information are available from http://www.marcottelab.org/MSpresso/.
Contact: marcotte@icmb.utexas.edu; miranker@cs.utexas.edu
Supplementary Information: Supplementary data website: http://www.marcottelab.org/MSpresso/.
doi:10.1093/bioinformatics/btp168
PMCID: PMC2682515  PMID: 19318424
41.  Systematic Definition of Protein Constituents along the Major Polarization Axis Reveals an Adaptive Reuse of the Polarization Machinery in Pheromone-Treated Budding Yeast 
Polarizing cells extensively restructure cellular components in a spatially and temporally coupled manner along the major axis of cellular extension. Budding yeast are a useful model of polarized growth, helping to define many molecular components of this conserved process. Besides budding, yeast cells also differentiate upon treatment with pheromone from the opposite mating type, forming a mating projection (the ‘shmoo’) by directional restructuring of the cytoskeleton, localized vesicular transport and overall reorganization of the cytosol. To characterize the proteomic localization changes accompanying polarized growth, we developed and implemented a novel cell microarray-based imaging assay for measuring the spatial redistribution of a large fraction of the yeast proteome, and applied this assay to identify proteins localized along the mating projection following pheromone treatment. We further trained a machine learning algorithm to refine the cell imaging screen, identifying additional shmoo-localized proteins. In all, we identified 74 proteins that specifically localize to the mating projection, including previously uncharacterized proteins (Ycr043c, Ydr348c, Yer071c, Ymr295c, and Yor304c-a) and known polarization complexes such as the exocyst. Functional analysis of these proteins, coupled with quantitative analysis of individual organelle movements during shmoo formation, suggests a model in which the basic machinery for cell polarization is generally conserved between processes forming the bud and the shmoo, with a distinct subset of proteins used only for shmoo formation. The net effect is a defined ordering of major organelles along the polarization axis, with specific proteins implicated at the proximal growth tip.
Upon sensing mating pheromone, budding yeast cells form a mating projection (the ‘shmoo’) that serves as a model for polarized cell growth, involving cytoskeletal/cytosolic restructuring and directed vesicular transport. We developed a cell microarray-based imaging assay for measuring localization of the yeast proteome during polarized growth. We find major organelles ordered along the polarization axis, localize 74 proteins to the growth tip, and observe adaptive reuse of general polarization machinery.
doi:10.1021/pr800524g
PMCID: PMC2651748  PMID: 19053807
Proteomics; polarized growth; subcellular localization; pheromone response; yeast
42.  Buffering by gene duplicates: an analysis of molecular correlates and evolutionary conservation 
BMC Genomics  2008;9:609.
Background
One mechanism to account for robustness against gene knockouts or knockdowns is through buffering by gene duplicates, but the extent and general correlates of this process in organisms is still a matter of debate. To reveal general trends of this process, we provide a comprehensive comparison of gene essentiality, duplication and buffering by duplicates across seven bacteria (Mycoplasma genitalium, Bacillus subtilis, Helicobacter pylori, Haemophilus influenzae, Mycobacterium tuberculosis, Pseudomonas aeruginosa, Escherichia coli), and four eukaryotes (Saccharomyces cerevisiae (yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fly), Mus musculus (mouse)).
Results
In nine of the eleven organisms, duplicates significantly increase chances of survival upon gene deletion (P-value ≤ 0.05), but only by up to 13%. Given that duplicates make up to 80% of eukaryotic genomes, the small contribution is surprising and points to dominant roles of other buffering processes, such as alternative metabolic pathways. The buffering capacity of duplicates appears to be independent of the degree of gene essentiality and tends to be higher for genes with high expression levels. For example, buffering capacity increases to 23% amongst highly expressed genes in E. coli. Sequence similarity and the number of duplicates per gene are weak predictors of the duplicate's buffering capacity. In a case study we show that buffering gene duplicates in yeast and worm are somewhat more similar in their functions than non-buffering duplicates and have increased transcriptional and translational activity.
Conclusion
In sum, the extent of gene essentiality and buffering by duplicates is not conserved across organisms and does not correlate with the organisms' apparent complexity. This heterogeneity goes beyond what would be expected from differences in experimental approaches alone. Buffering by duplicates contributes to robustness in several organisms, but to a small extent – and the relatively large amount of buffering by duplicates observed in yeast and worm may be largely specific to these organisms. Thus, the only common factor of buffering by duplicates between different organisms may be the by-product of duplicate retention due to demands of high dosage.
doi:10.1186/1471-2164-9-609
PMCID: PMC2627895  PMID: 19087332
43.  The APEX Quantitative Proteomics Tool: Generating protein quantitation estimates from LC-MS/MS proteomics results 
BMC Bioinformatics  2008;9:529.
Background
Mass spectrometry (MS) based label-free protein quantitation has mainly focused on analysis of ion peak heights and peptide spectral counts. Most analyses of tandem mass spectrometry (MS/MS) data begin with an enzymatic digestion of a complex protein mixture to generate smaller peptides that can be separated and identified by an MS/MS instrument. Peptide spectral counting techniques attempt to quantify protein abundance by counting the number of detected tryptic peptides and their corresponding MS spectra. However, spectral counting is confounded by the fact that peptide physicochemical properties severely affect MS detection resulting in each peptide having a different detection probability. Lu et al. (2007) described a modified spectral counting technique, Absolute Protein Expression (APEX), which improves on basic spectral counting methods by including a correction factor for each protein (called Oi value) that accounts for variable peptide detection by MS techniques. The technique uses machine learning classification to derive peptide detection probabilities that are used to predict the number of tryptic peptides expected to be detected for one molecule of a particular protein (Oi). This predicted spectral count is compared to the protein's observed MS total spectral count during APEX computation of protein abundances.
Results
The APEX Quantitative Proteomics Tool, introduced here, is a free open source Java application that supports the APEX protein quantitation technique. The APEX tool uses data from standard tandem mass spectrometry proteomics experiments and provides computational support for APEX protein abundance quantitation through a set of graphical user interfaces that partition thparameter controls for the various processing tasks. The tool also provides a Z-score analysis for identification of significant differential protein expression, a utility to assess APEX classifier performance via cross validation, and a utility to merge multiple APEX results into a standardized format in preparation for further statistical analysis.
Conclusion
The APEX Quantitative Proteomics Tool provides a simple means to quickly derive hundreds to thousands of protein abundance values from standard liquid chromatography-tandem mass spectrometry proteomics datasets. The APEX tool provides a straightforward intuitive interface design overlaying a highly customizable computational workflow to produce protein abundance values from LC-MS/MS datasets.
doi:10.1186/1471-2105-9-529
PMCID: PMC2639435  PMID: 19068132
44.  Age-Dependent Evolution of the Yeast Protein Interaction Network Suggests a Limited Role of Gene Duplication and Divergence 
PLoS Computational Biology  2008;4(11):e1000232.
Proteins interact in complex protein–protein interaction (PPI) networks whose topological properties—such as scale-free topology, hierarchical modularity, and dissortativity—have suggested models of network evolution. Currently preferred models invoke preferential attachment or gene duplication and divergence to produce networks whose topology matches that observed for real PPIs, thus supporting these as likely models for network evolution. Here, we show that the interaction density and homodimeric frequency are highly protein age–dependent in real PPI networks in a manner which does not agree with these canonical models. In light of these results, we propose an alternative stochastic model, which adds each protein sequentially to a growing network in a manner analogous to protein crystal growth (CG) in solution. The key ideas are (1) interaction probability increases with availability of unoccupied interaction surface, thus following an anti-preferential attachment rule, (2) as a network grows, highly connected sub-networks emerge into protein modules or complexes, and (3) once a new protein is committed to a module, further connections tend to be localized within that module. The CG model produces PPI networks consistent in both topology and age distributions with real PPI networks and is well supported by the spatial arrangement of protein complexes of known 3-D structure, suggesting a plausible physical mechanism for network evolution.
Author Summary
Proteins function together forming stable protein complexes or transient interactions in various cellular processes, such as gene regulation and signaling. Here, we address the basic question of how these networks of interacting proteins evolve. This is an important problem, as the structures of such networks underlie important features of biological systems, such as functional modularity, error-tolerance, and stability. It is not yet known how these network architectures originate or what driving forces underlie the observed network structure. Several models have been proposed over the past decade—in particular, a “rich get richer” model (preferential attachment) and a model based upon gene duplication and divergence—often based only on network topologies. Here, we show that real yeast protein interaction networks show a unique age distribution among interacting proteins, which rules out these canonical models. In light of these results, we developed a simple, alternative model based on well-established physical principles, analogous to the process of growing protein crystals in solution. The model better explains many features of real PPI networks, including the network topologies, their characteristic age distributions, and the spatial distribution of subunits of differing ages within protein complexes, suggesting a plausible physical mechanism of network evolution.
doi:10.1371/journal.pcbi.1000232
PMCID: PMC2583957  PMID: 19043579
45.  mspire: mass spectrometry proteomics in Ruby 
Bioinformatics  2008;24(23):2796-2797.
Summary: Mass spectrometry-based proteomics stands to gain from additional analysis of its data, but its large, complex datasets make demands on speed and memory usage requiring special consideration from scripting languages. The software library ‘mspire’—developed in the Ruby programming language—offers quick and memory-efficient readers for standard xml proteomics formats, converters for intermediate file types in typical proteomics spectral-identification work flows (including the Bioworks .srf format), and modules for the calculation of peptide false identification rates.
Availability: Freely available at http://mspire.rubyforge.org. Additional data models, usage information, and methods available at http://bioinformatics.icmb.utexas.edu/mspire
Contact: marcotte@icmb.utexas.edu
doi:10.1093/bioinformatics/btn513
PMCID: PMC2639276  PMID: 18930952
46.  Bud23 Methylates G1575 of 18S rRNA and Is Required for Efficient Nuclear Export of Pre-40S Subunits▿  
Molecular and Cellular Biology  2008;28(10):3151-3161.
BUD23 was identified from a bioinformatics analysis of Saccharomyces cerevisiae genes involved in ribosome biogenesis. Deletion of BUD23 leads to severely impaired growth, reduced levels of the small (40S) ribosomal subunit, and a block in processing 20S rRNA to 18S rRNA, a late step in 40S maturation. Bud23 belongs to the S-adenosylmethionine-dependent Rossmann-fold methyltransferase superfamily and is related to small-molecule methyltransferases. Nevertheless, we considered that Bud23 methylates rRNA. Methylation of G1575 is the only mapped modification for which the methylase has not been assigned. Here, we show that this modification is lost in bud23 mutants. The nuclear accumulation of the small-subunit reporters Rps2-green fluorescent protein (GFP) and Rps3-GFP, as well as the rRNA processing intermediate, the 5′ internal transcribed spacer 1, indicate that bud23 mutants are defective for small-subunit export. Mutations in Bud23 that inactivated its methyltransferase activity complemented a bud23Δ mutant. In addition, mutant ribosomes in which G1575 was changed to adenosine supported growth comparable to that of cells with wild-type ribosomes. Thus, Bud23 protein, but not its methyltransferase activity, is important for biogenesis and export of the 40S subunit in yeast.
doi:10.1128/MCB.01674-07
PMCID: PMC2423152  PMID: 18332120
47.  Mechanisms of Cell Cycle Control Revealed by a Systematic and Quantitative Overexpression Screen in S. cerevisiae 
PLoS Genetics  2008;4(7):e1000120.
Regulation of cell cycle progression is fundamental to cell health and reproduction, and failures in this process are associated with many human diseases. Much of our knowledge of cell cycle regulators derives from loss-of-function studies. To reveal new cell cycle regulatory genes that are difficult to identify in loss-of-function studies, we performed a near-genome-wide flow cytometry assay of yeast gene overexpression-induced cell cycle delay phenotypes. We identified 108 genes whose overexpression significantly delayed the progression of the yeast cell cycle at a specific stage. Many of the genes are newly implicated in cell cycle progression, for example SKO1, RFA1, and YPR015C. The overexpression of RFA1 or YPR015C delayed the cell cycle at G2/M phases by disrupting spindle attachment to chromosomes and activating the DNA damage checkpoint, respectively. In contrast, overexpression of the transcription factor SKO1 arrests cells at G1 phase by activating the pheromone response pathway, revealing new cross-talk between osmotic sensing and mating. More generally, 92%–94% of the genes exhibit distinct phenotypes when overexpressed as compared to their corresponding deletion mutants, supporting the notion that many genes may gain functions upon overexpression. This work thus implicates new genes in cell cycle progression, complements previous screens, and lays the foundation for future experiments to define more precisely roles for these genes in cell cycle progression.
Author Summary
All cells require proper cell cycle regulation; failure leads to numerous human diseases. Cell cycle mechanisms are broadly conserved across eukaryotes, with many key regulatory genes known. Nonetheless, our knowledge of regulators is incomplete. Many classic studies have analyzed yeast loss-of-function mutants to identify cell cycle genes. Studies have also implicated genes based upon their overexpression phenotypes, but the effects of gene overexpression on the cell cycle have not been quantified for all yeast genes. We individually quantified the effect of overexpression on cell cycle progression for nearly all (91%) of yeast genes, and we report the 108 genes causing the most significant and reproducible cell cycle defects, most of which have not been previously observed. We characterize three genes in more detail, implicating one in chromosomal segregation and mitotic spindle formation. A second affects mitotic stability and the DNA damage checkpoint. Curiously, overexpression of a third gene, SKO1, arrests the cell cycle by activating the pheromone response pathway, with cells mistakenly behaving as if mating pheromone is present. These results establish a basis for future experiments elucidating precise cell cycle roles for these genes. Similar assays in human cells could help further clarify the many connections between cell cycle control and cancers.
doi:10.1371/journal.pgen.1000120
PMCID: PMC2438615  PMID: 18617996
48.  Inferring mouse gene functions from genomic-scale data using a combined functional network/classification strategy 
Genome Biology  2008;9(Suppl 1):S5.
The complete set of mouse genes, as with the set of human genes, is still largely uncharacterized, with many pieces of experimental evidence accumulating regarding the activities and expression of the genes, but the majority of genes as yet still of unknown function. Within the context of the MouseFunc competition, we developed and applied two distinct large-scale data mining approaches to infer the functions (Gene Ontology annotations) of mouse genes from experimental observations from available functional genomics, proteomics, comparative genomics, and phenotypic data. The two strategies — the first using classifiers to map features to annotations, the second propagating annotations from characterized genes to uncharacterized genes along edges in a network constructed from the features — offer alternative and possibly complementary approaches to providing functional annotations. Here, we re-implement and evaluate these approaches and their combination for their ability to predict the proper functional annotations of genes in the MouseFunc data set. We show that, when controlling for the same set of input features, the network approach generally outperformed a naïve Bayesian classifier approach, while their combination offers some improvement over either independently. We make our observations of predictive performance on the MouseFunc competition hold-out set, as well as on a ten-fold cross-validation of the MouseFunc data. Across all 1,339 annotated genes in the MouseFunc test set, the median predictive power was quite strong (median area under a receiver operating characteristic plot of 0.865 and average precision of 0.195), indicating that a mining-based strategy with existing data is a promising path towards discovering mammalian gene functions. As one product of this work, a high-confidence subset of the functional mouse gene network was produced — spanning >70% of mouse genes with >1.6 million associations — that is predictive of mouse (and therefore often human) gene function and functional associations. The network should be generally useful for mammalian gene functional analyses, such as for predicting interactions, inferring functional connections between genes and pathways, and prioritizing candidate genes. The network and all predictions are available on the worldwide web.
doi:10.1186/gb-2008-9-s1-s5
PMCID: PMC2447539  PMID: 18613949
49.  A critical assessment of Mus musculus gene function prediction using integrated genomic evidence 
Genome Biology  2008;9(Suppl 1):S2.
Background:
Several years after sequencing the human genome and the mouse genome, much remains to be discovered about the functions of most human and mouse genes. Computational prediction of gene function promises to help focus limited experimental resources on the most likely hypotheses. Several algorithms using diverse genomic data have been applied to this task in model organisms; however, the performance of such approaches in mammals has not yet been evaluated.
Results:
In this study, a standardized collection of mouse functional genomic data was assembled; nine bioinformatics teams used this data set to independently train classifiers and generate predictions of function, as defined by Gene Ontology (GO) terms, for 21,603 mouse genes; and the best performing submissions were combined in a single set of predictions. We identified strengths and weaknesses of current functional genomic data sets and compared the performance of function prediction algorithms. This analysis inferred functions for 76% of mouse genes, including 5,000 currently uncharacterized genes. At a recall rate of 20%, a unified set of predictions averaged 41% precision, with 26% of GO terms achieving a precision better than 90%.
Conclusion:
We performed a systematic evaluation of diverse, independently developed computational approaches for predicting gene function from heterogeneous data sources in mammals. The results show that currently available data for mammals allows predictions with both breadth and accuracy. Importantly, many highly novel predictions emerge for the 38% of mouse genes that remain uncharacterized.
doi:10.1186/gb-2008-9-s1-s2
PMCID: PMC2447536  PMID: 18613946
50.  Group II Intron Protein Localization and Insertion Sites Are Affected by Polyphosphate 
PLoS Biology  2008;6(6):e150.
Mobile group II introns consist of a catalytic intron RNA and an intron-encoded protein with reverse transcriptase activity, which act together in a ribonucleoprotein particle to promote DNA integration during intron mobility. Previously, we found that the Lactococcus lactis Ll.LtrB intron-encoded protein (LtrA) expressed alone or with the intron RNA to form ribonucleoprotein particles localizes to bacterial cellular poles, potentially accounting for the intron's preferential insertion in the oriC and ter regions of the Escherichia coli chromosome. Here, by using cell microarrays and automated fluorescence microscopy to screen a transposon-insertion library, we identified five E. coli genes (gppA, uhpT, wcaK, ynbC, and zntR) whose disruption results in both an increased proportion of cells with more diffuse LtrA localization and a more uniform genomic distribution of Ll.LtrB-insertion sites. Surprisingly, we find that a common factor affecting LtrA localization in these and other disruptants is the accumulation of intracellular polyphosphate, which appears to bind LtrA and other basic proteins and delocalize them away from the poles. Our findings show that the intracellular localization of a group II intron-encoded protein is a major determinant of insertion-site preference. More generally, our results suggest that polyphosphate accumulation may provide a means of localizing proteins to different sites of action during cellular stress or entry into stationary phase, with potentially wide physiological consequences.
Author Summary
Group II introns are bacterial mobile elements thought to be ancestors of introns—genetic material that is discarded from messenger RNA transcripts—and retroelements—genetic elements and viruses that replicate via reverse transcription—in higher organisms. They propagate by forming a complex consisting of the catalytically active intron RNA and an intron-encoded reverse transcriptase (which converts the RNA to DNA, which can then be reinserted in the host genome). The Ll.LtrB group II intron-encoded protein (LtrA) was found previously to localize to bacterial cellular poles, potentially accounting for the preferential insertion of Ll.LtrB in the replication origin (oriC) and terminus (ter) regions of the Escherichia coli chromosome, which are located near the poles during much of the cell cycle. Here, we identify E. coli genes whose disruption leads both to more diffuse LtrA localization and a more uniform chromosomal distribution of Ll.LtrB-insertion sites, proving that the location of the LtrA protein contributes to insertion-site preference. Surprisingly, we find that LtrA localization in the disruptants is affected by the accumulation of intracellular polyphosphate, which appears to bind basic proteins and delocalize them away from the cellular poles. Thus, polyphosphate, a ubiquitous but enigmatic molecule in prokaryotes and eukaryotes, can localize proteins to different sites of action, with potentially wide physiological consequences.
A novel cell microarray method uncovers connections between group II intron mobility, cell stress, and polyphosphate metabolism, including the finding that polyphosphate can influence intracellular protein localization.
doi:10.1371/journal.pbio.0060150
PMCID: PMC2435150  PMID: 18593213

Results 26-50 (61)