Inferring a protein’s function by homology is a powerful tool for biologists. The Princeton Protein Orthology Database (P-POD) offers a simple way to visualize and analyze the relationships between homologous proteins in order to infer function. P-POD contains computationally-generated analysis distinguishing orthologs from paralogs combined with curated published information on functional complementation and on human diseases. P-POD also features an applet, Notung, for users to explore and modify phylogenetic trees and generate their own ortholog/paralogs calls. This unit describes how to search P-POD for precomputed data, how to find and use the associated curated information from the literature, and how to use Notung to analyze and refine the results.
functional complementation; disease; conservation; phylogenetic analysis; trees; paralogs; Notung
PhosphoGRID is an online database that curates and houses experimentally verified in vivo phosphorylation sites in the Saccharomyces cerevisiae proteome (www.phosphogrid.org). Phosphosites are annotated with specific protein kinases and/or phosphatases, along with the condition(s) under which the phosphorylation occurs and/or the effects on protein function. We report here an updated data set, including nine additional high-throughput (HTP) mass spectrometry studies. The version 2.0 data set contains information on 20 177 unique phosphorylated residues, representing a 4-fold increase from version 1.0, and includes 1614 unique phosphosites derived from focused low-throughput (LTP) studies. The overlap between HTP and LTP studies represents only ∼3% of the total unique sites, but importantly 45% of sites from LTP studies with defined function were discovered in at least two independent HTP studies. The majority of new phosphosites in this update occur on previously documented proteins, suggesting that coverage of phosphoproteins in the yeast proteome is approaching saturation. We will continue to update the PhosphoGRID data set, with the expectation that the integration of information from LTP and HTP studies will enable the development of predictive models of phosphorylation-based signaling networks.
The Biological General Repository for Interaction Datasets (BioGRID: http//thebiogrid.org) is an open access archive of genetic and protein interactions that are curated from the primary biomedical literature for all major model organism species. As of September 2012, BioGRID houses more than 500 000 manually annotated interactions from more than 30 model organisms. BioGRID maintains complete curation coverage of the literature for the budding yeast Saccharomyces cerevisiae, the fission yeast Schizosaccharomyces pombe and the model plant Arabidopsis thaliana. A number of themed curation projects in areas of biomedical importance are also supported. BioGRID has established collaborations and/or shares data records for the annotation of interactions and phenotypes with most major model organism databases, including Saccharomyces Genome Database, PomBase, WormBase, FlyBase and The Arabidopsis Information Resource. BioGRID also actively engages with the text-mining community to benchmark and deploy automated tools to expedite curation workflows. BioGRID data are freely accessible through both a user-defined interactive interface and in batch downloads in a wide variety of formats, including PSI-MI2.5 and tab-delimited files. BioGRID records can also be interrogated and analyzed with a series of new bioinformatics tools, which include a post-translational modification viewer, a graphical viewer, a REST service and a Cytoscape plugin.
Comparative analysis of predicted protein sequences encoded by the genomes of Caenorhabditis elegans and Saccharomyces cerevisiae suggests that most of the core biological functions are carried out by orthologous proteins (proteins of different species that can be traced back to a common ancestor) that occur in comparable numbers. The specialized processes of signal transduction and regulatory control that are unique to the multicellular worm appear to use novel proteins, many of which re-use conserved domains. Major expansion of the number of some of these domains seen in the worm may have contributed to the advent of multicellularity. The proteins conserved in yeast and worm are likely to have orthologs throughout eukaryotes; in contrast, the proteins unique to the worm may well define metazoans.
The S. cerevisiae genome is the most well-characterized eukaryotic genome and one of the simplest in terms of identifying open reading frames (ORFs), yet its primary annotation has been updated continually in the decade since its initial release in 1996 (Goffeau et al., 1996). The Saccharomyces Genome Database (SGD; www.yeastgenome.org) (Hirschman et al., 2006), the community-designated repository for this reference genome, strives to ensure that the S. cerevisiae annotation is as accurate and useful as possible. At SGD, the S. cerevisiae genome sequence and annotation are treated as a working hypothesis, which must be repeatedly tested and refined. In this paper, in celebration of the tenth anniversary of the completion of the S. cerevisiae genome sequence, we discuss the ways in which the S. cerevisiae sequence and annotation have changed, consider the multiple sources of experimental and comparative data on which these changes are based, and describe our methods for evaluating, incorporating and documenting these new data.
S. cerevisiae; genome sequence; genome annotation; comparative genomics; exon/intron boundaries
The completion of the Saccharomyces cerevisiae genome sequencing project11 and the continued development of improved technology for large-scale genome analysis have led to tremendous growth in the amount of new yeast genetics and molecular biology data. Efficient organization, presentation, and dissemination of this information are essential if researchers are to exploit this knowledge. In addition, the development of tools that provide efficient analysis of this information and link it with pertinent information from other systems is becoming increasingly important at a time when the complete genome sequences of other organisms are becoming available. The aim of this review is to familiarize biologists with the type of data resources currently available on the World Wide Web (WWW).
World Wide Web; Saccharomyces Genome Database; Munich Information Center for Protein Sequences; Yeast Protein Database
A scientific database can be a powerful tool for biologists in an era where large-scale genomic analysis, combined with smaller-scale scientific results, provides new insights into the roles of genes and their products in the cell. However, the collection and assimilation of data is, in itself, not enough to make a database useful. The data must be incorporated into the database and presented to the user in an intuitive and biologically significant manner. Most importantly, this presentation must be driven by the user’s point of view; that is, from a biological perspective. The success of a scientific database can therefore be measured by the response of its users – statistically, by usage numbers and, in a less quantifiable way, by its relationship with the community it serves and its ability to serve as a model for similar projects. Since its inception ten years ago, the Saccharomyces Genome Database (SGD) has seen a dramatic increase in its usage, has developed and maintained a positive working relationship with the yeast research community, and has served as a template for at least one other database. The success of SGD, as measured by these criteria, is due in large part to philosophies that have guided its mission and organisation since it was established in 1993. This paper aims to detail these philosophies and how they shape the organisation and presentation of the database.
S. cerevisiae; database; genome-wide analysis; bioinformatics; yeast
Genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. To this end, three independent ontologies accessible on the World-Wide Web (http://www.geneontology.org) are being constructed: biological process, molecular function and cellular component.
The Biological General Repository for Interaction Datasets (BioGRID) is a public database that archives and disseminates genetic and protein interaction data from model organisms and humans (http://www.thebiogrid.org). BioGRID currently holds 347 966 interactions (170 162 genetic, 177 804 protein) curated from both high-throughput data sets and individual focused studies, as derived from over 23 000 publications in the primary literature. Complete coverage of the entire literature is maintained for budding yeast (Saccharomyces cerevisiae), fission yeast (Schizosaccharomyces pombe) and thale cress (Arabidopsis thaliana), and efforts to expand curation across multiple metazoan species are underway. The BioGRID houses 48 831 human protein interactions that have been curated from 10 247 publications. Current curation drives are focused on particular areas of biology to enable insights into conserved networks and pathways that are relevant to human health. The BioGRID 3.0 web interface contains new search and display features that enable rapid queries across multiple data types and sources. An automated Interaction Management System (IMS) is used to prioritize, coordinate and track curation across international sites and projects. BioGRID provides interaction data to several model organism databases, resources such as Entrez-Gene and other interaction meta-databases. The entire BioGRID 3.0 data collection may be downloaded in multiple file formats, including PSI MI XML. Source code for BioGRID 3.0 is freely available without any restrictions.
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is a scientific database for the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker’s or budding yeast. The information in SGD includes functional annotations, mapping and sequence information, protein domains and structure, expression data, mutant phenotypes, physical and genetic interactions and the primary literature from which these data are derived. Here we describe how published phenotypes and genetic interaction data are annotated and displayed in SGD.
The Biological General Repository for Interaction Datasets (BioGRID) database (http://www.thebiogrid.org) was developed to house and distribute collections of protein and genetic interactions from major model organism species. BioGRID currently contains over 198 000 interactions from six different species, as derived from both high-throughput studies and conventional focused studies. Through comprehensive curation efforts, BioGRID now includes a virtually complete set of interactions reported to date in the primary literature for both the budding yeast Saccharomyces cerevisiae and the fission yeast Schizosaccharomyces pombe. A number of new features have been added to the BioGRID including an improved user interface to display interactions based on different attributes, a mirror site and a dedicated interaction management system to coordinate curation across different locations. The BioGRID provides interaction data with monthly updates to Saccharomyces Genome Database, Flybase and Entrez Gene. Source code for the BioGRID and the linked Osprey network visualization system is now freely available without restriction.
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) collects and organizes biological information about the chromosomal features and gene products of the budding yeast Saccharomyces cerevisiae. Although published data from traditional experimental methods are the primary sources of evidence supporting Gene Ontology (GO) annotations for a gene product, high-throughput experiments and computational predictions can also provide valuable insights in the absence of an extensive body of literature. Therefore, GO annotations available at SGD now include high-throughput data as well as computational predictions provided by the GO Annotation Project (GOA UniProt; http://www.ebi.ac.uk/GOA/). Because the annotation method used to assign GO annotations varies by data source, GO resources at SGD have been modified to distinguish data sources and annotation methods. In addition to providing information for genes that have not been experimentally characterized, GO annotations from independent sources can be compared to those made by SGD to help keep the literature-based GO annotations current.
Many biological databases that provide comparative genomics information and tools are now available on the internet. While certainly quite useful, to our knowledge none of the existing databases combine results from multiple comparative genomics methods with manually curated information from the literature. Here we describe the Princeton Protein Orthology Database (P-POD, http://ortholog.princeton.edu), a user-friendly database system that allows users to find and visualize the phylogenetic relationships among predicted orthologs (based on the OrthoMCL method) to a query gene from any of eight eukaryotic organisms, and to see the orthologs in a wider evolutionary context (based on the Jaccard clustering method). In addition to the phylogenetic information, the database contains experimental results manually collected from the literature that can be compared to the computational analyses, as well as links to relevant human disease and gene information via the OMIM, model organism, and sequence databases. Our aim is for the P-POD resource to be extremely useful to typical experimental biologists wanting to learn more about the evolutionary context of their favorite genes. P-POD is based on the commonly used Generic Model Organism Database (GMOD) schema and can be downloaded in its entirety for installation on one's own system. Thus, bioinformaticians and software developers may also find P-POD useful because they can use the P-POD database infrastructure when developing their own comparative genomics resources and database tools.
Ten years have passed since the genome of Saccharomyces cerevisiae–more precisely, the S288c strain–was completely sequenced. However, experimental work in yeast is commonly performed using strains that are of unknown genetic relationship to S288c. Here, we characterized the nucleotide-level similarity between S288c and seven commonly used lab strains (A364A, W303, FL100, CEN.PK, ∑1278b, SK1 and BY4716) using 25mer oligonucleotide microarrays that provide complete and redundant coverage of the ∼12 Mb Saccharomyces cerevisiae genome. Using these data, we assessed the frequency and distribution of nucleotide variation in comparison to the sequenced reference genome. These data allow us to infer the relationships between experimentally important strains of yeast and provide insight for experimental designs that are sensitive to sequence variation. We propose a rational approach for near complete sequencing of strains related to the reference using these data and directed re-sequencing. These data and new visualization tools are accessible online in a new resource: the Yeast SNPs Browser (YSB; http://gbrowse.princeton.edu/cgi-bin/gbrowse/yeast_strains_snps) that is available to all researchers.
The recent explosion in protein data generated from both directed small-scale studies and large-scale proteomics efforts has greatly expanded the quantity of available protein information and has prompted the Saccharomyces Genome Database (SGD; ) to enhance the depth and accessibility of protein annotations. In particular, we have expanded ongoing efforts to improve the integration of experimental information and sequence-based predictions and have redesigned the protein information web pages. A key feature of this redesign is the development of a GBrowse-derived interactive Proteome Browser customized to improve the visualization of sequence-based protein information. This Proteome Browser has enabled SGD to unify the display of hidden Markov model (HMM) domains, protein family HMMs, motifs, transmembrane regions, signal peptides, hydropathy plots and profile hits using several popular prediction algorithms. In addition, a physico-chemical properties page has been introduced to provide easy access to basic protein information. Improvements to the layout of the Protein Information page and integration of the Proteome Browser will facilitate the ongoing expansion of sequence-specific experimental information captured in SGD, including post-translational modifications and other user-defined annotations. Finally, SGD continues to improve upon the availability of genetic and physical interaction data in an ongoing collaboration with BioGRID by providing direct access to more than 82 000 manually-curated interactions.
The study of complex biological networks and prediction of gene function has been enabled by high-throughput (HTP) methods for detection of genetic and protein interactions. Sparse coverage in HTP datasets may, however, distort network properties and confound predictions. Although a vast number of well substantiated interactions are recorded in the scientific literature, these data have not yet been distilled into networks that enable system-level inference.
We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications. This literature-curated (LC) dataset contains 33,311 interactions, on the order of all extant HTP datasets combined. Surprisingly, HTP protein-interaction datasets currently achieve only around 14% coverage of the interactions in the literature. The LC network nevertheless shares attributes with HTP networks, including scale-free connectivity and correlations between interactions, abundance, localization, and expression. We find that essential genes or proteins are enriched for interactions with other essential genes or proteins, suggesting that the global network may be functionally unified. This interconnectivity is supported by a substantial overlap of protein and genetic interactions in the LC dataset. We show that the LC dataset considerably improves the predictive power of network-analysis approaches. The full LC dataset is available at the BioGRID () and SGD () databases.
Comprehensive datasets of biological interactions derived from the primary literature provide critical benchmarks for HTP methods, augment functional prediction, and reveal system-level attributes of biological networks.
Sequencing and annotation of the entire Saccharomyces cerevisiae genome has made it possible to gain a genome-wide perspective on yeast genes and gene products. To make this information available on an ongoing basis, the Saccharomyces Genome Database (SGD) () has created the Genome Snapshot (). The Genome Snapshot summarizes the current state of knowledge about the genes and chromosomal features of S.cerevisiae. The information is organized into two categories: (i) number of each type of chromosomal feature annotated in the genome and (ii) number and distribution of genes annotated to Gene Ontology terms. Detailed lists are accessible through SGD's Advanced Search tool (), and all the data presented on this page are available from the SGD ftp site ().
BioPIXIE is a probabilistic system for query-based discovery of pathway-specific networks through integration of diverse genome-wide data.
We have developed a general probabilistic system for query-based discovery of pathway-specific networks through integration of diverse genome-wide data. This framework was validated by accurately recovering known networks for 31 biological processes in Saccharomyces cerevisiae and experimentally verifying predictions for the process of chromosomal segregation. Our system, bioPIXIE, a public, comprehensive system for integration, analysis, and visualization of biological network predictions for S. cerevisiae, is freely accessible over the worldwide web.
We studied the physiological response to glucose limitation in batch and steady-state (chemostat) cultures of Saccharomyces cerevisiae by following global patterns of gene expression. Glucose-limited batch cultures of yeast go through two sequential exponential growth phases, beginning with a largely fermentative phase, followed by an essentially completely aerobic use of residual glucose and evolved ethanol. Judging from the patterns of gene expression, the state of the cells growing at steady state in glucose-limited chemostats corresponds most closely with the state of cells in batch cultures just before they undergo this “diauxic shift.” Essentially the same pattern was found between chemostats having a fivefold difference in steady-state growth rate (the lower rate approximating that of the second phase respiratory growth rate in batch cultures). Although in both cases the cells in the chemostat consumed most of the glucose, in neither case did they seem to be metabolizing it primarily through respiration. Although there was some indication of a modest oxidative stress response, the chemostat cultures did not exhibit the massive environmental stress response associated with starvation that also is observed, at least in part, during the diauxic shift in batch cultures. We conclude that despite the theoretical possibility of a switch to fully aerobic metabolism of glucose in the chemostat under conditions of glucose scarcity, homeostatic mechanisms are able to carry out metabolic adjustment as if fermentation of the glucose is the preferred option until the glucose is entirely depleted. These results suggest that some aspect of actual starvation, possibly a component of the stress response, may be required for triggering the metabolic remodeling associated with the diauxic shift.
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/), a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, has recently developed several new resources that allow the comparison and integration of information on a genome-wide scale, enabling the user not only to find detailed information about individual genes, but also to make connections across groups of genes with common features and across different species. The Fungal Alignment Viewer displays alignments of sequences from multiple fungal genomes, while the Sequence Similarity Query tool displays PSI-BLAST alignments of each S.cerevisiae protein with similar proteins from any species whose sequences are contained in the non-redundant (nr) protein data set at NCBI. The Yeast Biochemical Pathways tool integrates groups of genes by their common roles in metabolism and displays the metabolic pathways in a graphical form. Finally, the Find Chromosomal Features search interface provides a versatile tool for querying multiple types of information in SGD.
The Saccharomyces Genome Database (SGD: http://genome-www.stanford.edu/Saccharomyces/) has recently developed new resources to provide more complete information about proteins from the budding yeast Saccharomyces cerevisiae. The PDB Homologs page provides structural information from the Protein Data Bank (PDB) about yeast proteins and/or their homologs. SGD has also created a resource that utilizes the eMOTIF database for motif information about a given protein. A third new resource is the Protein Information page, which contains protein physical and chemical properties, such as molecular weight and hydropathicity scores, predicted from the translated ORF sequence.
The Saccharomyces Genome Database (SGD) resources, ranging from genetic and physical maps to genome-wide analysis tools, reflect the scientific progress in identifying genes and their functions over the last decade. As emphasis shifts from identification of the genes to identification of the role of their gene products in the cell, SGD seeks to provide its users with annotations that will allow relationships to be made between gene products, both within Saccharomyces cerevisiae and across species. To this end, SGD is annotating genes to the Gene Ontology (GO), a structured representation of biological knowledge that can be shared across species. The GO consists of three separate ontologies describing molecular function, biological process and cellular component. The goal is to use published information to associate each characterized S.cerevisiae gene product with one or more GO terms from each of the three ontologies. To be useful, this must be done in a manner that allows accurate associations based on experimental evidence, modifications to GO when necessary, and careful documentation of the annotations through evidence codes for given citations. Reaching this goal is an ongoing process at SGD. For information on the current progress of GO annotations at SGD and other participating databases, as well as a description of each of the three ontologies, please visit the GO Consortium page at http://www.geneontology.org. SGD gene associations to GO can be found by visiting our site at http://genome-www.stanford.edu/Saccharomyces/.
Upon the completion of the Saccharomyces cerevisiae genomic
sequence in 1996 [Goffeau,A. et al. (1997) Nature, 387, 5], several
creative and ambitious projects have been initiated to explore the
functions of gene products or gene expression on a genome-wide scale.
To help researchers take advantage of these projects, the Saccharomyces Genome
Database (SGD) has created two new tools, Function Junction and
Expression Connection. Together, the tools form a central resource
for querying multiple large-scale analysis projects for data about
individual genes. Function Junction provides information from diverse projects
that shed light on the role a gene product plays in the cell, while
Expression Connection delivers information produced by the ever-increasing
number of microarray projects. WWW access to SGD is available at genome-www.stanford.edu/Saccharomyces/.
The Saccharomyces Genome Database (SGD) stores and organizes information about the nearly 6200 genes in the yeast genome. The information is organized around the ‘locus page’ and directs users to the detailed information they seek. SGD is endeavoring to integrate the existing information about yeast genes with the large volume of data generated by functional analyses that are beginning to appear in the literature and on web sites. New features will include searches of systematic analyses and Gene Summary Paragraphs that succinctly review the literature for each gene. In addition to current information, such as gene product and phenotype descriptions, the new locus page will also describe a gene product’s cellular process, function and localization using a controlled vocabulary developed in collaboration with two other model organism databases. We describe these developments in SGD through the newly reorganized locus page. The SGD is accessible via the WWW at http://genome-www. stanford. edu/Saccharomyces/