The molecular biology, biochemistry and genetics of the budding yeast Saccharomyces cerevisiae
have been intensively studied for decades; it remains the best-understood eukaryote at the molecular genetic level. Completion of the S. cerevisiae
genome sequence nearly a decade ago spawned a host of functional genomic tools for interrogation of gene and protein function, including DNA microarrays for global gene-expression profiling and location of DNA-binding factors, and a comprehensive set of gene deletion strains for phenotypic analysis [1
]. In the post-genome sequence era, high-throughput (HTP) screening techniques aimed at identifying novel protein complexes and gene networks have begun to complement conventional biochemical and genetic approaches [3
]. Systematic elucidation of protein interactions in S. cerevisiae
has been carried out by the two-hybrid method, which detects pair-wise interactions [5
], and by mass spectrometric (MS) analysis of purified protein complexes [8
]. In parallel, the synthetic genetic array (SGA) and synthetic lethal analysis by microarray (dSLAM) methods have been used to systematically uncover synthetic lethal genetic interactions, in which non-lethal gene mutations combine to cause inviability [10
]. In addition to HTP analyses of yeast protein-interaction networks, initial yeast two-hybrid maps have been generated for the nematode worm Caenorhabditis elegans
, the fruit fly Drosophila melanogaster
and, most recently, for humans [14
]. The various datasets generated by these techniques have begun to unveil the global network that underlies cellular complexity.
The networks implicit in HTP datasets from yeast, and to a limited extent from other organisms, have been analyzed using graph theory. A primary attribute of biological interaction networks is a scale-free distribution of connections, as described by an apparent power-law formulation [18
]. Most nodes – that is, genes or proteins – in biological networks are sparsely connected, whereas a few nodes, called hubs, are highly connected. This class of network is robust to the random disruption of individual nodes, but sensitive to an attack on specific highly connected hubs [19
]. Whether this property has actually been selected for in biological networks or is a simple consequence of multilayered regulatory control is open to debate [20
]. Biological networks also appear to exhibit small-world organization - namely, locally dense regions that are sparsely connected to other regions but with a short average path length [21
]. Recurrent patterns of regulatory interactions, termed motifs, have also recently been discerned [24
]. In conjunction with global profiles of gene expression, HTP datasets have been used in a variety of schemes to predict biological function for characterized and uncharacterized proteins [3
]. These initial network approaches to system-level understanding hold considerable promise.
Despite these successes, all network analyses undertaken so far have relied exclusively on HTP datasets that are burdened with false-positive and false-negative interactions [33
]. The inherent noise in these datasets has compromised attempts to build a comprehensive view of cellular architecture. For example, yeast two-hybrid datasets in general exhibit poor concordance [35
]. The unreliability of such datasets, together with the still sparse coverage of known biological interaction space, clearly limit studies of biological networks, and may well bias conclusions obtained to date.
A vast resource of previously discovered physical and genetic interactions is recorded in the primary literature for many species, including yeast. In general, interactions reported in the literature are reliable: many have been verified by multiple experimental methods and/or more than one research group; most are based on methods of known sensitivity and reproducibility in well controlled experiments; most are reported in the context of supporting cell biological information; and all have been subjected to the scrutiny of peer review. But while publications on individual genes are readily accessed through public databases such as PubMed, the embedded interaction data have not been systematically compiled in a searchable relational database. The Yeast Proteome Database (YPD) represented the first systematic effort to compile protein-interaction and other data from the literature [36
]; but although originally free of charge to academic users, YPD is now available only on a subscription basis. A number of important databases that curate protein and genetic interactions from the literature have been developed, including the Munich Information Center for Protein Sequences (MIPS) database [37
], the Molecular Interactions (MINT) database [38
], the IntAct database [39
], the Database of Interacting Proteins (DIP) [40
], the Biomolecular Interaction Network Database (BIND) [41
], the Human Protein Reference Database (HPRD) [42
], and the BioGRID database [43
]. At present, however, interactions recorded in these databases represent only partial coverage of the primary literature. The efforts of these databases will be facilitated by a recently established consortium of interaction databases, termed the International Molecular Exchange Consortium (IMEx) [45
], which aims both to implement a structured vocabulary to describe interaction data (the Protein Standards Initiative-Molecular Interaction, PSI-MI [46
]) and to openly disseminate interaction records. A systematic international effort to codify gene function by the Gene Ontology (GO) Consortium also records protein and genetic interactions as functional evidence codes [47
], which can therefore be used to infer interaction networks [48
Despite the fact that many interactions are clearly documented in the literature, these data are not yet in a form that can be readily applied to network or system-level analysis. Manual curation of the literature specifically for gene and protein interactions poses a number of problems, including curation consistency, the myriad possible levels of annotation detail, and the sheer volume of text that must be distilled. Moreover, because structured vocabularies have not been implemented in biological publications, automated machine-learning methods are unable to reliably extract most interaction information from full-text sources [49
]. Budding yeast represents an ideal test case for systematic literature curation, both because the genome is annotated to an unparalleled degree of accuracy and because a large fraction of genes are characterized [50
]. Approximately 4,200 budding yeast open reading frames (ORFs) have been functionally interrogated by one means or another [51
]. At the same time, because some 1,500 are currently classified by the GO term 'biological process unknown', a substantial number of gene functions remain to be assigned or inferred.
Here we report a literature-curated (LC) dataset of 33,311 protein and genetic interactions, representing 19,499 non-redundant interactions, from a total of 6,148 publications in the primary literature. The low overlap between the LC dataset and existing HTP datasets suggests that known physical and genetic interaction space may be far from saturating. Analysis of the network properties of the LC dataset supports some conclusions based on HTP data but refutes others. The systematic LC dataset improves prediction of gene function and provides a resource for future endeavors in network biology.