Since the advent of genetic mapping, the approximate genomic locations of the polymorphisms that cause thousands of human phenotypes have been reported. As compiled by the Online Mendelian Inheritance in Man (OMIM) database, more than 1,500 human genes have been found to be associated with over 3,000 disorders [1
]. This impressive level of success is tempered by the fact that more than 1,000 disorders have been mapped to a genomic region, but the underlying 'disease gene' has not yet been identified for these disorders [1
]. Although some fraction of these 1,000 loci are surely false positives, the statistical significance associated with them indicates that most are likely to contain true Mendelian disease genes that have yet to be pinpointed.
This set of mapped disease loci represents an exciting opportunity for rapid advancement in our understanding of human disease genetics. Any method that can generate high-confidence predictions for which genes within the mapped regions are responsible for the diseases in question would be an important step forward. Indeed, some such methods were recently proposed, for example genomic screens for mitochondria-related genes identified several candidate disease genes for mitochondrial disorders [2
Another route to gaining insights into a particular disease is to study a model of the disease in a nonhuman organism. Such models, if they are faithful reproductions of a specific human disease, can be informative by revealing aspects of the function of both wild-type and mutant versions of the disease gene (or its ortholog in the model organism) and by providing a testing ground for potential therapies. The mouse has been a particularly useful model in this regard. In general, the more diverged a model organism is from human, the more difficult it is to create an accurate model of a human disease; more deeply divergent lineages are less likely to have human disease gene orthologs, and they are also less likely to have a phenotype similar enough to humans to allow detailed study of a particular disease phenotype.
It is unfortunate that most diseases cannot accurately be modeled in species such as the bacterium Escherichia coli or the budding yeast Saccharomyces cerevisise, considering the ease of growing, storing, manipulating, and studying these organisms. Indeed, largely because of the simplicity of genetic manipulation in yeast, more functional genomic data have been generated for this species than for any other. For most genes/proteins, the mRNA expression level is known in thousands of conditions, as are the protein subcellular localization, the mRNA and protein decay rate, the mRNA translation rate, the protein abundance, the growth rates of systematic knockout strains across many conditions, a substantial fraction of the physical and genetic interactions, and much more.
Despite the vast amount of published functional genomic data, yeast and other unicellular organisms generally lack a morphologic phenotype rich enough to allow for detailed phenotypic descriptions based on a single growth condition. For example, even though different yeast strains may have distinct differences in their size, shape, and growth rate, in general very little information can be gleaned about a gene's knockout phenotype by observing growing cells in a single environment. However, if multiple environments are utilized in defining the phenotype, then even just one characteristic (such as growth rate) can be used to describe the phenotype with greater specificity, limited only by the diversity of environments tested. The description of a phenotype is simply a list of growth rates (or other measured characteristics) in all conditions tested, and two genes can be said to cause the same knockout phenotype if strains deleted for each gene exhibit similar growth rates across all tested environments. It is worth noting that this definition of phenotype is analogous to human disease, because any disease is simply a specific phenotype of lowered fitness in some set of environments.
The concept of identifying genes whose mutation or deletion leads to similar phenotypes is by no means novel. Indeed, much of classical genetics is based on this idea. Since the development of nearly comprehensive gene knockout or RNA interference knockdown resources in yeast, Caenorhabditis elegans
, and Drosophila melanogaster
, many researchers have systematically measured various phenotypes and identified clusters of genes with similar phenotypic profiles across a set of conditions [4
Given such a phenotypic profile, we can ask what other types of data best predict 'phenotype pairs', that is, pairs of genes whose loss leads to similar phenotypic profiles. If we could identify an effective predictor of phenotype pairs, and if the predictor is sufficiently generic to apply to other species, then this predictor may be useful for elucidating human disease phenotypes as well. If this predictor has been measured for at least some human gene pairs, then it could then be used to predict human disease genes simply by searching the genome for gene pairs that score highly and for which one of the two genes is a known disease gene. For example, if co-expressed genes in yeast were found to be the best predictor of phenotype pairs, then it stands to reason that co-expressed human genes may also lead to the same phenotype when mutated. If so, then identifying all genes that are co-expressed with a known disease gene would give a list of candidates for additional genes that cause the same disease. By combining the candidate list with mapped but unidentified disease genes, more confidence could be given to candidate genes that fall within the mapped susceptibility loci. In this manner, genes that are likely to be responsible for any type of disease could be identified, as long as the disease has at least one known causative gene. Others have previously used physical interactions between proteins or multiple gene-ranking algorithms to predict new disease genes from within mapped susceptibility loci [7
]. However, because the functional genomic data for humans is currently rather sparse (compared with what is available for some model organisms), it remains to be seen whether some type of data not yet explored in humans could be even more predictive of human disease genes.
To discover what predictor(s) might be the most effective in human, we turned to yeast as a model. We reasoned that if quantitative phenotypes can be studied in yeast, then the vast amount of functional genomic data available could be used to predict the phenotypic effects of gene mutations or deletions. Here, we utilize a general framework for studying phenotypes to find what types of data are predictive of phenotypes in yeast; we then apply this framework to human disease phenotypes.