Cells are complex molecular machines that employ multiple levels of regulation that enable them to respond to genetic and environmental perturbations. Advances in biology over the past several years to elucidate the complexity of this regulation have been truly astonishing. However, despite transformative advances in technology, it remains difficult to assess where we are in our understanding of cell regulation, relative to a complete comprehension of such a process. One of the primary difficulties in our making such an assessment is that the suite of research tools available to us seldom provides insights into aspects of the overall picture of the system that are not directly measured. While different technologies provide information that our analytical tools, both algorithmic and intellectual, seek to combine into a coherent picture, one of the primary limitations of the majority of analytical tools in use today is a focus on single dimensions of data, rather than on maximally integrating data across many different dimensions simultaneously to view processes more completely, thereby achieving a greater understanding of these processes.
The full suite of interacting parts in a cell over time, if they could be viewed collectively, would enable our achieving a more complete understanding of cellular processes, much in the same way we achieve understanding by watching a movie. The continuous flow of information in a movie enables our minds to exercise an array of priors that provide context and constrain the possible relationships (structures), while our internal network reconstruction engine pieces all of the information together regarding the highly complex and nonlinear relationships represented in the movie, so that in the end we are able to achieve an understanding of what is depicted at a hierarchy of levels. If instead of viewing a movie as a continuous stream of frames of coherent pixels and sound, we viewed single dimensions of the information independently, understanding would be difficult if not impossible to achieve. For example, consider viewing a movie as independent, one dimensional slices through the frames of the movie, where each slice is viewed as pixel intensities across that one dimension changing over time (like a dynamic mass spec trace). In this way it would be very difficult to understand the meaning of the movie by looking at all of the one dimensional traces independently.
Despite the complexity of biological systems, even at the cellular level, research in the context of large-scale high dimensional -omics data has tended to focus on single data dimensions, whether constructing coexpression networks on the basis of gene expression data, carrying out genome-wide association analyses on the basis of DNA variation information, or constructing protein interaction networks on the basis of protein–protein interaction data. While we achieve some understanding in this way, progress is limited because none of the dimensions on their own provide a complete enough context within which to interpret results fully. This type of limitation has become apparent in genome-wide association studies (GWAS), where many hundreds of highly replicated loci have been identified and highly replicated as associated with disease; but our understanding of disease is still limited because the genetic loci do not necessarily inform on the gene affected, on how gene function is altered, or more generally, how the biological processes involving a given gene are altered
[1]–
[4]. It is apparent that if different biological data dimensions could be formally considered simultaneously, we would achieve a more complete understanding of biological systems
[2],
[3],
[5]–
[7]. (See the documentary film
The New Biology at
http://www.youtube.com/watch?v=sjTQD6E3lH4.)
Therefore, to form a more complete understanding of biological systems, we must not only evolve technologies to sample systems at ever higher rates and with ever greater breadth, we must innovate methods that consider many different dimensions of information to produce more descriptive models (movies) of the system.
Methods are emerging that integrate pairs of data dimensions. For example, we recently developed methods that simultaneously integrate DNA variation and RNA expression data generated in a population context to identify coherent modules of interconnected gene expression traits driven by common genetic factors
[2],
[8]. In addition, many groups have begun incorporating a time dimension in the context of high-dimensional molecular-profiling data to elucidate how networks can transform over time
[9],
[10].
Here we develop and apply a network reconstruction approach that simultaneously integrates six different types of data: endogenous metabolite concentration, RNA expression, DNA variation, DNA–protein binding, protein–metabolite interaction, and protein–protein interaction data, to construct probabilistic causal networks that elucidate the complexity of cell regulation (). The goals of our integrative analysis are not only to find causal regulators underlying expression quantitative trait loci (eQTL) hot spots, but to uncover mechanisms by which these predicted causal regulators affect genes and metabolites whose transcriptional profiles or metabolite profiles are linked to the eQTL hot spots. We leveraged a previously described cross between laboratory (BY) and wild (RM) yeast strains (referred to here as the BXR cross) for which DNA variation and RNA expression had been assessed
[11],
[12], to carry out a quantitative metabolite profiling using quantitative NMR (qNMR) under the same experimental conditions as the gene expression study
[12]–
[14]. We demonstrate that, like transcript and protein levels, concentrations of many metabolites are strongly linked to metabolite QTLs (metQTLs). Several of the metQTLs are seen to colocalize with expression quantitative trait loci (eQTLs) previously identified in the same yeast population
[13], enabling us to infer causal relationships between metabolites and expression traits
[13],
[14]. Then, by extending a previously described Bayesian network (BN) reconstruction algorithm
[13], we constructed a probabilistic causal network by integrating metabolite levels, genotype, gene expression, transcription factor (TF) binding, and protein–protein interaction data. The resulting network not only validates the functional importance of eQTL hot spots in the BXR cross, but elucidates the mechanisms by which variation in DNA at eQTL hot spots affect gene expression. By systematically using the networks to elucidate the regulators of these eQTL hot spots, we are not only able to recapitulate known regulatory mechanisms, we are able to provide a number of novel and experimentally supported causal relationships predicted by our network, including that cellular amino acid concentrations are related to both amino acid biosynthesis pathways and amino acid degradation pathways, with
VPS9 predicted and prospectively validated as a key driver of a previously identified eQTL hot spot that could not previously be well characterized. In addition, we further experimentally demonstrated that
PHM7, a previously predicted and validated causal regulator for stress response genes whose expression variations are linked to the
PHM7 locus on Chromosome XV, affected trehalose, a yeast metabolite product of the stress response pathway. These results combined not only help uncover the mechanisms by which gene expression profiles are regulated by metabolite profiles, but they also confirm the importance of gene expression in understanding system-wide variation linked to genetic perturbations.