We describe a multistep process to extract causal information from gene-expression data related to complex phenotypes such as obesity and gene expression. Central to this process is a likelihood-based test for causality that takes into account genotypic, RNA and clinical data in a segregating population to identify genes in the trait-specific transcriptional network that are under the control of multiple QTLs for the trait of interest but still upstream of the trait. Whereas previous methods allow for tests of pleiotropy versus close linkage to determine whether multiple traits are under the control of common QTLs20
, the LCMS procedure described here allows for the possibility to unravel the nature of such associations.
We applied the LCMS procedure to a segregating mouse population phenotyped for OFPM and identified known (Hsd11b1
) and new susceptibility genes (Tgfbr2
) for fat mass in this population, in addition to significantly predicting the transcriptional response to perturbation of Hsd11b1
. The three new susceptibility genes that we identified have not previously been directly associated with obesity-related traits. In addition to these three genes, a SNP in lipoprotein lipase (ranked number 9 in ) was recently reported to be associated with obesity and other components of the metabolic syndrome in a human population40
Our results indicate that integrating genotypic and expression data may help the search for new targets for common human diseases. But certain issues surrounding this process will require more careful consideration. One such issue is the dependency of the LCMS procedure on measurement and modeling errors. Suppose RNA trait R is causal for trait C, but the measurement errors related to the expression of R far exceed that of C. This might lead to a failure to detect R as causal for C or, worse, incorrectly identify C as causal for R. A second issue is that the LCMS procedure will fail to discriminate between traits that are very highly correlated (Supplementary Fig. 3
online). Thus, for cases in which a causal gene is almost completely correlated with a complex trait of interest or tightly regulates the expression of other genes unrelated to the complex trait, the power to resolve the true relationships will be reduced. Furthermore, our procedure introduces a very simplistic view of the gene networks associated with disease, focused on identifying genes in the causal-reactive interval. The true situation is more complicated, however, because the causal-reactive genes are interacting in a larger network and may be subject to negative and positive feedback control. Finally, the high-dimensional nature of this problem, involving potentially tens of thousands of molecular profiling traits, combined with the complexities of genetic model selection procedures, has only recently begun to be explored in this context. Many statistical issues remain to be addressed41–43
, and many of the steps in our overall process that are herein only heuristically justified will require more careful statistical consideration before the approach can be automatically applied to general data sets.
Despite these and other issues, the ability to partition genes into causal and reactive sets and identify those targets from the causal set that are optimally placed in the gene network associated with complex traits of interest with respect to therapeutic intervention offers a promising approach to understanding the complex network of gene changes that are associated with complex traits such as common human diseases and, in the process, identifying new ways to combat these diseases.