We performed a comprehensive analysis of genetic effects on gene expression variation in human lymphoblastoid cell lines, presenting evidence for cis regulatory effects of 1348 genes and their biological properties by adopting a “candidate region approach”. The limited power of our analysis means that we detect only a subset of the existing functional regulatory effects in these populations. In addition, as we have only interrogated a single cell type, variation manifested only in other cell types is not represented here. These two facts argue for an abundance of cis regulatory variants segregating in human populations, some of which may be responsible for higher-order phenotypic variation and susceptibility to disease.
Our analysis goes beyond the mere detection of cis
regulatory effects. We have performed to our knowledge, the most comprehensive analysis to date of the properties of the cis
association signals, and we have systematically described characteristics of the expression data. Together these analyses provide us with confidence in the detected signals. In addition, we have demonstrated that the detected association signals replicate very well across populations, even though the populations are quite divergent and the sample sizes are small. We also detect the effect of population differentiation on gene expression. In this respect, we confirm what has been previously documented in smaller scale studies 16,28
; Among-population allele frequency differences exist and provide a framework for the study of phenotypic differences among populations.
We have provided new methodological insights into the analysis of gene expression variation. By employing pooling of divergent populations and conditional permutation schemes, we increased the sensitivity of our analysis, detecting smaller regulatory effects shared across populations. One can imagine a more sophisticated conditional permutation scheme that would permit pooling of any set of populations for which the population identities or relatedness metrics are known. We have also employed a non-parametric test, namely Spearman rank correlation, and demonstrated that it has enough power to be used in such studies. In addition, SRC has some advantages over linear regression due to the fact that, contrary to the linear regression where outliers can have a large impact on the p-values, SRC is not sensitive to them and therefore the nominal p-values can be used directly in methods that estimate FDR (example given in Figure S6
The evolutionary and annotation properties of cis regulatory associations are very relevant since the density of the phase II HapMap allows for a fine-scale analysis of the association signal. The vast majority of detected cis regulatory effects map very close to the TSS and are enriched in regions of high sequence conservation. This information provides a useful framework to search for cis regulatory variants in the human genome and suggests that most of the large effect variants are in the genic and immediate intergenic regions. The association data will become available at the Ensembl web site in the October 2007 as Distributed Annotation System (DAS) tracks to enable browsing and downloading.
Finally, we have attempted to analyze effects in trans
by adopting a “candidate variants approach” assigning prior relevance to those SNPs already known to be associated with cis
regulation, protein sequence variation (amino acid or splicing variation), or miRNA structure, and this approach made correction by permutation feasible. There were fewer genes exhibiting significant trans
effects than exhibiting cis
effects. This is a function of the fact that trans
effects are often more indirect and therefore weaker, so our sample size does not provide us with enough power in conjunction with the much larger number of tests we have to correct for. In general, the detection of trans
effects in humans has been less successful than in yeast 38,39
. This may be because the yeast cell comprises the entire organism, so study of the biological interactions in a yeast cell has the potential to detect all of the interactions, while the human cell is just a small part of the organism so many of the intercellular effects mediating trans
effects cannot be discovered. Finally, we have provided evidence that among a set of potential variants that could have effects in trans,
we observe a large enrichment in the contribution of cis
regulatory variants, which may suggest that cis
regulatory variation explains much of the complex phenotypic variation in humans, at least at the molecular level.
We have described the most comprehensive analysis to date of gene expression variation in human populations, and provide a detailed characterization of the genetic as well as the positional effects in the genome. This detailed analysis provides a robust and useful framework for the future analysis of gene expression variation in large cohorts with larger sample sizes but lower SNP densities and potentially multiple cell types. It will also greatly facilitate the interpretation and follow up of disease association studies by allowing the dissection of biological effects in regions that carry strong statistical signals of association. This and future studies will lead to a detailed map of functional variation in the human genome that will complement functional and variation studies towards the complete understanding of phenotypic variation in human populations.