Search tips
Search criteria

Results 1-25 (1741067)

Clipboard (0)

Related Articles

1.  Fine-Scale Variation and Genetic Determinants of Alternative Splicing across Individuals 
PLoS Genetics  2009;5(12):e1000766.
Recently, thanks to the increasing throughput of new technologies, we have begun to explore the full extent of alternative pre–mRNA splicing (AS) in the human transcriptome. This is unveiling a vast layer of complexity in isoform-level expression differences between individuals. We used previously published splicing sensitive microarray data from lymphoblastoid cell lines to conduct an in-depth analysis on splicing efficiency of known and predicted exons. By combining publicly available AS annotation with a novel algorithm designed to search for AS, we show that many real AS events can be detected within the usually unexploited, speculative majority of the array and at significance levels much below standard multiple-testing thresholds, demonstrating that the extent of cis-regulated differential splicing between individuals is potentially far greater than previously reported. Specifically, many genes show subtle but significant genetically controlled differences in splice-site usage. PCR validation shows that 42 out of 58 (72%) candidate gene regions undergo detectable AS, amounting to the largest scale validation of isoform eQTLs to date. Targeted sequencing revealed a likely causative SNP in most validated cases. In all 17 incidences where a SNP affected a splice-site region, in silico splice-site strength modeling correctly predicted the direction of the micro-array and PCR results. In 13 other cases, we identified likely causative SNPs disrupting predicted splicing enhancers. Using Fst and REHH analysis, we uncovered significant evidence that 2 putative causative SNPs have undergone recent positive selection. We verified the effect of five SNPs using in vivo minigene assays. This study shows that splicing differences between individuals, including quantitative differences in isoform ratios, are frequent in human populations and that causative SNPs can be identified using in silico predictions. Several cases affected disease-relevant genes and it is likely some of these differences are involved in phenotypic diversity and susceptibility to complex diseases.
Author Summary
Alternative splicing (AS), through the alternative use of exons, can produce many different mRNA transcripts from the same genomic locus, thus possibly resulting in the production of many different proteins. We know that splicing differences between individuals exist and that these changes are often associated with genetic variants. Thus far, very few of these associations have led to the precise localization of the causative polymorphisms. In this work, using in-depth analysis of previously published splicing sensitive micro-array data from human cell lines, we identified and validated a large number of splicing changes which are highly correlated with nearby genetic variations. We then sequenced the genomic DNA around candidate exons and used in silico modeling tools to identify causative SNPs for most of our candidates. Using a plasmid reporter construct, we further demonstrated that five selected SNPs reproduce the expected effect in vivo. Our results indicate that genetically controlled splicing differences between individuals may be more common than previously suggested and can be very subtle; and most are caused by SNPs affecting either the splice-site region or exonic splicing enhancers (ESEs) sequences.
PMCID: PMC2780703  PMID: 20011102
2.  Genome-Wide Associations of Gene Expression Variation in Humans 
PLoS Genetics  2005;1(6):e78.
The exploration of quantitative variation in human populations has become one of the major priorities for medical genetics. The successful identification of variants that contribute to complex traits is highly dependent on reliable assays and genetic maps. We have performed a genome-wide quantitative trait analysis of 630 genes in 60 unrelated Utah residents with ancestry from Northern and Western Europe using the publicly available phase I data of the International HapMap project. The genes are located in regions of the human genome with elevated functional annotation and disease interest including the ENCODE regions spanning 1% of the genome, Chromosome 21 and Chromosome 20q12–13.2. We apply three different methods of multiple test correction, including Bonferroni, false discovery rate, and permutations. For the 374 expressed genes, we find many regions with statistically significant association of single nucleotide polymorphisms (SNPs) with expression variation in lymphoblastoid cell lines after correcting for multiple tests. Based on our analyses, the signal proximal (cis-) to the genes of interest is more abundant and more stable than distal and trans across statistical methodologies. Our results suggest that regulatory polymorphism is widespread in the human genome and show that the 5-kb (phase I) HapMap has sufficient density to enable linkage disequilibrium mapping in humans. Such studies will significantly enhance our ability to annotate the non-coding part of the genome and interpret functional variation. In addition, we demonstrate that the HapMap cell lines themselves may serve as a useful resource for quantitative measurements at the cellular level.
With the finished reference sequence of the human genome now available, focus has shifted towards trying to identify all of the functional elements within the sequence. Although quite a lot of progress has been made towards identifying some classes of genomic elements, in particular protein-coding sequences, the characterization of regulatory elements remains a challenge. The authors describe the genetic mapping of regions of the genome that have functional effects on quantitative levels of gene expression. Gene expression of 630 genes was measured in cell lines derived from 60 unrelated human individuals, the same Utah residents of Northern and Western European ancestry that have been genetically well-characterized by The International HapMap Project. This paper reports significant variation among individuals with respect to levels of gene expression, and demonstrates that this quantitative trait has a genetic basis. For some genes, the genetic signal was localized to specific locations in the human genome sequence; in most cases the genomic region associated with expression variation was physically close to the gene whose expression it regulated. The authors demonstrate the feasibility of performing whole-genome association scans to map quantitative traits, and highlight statistical issues that are increasingly important for whole-genome disease mapping studies.
PMCID: PMC1315281  PMID: 16362079
3.  Convergence of Mutation and Epigenetic Alterations Identifies Common Genes in Cancer That Predict for Poor Prognosis  
PLoS Medicine  2008;5(5):e114.
The identification and characterization of tumor suppressor genes has enhanced our understanding of the biology of cancer and enabled the development of new diagnostic and therapeutic modalities. Whereas in past decades, a handful of tumor suppressors have been slowly identified using techniques such as linkage analysis, large-scale sequencing of the cancer genome has enabled the rapid identification of a large number of genes that are mutated in cancer. However, determining which of these many genes play key roles in cancer development has proven challenging. Specifically, recent sequencing of human breast and colon cancers has revealed a large number of somatic gene mutations, but virtually all are heterozygous, occur at low frequency, and are tumor-type specific. We hypothesize that key tumor suppressor genes in cancer may be subject to mutation or hypermethylation.
Methods and Findings
Here, we show that combined genetic and epigenetic analysis of these genes reveals many with a higher putative tumor suppressor status than would otherwise be appreciated. At least 36 of the 189 genes newly recognized to be mutated are targets of promoter CpG island hypermethylation, often in both colon and breast cancer cell lines. Analyses of primary tumors show that 18 of these genes are hypermethylated strictly in primary cancers and often with an incidence that is much higher than for the mutations and which is not restricted to a single tumor-type. In the identical breast cancer cell lines in which the mutations were identified, hypermethylation is usually, but not always, mutually exclusive from genetic changes for a given tumor, and there is a high incidence of concomitant loss of expression. Sixteen out of 18 (89%) of these genes map to loci deleted in human cancers. Lastly, and most importantly, the reduced expression of a subset of these genes strongly correlates with poor clinical outcome.
Using an unbiased genome-wide approach, our analysis has enabled the discovery of a number of clinically significant genes targeted by multiple modes of inactivation in breast and colon cancer. Importantly, we demonstrate that a subset of these genes predict strongly for poor clinical outcome. Our data define a set of genes that are targeted by both genetic and epigenetic events, predict for clinical prognosis, and are likely fundamentally important for cancer initiation or progression.
Stephen Baylin and colleagues show that a combined genetic and epigenetic analysis of breast and colon cancers identifies a number of clinically significant genes targeted by multiple modes of inactivation.
Editors' Summary
Cancer is one of the developed world's biggest killers—over half a million Americans die of cancer each year, for instance. As a result, there is great interest in understanding the genetic and environmental causes of cancer in order to improve cancer prevention, diagnosis, and treatment.
Cancer begins when cells begin to multiply out of control. DNA is the sequence of coded instructions—genes—for how to build and maintain the body. Certain “tumor suppressor” genes, for instance, help to prevent cancer by preventing tumors from developing, but changes that alter the DNA code sequence—mutations—can profoundly affect how a gene works. Modern techniques of genetic analysis have identified genes such as tumor suppressors that, when mutated, are linked to the development of certain cancers.
Why Was This Study Done?
However, in recent years, it has become increasingly apparent that mutations are neither necessary nor sufficient to explain every case of cancer. This has led researchers to look at so-called epigenetic factors, which also alter how a gene works without altering its DNA sequence. An example of this is “methylation,” which prevents a gene from being expressed—deactivates it—by a chemical tag. Methylation of genes is part of the normal functioning of DNA, but abnormal methylation has been linked with cancer, aging, and some rare birth abnormalities.
Previous analysis of DNA from breast and colon cancer cells had revealed 189 “candidate cancer genes”—mutated genes that were linked to the development of breast and colon cancer. However, it was not clear how those mutations gave rise to cancer, and individual mutations were present in only 5% to 15% of specific tumors. The authors of this study wanted to know whether epigenetic factors such as methylation contributed to causing the cancers.
What Did the Researchers Do and Find?
The researchers first identified 56 of the 189 candidate cancer genes as likely tumor suppressors and then determined that 36 of these genes were methylated and deactivated, often in both breast and colon (laboratory-grown) cancer cells. In nearly all cases, the methylated genes were not active but could be reactivated by being demethylated. They further showed that, in normal colon and breast tissue samples, 18 of the 36 genes were unmethylated and functioned normally, but in cells taken from breast and colon cancer tumors they were methylated.
In contrast to the genetic mutations, the 18 genes were frequently methylated across a range of tumor types, and eight genes were methylated in both the breast and colon cancers. The authors found by reviewing the genetics and epigenetics of those 18 genes in breast and colon cancer that they were either mutated, methylated, or both. A literature review showed that at least six of the 18 genes were known to have tumor suppressor properties, and the authors determined that 16 were located in parts of DNA known to be missing from cells taken from a range of cancer tumors.
Finally, the researchers analyzed data on cancer cases to show that methylation of these 18 genes was correlated with reduced function of these genes in tumors and with a greater likelihood that a cancer will be terminal or spread to other parts of the body.
What Do These Findings Mean?
The researchers considered only the 189 candidate cancer genes found in one previous study and not other genes identified elsewhere. They also did not consider the biological effects of the individual mutations found in those genes. Despite this, they have demonstrated that methylation of specific genes is likely to play a role in the development of breast and/or colon cancer cells either together with mutations or independently, most likely by turning off their tumor suppression function.
More broadly, however, the study adds to the evidence that future analysis of the role of genes in cancer should include epigenetic as well as genetic factors. In addition, the authors have also shown that a number of these genes may be useful for predicting clinical outcomes for a range of tumor types.
Additional Information.
Please access these Web sites via the online version of this summary at
A December 2006 PLoS Medicine Perspective article reviews the value of examining methylation as a factor in common cancers and its use for early detection
The Web site of the American Cancer Society has a wealth of information and resources on a variety of cancers, including breast and colon cancer is a nonprofit organization providing information about breast cancer on the Web, including research news
Cancer Research UK provides information on cancer research
The Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins publishes background information on the authors' research on methylation, setting out its potential for earlier diagnosis and better treatment of cancer
PMCID: PMC2429944  PMID: 18507500
4.  Rare and Common Regulatory Variation in Population-Scale Sequenced Human Genomes 
PLoS Genetics  2011;7(7):e1002144.
Population-scale genome sequencing allows the characterization of functional effects of a broad spectrum of genetic variants underlying human phenotypic variation. Here, we investigate the influence of rare and common genetic variants on gene expression patterns, using variants identified from sequencing data from the 1000 genomes project in an African and European population sample and gene expression data from lymphoblastoid cell lines. We detect comparable numbers of expression quantitative trait loci (eQTLs) when compared to genotypes obtained from HapMap 3, but as many as 80% of the top expression quantitative trait variants (eQTVs) discovered from 1000 genomes data are novel. The properties of the newly discovered variants suggest that mapping common causal regulatory variants is challenging even with full resequencing data; however, we observe significant enrichment of regulatory effects in splice-site and nonsense variants. Using RNA sequencing data, we show that 46.2% of nonsynonymous variants are differentially expressed in at least one individual in our sample, creating widespread potential for interactions between functional protein-coding and regulatory variants. We also use allele-specific expression to identify putative rare causal regulatory variants. Furthermore, we demonstrate that outlier expression values can be due to rare variant effects, and we approximate the number of such effects harboured in an individual by effect size. Our results demonstrate that integration of genomic and RNA sequencing analyses allows for the joint assessment of genome sequence and genome function.
Author Summary
The recent availability of almost fully sequenced human genomes by the 1000 genomes project allows the direct study of genetic variants that influence levels of gene expression in the cell. In this study, we explore the effect of rare and common variants on levels of gene expression. We show that the availability of a more comprehensive list of variants brings us closer to the likely causal variants, and we discuss their genomic and evolutionary properties. We also demonstrate the effects of variants that change splicing patterns or length of the protein product, the putative joint impacts of variants that affect gene expression, and those that affect protein structure. Finally, we show the impact of rare regulatory variants that cannot be detected by the conventional methodologies of association and require the interrogation of full genome sequencing and full transcriptome sequencing. These approaches bring us closer to the implementation of these data and methodologies to a direct clinical application.
PMCID: PMC3141000  PMID: 21811411
5.  DNA sequence polymorphisms in a panel of eight candidate bovine imprinted genes and their association with performance traits in Irish Holstein-Friesian cattle 
BMC Genetics  2010;11:93.
Studies in mice and humans have shown that imprinted genes, whereby expression from one of the two parentally inherited alleles is attenuated or completely silenced, have a major effect on mammalian growth, metabolism and physiology. More recently, investigations in livestock species indicate that genes subject to this type of epigenetic regulation contribute to, or are associated with, several performance traits, most notably muscle mass and fat deposition. In the present study, a candidate gene approach was adopted to assess 17 validated single nucleotide polymorphisms (SNPs) and their association with a range of performance traits in 848 progeny-tested Irish Holstein-Friesian artificial insemination sires. These SNPs are located proximal to, or within, the bovine orthologs of eight genes (CALCR, GRB10, PEG3, PHLDA2, RASGRF1, TSPAN32, ZIM2 and ZNF215) that have been shown to be imprinted in cattle or in at least one other mammalian species (i.e. human/mouse/pig/sheep).
Heterozygosities for all SNPs analysed ranged from 0.09 to 0.46 and significant deviations from Hardy-Weinberg proportions (P ≤ 0.01) were observed at four loci. Phenotypic associations (P ≤ 0.05) were observed between nine SNPs proximal to, or within, six of the eight analysed genes and a number of performance traits evaluated, including milk protein percentage, somatic cell count, culled cow and progeny carcass weight, angularity, body conditioning score, progeny carcass conformation, body depth, rump angle, rump width, animal stature, calving difficulty, gestation length and calf perinatal mortality. Notably, SNPs within the imprinted paternally expressed gene 3 (PEG3) gene cluster were associated (P ≤ 0.05) with calving, calf performance and fertility traits, while a single SNP in the zinc finger protein 215 gene (ZNF215) was associated with milk protein percentage (P ≤ 0.05), progeny carcass weight (P ≤ 0.05), culled cow carcass weight (P ≤ 0.01), angularity (P ≤ 0.01), body depth (P ≤ 0.01), rump width (P ≤ 0.01) and animal stature (P ≤ 0.01).
Of the eight candidate bovine imprinted genes assessed, DNA sequence polymorphisms in six of these genes (CALCR, GRB10, PEG3, RASGRF1, ZIM2 and ZNF215) displayed associations with several of the phenotypes included for analyses. The genotype-phenotype associations detected here are further supported by the biological function of these six genes, each of which plays important roles in mammalian growth, development and physiology. The associations between SNPs within the imprinted PEG3 gene cluster and traits related to calving, calf performance and gestation length suggest that this domain on chromosome 18 may play a role regulating pre-natal growth and development and fertility. SNPs within the bovine ZNF215 gene were associated with bovine growth and body conformation traits and studies in humans have revealed that the human ZNF215 ortholog belongs to the imprinted gene cluster associated with Beckwith-Wiedemann syndrome--a genetic disorder characterised by growth abnormalities. Similarly, the data presented here suggest that the ZNF215 gene may have an important role in regulating bovine growth. Collectively, our results support previous work showing that (candidate) imprinted genes/loci contribute to heritable variation in bovine performance traits and suggest that DNA sequence polymorphisms within these genes/loci represents an important reservoir of genomic markers for future genetic improvement of dairy and beef cattle populations.
PMCID: PMC2965127  PMID: 20942903
6.  In Silico Detection of Sequence Variations Modifying Transcriptional Regulation 
Identification of functional genetic variation associated with increased susceptibility to complex diseases can elucidate genes and underlying biochemical mechanisms linked to disease onset and progression. For genes linked to genetic diseases, most identified causal mutations alter an encoded protein sequence. Technological advances for measuring RNA abundance suggest that a significant number of undiscovered causal mutations may alter the regulation of gene transcription. However, it remains a challenge to separate causal genetic variations from linked neutral variations. Here we present an in silico driven approach to identify possible genetic variation in regulatory sequences. The approach combines phylogenetic footprinting and transcription factor binding site prediction to identify variation in candidate cis-regulatory elements. The bioinformatics approach has been tested on a set of SNPs that are reported to have a regulatory function, as well as background SNPs. In the absence of additional information about an analyzed gene, the poor specificity of binding site prediction is prohibitive to its application. However, when additional data is available that can give guidance on which transcription factor is involved in the regulation of the gene, the in silico binding site prediction improves the selection of candidate regulatory polymorphisms for further analyses. The bioinformatics software generated for the analysis has been implemented as a Web-based application system entitled RAVEN (regulatory analysis of variation in enhancers). The RAVEN system is available at for all researchers interested in the detection and characterization of regulatory sequence variation.
Author Summary
DNA sequence variations (polymorphisms) that affect the expression levels of genes play important roles in the pathogenesis of many complex diseases. Compared with genetic variations that alter the amino acid sequences of encoded proteins, which are relatively easy to identify, sequence variants that affect the regulation of genes are difficult to pinpoint among the large amount of nonfunctional polymorphisms located in the vicinity of genes. Computational methods to distinguish functional from neutral variations could therefore prove useful to direct limited laboratory resources to sites most likely to exhibit a phenotypic effect. In this paper we present a Web-based tool for the identification of genetic variation in potential transcription factor binding sites. This tool can be used by any scientist interested in the characterization of regulatory polymorphisms. Using experimentally verified regulatory polymorphisms and background data collected from the literature, we evaluate the method's capacity to identify regulatory genetic variation, and we discuss the limitations of its application.
PMCID: PMC2211530  PMID: 18208319
7.  Genetic Modifiers of Neurofibromatosis Type 1-Associated Café-au-Lait Macule Count Identified Using Multi-platform Analysis 
PLoS Genetics  2014;10(10):e1004575.
Neurofibromatosis type 1 (NF1) is an autosomal dominant, monogenic disorder of dysregulated neurocutaneous tissue growth. Pleiotropy, variable expressivity and few NF1 genotype-phenotype correlates limit clinical prognostication in NF1. Phenotype complexity in NF1 is hypothesized to derive in part from genetic modifiers unlinked to the NF1 locus. In this study, we hypothesized that normal variation in germline gene expression confers risk for certain phenotypes in NF1. In a set of 79 individuals with NF1, we examined the association between gene expression in lymphoblastoid cell lines with NF1-associated phenotypes and sequenced select genes with significant phenotype/expression correlations. In a discovery cohort of 89 self-reported European-Americans with NF1 we examined the association between germline sequence variants of these genes with café-au-lait macule (CALM) count, a tractable, tumor-like phenotype in NF1. Two correlated, common SNPs (rs4660761 and rs7161) between DPH2 and ATP6V0B were significantly associated with the CALM count. Analysis with tiled regression also identified SNP rs4660761 as significantly associated with CALM count. SNP rs1800934 and 12 rare variants in the mismatch repair gene MSH6 were also associated with CALM count. Both SNPs rs7161 and rs4660761 (DPH2 and ATP6V0B) were highly significant in a mega-analysis in a combined cohort of 180 self-reported European-Americans; SNP rs1800934 (MSH6) was near-significant in a meta-analysis assuming dominant effect of the minor allele. SNP rs4660761 is predicted to regulate ATP6V0B, a gene associated with melanosome biology. Individuals with homozygous mutations in MSH6 can develop an NF1-like phenotype, including multiple CALMs. Through a multi-platform approach, we identified variants that influence NF1 CALM count.
Author Summary
Neurofibromatosis type 1 (NF1) is a relatively common genetic disease that increases the chance to develop a variety of benign and malignant tumors. People with NF1 also typically feature a large number of birthmarks called café-au-lait macules. It is difficult to predict severity or specific problems in NF1. We sought to identify genes (other than NF1, the gene that causes the disease) that influence severity in NF1. We determined the number of café-au-lait macules in two groups of people with NF1. We measured the gene expression of about 10,000 genes in the cultured white blood cells from one group of people. We then sequenced a group of genes whose expression level was increased in people with higher numbers of café-au-lait macules. In the first group, we found common variants in genes MSH6 and near DPH2 and ATP6V0B that were significantly associated with the number of café-au-lait macules. Some of these variants were close to significant in the second group of people. The two variants near DPH2 and ATP6V0B were very significant when analysed in both groups combined. Our work is among the first to identify genetic variants that influence the severity of NF1.
PMCID: PMC4199479  PMID: 25329635
8.  Genetic interactions affecting human gene expression identified by variance association mapping 
eLife  2014;3:e01381.
Non-additive interaction between genetic variants, or epistasis, is a possible explanation for the gap between heritability of complex traits and the variation explained by identified genetic loci. Interactions give rise to genotype dependent variance, and therefore the identification of variance quantitative trait loci can be an intermediate step to discover both epistasis and gene by environment effects (GxE). Using RNA-sequence data from lymphoblastoid cell lines (LCLs) from the TwinsUK cohort, we identify a candidate set of 508 variance associated SNPs. Exploiting the twin design we show that GxE plays a role in ∼70% of these associations. Further investigation of these loci reveals 57 epistatic interactions that replicated in a smaller dataset, explaining on average 4.3% of phenotypic variance. In 24 cases, more variance is explained by the interaction than their additive contributions. Using molecular phenotypes in this way may provide a route to uncovering genetic interactions underlying more complex traits.
eLife digest
Every person has two copies of each gene: one is inherited from their mother and the other from their father. These two copies are often not identical because there can be many different variants of the same gene in the human population. Traits (such as height, body mass and risk of disease) vary from one person to the next—and for many traits this variation depends in part on the different gene variants that each person has inherited. Studies seeking to find the differences in DNA that can predict this variation have often assumed that the changes in DNA act on traits independently of the effect of environment and of other genetic variants.
In contrast, studies with animals have shown that some genetic variants can interact to produce a bigger (or smaller) effect than would be expected from simply ‘adding together’ their individual effects—a phenomenon called epistasis. But how much does epistasis contribute to variation in human traits, if at all? This question has been much disputed, and is difficult to test, not least because of the sheer number of interactions to assess: tens of millions of changes in DNA have been observed in the human genome, and so there are many more than billions of possible combinations of these changes to investigate.
Here, Brown et al. have examined the sequences of all the genes that were expressed in cells taken from a cohort of twins and searched for genetic variants that show these epistatic interactions. By studying gene expression, which can be greatly affected by small changes in the DNA code, Brown et al. were able to identify 508 variants that had a bigger than expected effect on the level of gene expression. This may be a sign that these variants act in combinations: if within one genome a variant increased expression and in another it decreased expression, then this would cause greater variation in gene expression. Further investigation of these 508 variants led to the discovery of 256 examples of epistasis, and 57 of these were replicated in samples from another cohort. Brown et al. calculated that these epistatic interactions explained up to 16% of the variation in gene expression. Furthermore, as well as being involved in epistatic interactions, about 70% of the genetic variants that had an effect on the variation in gene expression were also involved in interactions between genes and the environment.
In addition to showing that epistasis contributes to variation in human traits, the work of Brown et al. could help to uncover interactions behind complex traits—beyond the expression level of a gene—that could not previously be investigated.
PMCID: PMC4017648  PMID: 24771767
gene expression; epistasis; gene-environment interactions; human
9.  RHOA Is a Modulator of the Cholesterol-Lowering Effects of Statin 
PLoS Genetics  2012;8(11):e1003058.
Although statin drugs are generally efficacious for lowering plasma LDL-cholesterol levels, there is considerable variability in response. To identify candidate genes that may contribute to this variation, we used an unbiased genome-wide filter approach that was applied to 10,149 genes expressed in immortalized lymphoblastoid cell lines (LCLs) derived from 480 participants of the Cholesterol and Pharmacogenomics (CAP) clinical trial of simvastatin. The criteria for identification of candidates included genes whose statin-induced changes in expression were correlated with change in expression of HMGCR, a key regulator of cellular cholesterol metabolism and the target of statin inhibition. This analysis yielded 45 genes, from which RHOA was selected for follow-up because it has been found to participate in mediating the pleiotropic but not the lipid-lowering effects of statin treatment. RHOA knock-down in hepatoma cell lines reduced HMGCR, LDLR, and SREBF2 mRNA expression and increased intracellular cholesterol ester content as well as apolipoprotein B (APOB) concentrations in the conditioned media. Furthermore, inter-individual variation in statin-induced RHOA mRNA expression measured in vitro in CAP LCLs was correlated with the changes in plasma total cholesterol, LDL-cholesterol, and APOB induced by simvastatin treatment (40 mg/d for 6 wk) of the individuals from whom these cell lines were derived. Moreover, the minor allele of rs11716445, a SNP located in a novel cryptic RHOA exon, dramatically increased inclusion of the exon in RHOA transcripts during splicing and was associated with a smaller LDL-cholesterol reduction in response to statin treatment in 1,886 participants from the CAP and Pravastatin Inflamation and CRP Evaluation (PRINCE; pravastatin 40 mg/d) statin clinical trials. Thus, an unbiased filter approach based on transcriptome-wide profiling identified RHOA as a gene contributing to variation in LDL-cholesterol response to statin, illustrating the power of this approach for identifying candidate genes involved in drug response phenotypes.
Author Summary
Statins, or HMG CoA reductase inhibitors, are widely used to lower plasma LDL-cholesterol levels as a means of reducing risk for cardiovascular disease. We performed an unbiased genome-wide survey to identify novel candidate genes that may be involved in statin response using genome-wide mRNA expression analysis in a sequential filtering strategy to identify those most likely to be relevant to cholesterol metabolism based on their gene expression characteristics. Among these, RHOA was selected for further functional study. A role for this gene in the maintenance of intracellular cholesterol homeostasis was confirmed by knock-down in hepatoma cell lines. In addition, statin-induced RHOA transcript levels measured in a panel of lymphoblastoid cell lines was correlated with statin-induced change in plasma LDL-cholesterol measured in individuals from whom the cell lines were derived. Lastly, a cis-acting splicing QTL associated with expression of a rare cryptic RHOA exon was also associated with statin-induced changes in plasma LDLC levels. This result exemplifies the power of applying biological information of well understood molecular pathways with genome-wide expression data for the identification of candidate genes that influence drug response.
PMCID: PMC3499361  PMID: 23166513
10.  Transcriptome Sequencing from Diverse Human Populations Reveals Differentiated Regulatory Architecture 
PLoS Genetics  2014;10(8):e1004549.
Large-scale sequencing efforts have documented extensive genetic variation within the human genome. However, our understanding of the origins, global distribution, and functional consequences of this variation is far from complete. While regulatory variation influencing gene expression has been studied within a handful of populations, the breadth of transcriptome differences across diverse human populations has not been systematically analyzed. To better understand the spectrum of gene expression variation, alternative splicing, and the population genetics of regulatory variation in humans, we have sequenced the genomes, exomes, and transcriptomes of EBV transformed lymphoblastoid cell lines derived from 45 individuals in the Human Genome Diversity Panel (HGDP). The populations sampled span the geographic breadth of human migration history and include Namibian San, Mbuti Pygmies of the Democratic Republic of Congo, Algerian Mozabites, Pathan of Pakistan, Cambodians of East Asia, Yakut of Siberia, and Mayans of Mexico. We discover that approximately 25.0% of the variation in gene expression found amongst individuals can be attributed to population differences. However, we find few genes that are systematically differentially expressed among populations. Of this population-specific variation, 75.5% is due to expression rather than splicing variability, and we find few genes with strong evidence for differential splicing across populations. Allelic expression analyses indicate that previously mapped common regulatory variants identified in eight populations from the International Haplotype Map Phase 3 project have similar effects in our seven sampled HGDP populations, suggesting that the cellular effects of common variants are shared across diverse populations. Together, these results provide a resource for studies analyzing functional differences across populations by estimating the degree of shared gene expression, alternative splicing, and regulatory genetics across populations from the broadest points of human migration history yet sampled.
Author Summary
Previous gene expression studies have identified factors influencing population-level variation in gene regulation. However, these efforts have been limited to a small set of well-studied populations. By leveraging the high resolution of RNA sequencing and broad population sampling, we survey the landscape of transcriptome variation across a globally distributed set of seven populations that span a breadth of human genetic variation and major dispersal events. We assess differences in gene expression, transcript structure, and regulatory variation. We find only 44 transcripts that show significant differences in expression, likely as a result of the small sample size, but we find that 25% of the variance in gene expression is due to population differences. This is a larger fraction than previously observed, and it is likely due to the greater breadth of human diversity assayed in this study. We also find that population-specific variance is mostly due to transcription variability rather than the configuration of expressed gene products. Additionally, known common regulatory variants have similar effects across populations including those we study here. These data and results serve as a resource cataloging the wide array of gene expression regulation affecting population variation among diverse groups, improving our understanding of transcriptional diversity.
PMCID: PMC4133153  PMID: 25121757
11.  Genome-wide survey of allele-specific splicing in humans 
BMC Genomics  2008;9:265.
Accurate mRNA splicing depends on multiple regulatory signals encoded in the transcribed RNA sequence. Many examples of mutations within human splice regulatory regions that alter splicing qualitatively or quantitatively have been reported and allelic differences in mRNA splicing are likely to be a common and important source of phenotypic diversity at the molecular level, in addition to their contribution to genetic disease susceptibility. However, because the effect of a mutation on the efficiency of mRNA splicing is often difficult to predict, many mutations that cause disease through an effect on splicing are likely to remain undiscovered.
We have combined a genome-wide scan for sequence polymorphisms likely to affect mRNA splicing with analysis of publicly available Expressed Sequence Tag (EST) and exon array data. The genome-wide scan uses published tools and identified 30,977 SNPs located within donor and acceptor splice sites, branch points and exonic splicing enhancer elements. For 1,185 candidate splicing polymorphisms the difference in splicing between alternative alleles was corroborated by publicly available exon array data from 166 lymphoblastoid cell lines. We developed a novel probabilistic method to infer allele-specific splicing from EST data. The method uses SNPs and alternative mRNA isoforms mapped to EST sequences and models both regulated alternative splicing as well as allele-specific splicing. We have also estimated heritability of splicing and report that a greater proportion of genes show evidence of splicing heritability than show heritability of overall gene expression level. Our results provide an extensive resource that can be used to assess the possible effect on splicing of human polymorphisms in putative splice-regulatory sites.
We report a set of genes showing evidence of allele-specific splicing from an integrated analysis of genomic polymorphisms, EST data and exon array data, including several examples for which there is experimental evidence of polymorphisms affecting splicing in the literature. We also present a set of novel allele-specific splicing candidates and discuss the strengths and weaknesses of alternative technologies for inferring the effect of sequence variants on mRNA splicing.
PMCID: PMC2427040  PMID: 18518984
12.  Identification of Common Genetic Variation That Modulates Alternative Splicing 
PLoS Genetics  2007;3(6):e99.
Alternative splicing of genes is an efficient means of generating variation in protein function. Several disease states have been associated with rare genetic variants that affect splicing patterns. Conversely, splicing efficiency of some genes is known to vary between individuals without apparent ill effects. What is not clear is whether commonly observed phenotypic variation in splicing patterns, and hence potential variation in protein function, is to a significant extent determined by naturally occurring DNA sequence variation and in particular by single nucleotide polymorphisms (SNPs). In this study, we surveyed the splicing patterns of 250 exons in 22 individuals who had been previously genotyped by the International HapMap Project. We identified 70 simple cassette exon alternative splicing events in our experimental system; for six of these, we detected consistent differences in splicing pattern between individuals, with a highly significant association between splice phenotype and neighbouring SNPs. Remarkably, for five out of six of these events, the strongest correlation was found with the SNP closest to the intron–exon boundary, although the distance between these SNPs and the intron–exon boundary ranged from 2 bp to greater than 1,000 bp. Two of these SNPs were further investigated using a minigene splicing system, and in each case the SNPs were found to exert cis-acting effects on exon splicing efficiency in vitro. The functional consequences of these SNPs could not be predicted using bioinformatic algorithms. Our findings suggest that phenotypic variation in splicing patterns is determined by the presence of SNPs within flanking introns or exons. Effects on splicing may represent an important mechanism by which SNPs influence gene function.
Author Summary
Genetic variation, through its effects on gene expression, influences many aspects of the human phenotype. Understanding the impact of genetic variation on human disease risk has become a major goal for biomedical research and has the potential of revealing both novel disease mechanisms and novel functional elements controlling gene expression. Recent large-scale studies have suggested that a relatively high proportion of human genes show allele-specific variation in expression. Effects of common DNA polymorphisms on mRNA splicing are less well studied. Variation in splicing patterns is known to be tissue specific, and for a small number of genes has been shown to vary among individuals. What is not known is whether allele-specific splicing events are an important mechanism by which common genetic variation affects gene expression. In this study we show that allele-specific alternative splicing was observed in six out of 70 exon-skipping events. Sequence analysis of the relevant splice sites and of the regions surrounding single nucleotide polymorphisms correlated with the splicing events failed to identify any predictive bioinformatic signals. A genome-wide study of allele-specific splicing, using an experimental rather than a bioinformatic approach, is now required.
PMCID: PMC1904363  PMID: 17571926
13.  Gemcitabine and Arabinosylcytosin Pharmacogenomics: Genome-Wide Association and Drug Response Biomarkers 
PLoS ONE  2009;4(11):e7765.
Cancer patients show large individual variation in their response to chemotherapeutic agents. Gemcitabine (dFdC) and AraC, two cytidine analogues, have shown significant activity against a variety of tumors. We previously used expression data from a lymphoblastoid cell line-based model system to identify genes that might be important for the two drug cytotoxicity. In the present study, we used that same model system to perform a genome-wide association (GWA) study to test the hypothesis that common genetic variation might influence both gene expression and response to the two drugs. Specifically, genome-wide single nucleotide polymorphisms (SNPs) and mRNA expression data were obtained using the Illumina 550K® HumanHap550 SNP Chip and Affymetrix U133 Plus 2.0 GeneChip, respectively, for 174 ethnically-defined “Human Variation Panel” lymphoblastoid cell lines. Gemcitabine and AraC cytotoxicity assays were performed to obtain IC50 values for the cell lines. We then performed GWA studies with SNPs, gene expression and IC50 of these two drugs. This approach identified SNPs that were associated with gemcitabine or AraC IC50 values and with the expression regulation for 29 genes or 30 genes, respectively. One SNP in IQGAP2 (rs3797418) was significantly associated with variation in both the expression of multiple genes and gemcitabine and AraC IC50. A second SNP in TGM3 (rs6082527) was also significantly associated with multiple gene expression and gemcitabine IC50. To confirm the association results, we performed siRNA knock down of selected genes with expression that was associated with rs3797418 and rs6082527 in tumor cell and the knock down altered gemcitabine or AraC sensitivity, confirming our association study results. These results suggest that the application of GWA approaches using cell-based model systems, when combined with complementary functional validation, can provide insights into mechanisms responsible for variation in cytidine analogue response.
PMCID: PMC2770319  PMID: 19898621
14.  Identifying the genetic determinants of transcription factor activity 
Genome-wide messenger RNA expression levels are highly heritable. However, the molecular mechanisms underlying this heritability are poorly understood.The influence of trans-acting polymorphisms is often mediated by changes in the regulatory activity of one or more sequence-specific transcription factors (TFs). We use a method that exploits prior information about the DNA-binding specificity of each TF to estimate its genotype-specific regulatory activity. To this end, we perform linear regression of genotype-specific differential mRNA expression on TF-specific promoter-binding affinity.Treating inferred TF activity as a quantitative trait and mapping it across a panel of segregants from an experimental genetic cross allows us to identify trans-acting loci (‘aQTLs') whose allelic variation modulates the TF. A few of these aQTL regions contain the gene encoding the TF itself; several others contain a gene whose protein product is known to interact with the TF.Our method is strictly causal, as it only uses sequence-based features as predictors. Application to budding yeast demonstrates a dramatic increase in statistical power, compared with existing methods, to detect locus-TF associations and trans-acting loci. Our aQTL mapping strategy also succeeds in mouse.
Genetic sequence variation naturally perturbs mRNA expression levels in the cell. In recent years, analysis of parallel genotyping and expression profiling data for segregants from genetic crosses between parental strains has revealed that mRNA expression levels are highly heritable. Expression quantitative trait loci (eQTLs), whose allelic variation regulates the expression level of individual genes, have successfully been identified (Brem et al, 2002; Schadt et al, 2003). The molecular mechanisms underlying the heritability of mRNA expression are poorly understood. However, they are likely to involve mediation by transcription factors (TFs). We present a new transcription-factor-centric method that greatly increases our ability to understand what drives the genetic variation in mRNA expression (Figure 1). Our method identifies genomic loci (‘aQTLs') whose allelic variation modulates the protein-level activity of specific TFs. To map aQTLs, we integrate genotyping and expression profiling data with quantitative prior information about DNA-binding specificity of transcription factors in the form of position-specific affinity matrices (Bussemaker et al, 2007). We applied our method in two different organisms: budding yeast and mouse.
In our approach, the inferred TF activity is explicitly treated as a quantitative trait, and genetically mapped. The decrease of ‘phenotype space' from that of all genes (in the eQTL approach) to that of all TFs (in our aQTL approach) increases the statistical power to detect trans-acting loci in two distinct ways. First, as each inferred TF activity is derived from a large number of genes, it is far less noisy than mRNA levels of individual genes. Second, the number of trait/marker combinations that needs to be tested for statistical significance in parallel is roughly two orders of magnitude smaller than for eQTLs. We identified a total of 103 locus-TF associations, a more than six-fold improvement over the 17 locus-TF associations identified by several existing methods (Brem et al, 2002; Yvert et al, 2003; Lee et al, 2006; Smith and Kruglyak, 2008; Zhu et al, 2008). The total number of distinct genomic loci identified as an aQTL equals 31, which includes 11 of the 13 previously identified eQTL hotspots (Smith and Kruglyak, 2008).
To better understand the mechanisms underlying the identified genetic linkages, we examined the genes within each aQTL region. First, we found four ‘local' aQTLs, which encompass the gene encoding the TF itself. This includes the known polymorphism in the HAP1 gene (Brem et al, 2002), but also novel predictions of trans-acting polymorphisms in RFX1, STB5, and HAP4. Second, using high-throughput protein–protein interaction data, we identified putative causal genes for several aQTLs. For example, we predict that a polymorphism in the cyclin-dependent kinase CDC28 antagonistically modulates the functionally distinct cell cycle regulators Fkh1 and Fkh2. In this and other cases, our approach naturally accounts for post-translational modulation of TF activity at the protein level.
We validated our ability to predict locus-TF associations in yeast using gene expression profiles of allele replacement strains from a previous study (Smith and Kruglyak, 2008). Chromosome 15 contains an aQTL whose allelic status influences the activity of no fewer than 30 distinct TFs. This locus includes IRA2, which controls intracellular cAMP levels. We used the gene expression profile of IRA2 replacement strains to confirm that the polymorphism within IRA2 indeed modulates a subset of the TFs whose activity was predicted to link to this locus, and no other TFs.
Application of our approach to mouse data identified an aQTL modulating the activity of a specific TF in liver cells. We identified an aQTL on mouse chromosome 7 for Zscan4, a transcription factor containing four zinc finger domains and a SCAN domain. Even though we could not detect a candidate causal gene for Zscan4p because of lack of information about the mouse genome, our result demonstrates that our method also works in higher eukaryotes.
In summary, aQTL mapping has a greatly improved sensitivity to detect molecular mechanisms underlying the heritability of gene expression. The successful application of our approach to yeast and mouse data underscores the value of explicitly treating the inferred TF activity as a quantitative trait for increasing statistical power of detecting trans-acting loci. Furthermore, our method is computationally efficient, and easily applicable to any other organism whenever prior information about the DNA-binding specificity of TFs is available.
Analysis of parallel genotyping and expression profiling data has shown that mRNA expression levels are highly heritable. Currently, only a tiny fraction of this genetic variance can be mechanistically accounted for. The influence of trans-acting polymorphisms on gene expression traits is often mediated by transcription factors (TFs). We present a method that exploits prior knowledge about the in vitro DNA-binding specificity of a TF in order to map the loci (‘aQTLs') whose inheritance modulates its protein-level regulatory activity. Genome-wide regression of differential mRNA expression on predicted promoter affinity is used to estimate segregant-specific TF activity, which is subsequently mapped as a quantitative phenotype. In budding yeast, our method identifies six times as many locus-TF associations and more than twice as many trans-acting loci as all existing methods combined. Application to mouse data from an F2 intercross identified an aQTL on chromosome VII modulating the activity of Zscan4 in liver cells. Our method has greatly improved statistical power over existing methods, is mechanism based, strictly causal, computationally efficient, and generally applicable.
PMCID: PMC2964119  PMID: 20865005
gene expression; gene regulatory networks; genetic variation; quantitative trait loci; transcription factors
15.  Sequence signatures and mRNA concentration can explain two-thirds of protein abundance variation in a human cell line 
We provide a large-scale dataset on absolute protein and matching mRNA concentrations from the human medulloblastoma cell line Daoy. The correlation between mRNA and protein concentrations is significant and positive (Rs=0.46, R2=0.29, P-value<2e16), although non-linear.Out of ∼200 tested sequence features, sequence length, frequency and properties of amino acids, as well as translation initiation-related features are the strongest individual correlates of protein abundance when accounting for variation in mRNA concentration.When integrating mRNA expression data and all sequence features into a non-parametric regression model (Multivariate Adaptive Regression Splines), we were able to explain up to 67% of the variation in protein concentrations. Half of the contributions were attributed to mRNA concentrations, the other half to sequence features relating to regulation of translation and protein degradation. The sequence features are primarily linked to the coding and 3′ untranslated region. To our knowledge, this is the most comprehensive predictive model of human protein concentrations achieved so far.
mRNA decay, translation regulation and protein degradation are essential parts of eukaryotic gene expression regulation (Hieronymus and Silver, 2004; Mata et al, 2005), which enable the dynamics of cellular systems and their responses to external and internal stimuli without having to rely exclusively on transcription regulation. The importance of these processes is emphasized by the generally low correlation between mRNA and protein concentrations. For many prokaryotic and eukaryotic organisms, <50% of variation in protein abundance variation is explained by variation in mRNA concentrations (de Sousa Abreu et al, 2009).
Given the plethora of regulatory mechanisms involved, most studies have focused so far on individual regulators and specific targets. Particularly in human, we currently lack system-wide, quantitative analyses that evaluate the relative contribution of regulatory elements encoded in the mRNA and protein sequence. Existing studies have been carried out only in bacteria and yeast (Nie et al, 2006; Brockmann et al, 2007; Tuller et al, 2007; Wu et al, 2008). Here, we present the first comprehensive analysis on the impact of translation and protein degradation on protein abundance variation in a human cell line. For this purpose, we experimentally measured absolute protein and mRNA concentrations in the Daoy medulloblastoma cell line, using shotgun proteomics and microarrays, respectively (Figure 1). These data comprise one of the largest such sets available today for human. We focused on sequence features that likely impact protein translation and protein degradation, including length, nucleotide composition, structure of the untranslated regions (UTRs), coding sequence, composition of the translation initiation site, presence of upstream open reading frames putative target sites of miRNAs, codon usage, amino-acid composition and protein degradation signals.
Three types of tests have been conducted: (a) we examined partial Spearman's rank correlation of numerical features (e.g. length) with protein concentration, accounting for variation in mRNA concentrations; (b) for numerical and categorical features (e.g. function), we compared two extreme populations with Welch's t-test and (c) using a Multivariate Adaptive Regression Splines model, we analyzed the combined contributions of mRNA expression and sequence features to protein abundance variation (Figure 1). To account for the non-linearity of many relationships, we use non-parametric approaches throughout the analysis.
We observed a significant positive correlation between mRNA and protein concentrations, larger than many previous measurements (de Sousa Abreu et al, 2009). We also show that the contribution of translation and protein degradation is at least as important as the contribution of mRNA transcription and stability to the abundance variation of the final protein products. Although variation in mRNA expression explains ∼25–30% of the variation in protein abundance, another 30–40% can be accounted for by characteristics of the sequences, which we identified in a comparative assessment of global correlates. Among these characteristics, sequence length, amino-acid frequencies and also nucleotide frequencies in the coding region are of strong influence (Figure 3A). Characteristics of the 3′UTR and of the 5′UTR, that is length, nucleotide composition and secondary structures, describe another part of the variation, leaving 33% expression variation unexplained. The unexplained fraction may be accounted for by mechanisms not considered in this analysis (e.g. regulation by RNA-binding proteins or gene-specific structural motifs), as well as expression and measurement noise.
Our combined model including mRNA concentration and sequence features can explain 67% of the variation of protein abundance in this system—and thus has the highest predictive power for human protein abundance achieved so far (Figure 3B).
Transcription, mRNA decay, translation and protein degradation are essential processes during eukaryotic gene expression, but their relative global contributions to steady-state protein concentrations in multi-cellular eukaryotes are largely unknown. Using measurements of absolute protein and mRNA abundances in cellular lysate from the human Daoy medulloblastoma cell line, we quantitatively evaluate the impact of mRNA concentration and sequence features implicated in translation and protein degradation on protein expression. Sequence features related to translation and protein degradation have an impact similar to that of mRNA abundance, and their combined contribution explains two-thirds of protein abundance variation. mRNA sequence lengths, amino-acid properties, upstream open reading frames and secondary structures in the 5′ untranslated region (UTR) were the strongest individual correlates of protein concentrations. In a combined model, characteristics of the coding region and the 3′UTR explained a larger proportion of protein abundance variation than characteristics of the 5′UTR. The absolute protein and mRNA concentration measurements for >1000 human genes described here represent one of the largest datasets currently available, and reveal both general trends and specific examples of post-transcriptional regulation.
PMCID: PMC2947365  PMID: 20739923
gene expression regulation; protein degradation; protein stability; translation
16.  Contrasting signals of positive selection in genes involved in human skin-color variation from tests based on SNP scans and resequencing 
Numerous genome-wide scans conducted by genotyping previously ascertained single-nucleotide polymorphisms (SNPs) have provided candidate signatures for positive selection in various regions of the human genome, including in genes involved in pigmentation traits. However, it is unclear how well the signatures discovered by such haplotype-based test statistics can be reproduced in tests based on full resequencing data. Four genes (oculocutaneous albinism II (OCA2), tyrosinase-related protein 1 (TYRP1), dopachrome tautomerase (DCT), and KIT ligand (KITLG)) implicated in human skin-color variation, have shown evidence for positive selection in Europeans and East Asians in previous SNP-scan data. In the current study, we resequenced 4.7 to 6.7 kb of DNA from each of these genes in Africans, Europeans, East Asians, and South Asians.
Applying all commonly used neutrality-test statistics for allele frequency distribution to the newly generated sequence data provided conflicting results regarding evidence for positive selection. Previous haplotype-based findings could not be clearly confirmed. Although some tests were marginally significant for some populations and genes, none of them were significant after multiple-testing correction. Combined P values for each gene-population pair did not improve these results. Application of Approximate Bayesian Computation Markov chain Monte Carlo based to these sequence data using a simple forward simulator revealed broad posterior distributions of the selective parameters for all four genes, providing no support for positive selection. However, when we applied this approach to published sequence data on SLC45A2, another human pigmentation candidate gene, we could readily confirm evidence for positive selection, as previously detected with sequence-based and some haplotype-based tests.
Overall, our data indicate that even genes that are strong biological candidates for positive selection and show reproducible signatures of positive selection in SNP scans do not always show the same replicability of selection signals in other tests, which should be considered in future studies on detecting positive selection in genetic data.
PMCID: PMC3287149  PMID: 22133426
17.  High-Resolution Mapping of Expression-QTLs Yields Insight into Human Gene Regulation 
PLoS Genetics  2008;4(10):e1000214.
Recent studies of the HapMap lymphoblastoid cell lines have identified large numbers of quantitative trait loci for gene expression (eQTLs). Reanalyzing these data using a novel Bayesian hierarchical model, we were able to create a surprisingly high-resolution map of the typical locations of sites that affect mRNA levels in cis. Strikingly, we found a strong enrichment of eQTLs in the 250 bp just upstream of the transcription end site (TES), in addition to an enrichment around the transcription start site (TSS). Most eQTLs lie either within genes or close to genes; for example, we estimate that only 5% of eQTLs lie more than 20 kb upstream of the TSS. After controlling for position effects, SNPs in exons are ∼2-fold more likely than SNPs in introns to be eQTLs. Our results suggest an important role for mRNA stability in determining steady-state mRNA levels, and highlight the potential of eQTL mapping as a high-resolution tool for studying the determinants of gene regulation.
Author Summary
Individual phenotypes within natural populations generally exhibit a large diversity resulting from a complex interplay of genes and environmental factors. Since the advent of molecular markers in the 1980s, quantitative genetics has made a significant step toward unraveling the genetic bases of such complex traits, in particular by developing sophisticated tools to map the genomic locations of genes that affect complex traits. These regions are known as quantitative trait loci (QTLs). More recently, these tools have been extended to the study of gene expression phenotypes on a massive scale. In this paper, we used a previously published dataset consisting of expression measurements of 11,446 genes in human cell lines derived from 210 unrelated human individuals that have been genetically characterized by the International HapMap Project. Our article develops and applies a framework for determining the genetic factors that impact gene regulation. We show that these factors cluster strongly near to the gene start and gene end and are enriched within the transcribed region. Our approach suggests a general framework for studying the genetic factors that affect variation in gene expression.
PMCID: PMC2556086  PMID: 18846210
18.  Aberrant Gene Expression in Humans 
PLoS Genetics  2015;11(1):e1004942.
Gene expression as an intermediate molecular phenotype has been a focus of research interest. In particular, studies of expression quantitative trait loci (eQTL) have offered promise for understanding gene regulation through the discovery of genetic variants that explain variation in gene expression levels. Existing eQTL methods are designed for assessing the effects of common variants, but not rare variants. Here, we address the problem by establishing a novel analytical framework for evaluating the effects of rare or private variants on gene expression. Our method starts from the identification of outlier individuals that show markedly different gene expression from the majority of a population, and then reveals the contributions of private SNPs to the aberrant gene expression in these outliers. Using population-scale mRNA sequencing data, we identify outlier individuals using a multivariate approach. We find that outlier individuals are more readily detected with respect to gene sets that include genes involved in cellular regulation and signal transduction, and less likely to be detected with respect to the gene sets with genes involved in metabolic pathways and other fundamental molecular functions. Analysis of polymorphic data suggests that private SNPs of outlier individuals are enriched in the enhancer and promoter regions of corresponding aberrantly-expressed genes, suggesting a specific regulatory role of private SNPs, while the commonly-occurring regulatory genetic variants (i.e., eQTL SNPs) show little evidence of involvement. Additional data suggest that non-genetic factors may also underlie aberrant gene expression. Taken together, our findings advance a novel viewpoint relevant to situations wherein common eQTLs fail to predict gene expression when heritable, rare inter-individual variation exists. The analytical framework we describe, taking into consideration the reality of differential phenotypic robustness, may be valuable for investigating complex traits and conditions.
Author Summary
The uniqueness of individuals is due to differences in the combination of genetic, epigenetic and environmental determinants. Understanding the genetic basis of phenotypic variation is a key objective in genetics. Gene expression has been considered as an intermediate phenotype, and the association between gene expression and commonly-occurring genetic variants in the general population has been convincingly established. However, there are few methods to assess the impact of rare genetic variants, such as private SNPs, on gene expression. Here we describe a systematic approach, based on the theory of multivariate outlier detection, to identify individuals that show unusual or aberrant gene expression, relative the rest of the study cohort. Through characterizing detected outliers and corresponding gene sets, we are able to identify which gene sets tend to be aberrantly expressed and which individuals show deviant gene expression within a population. One of our major findings is that private SNPs may contribute to aberrant expression in outlier individuals. These private SNPs are more frequently located in the enhancer and promoter regions of genes that are aberrantly expressed, suggesting a possible regulatory function of these SNPs. Overall, our results provide new insight into the determinants of inter-individual variation, which have not been evaluated by large population-level cohort studies.
PMCID: PMC4305293  PMID: 25617623
19.  Tissue Effect on Genetic Control of Transcript Isoform Variation 
PLoS Genetics  2009;5(8):e1000608.
Current genome-wide association studies (GWAS) are moving towards the use of large cohorts of primary cell lines to study a disease of interest and to assign biological relevance to the genetic signals identified. Here, we use a panel of human osteoblasts (HObs) to carry out a transcriptomic survey, similar to recent studies in lymphoblastoid cell lines (LCLs). The distinct nature of HObs and LCLs is reflected by the preferential grouping of cell type–specific genes within biologically and functionally relevant pathways unique to each tissue type. We performed cis-association analysis with SNP genotypes to identify genetic variations of transcript isoforms, and our analysis indicates that differential expression of transcript isoforms in HObs is also partly controlled by cis-regulatory genetic variants. These isoforms are regulated by genetic variants in both a tissue-specific and tissue-independent fashion, and these associations have been confirmed by RT–PCR validation. Our study suggests that multiple transcript isoforms are often present in both tissues and that genetic control may affect the relative expression of one isoform to another, rather than having an all-or-none effect. Examination of the top SNPs from a GWAS of bone mineral density show overlap with probeset associations observed in this study. The top hit corresponding to the FAM118A gene was tested for association studies in two additional clinical studies, revealing a novel transcript isoform variant. Our approach to examining transcriptome variation in multiple tissue types is useful for detecting the proportion of genetic variation common to different cell types and for the identification of cell-specific isoform variants that may be functionally relevant, an important follow-up step for GWAS.
Author Summary
The transcriptome of any given cell type is a complex program of controlled gene expression underlying its biological function. An additional layer of molecular complexity involving individual genetic variation can modulate the transcriptome within the same tissue type, conferring potential phenotypic differences between individuals at the cellular level. This study highlights common and unique aspects of the transcriptome between the well-characterized lymphoblastoid cell lines from the International HapMap Project and those of a cultured primary cell type, human osteoblasts. We observe that inter-individual genetic variation can regulate transcript isoform expression in tissue-specific and tissue-independent manners, indicating that genetic differences among individuals can alter the transcriptome in one or more tissues, ultimately leading to altered biological functions within the lymphoblasts and/or osteoblasts. Pursuant to this, genome wide association studies on bone mineral density (BMD) have identified a number of significant loci and polymorphisms highly linked to the BMD quantitative phenotype. A small proportion of these polymorphisms overlap with our highly significant SNPs regulating the osteoblast transcriptome, revealing a potential molecular basis for this phenotype at the transcriptional level. This study highlights the importance of examining the differing transcriptomes and cis-regulatory mechanisms governing the biological and functional roles of varied tissue types.
PMCID: PMC2719916  PMID: 19680542
20.  Tissue-Specific Effects of Genetic and Epigenetic Variation on Gene Regulation and Splicing 
PLoS Genetics  2015;11(1):e1004958.
Understanding how genetic variation affects distinct cellular phenotypes, such as gene expression levels, alternative splicing and DNA methylation levels, is essential for better understanding of complex diseases and traits. Furthermore, how inter-individual variation of DNA methylation is associated to gene expression is just starting to be studied. In this study, we use the GenCord cohort of 204 newborn Europeans’ lymphoblastoid cell lines, T-cells and fibroblasts derived from umbilical cords. The samples were previously genotyped for 2.5 million SNPs, mRNA-sequenced, and assayed for methylation levels in 482,421 CpG sites. We observe that methylation sites associated to expression levels are enriched in enhancers, gene bodies and CpG island shores. We show that while the correlation between DNA methylation and gene expression can be positive or negative, it is very consistent across cell-types. However, this epigenetic association to gene expression appears more tissue-specific than the genetic effects on gene expression or DNA methylation (observed in both sharing estimations based on P-values and effect size correlations between cell-types). This predominance of genetic effects can also be reflected by the observation that allele specific expression differences between individuals dominate over tissue-specific effects. Additionally, we discover genetic effects on alternative splicing and interestingly, a large amount of DNA methylation correlating to alternative splicing, both in a tissue-specific manner. The locations of the SNPs and methylation sites involved in these associations highlight the participation of promoter proximal and distant regulatory regions on alternative splicing. Overall, our results provide high-resolution analyses showing how genome sequence variation has a broad effect on cellular phenotypes across cell-types, whereas epigenetic factors provide a secondary layer of variation that is more tissue-specific. Furthermore, the details of how this tissue-specificity may vary across inter-relations of molecular traits, and where these are occurring, can yield further insights into gene regulation and cellular biology as a whole.
Author Summary
In order to better understand how genetic differences between individuals can cause diseases, it is crucial to understand how genetic variants affect cellular functions in the different tissues that compose the human body. From the umbilical cord of 195 newborn babies, we previously obtained three different cell-types: fibroblasts, T-cells and immortalized B-cells. From every individual in each cell type we measured four features across the genome: 1) genetic differences, 2) DNA methylation, an epigenetic modification of DNA that can affect its functional state, 3) gene expression—the amount of gene activity, 4) alternative splicing—which of the different versions of a gene is manifested. We find thousands of genetic variants of the DNA sequence that affect methylation, gene expression, and splicing. We show that while these genetic effects often affect multiple cell-types, the strength of these effects varies between cell-types. Also epigenetic methylation marks of DNA associate to gene expression and particularly often to splicing. Since abnormalities in gene expression, DNA methylation and alternative splicing are associated to diseases, it is important to continue studying how these traits are inter-related and affected by genetic variation across cell-types.
PMCID: PMC4310612  PMID: 25634236
21.  A towards-multidimensional screening approach to predict candidate genes of rheumatoid arthritis based on SNP, structural and functional annotations 
BMC Medical Genomics  2010;3:38.
According to the Genetic Analysis Workshops (GAW), hundreds of thousands of SNPs have been tested for association with rheumatoid arthritis. Traditional genome-wide association studies (GWAS) have been developed to identify susceptibility genes using a "most significant SNPs/genes" model. However, many minor- or modest-risk genes are likely to be missed after adjustment of multiple testing. This screening process uses a strict selection of statistical thresholds that aim to identify susceptibility genes based only on statistical model, without considering multi-dimensional biological similarities in sequence arrangement, crystal structure, or functional categories/biological pathways between candidate and known disease genes.
Multidimensional screening approaches combined with traditional statistical genetics methods can consider multiple biological backgrounds of genetic mutation, structural, and functional annotations. Here we introduce a newly developed multidimensional screening approach for rheumatoid arthritis candidate genes that considers all SNPs with nominal evidence of Bayesian association (BFLn > 0), and structural and functional similarities of corresponding genes or proteins.
Our multidimensional screening approach extracted all risk genes (BFLn > 0) by odd ratios of hypothesis H1 to H0, and determined whether a particular group of genes shared underlying biological similarities with known disease genes. Using this method, we found 6614 risk SNPs in our Bayesian screen result set. Finally, we identified 146 likely causal genes for rheumatoid arthritis, including CD4, FGFR1, and KDR, which have been reported as high risk factors by recent studies. We must denote that 790 (96.1%) of genes identified by GWAS could not easily be classified into related functional categories or biological processes associated with the disease, while our candidate genes shared underlying biological similarities (e.g. were in the same pathway or GO term) and contributed to disease etiology, but where common variations in each of these genes make modest contributions to disease risk. We also found 6141 risk SNPs that were too minor to be detected by conventional approaches, and associations between 58 candidate genes and rheumatoid arthritis were verified by literature retrieved from the NCBI PubMed module.
Our proposed approach to the analysis of GAW16 data for rheumatoid arthritis was based on an underlying biological similarities-based method applied to candidate and known disease genes. Application of our method could identify likely causal candidate disease genes of rheumatoid arthritis, and could yield biological insights that not detected when focusing only on genes that give the strongest evidence by multiple testing. We hope that our proposed method complements the "most significant SNPs/genes" model, and provides additional insights into the pathogenesis of rheumatoid arthritis and other diseases, when searching datasets for hundreds of genetic variances.
PMCID: PMC2939610  PMID: 20727150
22.  Extensive Natural Variation for Cellular Hydrogen Peroxide Release Is Genetically Controlled 
PLoS ONE  2012;7(8):e43566.
Natural variation in DNA sequence contributes to individual differences in quantitative traits. While multiple studies have shown genetic control over gene expression variation, few additional cellular traits have been investigated. Here, we investigated the natural variation of NADPH oxidase-dependent hydrogen peroxide (H2O2 release), which is the joint effect of reactive oxygen species (ROS) production, superoxide metabolism and degradation, and is related to a number of human disorders. We assessed the normal variation of H2O2 release in lymphoblastoid cell lines (LCL) in a family-based 3-generation cohort (CEPH-HapMap), and in 3 population-based cohorts (KORA, GenCord, HapMap). Substantial individual variation was observed, 45% of which were associated with heritability in the CEPH-HapMap cohort. We identified 2 genome-wide significant loci of Hsa12 and Hsa15 in genome-wide linkage analysis. Next, we performed genome-wide association study (GWAS) for the combined KORA-GenCord cohorts (n = 279) using enhanced marker resolution by imputation (>1.4 million SNPs). We found 5 significant associations (p<5.00×10−8) and 54 suggestive associations (p<1.00×10−5), one of which confirmed the linked region on Hsa15. To replicate our findings, we performed GWAS using 58 HapMap individuals and ∼2.1 million SNPs. We identified 40 genome-wide significant and 302 suggestive SNPs, and confirmed genome signals on Hsa1, Hsa12, and Hsa15. Genetic loci within 900 kb from the known candidate gene p67phox on Hsa1 were identified in GWAS in both cohorts. We did not find replication of SNPs across all cohorts, but replication within the same genomic region. Finally, a highly significant decrease in H2O2 release was observed in Down Syndrome (DS) individuals (p<2.88×10−12). Taken together, our results show strong evidence of genetic control of H2O2 in LCL of healthy and DS cohorts and suggest that cellular phenotypes, which themselves are also complex, may be used as proxies for dissection of complex disorders.
PMCID: PMC3430705  PMID: 22952707
23.  Maps of Open Chromatin Guide the Functional Follow-Up of Genome-Wide Association Signals: Application to Hematological Traits 
PLoS Genetics  2011;7(6):e1002139.
Turning genetic discoveries identified in genome-wide association (GWA) studies into biological mechanisms is an important challenge in human genetics. Many GWA signals map outside exons, suggesting that the associated variants may lie within regulatory regions. We applied the formaldehyde-assisted isolation of regulatory elements (FAIRE) method in a megakaryocytic and an erythroblastoid cell line to map active regulatory elements at known loci associated with hematological quantitative traits, coronary artery disease, and myocardial infarction. We showed that the two cell types exhibit distinct patterns of open chromatin and that cell-specific open chromatin can guide the finding of functional variants. We identified an open chromatin region at chromosome 7q22.3 in megakaryocytes but not erythroblasts, which harbors the common non-coding sequence variant rs342293 known to be associated with platelet volume and function. Resequencing of this open chromatin region in 643 individuals provided strong evidence that rs342293 is the only putative causative variant in this region. We demonstrated that the C- and G-alleles differentially bind the transcription factor EVI1 affecting PIK3CG gene expression in platelets and macrophages. A protein–protein interaction network including up- and down-regulated genes in Pik3cg knockout mice indicated that PIK3CG is associated with gene pathways with an established role in platelet membrane biogenesis and thrombus formation. Thus, rs342293 is the functional common variant at this locus; to the best of our knowledge this is the first such variant to be elucidated among the known platelet quantitative trait loci (QTLs). Our data suggested a molecular mechanism by which a non-coding GWA index SNP modulates platelet phenotype.
Author Summary
Genome-wide scans have revealed multiple genetic regions underlying complex traits. However, the transition from an initial association signal to identifying the functional DNA change(s) has proved challenging. Many of the DNA changes discovered are located outside protein-coding regions and may exert their effects through gene regulation. We screened genetic regions associated with hematological traits in erythroblasts (red blood cells) and megakaryocytes (platelet-producing cells) and mapped sites of open chromatin, which harbor active gene regulatory elements. We investigated a DNA sequence change located within a site of open chromatin at chromosome 7 in megakaryocytes, but not erythroblasts, known to be associated with platelet volume. We showed that this DNA change is functional due to alteration of the binding site of a transcription factor, which regulates the expression of a gene that affects platelet characteristics. Mice lacking this gene revealed significant differences in expression of several important platelet genes compared to wild-type mice. The approach described here can be applied in different cell types to functionally follow-up association signals with many other biological traits by identification of the causative base change and how it affects gene function, thus paving the way to clinical benefit.
PMCID: PMC3128100  PMID: 21738486
24.  Candidate genes for obesity-susceptibility show enriched association within a large genome-wide association study for BMI 
Human Molecular Genetics  2012;21(20):4537-4542.
Before the advent of genome-wide association studies (GWASs), hundreds of candidate genes for obesity-susceptibility had been identified through a variety of approaches. We examined whether those obesity candidate genes are enriched for associations with body mass index (BMI) compared with non-candidate genes by using data from a large-scale GWAS. A thorough literature search identified 547 candidate genes for obesity-susceptibility based on evidence from animal studies, Mendelian syndromes, linkage studies, genetic association studies and expression studies. Genomic regions were defined to include the genes ±10 kb of flanking sequence around candidate and non-candidate genes. We used summary statistics publicly available from the discovery stage of the genome-wide meta-analysis for BMI performed by the genetic investigation of anthropometric traits consortium in 123 564 individuals. Hypergeometric, rank tail-strength and gene-set enrichment analysis tests were used to test for the enrichment of association in candidate compared with non-candidate genes. The hypergeometric test of enrichment was not significant at the 5% P-value quantile (P = 0.35), but was nominally significant at the 25% quantile (P = 0.015). The rank tail-strength and gene-set enrichment tests were nominally significant for the full set of genes and borderline significant for the subset without SNPs at P < 10−7. Taken together, the observed evidence for enrichment suggests that the candidate gene approach retains some value. However, the degree of enrichment is small despite the extensive number of candidate genes and the large sample size. Studies that focus on candidate genes have only slightly increased chances of detecting associations, and are likely to miss many true effects in non-candidate genes, at least for obesity-related traits.
PMCID: PMC3607467  PMID: 22791748
25.  Patterns of Cis Regulatory Variation in Diverse Human Populations 
PLoS Genetics  2012;8(4):e1002639.
The genetic basis of gene expression variation has long been studied with the aim to understand the landscape of regulatory variants, but also more recently to assist in the interpretation and elucidation of disease signals. To date, many studies have looked in specific tissues and population-based samples, but there has been limited assessment of the degree of inter-population variability in regulatory variation. We analyzed genome-wide gene expression in lymphoblastoid cell lines from a total of 726 individuals from 8 global populations from the HapMap3 project and correlated gene expression levels with HapMap3 SNPs located in cis to the genes. We describe the influence of ancestry on gene expression levels within and between these diverse human populations and uncover a non-negligible impact on global patterns of gene expression. We further dissect the specific functional pathways differentiated between populations. We also identify 5,691 expression quantitative trait loci (eQTLs) after controlling for both non-genetic factors and population admixture and observe that half of the cis-eQTLs are replicated in one or more of the populations. We highlight patterns of eQTL-sharing between populations, which are partially determined by population genetic relatedness, and discover significant sharing of eQTL effects between Asians, European-admixed, and African subpopulations. Specifically, we observe that both the effect size and the direction of effect for eQTLs are highly conserved across populations. We observe an increasing proximity of eQTLs toward the transcription start site as sharing of eQTLs among populations increases, highlighting that variants close to TSS have stronger effects and therefore are more likely to be detected across a wider panel of populations. Together these results offer a unique picture and resource of the degree of differentiation among human populations in functional regulatory variation and provide an estimate for the transferability of complex trait variants across populations.
Author Summary
Variation among individuals in the degree to which genes are expressed (i.e. turned on or off) is a characteristic exhibited by all species, and studies have identified regions of the genome harboring genetic variation affecting gene expression levels. To assess the degree of human inter-population variability in regulatory variation, we describe mapping of regions of the genome that have functional effects on gene expression levels. We analyzed genome-wide gene expression in human cell lines derived from 726 unrelated individuals representing 8 global populations that have been genetically well-characterized by the International HapMap Project. We describe the influence of ancestry on gene expression levels within and between these diverse human populations and uncover a non-negligible impact on global patterns of gene expression. We identify ∼5,700 genes whose expression levels are associated with genetic variation located physically close to the gene, and we observe significant sharing of associations that is partially dependent on population genetic relatedness, among Asians, European-admixed, and African subpopulations. We identify biological functions affected by regulatory variation and describe common and unique characteristics of population-specific and population-shared associations. These results offer a unique picture and resource of the degree of differentiation among human populations in functional regulatory variation.
PMCID: PMC3330104  PMID: 22532805

Results 1-25 (1741067)