Genes do not act in isolation but instead as part of complex regulatory networks. To understand how breast tumors adapt to the presence of the drug letrozole, at the molecular level, it is necessary to consider how the expression levels of genes in these networks change relative to one another.
Using transcriptomic data generated from sequential tumor biopsy samples, taken at diagnosis, following 10-14 days and following 90 days of letrozole treatment, and a pairwise partial correlation statistic, we build temporal gene coexpression networks. We characterize the structure of each network and identify genes that hold prominent positions for maintaining network integrity and controlling information-flow.
Letrozole treatment leads to extensive rewiring of the breast tumor coexpression network. Approximately 20% of gene-gene relationships are conserved over time in the presence of letrozole while 80% of relationships are condition dependent. The positions of influence within the networks are transiently held with few genes stably maintaining high centrality scores across the three time points.
Genes integral for maintaining network integrity and controlling information flow are dynamically changing as the breast tumor coexpression network adapts to perturbation by the drug letrozole.
Pancreatic cancer is the fourth leading cause of cancer death in the United States because most patients are diagnosed too late in the course of the disease to be treated effectively. Thus, there is a pressing need to more clearly understand how gene expression is regulated in cancer cells and to identify new biomarkers and therapeutic targets. Translational regulation is thought to occur primarily through non-SMAD directed signaling pathways. We tested the hypothesis that SMAD4-dependent signaling does play a role in the regulation of mRNA entry into polysomes and that novel candidate genes in pancreatic cancer could be identified using polysome RNA from the human pancreatic cancer cell line BxPC3 with or without a functional SMAD4 gene. We found that (i) differentially expressed whole cell and cytoplasm RNA levels are both poor predictors of polysome RNA levels; (ii) for a majority of RNAs, differential RNA levels are regulated independently in the nucleus, cytoplasm, and polysomes; (iii) for most of the remaining polysome RNA, levels are regulated via a “tagging” of the RNAs in the nucleus for rapid entry into the polysomes; (iv) a SMAD4-dependent pathway appears to indeed play a role in regulating mRNA entry into polysomes; and (v) a gene list derived from differentially expressed polysome RNA in BxPC3 cells generated new candidate genes and cell pathways potentially related to pancreatic cancer.
polysomes; differential gene expression; pancreatic cancer; BxPC3; SMAD4
Prior to the discovery of CLCNKB-T481S there were no variants or clinical disorders associated with gain-of-function defects in thick ascending limb (TAL) of the kidney channels or transporters. CLCNKB-T481S is a novel gain-of-function variant that has been associated with essential hypertension. This finding has not been replicated until our current study. In this study we re-examine CLCNKB-T481S using a large homogenous population from Ghana, and coupled genetic analyses with the functional characterization of this polymorphism using a mammalian expression system.
We genotyped CLCNKB-T481S in four ethnically-defined control populations and a homogenous cohort of normotensive and hypertensive Ghanaians. Functional analysis was performed by whole-cell patch-clamp recording of tsA201 cells (a cell line derived from the human renal cell line, HEK-293) transiently transfected with ClC-Kb and barttin.
CLCNKB-T481S was found more commonly in the African and Caucasian-Americans when compared to the Asian and Hispanic American populations having minor allele frequencies of 0.20, 0.15 and 0.06 and 0.01 respectively. Additionally, CLCNKB-T481S was significantly associated with hypertension in Ghanaian males. In stratified logistic regression analysis with Ghanaian males we observed a significant odds ratio of 3.29 (1.17 - 9.20 95% CI, p=0.024) in the recessive model (TT v AT&TT). Unlike previous results obtained in Xenopus oocytes, co-expression of CLCNKB-T481S with the obligatory accessory subunit barttin in tsA201 cells did not generate larger currents than co-expression of the wild type allele.
We conclude that CLCNKB-T481S is associated with essential hypertension in males within the Ghanaian population; however further studies are needed to understand its gender and ethnic segregation as well as to identify cellular factors that account for the divergent functional expression of ClC-Kb-T481S plus barttin in Xenopus oocytes and mammalian cells.
Gene regulatory networks (GRNs) drive the cellular processes that sustain life. To do so reliably, GRNs must be robust to perturbations, such as gene deletion and the addition or removal of regulatory interactions. GRNs must also be robust to genetic changes in regulatory regions that define the logic of signal-integration, as these changes can affect how specific combinations of regulatory signals are mapped to particular gene expression states. Previous theoretical analyses have demonstrated that the robustness of a GRN is influenced by its underlying topological properties, such as degree distribution and modularity. Another important topological property is assortativity, which measures the propensity with which nodes of similar connectivity are connected to one another. How assortativity influences the robustness of the signal-integration logic of GRNs remains an open question. Here, we use computational models of GRNs to investigate this relationship. We separately consider each of the three dynamical regimes of this model for a variety of degree distributions. We find that in the chaotic regime, robustness exhibits a pronounced increase as assortativity becomes more positive, while in the critical and ordered regimes, robustness is generally less sensitive to changes in assortativity. We attribute the increased robustness to a decrease in the duration of the gene expression pattern, which is caused by a reduction in the average size of a GRN’s in-components. This study provides the first direct evidence that assortativity influences the robustness of the signal-integration logic of computational models of GRNs, illuminates a mechanistic explanation for this influence, and furthers our understanding of the relationship between topology and robustness in complex biological systems.
Boolean networks; regulatory regions; in-components; genetic regulation
Indoor and outdoor air pollution is known to contribute to increased lung cancer incidence. This study is the first to address the contribution of home heating fuel and geographical course particulate matter (PM10) concentrations to lung cancer rates in New Hampshire, U.S. First, Pearson correlation analysis and Geographically weighted regression were used to investigate spatial relationships between outdoor PM10 and lung cancer rates. While the aforementioned analyses did not indicate a significant contribution of PM10 to lung cancer in the state, there was a trend towards a significant association in the northern and southwestern regions of the state. Second, case-control data were used to estimate the contributions of indoor pollution and second hand smoke to risk of lung cancer with adjustment for confounders. Increased risk was found among those who used wood or coal to heat their homes for more than 10 winters before the age of 18, with a significant increase in risk per winter. Resulting data suggest that further investigation of the relationship between heating-related air pollution levels and lung cancer risk is needed.
The rapid development of sequencing technologies makes thousands to millions of genetic attributes available for testing associations with various biological traits. Searching this enormous high-dimensional data space imposes a great computational challenge in genome-wide association studies. We introduce a network-based approach to supervise the search for three-locus models of disease susceptibility. Such statistical epistasis networks (SEN) are built using strong pairwise epistatic interactions and provide a global interaction map to search for higher-order interactions by prioritizing genetic attributes clustered together in the networks. Applying this approach to a population-based bladder cancer dataset, we found a high susceptibility three-way model of genetic variations in DNA repair and immune regulation pathways, which holds great potential for studying the etiology of bladder cancer with further biological validations. We demonstrate that our SEN-supervised search is able to find a small subset of three-locus models with significantly high associations at a substantially reduced computational cost.
Epistasis; High-order genetic interactions; GWAS; Statistical epistasis networks; MDR
Identifying high-order genetics associations with non-additive (i.e. epistatic) effects in population-based studies of common human diseases is a computational challenge. Multifactor dimensionality reduction (MDR) is a machine learning method that was designed specifically for this problem. The goal of the present study was to apply MDR to mining high-order epistatic interactions in a population-based genetic study of tuberculosis (TB).
The study used a previously published data set consisting of 19 candidate single-nucleotide polymorphisms (SNPs) in 321 pulmonary TB cases and 347 healthy controls from Guniea-Bissau in Africa. The ReliefF algorithm was applied first to generate a smaller set of the five most informative SNPs. MDR with 10-fold cross-validation was then applied to look at all possible combinations of two, three, four and five SNPs. The MDR model with the best testing accuracy (TA) consisted of SNPs rs2305619, rs187084, and rs11465421 (TA = 0.588) in PTX3, TLR9 and DC-Sign, respectively. A general 1000-fold permutation test of the null hypothesis of no association confirmed the statistical significance of the model (p = 0.008). An additional 1000-fold permutation test designed specifically to test the linear null hypothesis that the association effects are only additive confirmed the presence of non-additive (i.e. nonlinear) or epistatic effects (p = 0.013). An independent information-gain measure corroborated these results with a third-order epistatic interaction that was stronger than any lower-order associations.
We have identified statistically significant evidence for a three-way epistatic interaction that is associated with susceptibility to TB. This interaction is stronger than any previously described one-way or two-way associations. This study highlights the importance of using machine learning methods that are designed to embrace, rather than ignore, the complexity of common diseases such as TB. We recommend future studies of the genetics of TB take into account the possibility that high-order epistatic interactions might play an important role in disease susceptibility.
Epistasis; Gene-gene interactions; Machine learning; Pulmonary tuberculosis
Decades after the eradication of smallpox, its etiological agent, variola virus (VARV), remains a threat as a potential bioweapon. Outbreaks of smallpox around the time of the global eradication effort exhibited variable case fatality rates (CFRs), likely attributable in part to complex viral genetic determinants of smallpox virulence. We aimed to identify genome-wide single nucleotide polymorphisms associated with CFR. We evaluated unadjusted and outbreak geographic location-adjusted models of single SNPs and two- and three-way interactions between SNPs.
Using the data mining approach multifactor dimensionality reduction (MDR), we identified five VARV SNPs in models significantly associated with CFR. The top performing unadjusted model and adjusted models both revealed the same two-way gene-gene interaction. We discuss the biological plausibility of the influence of the SNPs identified these and other significant models on the strain-specific virulence of VARV.
We have identified genetic loci in the VARV genome that are statistically associated with VARV virulence as measured by CFR. While our ability to infer a causal relationship between the specific SNPs identified in our analysis and VARV virulence is limited, our results suggest that smallpox severity is in part associated with VARV strain variation and that VARV virulence may be determined by multiple genetic loci. This study represents the first application of MDR to the identification of pathogen gene-gene interactions for predicting infectious disease outbreak severity.
Smallpox; Variola virus; Single nucleotide polymorphisms; Multifactor dimensionality reduction
Mitochondrial DNA (mtDNA) variation can affect phenotypic variation; therefore, knowing its distribution within and among individuals is of importance to understanding many human diseases. Intra-individual mtDNA variation (heteroplasmy) has been generally assumed to be random. We used massively parallel sequencing to assess heteroplasmy across ten tissues and demonstrate that in unrelated individuals there are tissue-specific, recurrent mutations. Certain tissues, notably kidney, liver and skeletal muscle, displayed the identical recurrent mutations that were undetectable in other tissues in the same individuals. Using RFLP analyses we validated one of the tissue-specific mutations in the two sequenced individuals and replicated the patterns in two additional individuals. These recurrent mutations all occur within or in very close proximity to sites that regulate mtDNA replication, strongly implying that these variations alter the replication dynamics of the mutated mtDNA genome. These recurrent variants are all independent of each other and do not occur in the mtDNA coding regions. The most parsimonious explanation of the data is that these frequently repeated mutations experience tissue-specific positive selection, probably through replication advantage.
DNA mutations are expected to be formed randomly, thus any reproducible pattern of DNA somatic mutations across multiple individuals or even across organs within each individual is highly unexpected. Using next generation sequencing of multiple tissues from the same individuals we found several somatic mutations in mitochondrial DNA that appear in a heteroplasmic state in all individuals examined, but only in particular tissues. These mutations were only found in known regions of replication control for the mitochondrial DNA. These data imply the presence of tissue-specific positive selection for these variants.
Genetic association studies have become standard approaches to characterize the genetic and epigenetic variability associated with cancer development, including predispositions and mutations. However, the bewildering genetic and phenotypic heterogeneity inherent in cancer both magnifies the conceptual and methodological problems associated with these approaches and renders the translation of available genetic information into a knowledge that is both biologically sound and clinically relevant difficult. Here, we elaborate on the underlying causes of this complexity, illustrate why it represents a challenge for genetic association studies, and briefly discuss how it can be reconciled with the ultimate goal of identifying targetable disease pathways and successfully treating individual patients.
cancer heterogeneity; genetic predispositions; somatic mutations; genetic association studies
The collection and analysis of genomic data has the potential to reveal novel druggable targets by providing insight into the genetic basis of disease. However, the number of drugs, targeting new molecular entities, approved by the US Food and Drug Administration (FDA) has not increased in the years since the collection of genomic data has become commonplace. The paucity of translatable results can be partly attributed to conventional analysis methods that test one gene at a time in an effort to identify disease-associated factors as candidate drug targets. By disengaging genetic factors from their position within the genetic regulatory system, much of the information stored within the genomic data set is lost. Here we discuss how genomic data is used to identify disease-associated genes or genomic regions, how disease-associated regions are validated as functional targets, and the role network analysis can play in bridging the gap between data generation and effective drug target identification.
Geneticists who look beyond single locus disease associations require additional strategies for the detection of complex multi-locus effects. Epistasis, a multi-locus masking effect, presents a particular challenge, and has been the target of bioinformatic development. Thorough evaluation of new algorithms calls for simulation studies in which known disease models are sought. To date, the best methods for generating simulated multi-locus epistatic models rely on genetic algorithms. However, such methods are computationally expensive, difficult to adapt to multiple objectives, and unlikely to yield models with a precise form of epistasis which we refer to as pure and strict. Purely and strictly epistatic models constitute the worst-case in terms of detecting disease associations, since such associations may only be observed if all n-loci are included in the disease model. This makes them an attractive gold standard for simulation studies considering complex multi-locus effects.
We introduce GAMETES, a user-friendly software package and algorithm which generates complex biallelic single nucleotide polymorphism (SNP) disease models for simulation studies. GAMETES rapidly and precisely generates random, pure, strict n-locus models with specified genetic constraints. These constraints include heritability, minor allele frequencies of the SNPs, and population prevalence. GAMETES also includes a simple dataset simulation strategy which may be utilized to rapidly generate an archive of simulated datasets for given genetic models. We highlight the utility and limitations of GAMETES with an example simulation study using MDR, an algorithm designed to detect epistasis.
GAMETES is a fast, flexible, and precise tool for generating complex n-locus models with random architectures. While GAMETES has a limited ability to generate models with higher heritabilities, it is proficient at generating the lower heritability models typically used in simulation studies evaluating new algorithms. In addition, the GAMETES modeling strategy may be flexibly combined with any dataset simulation strategy. Beyond dataset simulation, GAMETES could be employed to pursue theoretical characterization of genetic models and epistasis.
GAMETES; SNP; Epistasis; Simulation; Model; Genetics
Algorithms designed to detect complex genetic disease associations are initially evaluated using simulated datasets. Typical evaluations vary constraints that influence the correct detection of underlying models (i.e. number of loci, heritability, and minor allele frequency). Such studies neglect to account for model architecture (i.e. the unique specification and arrangement of penetrance values comprising the genetic model), which alone can influence the detectability of a model. In order to design a simulation study which efficiently takes architecture into account, a reliable metric is needed for model selection.
We evaluate three metrics as predictors of relative model detection difficulty derived from previous works: (1) Penetrance table variance (PTV), (2) customized odds ratio (COR), and (3) our own Ease of Detection Measure (EDM), calculated from the penetrance values and respective genotype frequencies of each simulated genetic model. We evaluate the reliability of these metrics across three very different data search algorithms, each with the capacity to detect epistatic interactions. We find that a model’s EDM and COR are each stronger predictors of model detection success than heritability.
This study formally identifies and evaluates metrics which quantify model detection difficulty. We utilize these metrics to intelligently select models from a population of potential architectures. This allows for an improved simulation study design which accounts for differences in detection difficulty attributed to model architecture. We implement the calculation and utilization of EDM and COR into GAMETES, an algorithm which rapidly and precisely generates pure, strict, n-locus epistatic models.
EDM; COR; GAMETES; SNP; Model detection; Epistasis; Simulation; Model; Genetics
The study of common, complex multifactorial diseases in genetic epidemiology is complicated by nonlinearity in the genotype-to-phenotype mapping relationship that is due, in part, to epistasis or gene-gene interactions. Symobolic discriminant analysis (SDA) is a flexible modeling approach which uses genetic programming (GP) to evolve an optimal predictive model using a predefined collection of mathematical functions, constants, and attributes. This has been shown to be an effective strategy for modeling epistasis. In the present study, we introduce the genetic “mask” as a novel building block which exploits expert knowledge in the form of a pre-constructed relationship between two attributes. The goal of this study was to determine whether the availability of “mask” building blocks improves SDA performance. The results of this study support the idea that pre-processing data improves GP performance.
Genetic Analysis; Genetic Epidemiology; Genetic Programming; Symbolic Discriminant Analysis; Symbolic Regression; Function Set; Two-Locus Model; Genetic Mask
Neonatal sepsis due to intestinal bacterial translocation is a major cause of morbidity and mortality. Understanding microbial colonisation of the gut in prematurity may predict risk of sepsis to guide future strategies to manipulate the microbiome.
Prospective longitudinal study of premature infants. Stool samples were obtained weekly. DNA was extracted and the V6 hypervariable region of 16S rRNA was amplified followed by high throughput pyrosequencing, comparing subjects with and without sepsis.
Six neonates were 24–27 weeks gestation at birth and had 18 samples analysed. Two subjects had no sepsis during the study period, two developed late-onset culture-positive sepsis and two had culture-negative systemic inflammation. 324 350 sequences were obtained. The meconium was not sterile and had predominance of Lactobacillus, Staphylococcus and Enterobacteriales. Overall, infants who developed sepsis began life with low microbial diversity, and acquired a predominance of Staphylococcus, while healthy infants had more diversity and predominance of Clostridium, Klebsiella and Veillonella.
In very low birth weight infants, the authors found that meconium is not sterile and is less diverse from birth in infants who will develop late-onset sepsis. Empiric, prolonged antibiotics profoundly decrease microbial diversity and promote a microbiota that is associated not only with neonatal sepsis, but the predominant pathogen previously identified in the microbiome. Our data suggest that there may be a ‘healthy microbiome’ present in extremely premature neonates that may ameliorate risk of sepsis. More research is needed to determine whether altered antibiotics, probiotics or other novel therapies can re-establish a healthy microbiome in neonates.
Cancer is characterized by gene expression aberrations. Studies have largely focused on coding sequences and promoters, despite the fact that distal regulatory elements play a central role in controlling transcription patterns. Here we utilize the histone mark H3K4me1 to analyze gain and loss of enhancer activity genome wide in primary colon cancer lines relative to normal colon crypts. We identified thousands of variant enhancer loci (VELs) that comprise a signature that is robustly predictive of the in vivo colon cancer transcriptome. Furthermore, VELs are enriched in haplotype blocks containing colon cancer genetic risk variants, implicating these genomic regions in colon cancer pathogenesis. We propose that reproducible changes in the epigenome at enhancer elements drive a unique transcriptional program to promote colon carcinogenesis.
Genome-wide data sets are increasingly being used to identify biological pathways and networks underlying complex diseases. In particular, analyzing genomic data through sets defined by functional pathways offers the potential of greater power for discovery and natural connections to biological mechanisms. With the burgeoning availability of next-generation sequencing, this is an opportune moment to revisit strategies for pathway-based analysis of genomic data. Here, we synthesize relevant concepts and extant methodologies to guide investigators in study design and execution. We also highlight ongoing challenges and proposed solutions. As relevant analytical strategies mature, pathways and networks will be ideally placed to integrate data from diverse -omics sources in order to harness the extensive, rich information related to disease and treatment mechanisms.
pathway analysis; gene set; enrichment methods; genome-wide association study; functional annotation; complex diseases
Epistasis is recognized ubiquitous in the genetic architecture of complex traits such as disease susceptibility. Experimental studies in model organisms have revealed extensive evidence of biological interactions among genes. Meanwhile, statistical and computational studies in human populations have suggested non-additive effects of genetic variation on complex traits. Although these studies form a baseline for understanding the genetic architecture of complex traits, to date they have only considered interactions among a small number of genetic variants. Our goal here is to use network science to determine the extent to which non-additive interactions exist beyond small subsets of genetic variants. We infer statistical epistasis networks to characterize the global space of pairwise interactions among approximately 1500 Single Nucleotide Polymorphisms (SNPs) spanning nearly 500 cancer susceptibility genes in a large population-based study of bladder cancer.
The statistical epistasis network was built by linking pairs of SNPs if their pairwise interactions were stronger than a systematically derived threshold. Its topology clearly differentiated this real-data network from networks obtained from permutations of the same data under the null hypothesis that no association exists between genotype and phenotype. The network had a significantly higher number of hub SNPs and, interestingly, these hub SNPs were not necessarily with high main effects. The network had a largest connected component of 39 SNPs that was absent in any other permuted-data networks. In addition, the vertex degrees of this network were distinctively found following an approximate power-law distribution and its topology appeared scale-free.
In contrast to many existing techniques focusing on high main-effect SNPs or models of several interacting SNPs, our network approach characterized a global picture of gene-gene interactions in a population-based genetic data. The network was built using pairwise interactions, and its distinctive network topology and large connected components indicated joint effects in a large set of SNPs. Our observations suggested that this particular statistical epistasis network captured important features of the genetic architecture of bladder cancer that have not been described previously.
A goal of human genetics is to discover genetic factors that influence individuals' susceptibility to common diseases. Most common diseases are thought to result from the joint failure of two or more interacting components instead of single component failures. This greatly complicates both the task of selecting informative genetic variants and the task of modeling interactions between them. We and others have previously developed algorithms to detect and model the relationships between these genetic factors and disease. Previously these methods have been evaluated with datasets simulated according to pre-defined genetic models.
Here we develop and evaluate a model free evolution strategy to generate datasets which display a complex relationship between individual genotype and disease susceptibility. We show that this model free approach is capable of generating a diverse array of datasets with distinct gene-disease relationships for an arbitrary interaction order and sample size. We specifically generate eight-hundred Pareto fronts; one for each independent run of our algorithm. In each run the predictiveness of single genetic variation and pairs of genetic variants have been minimized, while the predictiveness of third, fourth, or fifth-order combinations is maximized. Two hundred runs of the algorithm are further dedicated to creating datasets with predictive four or five order interactions and minimized lower-level effects.
This method and the resulting datasets will allow the capabilities of novel methods to be tested without pre-specified genetic models. This allows researchers to evaluate which methods will succeed on human genetics problems where the model is not known in advance. We further make freely available to the community the entire Pareto-optimal front of datasets from each run so that novel methods may be rigorously evaluated. These 76,600 datasets are available from http://discovery.dartmouth.edu/model_free_data/.