Association studies have identified several signals at the LRRK2 locus for Parkinson's disease (PD), Crohn's disease (CD) and leprosy. However, little is known about the molecular mechanisms mediating these effects. To further characterize this locus, we fine-mapped the risk association in 5,802 PD and 5,556 controls using a dense genotyping array (ImmunoChip). Using samples from 134 post-mortem control adult human brains (UK Human Brain Expression Consortium), where up to ten brain regions were available per individual, we studied the regional variation, splicing and regulation of LRRK2. We found convincing evidence for a common variant PD association located outside of the LRRK2 protein coding region (rs117762348, A>G, P = 2.56×10−8, case/control MAF 0.083/0.074, odds ratio 0.86 for the minor allele with 95% confidence interval [0.80–0.91]). We show that mRNA expression levels are highest in cortical regions and lowest in cerebellum. We find an exon quantitative trait locus (QTL) in brain samples that localizes to exons 32–33 and investigate the molecular basis of this eQTL using RNA-Seq data in n = 8 brain samples. The genotype underlying this eQTL is in strong linkage disequilibrium with the CD associated non-synonymous SNP rs3761863 (M2397T). We found two additional QTLs in liver and monocyte samples but none of these explained the common variant PD association at rs117762348. Our results characterize the LRRK2 locus, and highlight the importance and difficulties of fine-mapping and integration of multiple datasets to delineate pathogenic variants and thus develop an understanding of disease mechanisms.
Dramatic improvements in DNA sequencing technology have revolutionized our ability to characterize most genomic diversity. However, accurate resolution of large structural events has remained challenging due to the comparatively shorter read lengths of second-generation technologies. Emerging third-generation sequencing technologies, which yield markedly increased read length on rapid time scales and for low cost, have the potential to address assembly limitations. Here we combine sequencing data from second- and third-generation DNA sequencing technologies to assemble the two-chromosome genome of a recent Haitian cholera outbreak strain into two nearly finished contigs at > 99.9% accuracy. Complex regions with clinically significant structure were completely resolved. In separate control assemblies on experimental and simulated data for the canonical N16961 reference we obtain 14 and 8 scaffolds greater than 1kb, respectively, correcting several errors in the underlying source data. This work provides a blueprint for the next generation of rapid microbial identification and full-genome assembly.
Coronary heart disease (CHD) is the leading cause of mortality in both developed and developing countries worldwide. Genome-wide association studies (GWAS) have now identified 46 independent susceptibility loci for CHD, however, the biological and disease-relevant mechanisms for these associations remain elusive. The large-scale meta-analysis of GWAS recently identified in Caucasians a CHD-associated locus at chromosome 6q23.2, a region containing the transcription factor TCF21 gene. TCF21 (Capsulin/Pod1/Epicardin) is a member of the basic-helix-loop-helix (bHLH) transcription factor family, and regulates cell fate decisions and differentiation in the developing coronary vasculature. Herein, we characterize a cis-regulatory mechanism by which the lead polymorphism rs12190287 disrupts an atypical activator protein 1 (AP-1) element, as demonstrated by allele-specific transcriptional regulation, transcription factor binding, and chromatin organization, leading to altered TCF21 expression. Further, this element is shown to mediate signaling through platelet-derived growth factor receptor beta (PDGFR-β) and Wilms tumor 1 (WT1) pathways. A second disease allele identified in East Asians also appears to disrupt an AP-1-like element. Thus, both disease-related growth factor and embryonic signaling pathways may regulate CHD risk through two independent alleles at TCF21.
As much as half of the risk of developing coronary heart disease is genetically predetermined. Genome-wide association studies in human populations have now uncovered multiple sites of common genetic variation associated with heart disease. However, the biological mechanisms responsible for linking the disease associations with changes in gene expression are still underexplored. One of these variants occurs within the vascular developmental factor, TCF21, leading to dysregulated gene expression. Using various in silico and molecular approaches, we identify an intricate allele-specific regulatory mechanism underlying altered expression of TCF21. Notably, we observe that two apparently independent risk alleles identified in distinct populations function through a similar regulatory mechanism. Together these data suggest that conserved upstream pathways may organize the complex genetic etiology of coronary heart disease and potentially lead to new treatment opportunities.
Breast cancer is the most common malignancy in women and is responsible for hundreds of thousands of deaths annually. As with most cancers, it is a heterogeneous disease and different breast cancer subtypes are treated differently. Understanding the difference in prognosis for breast cancer based on its molecular and phenotypic features is one avenue for improving treatment by matching the proper treatment with molecular subtypes of the disease. In this work, we employed a competition-based approach to modeling breast cancer prognosis using large datasets containing genomic and clinical information and an online real-time leaderboard program used to speed feedback to the modeling team and to encourage each modeler to work towards achieving a higher ranked submission. We find that machine learning methods combined with molecular features selected based on expert prior knowledge can improve survival predictions compared to current best-in-class methodologies and that ensemble models trained across multiple user submissions systematically outperform individual models within the ensemble. We also find that model scores are highly consistent across multiple independent evaluations. This study serves as the pilot phase of a much larger competition open to the whole research community, with the goal of understanding general strategies for model optimization using clinical and molecular profiling data and providing an objective, transparent system for assessing prognostic models.
We developed an extensible software framework for sharing molecular prognostic models of breast cancer survival in a transparent collaborative environment and subjecting each model to automated evaluation using objective metrics. The computational framework presented in this study, our detailed post-hoc analysis of hundreds of modeling approaches, and the use of a novel cutting-edge data resource together represents one of the largest-scale systematic studies to date assessing the factors influencing accuracy of molecular-based prognostic models in breast cancer. Our results demonstrate the ability to infer prognostic models with accuracy on par or greater than previously reported studies, with significant performance improvements by using state-of-the-art machine learning approaches trained on clinical covariates. Our results also demonstrate the difficultly in incorporating molecular data to achieve substantial performance improvements over clinical covariates alone. However, improvement was achieved by combining clinical feature data with intelligent selection of important molecular features based on domain-specific prior knowledge. We observe that ensemble models aggregating the information across many diverse models achieve among the highest scores of all models and systematically out-perform individual models within the ensemble, suggesting a general strategy for leveraging the wisdom of crowds to develop robust predictive models.
DNA modifications such as methylation and DNA damage can play critical regulatory roles in biological systems. Single molecule, real time (SMRT) sequencing technology generates DNA sequences as well as DNA polymerase kinetic information that can be used for the direct detection of DNA modifications. We demonstrate that local sequence context has a strong impact on DNA polymerase kinetics in the neighborhood of the incorporation site during the DNA synthesis reaction, allowing for the possibility of estimating the expected kinetic rate of the enzyme at the incorporation site using kinetic rate information collected from existing SMRT sequencing data (historical data) covering the same local sequence contexts of interest. We develop an Empirical Bayesian hierarchical model for incorporating historical data. Our results show that the model could greatly increase DNA modification detection accuracy, and reduce requirement of control data coverage. For some DNA modifications that have a strong signal, a control sample is not even needed by using historical data as alternative to control. Thus, sequencing costs can be greatly reduced by using the model. We implemented the model in a R package named seqPatch, which is available at https://github.com/zhixingfeng/seqPatch.
DNA modifications have been found in a wide range of living organisms, from bacteria to human. Many existing studies have shown that they play important roles in development, disease, bacteria virulence, etc. However, for many types of DNA modification, for example N6-methyladenine and 8-oxoG, there is not an efficient and accurate detection method. Single molecule real time (SMRT) sequencing not only generates DNA sequences, but also generates DNA polymerase kinetic information. The kinetic information is sensitive to DNA modifications in the sequenced DNA template, and therefore can be used for detecting a wide range of DNA modification types. The usual detection strategy is a case-control method, which compare kinetic information between native sample and a control sample whose modifications have been removed. However, generating a control sample doubles the cost. We proposed a hierarchical model, which can incorporate existing SMRT sequencing data to increase detection accuracy and reduce coverage requirement of control sample or even avoid the need of a control sample in some cases. We tested our method on SMRT sequencing data of plasmids with known modified sites and E. coli K-12 strain to demonstrate our method can greatly increase detection accuracy and reduce sequencing cost.
We report a systems genetics analysis of high density lipoproteins (HDL) levels in an F2 intercross between inbred strains CAST/EiJ and C57BL/6J. We previously showed that there are dramatic differences in HDL metabolism in a cross between these strains, and we now report co-expression network analysis of HDL that integrates global expression data from liver and adipose with relevant metabolic traits. Using data from a total of 293 F2 intercross mice, we constructed weighted gene co-expression networks and identified modules (subnetworks) associated with HDL and clinical traits. These were examined for genes implicated in HDL levels based on large human genome-wide associations studies (GWAS) and examined with respect to conservation between tissue and sexes in a total of 9 data sets. We identify genes that are consistently ranked high by association with HDL across the 9 data sets. We focus in particular on two genes, Wfdc2 and Hdac3, that are located in close proximity to HDL QTL peaks where causal testing indicates that they may affect HDL. Our results provide a rich resource for studies of complex metabolic interactions involving HDL.
In the bacterial world, methylation is most commonly associated with restriction-modification systems that provide a defense mechanism against invading foreign genomes. In addition, it is known that methylation plays functionally important roles, including timing of DNA replication, chromosome partitioning, DNA repair, and regulation of gene expression. However, full DNA methylome analyses are scarce due to a lack of a simple methodology for rapid and sensitive detection of common epigenetic marks (ie N6-methyladenine (6 mA) and N4-methylcytosine (4 mC)), in these organisms. Here, we use Single-Molecule Real-Time (SMRT) sequencing to determine the methylomes of two related human pathogen species, Mycoplasma genitalium G-37 and Mycoplasma pneumoniae M129, with single-base resolution. Our analysis identified two new methylation motifs not previously described in bacteria: a widespread 6 mA methylation motif common to both bacteria (5′-CTAT-3′), as well as a more complex Type I m6A sequence motif in M. pneumoniae (5′-GAN7TAY-3′/3′-CTN7ATR-5′). We identify the methyltransferase responsible for the common motif and suggest the one involved in M. pneumoniae only. Analysis of the distribution of methylation sites across the genome of M. pneumoniae suggests a potential role for methylation in regulating the cell cycle, as well as in regulation of gene expression. To our knowledge, this is one of the first direct methylome profiling studies with single-base resolution from a bacterial organism.
DNA methylation in bacteria plays important roles in cell division, DNA repair, regulation of gene expression, and pathogenesis. Here, we use a novel sequencing technique, Single-Molecule Real-Time (SMRT) sequencing, to determine the methylomes of two related human pathogen species, Mycoplasma genitalium G-37 and Mycoplasma pneumoniae M129. Our analysis identified two novel methylation motifs, one of them present uniquely in M. pneumoniae and the other common to both bacteria. We also identify the methyltransferase responsible for the common methylation motif and suggest the one associated with the M. pneumoniae unique motif. Functional analysis of the data suggests a potential role for methylation in regulating the cell cycle of M. pneumoniae, as well as in regulation of gene expression. To our knowledge, this is one of the first genome-wide approaches to study the biological role of methylation in a bacterial organism.
Complex diseases result from molecular changes induced by multiple genetic factors and the environment. To derive a systems view of how genetic loci interact in the context of tissue-specific molecular networks, we constructed an F2 intercross comprised of >500 mice from diabetes-resistant (B6) and diabetes-susceptible (BTBR) mouse strains made genetically obese by the Leptinob/ob mutation (Lepob). High-density genotypes, diabetes-related clinical traits, and whole-transcriptome expression profiling in five tissues (white adipose, liver, pancreatic islets, hypothalamus, and gastrocnemius muscle) were determined for all mice. We performed an integrative analysis to investigate the inter-relationship among genetic factors, expression traits, and plasma insulin, a hallmark diabetes trait. Among five tissues under study, there are extensive protein–protein interactions between genes responding to different loci in adipose and pancreatic islets that potentially jointly participated in the regulation of plasma insulin. We developed a novel ranking scheme based on cross-loci protein-protein network topology and gene expression to assess each gene's potential to regulate plasma insulin. Unique candidate genes were identified in adipose tissue and islets. In islets, the Alzheimer's gene App was identified as a top candidate regulator. Islets from 17-week-old, but not 10-week-old, App knockout mice showed increased insulin secretion in response to glucose or a membrane-permeant cAMP analog, in agreement with the predictions of the network model. Our result provides a novel hypothesis on the mechanism for the connection between two aging-related diseases: Alzheimer's disease and type 2 diabetes.
Alzheimer's disease and type 2 diabetes are two common aging-related diseases. Numerous studies have shown that the two diseases are associated. However, the mechanisms of such connection are not clear. Both diseases are complex diseases that are induced by multiple genetic factors and the environment. To understand the molecular network regulated by complex genetic factors causing type 2 diabetes, we constructed an F2 intercross comprised of >500 mice from diabetes-resistant and diabetic mouse strains. We measured genotypes, clinical traits, and expression profiling in five tissues for each mouse. We then performed an integrative analysis to investigate the inter-relationship among genetic factors, expression traits, and plasma insulin, a hallmark diabetes trait, and developed a novel method for inferring key regulators for regulating plasma insulin. In islets, the Alzheimer's gene App was identified as a top candidate regulator. Islets from 17-week-old, but not 10-week-old, App knockout mice showed increased insulin secretion in response to glucose, in agreement with the predictions of the network model. Our result provides a novel hypothesis on the mechanism for the connection between two aging-related diseases: Alzheimer's disease and type 2 diabetes.
A human genome-wide linkage scan for obesity identified a linkage peak on chromosome 5q13–15. Positional cloning revealed an association of a rare haplotype to high body-mass index (BMI) in males but not females. The risk locus contains a single gene, “arrestin domain containing 3” (ARRDC3), an uncharacterized α-arrestin. Inactivating Arrdc3 in mice led to a striking resistance to obesity, with greater impact on male mice. Mice with decreased ARRDC3 levels were protected from obesity due to increased energy expenditure through increased activity levels and increased thermogenesis of both brown and white adipose tissues. ARRDC3 interacted directly with β-adrenergic receptors, and loss of ARRDC3 increased the response to β-adrenergic stimulation in isolated adipose tissue. These results demonstrate that ARRDC3 is a gender-sensitive regulator of obesity and energy expenditure and reveal a surprising diversity for arrestin family protein functions.
Alternative RNA splicing greatly expands the repertoire of proteins encoded by genomes. Next-generation sequencing (NGS) is attractive for studying alternative splicing because of the efficiency and low cost per base, but short reads typical of NGS only report mRNA fragments containing one or few splice junctions. Here, we used single-molecule amplification and long-read sequencing to study the HIV-1 provirus, which is only 9700 bp in length, but encodes nine major proteins via alternative splicing. Our data showed that the clinical isolate HIV-189.6 produces at least 109 different spliced RNAs, including a previously unappreciated ∼1 kb class of messages, two of which encode new proteins. HIV-1 message populations differed between cell types, longitudinally during infection, and among T cells from different human donors. These findings open a new window on a little studied aspect of HIV-1 replication, suggest therapeutic opportunities and provide advanced tools for the study of alternative splicing.
Inference about regulatory networks from high-throughput genomics data is of great interest in systems biology. We present a Bayesian approach to infer gene regulatory networks from time series expression data by integrating various types of biological knowledge.
We formulate network construction as a series of variable selection problems and use linear regression to model the data. Our method summarizes additional data sources with an informative prior probability distribution over candidate regression models. We extend the Bayesian model averaging (BMA) variable selection method to select regulators in the regression framework. We summarize the external biological knowledge by an informative prior probability distribution over the candidate regression models.
We demonstrate our method on simulated data and a set of time-series microarray experiments measuring the effect of a drug perturbation on gene expression levels, and show that it outperforms leading regression-based methods in the literature.
Systems biology; Network inference; Data integration; Statistics; Time-series expression data; Model uncertainty
A common inflammatome signature, as well as disease-specific expression patterns, was identified from 11 different rodent inflammatory disease models. Causal regulatory networks and the drivers of the inflammatome signature were uncovered and validated.
Representative inflammatome gene signatures, as well as disease model-specific gene signatures, were identified from 12 gene expression profiling data sets derived from 9 different tissues isolated from 11 rodent inflammatory disease models.The inflammatome signature is highly enriched for immune response-related genes, disease causal genes, and drug targets.Regulatory relationships among the inflammatome signature genes were examined in over 70 causal networks derived from a number of large-scale genetic studies of multiple diseases, and the potential key drivers were uncovered and validated prospectively.Over 70% of the inflammatome signature genes and over 50% of the key driver genes have not been reported in previous studies of common signatures in inflammatory conditions.
Common inflammatome gene signatures as well as disease-specific signatures were identified by analyzing 12 expression profiling data sets derived from 9 different tissues isolated from 11 rodent inflammatory disease models. The inflammatome signature significantly overlaps with known drug targets and co-expressed gene modules linked to metabolic disorders and cancer. A large proportion of genes in this signature are tightly connected in tissue-specific Bayesian networks (BNs) built from multiple independent mouse and human cohorts. Both the inflammatome signature and the corresponding consensus BNs are highly enriched for immune response-related genes supported as causal for adiposity, adipokine, diabetes, aortic lesion, bone, muscle, and cholesterol traits, suggesting the causal nature of the inflammatome for a variety of diseases. Integration of this inflammatome signature with the BNs uncovered 151 key drivers that appeared to be more biologically important than the non-drivers in terms of their impact on disease phenotypes. The identification of this inflammatome signature, its network architecture, and key drivers not only highlights the shared etiology but also pinpoints potential targets for intervention of various common diseases.
Bayesian network; co-expression network; inflammatome; inflammatory diseases; key regulators
Effective targeted cancer therapeutic development depends upon distinguishing disease-associated ‘driver’ mutations, which have causative roles in malignancy pathogenesis, from ‘passenger’ mutations, which are dispensable for cancer initiation and maintenance. Translational studies of clinically active targeted therapeutics can definitively discriminate driver from passenger lesions and provide valuable insights into human cancer biology. Activating internal tandem duplication (ITD) mutations in FLT3 (FLT3-ITD) are detected in approximately 20% of acute myeloid leukaemia (AML) patients and are associated with a poor prognosis1. Abundant scientific2 and clinical evidence1,3, including the lack of convincing clinical activity of early FLT3 inhibitors4,5, suggests that FLT3-ITD probably represents a passenger lesion. Here we report point mutations at three residues within the kinase domain of FLT3-ITD that confer substantial in vitro resistance to AC220 (quizartinib), an active investigational inhibitor of FLT3, KIT, PDGFRA, PDGFRB and RET6,7; evolution of AC220-resistant substitutions at two of these amino acid positions was observed in eight of eight FLT3-ITD-positive AML patients with acquired resistance to AC220. Our findings demonstrate that FLT3-ITD can represent a driver lesion and valid therapeutic target in human AML. AC220-resistant FLT3 kinase domain mutants represent high-value targets for future FLT3 inhibitor development efforts.
Motivation: The identification of condition specific sub-networks from gene expression profiles has important biological applications, ranging from the selection of disease-related biomarkers to the discovery of pathway alterations across different phenotypes. Although many methods exist for extracting these sub-networks, very few existing approaches simultaneously consider both the differential expression of individual genes and the differential correlation of gene pairs, losing potentially valuable information in the data.
Results: In this article, we propose a new method, COSINE (COndition SpecIfic sub-NEtwork), which employs a scoring function that jointly measures the condition-specific changes of both ‘nodes’ (individual genes) and ‘edges’ (gene–gene co-expression). It uses the genetic algorithm to search for the single optimal sub-network which maximizes the scoring function. We applied COSINE to both simulated datasets with various differential expression patterns, and three real datasets, one prostate cancer dataset, a second one from the across-tissue comparison of morbidly obese patients and the other from the across-population comparison of the HapMap samples. Compared with previous methods, COSINE is more powerful in identifying truly significant sub-networks of appropriate size and meaningful biological relevance.
Availability: The R code is available as the COSINE package on CRAN: http://cran.r-project.org/web/packages/COSINE/index.html.
Supplementary information: Supplementary data are available at Bioinformatics online.
DNA variation can be used as a systematic source of perturbation in segregating populations as a way to infer regulatory networks via the integration of large-scale, high-dimensional molecular profiling data.
Cells employ multiple levels of regulation, including transcriptional and translational regulation, that drive core biological processes and enable cells to respond to genetic and environmental changes. Small-molecule metabolites are one category of critical cellular intermediates that can influence as well as be a target of cellular regulations. Because metabolites represent the direct output of protein-mediated cellular processes, endogenous metabolite concentrations can closely reflect cellular physiological states, especially when integrated with other molecular-profiling data. Here we develop and apply a network reconstruction approach that simultaneously integrates six different types of data: endogenous metabolite concentration, RNA expression, DNA variation, DNA–protein binding, protein–metabolite interaction, and protein–protein interaction data, to construct probabilistic causal networks that elucidate the complexity of cell regulation in a segregating yeast population. Because many of the metabolites are found to be under strong genetic control, we were able to employ a causal regulator detection algorithm to identify causal regulators of the resulting network that elucidated the mechanisms by which variations in their sequence affect gene expression and metabolite concentrations. We examined all four expression quantitative trait loci (eQTL) hot spots with colocalized metabolite QTLs, two of which recapitulated known biological processes, while the other two elucidated novel putative biological mechanisms for the eQTL hot spots.
It is now possible to score variations in DNA across whole genomes, RNA levels and alternative isoforms, metabolite levels, protein levels and protein state information, protein–protein interactions, and protein–DNA interactions, in a comprehensive fashion in populations of individuals. Interactions among these molecular entities define the complex web of biological processes that give rise to all higher order phenotypes, including disease. The development of analytical approaches that simultaneously integrate different dimensions of data is essential if we are to extract the meaning from large-scale data to elucidate the complexity of living systems. Here, we use a novel Bayesian network reconstruction algorithm that simultaneously integrates DNA variation, RNA levels, metabolite levels, protein–protein interaction data, protein–DNA binding data, and protein–small-molecule interaction data to construct molecular networks in yeast. We demonstrate that these networks can be used to infer causal relationships among genes, enabling the identification of novel genes that modulate cellular regulation. We show that our network predictions either recapitulate known biology or can be prospectively validated, demonstrating a high degree of accuracy in the predicted network.
Expression quantitative trait loci (eQTL), or genetic variants associated with changes in gene expression, have the potential to assist in interpreting results of genome-wide association studies (GWAS). eQTLs also have varying degrees of tissue specificity. By correlating the statistical significance of eQTLs mapped in various tissue types to their odds ratios reported in a large GWAS by the Wellcome Trust Case Control Consortium (WTCCC), we discovered that there is a significant association between diseases studied genetically and their relevant tissues. This suggests that eQTL data sets can be used to determine tissues that play a role in the pathogenesis of a disease, thereby highlighting these tissue types for further post-GWAS functional studies.
The genetic determinants of variation in iron status are actively sought, but remain incompletely understood. Meta-analysis of two genome-wide association (GWA) studies and replication in three independent cohorts was performed to identify genetic loci associated in the general population with serum levels of iron and markers of iron status, including transferrin, ferritin, soluble transferrin receptor (sTfR) and sTfR–ferritin index. We identified and replicated a novel association of a common variant in the type-2 transferrin receptor (TFR2) gene with iron levels, with effect sizes highly consistent across samples. In addition, we identified and replicated an association between the HFE locus and ferritin and confirmed previously reported associations with the TF, TMPRSS6 and HFE genes. The five replicated variants were tested for association with expression levels of the corresponding genes in a publicly available data set of human liver samples, and nominally statistically significant expression differences by genotype were observed for all genes, although only rs3811647 in the TF gene survived the Bonferroni correction for multiple testing. In addition, we measured for the first time the effects of the common variant in TMPRSS6, rs4820268, on hepcidin mRNA in peripheral blood (n = 83 individuals) and on hepcidin levels in urine (n = 529) and observed an association in the same direction, though only borderline significant. These functional findings require confirmation in further studies with larger sample sizes, but they suggest that common variants in TMPRSS6 could modify the hepcidin-iron feedback loop in clinically unaffected individuals, thus making them more susceptible to imbalances of iron homeostasis.
A large outbreak of diarrhea and the hemolytic–uremic syndrome caused by an unusual serotype of Shiga-toxin–producing Escherichia coli (O104:H4) began in Germany in May 2011. As of July 22, a large number of cases of diarrhea caused by Shiga-toxin–producing E. coli have been reported — 3167 without the hemolytic–uremic syndrome (16 deaths) and 908 with the hemolytic–uremic syndrome (34 deaths) — indicating that this strain is notably more virulent than most of the Shiga-toxin–producing E. coli strains. Preliminary genetic characterization of the outbreak strain suggested that, unlike most of these strains, it should be classified within the enteroaggregative pathotype of E. coli.
We used third-generation, single-molecule, real-time DNA sequencing to determine the complete genome sequence of the German outbreak strain, as well as the genome sequences of seven diarrhea-associated enteroaggregative E. coli serotype O104:H4 strains from Africa and four enteroaggregative E. coli reference strains belonging to other serotypes. Genomewide comparisons were performed with the use of these enteroaggregative E. coli genomes, as well as those of 40 previously sequenced E. coli isolates.
The enteroaggregative E. coli O104:H4 strains are closely related and form a distinct clade among E. coli and enteroaggregative E. coli strains. However, the genome of the German outbreak strain can be distinguished from those of other O104:H4 strains because it contains a prophage encoding Shiga toxin 2 and a distinct set of additional virulence and antibiotic-resistance factors.
Our findings suggest that horizontal genetic exchange allowed for the emergence of the highly virulent Shiga-toxin–producing enteroaggregative E. coli O104:H4 strain that caused the German outbreak. More broadly, these findings highlight the way in which the plasticity of bacterial genomes facilitates the emergence of new pathogens.
The prognosis of hepatocellular carcinoma (HCC) varies following surgical resection and the large variation remains largely unexplained. Studies have revealed the ability of clinicopathologic parameters and gene expression to predict HCC prognosis. However, there has been little systematic effort to compare the performance of these two types of predictors or combine them in a comprehensive model.
Tumor and adjacent non-tumor liver tissues were collected from 272 ethnic Chinese HCC patients who received curative surgery. We combined clinicopathologic parameters and gene expression data (from both tissue types) in predicting HCC prognosis. Cross-validation and independent studies were employed to assess prediction.
HCC prognosis was significantly associated with six clinicopathologic parameters, which can partition the patients into good- and poor-prognosis groups. Within each group, gene expression data further divide patients into distinct prognostic subgroups. Our predictive genes significantly overlap with previously published gene sets predictive of prognosis. Moreover, the predictive genes were enriched for genes that underwent normal-to-tumor gene network transformation. Previously documented liver eSNPs underlying the HCC predictive gene signatures were enriched for SNPs that associated with HCC prognosis, providing support that these genes are involved in key processes of tumorigenesis.
When applied individually, clinicopathologic parameters and gene expression offered similar predictive power for HCC prognosis. In contrast, a combination of the two types of data dramatically improved the power to predict HCC prognosis. Our results also provided a framework for understanding the impact of gene expression on the processes of tumorigenesis and clinical outcome.
Complex diseases such as obesity and type II diabetes can result from a failure in multiple organ systems including the central nervous system and tissues involved in partitioning and disposal of nutrients. Studying the genetics of gene expression in tissues that are involved in the development of these diseases can provide insights into how these tissues interact within the context of disease. Expression quantitative trait locus (eQTL) studies identify mRNA expression changes linked to proximal genetic signals (cis eQTLs) that have been shown to affect disease. Given the high impact of recent eQTL studies, it is important to understand what role sample size and environment plays in identification of cis eQTLs. Here we show in a genotyped obese human population that the number of cis eQTLs obey precise scaling laws as a function of sample size in three profiled tissues, i.e. omental adipose, subcutaneous adipose and liver. Also, we show that genes (or transcripts) with cis eQTL associations detected in a small population are detected at approximately 90% rate in the largest population available for our study, indicating that genes with strong cis acting regulatory elements can be identified with relatively high confidence in smaller populations. However, by increasing the sample size we allow for better detection of weaker and more distantly located cis-regulatory elements. Yet, we determined that the number of tissue specific cis eQTLs saturates in a modestly sized cohort while the number of cis eQTLs common to all tissues fails to reach a maximum value. Understanding the power laws that govern the number and specificity of eQTLs detected in different tissues, will allow a better utilization of genetics of gene expression to inform the molecular mechanism underlying complex disease traits.