DNA modifications such as methylation and DNA damage can play critical regulatory roles in biological systems. Single molecule, real time (SMRT) sequencing technology generates DNA sequences as well as DNA polymerase kinetic information that can be used for the direct detection of DNA modifications. We demonstrate that local sequence context has a strong impact on DNA polymerase kinetics in the neighborhood of the incorporation site during the DNA synthesis reaction, allowing for the possibility of estimating the expected kinetic rate of the enzyme at the incorporation site using kinetic rate information collected from existing SMRT sequencing data (historical data) covering the same local sequence contexts of interest. We develop an Empirical Bayesian hierarchical model for incorporating historical data. Our results show that the model could greatly increase DNA modification detection accuracy, and reduce requirement of control data coverage. For some DNA modifications that have a strong signal, a control sample is not even needed by using historical data as alternative to control. Thus, sequencing costs can be greatly reduced by using the model. We implemented the model in a R package named seqPatch, which is available at https://github.com/zhixingfeng/seqPatch.
DNA modifications have been found in a wide range of living organisms, from bacteria to human. Many existing studies have shown that they play important roles in development, disease, bacteria virulence, etc. However, for many types of DNA modification, for example N6-methyladenine and 8-oxoG, there is not an efficient and accurate detection method. Single molecule real time (SMRT) sequencing not only generates DNA sequences, but also generates DNA polymerase kinetic information. The kinetic information is sensitive to DNA modifications in the sequenced DNA template, and therefore can be used for detecting a wide range of DNA modification types. The usual detection strategy is a case-control method, which compare kinetic information between native sample and a control sample whose modifications have been removed. However, generating a control sample doubles the cost. We proposed a hierarchical model, which can incorporate existing SMRT sequencing data to increase detection accuracy and reduce coverage requirement of control sample or even avoid the need of a control sample in some cases. We tested our method on SMRT sequencing data of plasmids with known modified sites and E. coli K-12 strain to demonstrate our method can greatly increase detection accuracy and reduce sequencing cost.
We report a systems genetics analysis of high density lipoproteins (HDL) levels in an F2 intercross between inbred strains CAST/EiJ and C57BL/6J. We previously showed that there are dramatic differences in HDL metabolism in a cross between these strains, and we now report co-expression network analysis of HDL that integrates global expression data from liver and adipose with relevant metabolic traits. Using data from a total of 293 F2 intercross mice, we constructed weighted gene co-expression networks and identified modules (subnetworks) associated with HDL and clinical traits. These were examined for genes implicated in HDL levels based on large human genome-wide associations studies (GWAS) and examined with respect to conservation between tissue and sexes in a total of 9 data sets. We identify genes that are consistently ranked high by association with HDL across the 9 data sets. We focus in particular on two genes, Wfdc2 and Hdac3, that are located in close proximity to HDL QTL peaks where causal testing indicates that they may affect HDL. Our results provide a rich resource for studies of complex metabolic interactions involving HDL.
In the bacterial world, methylation is most commonly associated with restriction-modification systems that provide a defense mechanism against invading foreign genomes. In addition, it is known that methylation plays functionally important roles, including timing of DNA replication, chromosome partitioning, DNA repair, and regulation of gene expression. However, full DNA methylome analyses are scarce due to a lack of a simple methodology for rapid and sensitive detection of common epigenetic marks (ie N6-methyladenine (6 mA) and N4-methylcytosine (4 mC)), in these organisms. Here, we use Single-Molecule Real-Time (SMRT) sequencing to determine the methylomes of two related human pathogen species, Mycoplasma genitalium G-37 and Mycoplasma pneumoniae M129, with single-base resolution. Our analysis identified two new methylation motifs not previously described in bacteria: a widespread 6 mA methylation motif common to both bacteria (5′-CTAT-3′), as well as a more complex Type I m6A sequence motif in M. pneumoniae (5′-GAN7TAY-3′/3′-CTN7ATR-5′). We identify the methyltransferase responsible for the common motif and suggest the one involved in M. pneumoniae only. Analysis of the distribution of methylation sites across the genome of M. pneumoniae suggests a potential role for methylation in regulating the cell cycle, as well as in regulation of gene expression. To our knowledge, this is one of the first direct methylome profiling studies with single-base resolution from a bacterial organism.
DNA methylation in bacteria plays important roles in cell division, DNA repair, regulation of gene expression, and pathogenesis. Here, we use a novel sequencing technique, Single-Molecule Real-Time (SMRT) sequencing, to determine the methylomes of two related human pathogen species, Mycoplasma genitalium G-37 and Mycoplasma pneumoniae M129. Our analysis identified two novel methylation motifs, one of them present uniquely in M. pneumoniae and the other common to both bacteria. We also identify the methyltransferase responsible for the common methylation motif and suggest the one associated with the M. pneumoniae unique motif. Functional analysis of the data suggests a potential role for methylation in regulating the cell cycle of M. pneumoniae, as well as in regulation of gene expression. To our knowledge, this is one of the first genome-wide approaches to study the biological role of methylation in a bacterial organism.
Complex diseases result from molecular changes induced by multiple genetic factors and the environment. To derive a systems view of how genetic loci interact in the context of tissue-specific molecular networks, we constructed an F2 intercross comprised of >500 mice from diabetes-resistant (B6) and diabetes-susceptible (BTBR) mouse strains made genetically obese by the Leptinob/ob mutation (Lepob). High-density genotypes, diabetes-related clinical traits, and whole-transcriptome expression profiling in five tissues (white adipose, liver, pancreatic islets, hypothalamus, and gastrocnemius muscle) were determined for all mice. We performed an integrative analysis to investigate the inter-relationship among genetic factors, expression traits, and plasma insulin, a hallmark diabetes trait. Among five tissues under study, there are extensive protein–protein interactions between genes responding to different loci in adipose and pancreatic islets that potentially jointly participated in the regulation of plasma insulin. We developed a novel ranking scheme based on cross-loci protein-protein network topology and gene expression to assess each gene's potential to regulate plasma insulin. Unique candidate genes were identified in adipose tissue and islets. In islets, the Alzheimer's gene App was identified as a top candidate regulator. Islets from 17-week-old, but not 10-week-old, App knockout mice showed increased insulin secretion in response to glucose or a membrane-permeant cAMP analog, in agreement with the predictions of the network model. Our result provides a novel hypothesis on the mechanism for the connection between two aging-related diseases: Alzheimer's disease and type 2 diabetes.
Alzheimer's disease and type 2 diabetes are two common aging-related diseases. Numerous studies have shown that the two diseases are associated. However, the mechanisms of such connection are not clear. Both diseases are complex diseases that are induced by multiple genetic factors and the environment. To understand the molecular network regulated by complex genetic factors causing type 2 diabetes, we constructed an F2 intercross comprised of >500 mice from diabetes-resistant and diabetic mouse strains. We measured genotypes, clinical traits, and expression profiling in five tissues for each mouse. We then performed an integrative analysis to investigate the inter-relationship among genetic factors, expression traits, and plasma insulin, a hallmark diabetes trait, and developed a novel method for inferring key regulators for regulating plasma insulin. In islets, the Alzheimer's gene App was identified as a top candidate regulator. Islets from 17-week-old, but not 10-week-old, App knockout mice showed increased insulin secretion in response to glucose, in agreement with the predictions of the network model. Our result provides a novel hypothesis on the mechanism for the connection between two aging-related diseases: Alzheimer's disease and type 2 diabetes.
A human genome-wide linkage scan for obesity identified a linkage peak on chromosome 5q13–15. Positional cloning revealed an association of a rare haplotype to high body-mass index (BMI) in males but not females. The risk locus contains a single gene, “arrestin domain containing 3” (ARRDC3), an uncharacterized α-arrestin. Inactivating Arrdc3 in mice led to a striking resistance to obesity, with greater impact on male mice. Mice with decreased ARRDC3 levels were protected from obesity due to increased energy expenditure through increased activity levels and increased thermogenesis of both brown and white adipose tissues. ARRDC3 interacted directly with β-adrenergic receptors, and loss of ARRDC3 increased the response to β-adrenergic stimulation in isolated adipose tissue. These results demonstrate that ARRDC3 is a gender-sensitive regulator of obesity and energy expenditure and reveal a surprising diversity for arrestin family protein functions.
Alternative RNA splicing greatly expands the repertoire of proteins encoded by genomes. Next-generation sequencing (NGS) is attractive for studying alternative splicing because of the efficiency and low cost per base, but short reads typical of NGS only report mRNA fragments containing one or few splice junctions. Here, we used single-molecule amplification and long-read sequencing to study the HIV-1 provirus, which is only 9700 bp in length, but encodes nine major proteins via alternative splicing. Our data showed that the clinical isolate HIV-189.6 produces at least 109 different spliced RNAs, including a previously unappreciated ∼1 kb class of messages, two of which encode new proteins. HIV-1 message populations differed between cell types, longitudinally during infection, and among T cells from different human donors. These findings open a new window on a little studied aspect of HIV-1 replication, suggest therapeutic opportunities and provide advanced tools for the study of alternative splicing.
Inference about regulatory networks from high-throughput genomics data is of great interest in systems biology. We present a Bayesian approach to infer gene regulatory networks from time series expression data by integrating various types of biological knowledge.
We formulate network construction as a series of variable selection problems and use linear regression to model the data. Our method summarizes additional data sources with an informative prior probability distribution over candidate regression models. We extend the Bayesian model averaging (BMA) variable selection method to select regulators in the regression framework. We summarize the external biological knowledge by an informative prior probability distribution over the candidate regression models.
We demonstrate our method on simulated data and a set of time-series microarray experiments measuring the effect of a drug perturbation on gene expression levels, and show that it outperforms leading regression-based methods in the literature.
Systems biology; Network inference; Data integration; Statistics; Time-series expression data; Model uncertainty
A common inflammatome signature, as well as disease-specific expression patterns, was identified from 11 different rodent inflammatory disease models. Causal regulatory networks and the drivers of the inflammatome signature were uncovered and validated.
Representative inflammatome gene signatures, as well as disease model-specific gene signatures, were identified from 12 gene expression profiling data sets derived from 9 different tissues isolated from 11 rodent inflammatory disease models.The inflammatome signature is highly enriched for immune response-related genes, disease causal genes, and drug targets.Regulatory relationships among the inflammatome signature genes were examined in over 70 causal networks derived from a number of large-scale genetic studies of multiple diseases, and the potential key drivers were uncovered and validated prospectively.Over 70% of the inflammatome signature genes and over 50% of the key driver genes have not been reported in previous studies of common signatures in inflammatory conditions.
Common inflammatome gene signatures as well as disease-specific signatures were identified by analyzing 12 expression profiling data sets derived from 9 different tissues isolated from 11 rodent inflammatory disease models. The inflammatome signature significantly overlaps with known drug targets and co-expressed gene modules linked to metabolic disorders and cancer. A large proportion of genes in this signature are tightly connected in tissue-specific Bayesian networks (BNs) built from multiple independent mouse and human cohorts. Both the inflammatome signature and the corresponding consensus BNs are highly enriched for immune response-related genes supported as causal for adiposity, adipokine, diabetes, aortic lesion, bone, muscle, and cholesterol traits, suggesting the causal nature of the inflammatome for a variety of diseases. Integration of this inflammatome signature with the BNs uncovered 151 key drivers that appeared to be more biologically important than the non-drivers in terms of their impact on disease phenotypes. The identification of this inflammatome signature, its network architecture, and key drivers not only highlights the shared etiology but also pinpoints potential targets for intervention of various common diseases.
Bayesian network; co-expression network; inflammatome; inflammatory diseases; key regulators
Effective targeted cancer therapeutic development depends upon distinguishing disease-associated ‘driver’ mutations, which have causative roles in malignancy pathogenesis, from ‘passenger’ mutations, which are dispensable for cancer initiation and maintenance. Translational studies of clinically active targeted therapeutics can definitively discriminate driver from passenger lesions and provide valuable insights into human cancer biology. Activating internal tandem duplication (ITD) mutations in FLT3 (FLT3-ITD) are detected in approximately 20% of acute myeloid leukaemia (AML) patients and are associated with a poor prognosis1. Abundant scientific2 and clinical evidence1,3, including the lack of convincing clinical activity of early FLT3 inhibitors4,5, suggests that FLT3-ITD probably represents a passenger lesion. Here we report point mutations at three residues within the kinase domain of FLT3-ITD that confer substantial in vitro resistance to AC220 (quizartinib), an active investigational inhibitor of FLT3, KIT, PDGFRA, PDGFRB and RET6,7; evolution of AC220-resistant substitutions at two of these amino acid positions was observed in eight of eight FLT3-ITD-positive AML patients with acquired resistance to AC220. Our findings demonstrate that FLT3-ITD can represent a driver lesion and valid therapeutic target in human AML. AC220-resistant FLT3 kinase domain mutants represent high-value targets for future FLT3 inhibitor development efforts.
Motivation: The identification of condition specific sub-networks from gene expression profiles has important biological applications, ranging from the selection of disease-related biomarkers to the discovery of pathway alterations across different phenotypes. Although many methods exist for extracting these sub-networks, very few existing approaches simultaneously consider both the differential expression of individual genes and the differential correlation of gene pairs, losing potentially valuable information in the data.
Results: In this article, we propose a new method, COSINE (COndition SpecIfic sub-NEtwork), which employs a scoring function that jointly measures the condition-specific changes of both ‘nodes’ (individual genes) and ‘edges’ (gene–gene co-expression). It uses the genetic algorithm to search for the single optimal sub-network which maximizes the scoring function. We applied COSINE to both simulated datasets with various differential expression patterns, and three real datasets, one prostate cancer dataset, a second one from the across-tissue comparison of morbidly obese patients and the other from the across-population comparison of the HapMap samples. Compared with previous methods, COSINE is more powerful in identifying truly significant sub-networks of appropriate size and meaningful biological relevance.
Availability: The R code is available as the COSINE package on CRAN: http://cran.r-project.org/web/packages/COSINE/index.html.
Supplementary information: Supplementary data are available at Bioinformatics online.
DNA variation can be used as a systematic source of perturbation in segregating populations as a way to infer regulatory networks via the integration of large-scale, high-dimensional molecular profiling data.
Cells employ multiple levels of regulation, including transcriptional and translational regulation, that drive core biological processes and enable cells to respond to genetic and environmental changes. Small-molecule metabolites are one category of critical cellular intermediates that can influence as well as be a target of cellular regulations. Because metabolites represent the direct output of protein-mediated cellular processes, endogenous metabolite concentrations can closely reflect cellular physiological states, especially when integrated with other molecular-profiling data. Here we develop and apply a network reconstruction approach that simultaneously integrates six different types of data: endogenous metabolite concentration, RNA expression, DNA variation, DNA–protein binding, protein–metabolite interaction, and protein–protein interaction data, to construct probabilistic causal networks that elucidate the complexity of cell regulation in a segregating yeast population. Because many of the metabolites are found to be under strong genetic control, we were able to employ a causal regulator detection algorithm to identify causal regulators of the resulting network that elucidated the mechanisms by which variations in their sequence affect gene expression and metabolite concentrations. We examined all four expression quantitative trait loci (eQTL) hot spots with colocalized metabolite QTLs, two of which recapitulated known biological processes, while the other two elucidated novel putative biological mechanisms for the eQTL hot spots.
It is now possible to score variations in DNA across whole genomes, RNA levels and alternative isoforms, metabolite levels, protein levels and protein state information, protein–protein interactions, and protein–DNA interactions, in a comprehensive fashion in populations of individuals. Interactions among these molecular entities define the complex web of biological processes that give rise to all higher order phenotypes, including disease. The development of analytical approaches that simultaneously integrate different dimensions of data is essential if we are to extract the meaning from large-scale data to elucidate the complexity of living systems. Here, we use a novel Bayesian network reconstruction algorithm that simultaneously integrates DNA variation, RNA levels, metabolite levels, protein–protein interaction data, protein–DNA binding data, and protein–small-molecule interaction data to construct molecular networks in yeast. We demonstrate that these networks can be used to infer causal relationships among genes, enabling the identification of novel genes that modulate cellular regulation. We show that our network predictions either recapitulate known biology or can be prospectively validated, demonstrating a high degree of accuracy in the predicted network.
Expression quantitative trait loci (eQTL), or genetic variants associated with changes in gene expression, have the potential to assist in interpreting results of genome-wide association studies (GWAS). eQTLs also have varying degrees of tissue specificity. By correlating the statistical significance of eQTLs mapped in various tissue types to their odds ratios reported in a large GWAS by the Wellcome Trust Case Control Consortium (WTCCC), we discovered that there is a significant association between diseases studied genetically and their relevant tissues. This suggests that eQTL data sets can be used to determine tissues that play a role in the pathogenesis of a disease, thereby highlighting these tissue types for further post-GWAS functional studies.
The genetic determinants of variation in iron status are actively sought, but remain incompletely understood. Meta-analysis of two genome-wide association (GWA) studies and replication in three independent cohorts was performed to identify genetic loci associated in the general population with serum levels of iron and markers of iron status, including transferrin, ferritin, soluble transferrin receptor (sTfR) and sTfR–ferritin index. We identified and replicated a novel association of a common variant in the type-2 transferrin receptor (TFR2) gene with iron levels, with effect sizes highly consistent across samples. In addition, we identified and replicated an association between the HFE locus and ferritin and confirmed previously reported associations with the TF, TMPRSS6 and HFE genes. The five replicated variants were tested for association with expression levels of the corresponding genes in a publicly available data set of human liver samples, and nominally statistically significant expression differences by genotype were observed for all genes, although only rs3811647 in the TF gene survived the Bonferroni correction for multiple testing. In addition, we measured for the first time the effects of the common variant in TMPRSS6, rs4820268, on hepcidin mRNA in peripheral blood (n = 83 individuals) and on hepcidin levels in urine (n = 529) and observed an association in the same direction, though only borderline significant. These functional findings require confirmation in further studies with larger sample sizes, but they suggest that common variants in TMPRSS6 could modify the hepcidin-iron feedback loop in clinically unaffected individuals, thus making them more susceptible to imbalances of iron homeostasis.
A large outbreak of diarrhea and the hemolytic–uremic syndrome caused by an unusual serotype of Shiga-toxin–producing Escherichia coli (O104:H4) began in Germany in May 2011. As of July 22, a large number of cases of diarrhea caused by Shiga-toxin–producing E. coli have been reported — 3167 without the hemolytic–uremic syndrome (16 deaths) and 908 with the hemolytic–uremic syndrome (34 deaths) — indicating that this strain is notably more virulent than most of the Shiga-toxin–producing E. coli strains. Preliminary genetic characterization of the outbreak strain suggested that, unlike most of these strains, it should be classified within the enteroaggregative pathotype of E. coli.
We used third-generation, single-molecule, real-time DNA sequencing to determine the complete genome sequence of the German outbreak strain, as well as the genome sequences of seven diarrhea-associated enteroaggregative E. coli serotype O104:H4 strains from Africa and four enteroaggregative E. coli reference strains belonging to other serotypes. Genomewide comparisons were performed with the use of these enteroaggregative E. coli genomes, as well as those of 40 previously sequenced E. coli isolates.
The enteroaggregative E. coli O104:H4 strains are closely related and form a distinct clade among E. coli and enteroaggregative E. coli strains. However, the genome of the German outbreak strain can be distinguished from those of other O104:H4 strains because it contains a prophage encoding Shiga toxin 2 and a distinct set of additional virulence and antibiotic-resistance factors.
Our findings suggest that horizontal genetic exchange allowed for the emergence of the highly virulent Shiga-toxin–producing enteroaggregative E. coli O104:H4 strain that caused the German outbreak. More broadly, these findings highlight the way in which the plasticity of bacterial genomes facilitates the emergence of new pathogens.
The prognosis of hepatocellular carcinoma (HCC) varies following surgical resection and the large variation remains largely unexplained. Studies have revealed the ability of clinicopathologic parameters and gene expression to predict HCC prognosis. However, there has been little systematic effort to compare the performance of these two types of predictors or combine them in a comprehensive model.
Tumor and adjacent non-tumor liver tissues were collected from 272 ethnic Chinese HCC patients who received curative surgery. We combined clinicopathologic parameters and gene expression data (from both tissue types) in predicting HCC prognosis. Cross-validation and independent studies were employed to assess prediction.
HCC prognosis was significantly associated with six clinicopathologic parameters, which can partition the patients into good- and poor-prognosis groups. Within each group, gene expression data further divide patients into distinct prognostic subgroups. Our predictive genes significantly overlap with previously published gene sets predictive of prognosis. Moreover, the predictive genes were enriched for genes that underwent normal-to-tumor gene network transformation. Previously documented liver eSNPs underlying the HCC predictive gene signatures were enriched for SNPs that associated with HCC prognosis, providing support that these genes are involved in key processes of tumorigenesis.
When applied individually, clinicopathologic parameters and gene expression offered similar predictive power for HCC prognosis. In contrast, a combination of the two types of data dramatically improved the power to predict HCC prognosis. Our results also provided a framework for understanding the impact of gene expression on the processes of tumorigenesis and clinical outcome.
Complex diseases such as obesity and type II diabetes can result from a failure in multiple organ systems including the central nervous system and tissues involved in partitioning and disposal of nutrients. Studying the genetics of gene expression in tissues that are involved in the development of these diseases can provide insights into how these tissues interact within the context of disease. Expression quantitative trait locus (eQTL) studies identify mRNA expression changes linked to proximal genetic signals (cis eQTLs) that have been shown to affect disease. Given the high impact of recent eQTL studies, it is important to understand what role sample size and environment plays in identification of cis eQTLs. Here we show in a genotyped obese human population that the number of cis eQTLs obey precise scaling laws as a function of sample size in three profiled tissues, i.e. omental adipose, subcutaneous adipose and liver. Also, we show that genes (or transcripts) with cis eQTL associations detected in a small population are detected at approximately 90% rate in the largest population available for our study, indicating that genes with strong cis acting regulatory elements can be identified with relatively high confidence in smaller populations. However, by increasing the sample size we allow for better detection of weaker and more distantly located cis-regulatory elements. Yet, we determined that the number of tissue specific cis eQTLs saturates in a modestly sized cohort while the number of cis eQTLs common to all tissues fails to reach a maximum value. Understanding the power laws that govern the number and specificity of eQTLs detected in different tissues, will allow a better utilization of genetics of gene expression to inform the molecular mechanism underlying complex disease traits.
One of the primary objectives in cancer research is to identify causal genomic alterations, such as somatic copy number variation (CNV) and somatic mutations, during tumor development. Many valuable studies lack genomic data to detect CNV; therefore, methods that are able to infer CNVs from gene expression data would help maximize the value of these studies.
We developed a framework for identifying recurrent regions of CNV and distinguishing the cancer driver genes from the passenger genes in the regions. By inferring CNV regions across many datasets we were able to identify 109 recurrent amplified/deleted CNV regions. Many of these regions are enriched for genes involved in many important processes associated with tumorigenesis and cancer progression. Genes in these recurrent CNV regions were then examined in the context of gene regulatory networks to prioritize putative cancer driver genes. The cancer driver genes uncovered by the framework include not only well-known oncogenes but also a number of novel cancer susceptibility genes validated via siRNA experiments.
To our knowledge, this is the first effort to systematically identify and validate drivers for expression based CNV regions in breast cancer. The framework where the wavelet analysis of copy number alteration based on expression coupled with the gene regulatory network analysis, provides a blueprint for leveraging genomic data to identify key regulatory components and gene targets. This integrative approach can be applied to many other large-scale gene expression studies and other novel types of cancer data such as next-generation sequencing based expression (RNA-Seq) as well as CNV data.
breast cancer; copy number variation; gene regulatory networks; oncogenes
Although cholera has been present in Latin America since 1991, it had not been epidemic in Haiti for at least 100 years. Recently, however, there has been a severe outbreak of cholera in Haiti.
We used third-generation single-molecule real-time DNA sequencing to determine the genome sequences of 2 clinical Vibrio cholerae isolates from the current outbreak in Haiti, 1 strain that caused cholera in Latin America in 1991, and 2 strains isolated in South Asia in 2002 and 2008. Using primary sequence data, we compared the genomes of these 5 strains and a set of previously obtained partial genomic sequences of 23 diverse strains of V. cholerae to assess the likely origin of the cholera outbreak in Haiti.
Both single-nucleotide variations and the presence and structure of hypervariable chromosomal elements indicate that there is a close relationship between the Haitian isolates and variant V. cholerae El Tor O1 strains isolated in Bangladesh in 2002 and 2008. In contrast, analysis of genomic variation of the Haitian isolates reveals a more distant relationship with circulating South American isolates.
The Haitian epidemic is probably the result of the introduction, through human activity, of a V. cholerae strain from a distant geographic source. (Funded by the National Institute of Allergy and Infectious Diseases and the Howard Hughes Medical Institute.)
In hepatocellular carcinoma (HCC) genes predictive of survival have been found in both adjacent normal (AN) and tumor (TU) tissues. The relationships between these two sets of predictive genes and the general process of tumorigenesis and disease progression remains unclear.
Here we have investigated HCC tumorigenesis by comparing gene expression, DNA copy number variation and survival using ∼250 AN and TU samples representing, respectively, the pre-cancer state, and the result of tumorigenesis. Genes that participate in tumorigenesis were defined using a gene-gene correlation meta-analysis procedure that compared AN versus TU tissues. Genes predictive of survival in AN (AN-survival genes) were found to be enriched in the differential gene-gene correlation gene set indicating that they directly participate in the process of tumorigenesis. Additionally the AN-survival genes were mostly not predictive after tumorigenesis in TU tissue and this transition was associated with and could largely be explained by the effect of somatic DNA copy number variation (sCNV) in cis and in trans. The data was consistent with the variance of AN-survival genes being rate-limiting steps in tumorigenesis and this was confirmed using a treatment that promotes HCC tumorigenesis that selectively altered AN-survival genes and genes differentially correlated between AN and TU.
This suggests that the process of tumor evolution involves rate-limiting steps related to the background from which the tumor evolved where these were frequently predictive of clinical outcome. Additionally treatments that alter the likelihood of tumorigenesis occurring may act by altering AN-survival genes, suggesting that the process can be manipulated. Further sCNV explains a substantial fraction of tumor specific expression and may therefore be a causal driver of tumor evolution in HCC and perhaps many solid tumor types.
Today we can generate hundreds of gigabases of DNA and RNA sequencing data in a week for less than US$5,000. The astonishing rate of data generation by these low-cost, high-throughput technologies in genomics is being matched by that of other technologies, such as real-time imaging and mass spectrometry-based flow cytometry. Success in the life sciences will depend on our ability to properly interpret the large-scale, high-dimensional data sets that are generated by these technologies, which in turn requires us to adopt advances in informatics. Here we discuss how we can master the different types of computational environments that exist — such as cloud and heterogeneous computing — to successfully tackle our big data problems.