Immunologic dysfunction, mediated via monocyte activity, has been implicated in the development of HIV-associated neurocognitive disorder (HAND). We hypothesized that transcriptome changes in peripheral blood monocytes relate to neurocognitive functioning in HIV+ individuals, and that such alterations could be useful as biomarkers of worsening HAND.
mRNA was isolated from the monocytes of 86 HIV+ adults and analyzed with the Illumina HT-12 v4 Expression BeadChip. Neurocognitive functioning, HAND diagnosis, and other clinical and virologic variables were determined. Data were analyzed using standard expression analysis and weighted gene co-expression network analysis (WGCNA).
Neurocognitive functioning was correlated with multiple gene transcripts in the standard expression analysis. WGCNA identified two nominally significant co-expression modules associated with neurocognitive functioning, which were enriched with genes involved in mitotic processes and translational elongation.
Multiple modified gene transcripts involved in inflammation, cytoprotection, and neurodegeneration were correlated with neurocognitive functioning. The associations were not strong enough to justify their use as biomarkers of HAND; however, the associations of two co-expression modules with neurocognitive functioning warrants further exploration.
HIV-associated neurocognitive disorder; NeuroAIDS; monocyte; IL6R; KEAP1; LRP12; CSNK1A1; WGCNA
Neuroanatomically precise, genome-wide maps of transcript distributions are critical resources to complement genomic sequence data and to correlate functional and genetic brain architecture. Here we describe the generation and analysis of a transcriptional atlas of the adult human brain, comprising extensive histological analysis and comprehensive microarray profiling of ~900 neuroanatomically precise subdivisions in two individuals. Transcriptional regulation varies enormously by anatomical location, with different regions and their constituent cell types displaying robust molecular signatures that are highly conserved between individuals. Analysis of differential gene expression and gene co-expression relationships demonstrates that brain-wide variation strongly reflects the distributions of major cell classes such as neurons, oligodendrocytes, astrocytes and microglia. Local neighbourhood relationships between fine anatomical subdivisions are associated with discrete neuronal subtypes and genes involved with synaptic transmission. The neocortex displays a relatively homogeneous transcriptional pattern, but with distinct features associated selectively with primary sensorimotor cortices and with enriched frontal lobe expression. Notably, the spatial topography of the neocortex is strongly reflected in its molecular topography— the closer two cortical regions, the more similar their transcriptomes. This freely accessible online data resource forms a high-resolution transcriptional baseline for neurogenetic studies of normal and abnormal human brain function.
Neuroscience; Genetics; Genomics; Databases
Genetic studies have identified dozens of autism spectrum disorder (ASD) susceptibility genes, raising two critical questions: 1) do these genetic loci converge on specific biological processes, and 2) where does the phenotypic specificity of ASD arise, given its genetic overlap with intellectual disability (ID)? To address this, we mapped ASD and ID risk genes onto co-expression networks representing developmental trajectories and transcriptional profiles representing fetal and adult cortical laminae. ASD genes tightly coalesce in modules that implicate distinct biological functions during human cortical development, including early transcriptional regulation and synaptic development. Bioinformatic analyses suggest translational regulation by FMRP and transcriptional co-regulation by common transcription factors connect these processes. At a circuit level, ASD genes are enriched in superficial cortical layers and glutamatergic projection neurons. Furthermore, we show that the patterns of ASD and ID risk genes are distinct, providing a novel biological framework for investigating the pathophysiology of ASD.
gene networks; systems biology; exome; rare variants; Intellectual disability; human cortical development; gene expression; FMRP; Satb1; MEF2; RNA-seq
Common genetic variation and rare mutations in genes encoding calcium channel subunits have pleiotropic effects on risk for multiple neuropsychiatric disorders, including autism spectrum disorder (ASD) and schizophrenia. To gain further mechanistic insights by extending previous gene expression data, we constructed co-expression networks in Timothy syndrome (TS), a monogenic condition with high penetrance for ASD, caused by mutations in the L-type calcium channel, Cav1.2.
To identify patient-specific alterations in transcriptome organization, we conducted a genome-wide weighted co-expression network analysis (WGCNA) on neural progenitors and neurons from multiple lines of induced pluripotent stem cells (iPSC) derived from normal and TS (G406R in CACNA1C) individuals. We employed transcription factor binding site enrichment analysis to assess whether TS associated co-expression changes reflect calcium-dependent co-regulation.
We identified reproducible developmental and activity-dependent gene co-expression modules conserved in patient and control cell lines. By comparing cell lines from case and control subjects, we also identified co-expression modules reflecting distinct aspects of TS, including intellectual disability and ASD-related phenotypes. Moreover, by integrating co-expression with transcription factor binding analysis, we showed the TS-associated transcriptional changes were predicted to be co-regulated by calcium-dependent transcriptional regulators, including NFAT, MEF2, CREB, and FOXO, thus providing a mechanism by which altered Ca2+ signaling in TS patients leads to the observed molecular dysregulation.
We applied WGCNA to construct co-expression networks related to neural development and depolarization in iPSC-derived neural cells from TS and control individuals for the first time. These analyses illustrate how a systems biology approach based on gene networks can yield insights into the molecular mechanisms of neural development and function, and provide clues as to the functional impact of the downstream effects of Ca2+ signaling dysregulation on transcription.
Electronic supplementary material
The online version of this article (doi:10.1186/s13073-014-0075-5) contains supplementary material, which is available to authorized users.
Global gene expression measured by DNA microarray platforms have been extensively used to classify breast carcinomas correlating with clinical characteristics, including outcome. We generated a breast cancer Serial Analysis of Gene Expression (SAGE) high-resolution database of ~2.7 million tags to perform unsupervised statistical analyses to obtain the molecular classification of breast-invasive ductal carcinomas in correlation with clinicopathologic features. Unsupervised statistical analysis by means of a random forest approach identified two main clusters of breast carcinomas, which differed in their lymph node status (P = 0.01); this suggested that lymph node status leads to globally distinct expression profiles. A total of 245 (55 up-modulated and 190 down-modulated) transcripts were differentially expressed between lymph node (+) and lymph node (−) primary breast tumors (fold change, ≥2; P < 0.05). Various lymph node (+) up-modulated transcripts were validated in independent sets of human breast tumors by means of real-time reverse transcription-PCR (RT-PCR). We validated significant overexpression of transcripts for HOXC10 (P = 0.001), TPD52L1 (P = 0.007), ZFP36L1 (P = 0.011), PLINP1 (P = 0.013), DCTN3 (P = 0.025), DEK (P = 0.031), and CSNK1D (P = 0.04) in lymph node (+) breast carcinomas. Moreover, the DCTN3 (P = 0.022) and RHBDD2 (P = 0.002) transcripts were confirmed to be overexpressed in tumors that recurred within 6 years of follow-up by real-time RT-PCR. In addition, meta-analysis was used to compare SAGE data associated with lymph node (+) status with publicly available breast cancer DNA microarray data sets. We have generated evidence indicating that the pattern of gene expression in primary breast cancers at the time of surgical removal could discriminate those tumors with lymph node metastatic involvement using SAGE to identify specific transcripts that behave as predictors of recurrence as well.
Tuberculosis is a leading cause of infectious disease–related death worldwide; however, only 10% of people infected with Mycobacterium tuberculosis develop disease. Factors that contribute to protection could prove to be promising targets for M. tuberculosis therapies. Analysis of peripheral blood gene expression profiles of active tuberculosis patients has identified correlates of risk for disease or pathogenesis. We sought to identify potential human candidate markers of host defense by studying gene expression profiles of macrophages, cells that, upon infection by M. tuberculosis, can mount an antimicrobial response. Weighted gene coexpression network analysis revealed an association between the cytokine interleukin-32 (IL-32) and the vitamin D antimicrobial pathway in a network of interferon-γ– and IL-15–induced “defense response” genes. IL-32 induced the vitamin D–dependent antimicrobial peptides cathelicidin and DEFB4 and to generate antimicrobial activity in vitro, dependent on the presence of adequate 25-hydroxyvitamin D. In addition, the IL-15–induced defense response macrophage gene network was integrated with ranked pairwise comparisons of gene expression from five different clinical data sets of latent compared with active tuberculosis or healthy controls and a coexpression network derived from gene expression in patients with tuberculosis undergoing chemotherapy. Together, these analyses identified eight common genes, including IL-32, as molecular markers of latent tuberculosis and the IL-15–induced gene network. As maintaining M. tuberculosis in a latent state and preventing transition to active disease may represent a form of host resistance, these results identify IL-32 as one functional marker and potential correlate of protection against active tuberculosis.
Little is known about how changes in DNA methylation mediate risk for human diseases including dementia. Analysis of genome-wide methylation patterns in patients with two forms of tau-related dementia – progressive supranuclear palsy (PSP) and frontotemporal dementia (FTD) – revealed significant differentially methylated probes (DMPs) in patients versus unaffected controls. Remarkably, DMPs in PSP were clustered within the 17q21.31 region, previously known to harbor the major genetic risk factor for PSP. We identified and replicated a dose-dependent effect of the risk-associated H1 haplotype on methylation levels within the region in blood and brain. These data reveal that the H1 haplotype increases risk for tauopathy via differential methylation at that locus, indicating a mediating role for methylation in dementia pathophysiology.
Progressive supranuclear palsy (PSP) and frontotemporal dementia (FTD) are two neurodegenerative diseases linked, at the pathologic and genetic level, to the microtubule associated protein tau. We studied epigenetic changes (DNA methylation levels) in peripheral blood from patients with PSP, FTD, and unaffected controls. Analysis of genome-wide methylation patterns revealed significant differentially methylated probes in patients versus unaffected controls. Remarkably, differentially methylated probes in PSP vs. controls were preferentially clustered within the 17q21.31 region, previously known to harbor the major genetic risk factor for PSP. We identified and replicated a dose-dependent effect of the risk-associated H1 haplotype on methylation levels within the region in independent datasets in blood and brain. These data reveal that the H1 haplotype increases risk for tauopathy via differential methylation, indicating a mediating role for methylation in dementia pathophysiology.
The combination of expression patterns of AGR2 and CD10 by prostate cancer provided four phenotypes that correlated with clinical outcome. Based on immunophenotyping, CD10lowAGR2high, CD10highAGR2high, CD10lowAGR2low, and CD10highAGR2low were distinguished. AGR2+ tumors were associated with longer recurrence-free survival and CD10+ tumors with shorter recurrence-free survival. In high-stage cases, the CD10lowAGR2high phenotype was associated with a 9-fold higher recurrence-free survival than the CD10highAGR2low phenotype. The CD10highAGR2high and CD10lowAGR2low phenotypes were intermediate. The CD10highAGR2low phenotype was most frequent in high-grade primary tumors. Conversely, bone and other soft tissue metastases, and derivative xenografts, expressed more AGR2 and less CD10. AGR2 protein was readily detected in tumor metastases. The CD10highAGR2low phenotype in primary tumors is predictive of poor outcome; however, the CD10lowAGR2high phenotype is more common in metastases. It appears that AGR2 has a protective function in primary tumors but may have a role in the distal spread of tumor cells.
Prostate cancer; AGR2; CD10; cancer cell phenotypes; patient stratification; bone and soft tissue metastases; xenografts
Abnormalities of the intestinal microbiota are implicated in the pathogenesis of Crohn's disease (CD) and ulcerative colitis (UC), two spectra of inflammatory bowel disease (IBD). However, the high complexity and low inter-individual overlap of intestinal microbial composition are formidable barriers to identifying microbial taxa representing this dysbiosis. These difficulties might be overcome by an ecologic analytic strategy to identify modules of interacting bacteria (rather than individual bacteria) as quantitative reproducible features of microbial composition in normal and IBD mucosa. We sequenced 16S ribosomal RNA genes from 179 endoscopic lavage samples from different intestinal regions in 64 subjects (32 controls, 16 CD and 16 UC patients in clinical remission). CD and UC patients showed a reduction in phylogenetic diversity and shifts in microbial composition, comparable to previous studies using conventional mucosal biopsies. Analysis of weighted co-occurrence network revealed 5 microbial modules. These modules were unprecedented, as they were detectable in all individuals, and their composition and abundance was recapitulated in an independent, biopsy-based mucosal dataset 2 modules were associated with healthy, CD, or UC disease states. Imputed metagenome analysis indicated that these modules displayed distinct metabolic functionality, specifically the enrichment of oxidative response and glycan metabolism pathways relevant to host-pathogen interaction in the disease-associated modules. The highly preserved microbial modules accurately classified IBD status of individual patients during disease quiescence, suggesting that microbial dysbiosis in IBD may be an underlying disorder independent of disease activity. Microbial modules thus provide an integrative view of microbial ecology relevant to IBD.
It is not yet known whether DNA methylation levels can be used to accurately predict age across a broad spectrum of human tissues and cell types, nor whether the resulting age prediction is a biologically meaningful measure.
I developed a multi-tissue predictor of age that allows one to estimate the DNA methylation age of most tissues and cell types. The predictor, which is freely available, was developed using 8,000 samples from 82 Illumina DNA methylation array datasets, encompassing 51 healthy tissues and cell types. I found that DNA methylation age has the following properties: first, it is close to zero for embryonic and induced pluripotent stem cells; second, it correlates with cell passage number; third, it gives rise to a highly heritable measure of age acceleration; and, fourth, it is applicable to chimpanzee tissues. Analysis of 6,000 cancer samples from 32 datasets showed that all of the considered 20 cancer types exhibit significant age acceleration, with an average of 36 years. Low age-acceleration of cancer tissue is associated with a high number of somatic mutations and TP53 mutations, while mutations in steroid receptors greatly accelerate DNA methylation age in breast cancer. Finally, I characterize the 353 CpG sites that together form an aging clock in terms of chromatin states and tissue variance.
I propose that DNA methylation age measures the cumulative effect of an epigenetic maintenance system. This novel epigenetic clock can be used to address a host of questions in developmental biology, cancer and aging research.
Rationale and Objective
In this Emerging Science Review, we discuss a systems genetics strategy, which we call Gene Module Association Study (GMAS), as a novel approach complementing Genome Wide Association Studies (GWAS), to understand complex diseases by focusing on how genes work together in groups rather than singly.
The first step is to characterize phenotypic differences among a genetically diverse population. The second step is to use gene expression microarray (or other high throughput) data from the population to construct gene co-expression networks. Co-expression analysis typically groups 20,000 genes into 20–30 modules containing 10’s to 100’s of genes, whose aggregate behavior can be represented by the module’s “eigengene.” The third step is to correlate expression patterns with phenotype, as in GWAS, only applied to eigengenes instead of SNPs.
Results and Conclusions
The goal of the GMAS approach is to identify groups of co-regulated genes that explain complex traits from a systems perspective. From an evolutionary standpoint, we hypothesize that variability in eigengene patterns reflects the “good enough solution” concept, that biological systems are sufficiently complex so that many possible combinations of the same elements (in this case eigengenes) can produce an equivalent output, i.e. a “good enough solution” to accomplish normal biological functions. However, when faced with environmental stresses, some “good enough solutions” adapt better than others, explaining individual variability to disease and drug susceptibility. If validated, GMAS may imply that common polygenic diseases are related as much to group interactions between normal genes, as to multiple gene mutations.
systems genetics; genetics of complex diseases; scale-free networks; hybrid mouse diversity panel; computational biology
We used affinity-purification mass spectrometry to identify 747 candidate proteins that are complexed with Huntingtin (Htt) in distinct brain regions and ages in Huntington’s disease (HD) and wildtype mouse brains. To gain a systems-level view of the Htt interactome, we applied Weighted Gene Correlation Network Analysis (WGCNA) to the entire proteomic dataset to unveil a verifiable rank of Htt-correlated proteins and a network of Htt-interacting protein modules, with each module highlighting distinct aspects of Htt biology. Importantly, the Htt-containing module is highly enriched with proteins involved in 14-3-3 signaling, microtubule-based transport, and proteostasis. Top-ranked proteins in this module were validated as novel Htt interactors and genetic modifiers in an HD Drosophila model. Together, our study provides a compendium of spatiotemporal Htt-interacting proteins in the mammalian brain, and presents a conceptually novel approach to analyze proteomic interactome datasets to build in vivo protein networks in complex tissues such as the brain.
Consistent compositional shifts in the gut microbiota are observed in IBD and other chronic intestinal disorders and may contribute to pathogenesis. The identities of microbial biomolecular mechanisms and metabolic products responsible for disease phenotypes remain to be determined, as do the means by which such microbial functions may be therapeutically modified.
The composition of the microbiota and metabolites in gut microbiome samples in 47 subjects were determined. Samples were obtained by endoscopic mucosal lavage from the cecum and sigmoid colon regions, and each sample was sequenced using the 16S rRNA gene V4 region (Illumina-HiSeq 2000 platform) and assessed by UPLC mass spectroscopy. Spearman correlations were used to identify widespread, statistically significant microbial-metabolite relationships. Metagenomes for identified microbial OTUs were imputed using PICRUSt, and KEGG metabolic pathway modules for imputed genes were assigned using HUMAnN. The resulting metabolic pathway abundances were mostly concordant with metabolite data. Analysis of the metabolome-driven distribution of OTU phylogeny and function revealed clusters of clades that were both metabolically and metagenomically similar.
The results suggest that microbes are syntropic with mucosal metabolome composition and therefore may be the source of and/or dependent upon gut epithelial metabolites. The consistent relationship between inferred metagenomic function and assayed metabolites suggests that metagenomic composition is predictive to a reasonable degree of microbial community metabolite pools. The finding that certain metabolites strongly correlate with microbial community structure raises the possibility of targeting metabolites for monitoring and/or therapeutically manipulating microbial community function in IBD and other chronic diseases.
Microbiome; Metabolome; Inter-omic analysis
Transcriptional studies suggest Alzheimer's disease (AD) involves dysfunction of many cellular pathways, including synaptic transmission, cytoskeletal dynamics, energetics, and apoptosis. Despite known progression of AD pathologies, it is unclear how such striking regional vulnerability occurs, or which genes play causative roles in disease progression.
To address these issues, we performed a large-scale transcriptional analysis in the CA1 and relatively less vulnerable CA3 brain regions of individuals with advanced AD and nondemented controls. In our study, we assessed differential gene expression across region and disease status, compared our results to previous studies of similar design, and performed an unbiased co-expression analysis using weighted gene co-expression network analysis (WGCNA). Several disease genes were identified and validated using qRT-PCR.
We find disease signatures consistent with several previous microarray studies, then extend these results to show a relationship between disease status and brain region. Specifically, genes showing decreased expression with AD progression tend to show enrichment in CA3 (and vice versa), suggesting transcription levels may reflect a region's vulnerability to disease. Additionally, we find several candidate vulnerability (ABCA1, MT1H, PDK4, RHOBTB3) and protection (FAM13A1, LINGO2, UNC13C) genes based on expression patterns. Finally, we use a systems-biology approach based on WGCNA to uncover disease-relevant expression patterns for major cell types, including pathways consistent with a key role for early microglial activation in AD.
These results paint a picture of AD as a multifaceted disease involving slight transcriptional changes in many genes between regions, coupled with a systemic immune response, gliosis, and neurodegeneration. Despite this complexity, we find that a consistent picture of gene expression in AD is emerging.
Activation of the epidermal growth factor receptor (EGFR) in glioblastoma (GBM) occurs through mutations or deletions in the extracellular (EC) domain. Unlike lung cancers with EGFR kinase domain (KD) mutations, GBMs respond poorly to the EGFR inhibitor erlotinib. Using RNAi, we show that GBM cells carrying EGFR EC mutations display EGFR addiction. In contrast to KD mutants found in lung cancer, glioma-specific EGFR EC mutants are poorly inhibited by EGFR inhibitors that target the active kinase conformation (e.g., erlotinib). Inhibitors which bind to the inactive EGFR conformation, on the other hand, potently inhibit EGFR EC mutants and induce cell death in EGFR mutant GBM cells. Our results provide first evidence for single kinase addiction in GBM, and suggest that the disappointing clinical activity of first-generation EGFR inhibitors in GBM versus lung cancer may be attributed to the different conformational requirements of mutant EGFR in these two cancer types.
Many network analyses of fMRI data begin by defining a set of regions, extracting the mean signal from each region and then analyzing the correlations between regions. One essential question that has not been addressed in the literature is how to best define the network neighborhoods over which a signal is combined for network analyses. Here we present a novel unsupervised method for the identification of tightly interconnected voxels, or modules, from fMRI data. This approach, weighted voxel coactivation network analysis (WVCNA) is based on a method that was originally developed to find modules of genes in gene networks. This approach differs from many of the standard network approaches in fMRI in that connections between voxels are described by a continuous measure, whereas typically voxels are considered to be either connected or not connected depending on whether the correlation between the two voxels survives a hard threshold value. Additionally, instead of simply using pairwise correlations to describe the connection between two voxels, WVCNA relies on a measure of topological overlap, which not only compares how correlated two voxels are, but also the degree to which the pair of voxels is highly correlated with the same other voxels. We demonstrate the use of WVCNA to parcellate the brain into a set of modules that are reliably detected across data within the same subject and across subjects. In addition we compare WVCNA to ICA and show that the WVCNA modules have some of the same structure as the ICA components, but tend to be more spatially focused. We also demonstrate the use of some of the WVCNA network metrics for assessing a voxel’s membership to a module and also how that voxel relates to other modules. Last, we illustrate how WVCNA modules can be used in a network analysis to find connections between regions of the brain and show that it produces reasonable results.
Functional Magnetic Resonance Imaging; Functional Connectivity; Graph Theory; Small World Networks
Since hub nodes have been found to play important roles in many networks, highly connected hub genes are expected to play an important role in biology as well. However, the empirical evidence remains ambiguous. An open question is whether (or when) hub gene selection leads to more meaningful gene lists than a standard statistical analysis based on significance testing when analyzing genomic data sets (e.g., gene expression or DNA methylation data). Here we address this question for the special case when multiple genomic data sets are available. This is of great practical importance since for many research questions multiple data sets are publicly available. In this case, the data analyst can decide between a standard statistical approach (e.g., based on meta-analysis) and a co-expression network analysis approach that selects intramodular hubs in consensus modules. We assess the performance of these two types of approaches according to two criteria. The first criterion evaluates the biological insights gained and is relevant in basic research. The second criterion evaluates the validation success (reproducibility) in independent data sets and often applies in clinical diagnostic or prognostic applications. We compare meta-analysis with consensus network analysis based on weighted correlation network analysis (WGCNA) in three comprehensive and unbiased empirical studies: (1) Finding genes predictive of lung cancer survival, (2) finding methylation markers related to age, and (3) finding mouse genes related to total cholesterol. The results demonstrate that intramodular hub gene status with respect to consensus modules is more useful than a meta-analysis p-value when identifying biologically meaningful gene lists (reflecting criterion 1). However, standard meta-analysis methods perform as good as (if not better than) a consensus network approach in terms of validation success (criterion 2). The article also reports a comparison of meta-analysis techniques applied to gene expression data and presents novel R functions for carrying out consensus network analysis, network based screening, and meta analysis.
Autism spectrum disorder (ASD) is a common, highly heritable neuro-developmental condition characterized by marked genetic heterogeneity1–3. Thus, a fundamental question is whether autism represents an etiologically heterogeneous disorder in which the myriad genetic or environmental risk factors perturb common underlying molecular pathways in the brain4. Here, we demonstrate consistent differences in transcriptome organization between autistic and normal brain by gene co-expression network analysis. Remarkably, regional patterns of gene expression that typically distinguish frontal and temporal cortex are significantly attenuated in the ASD brain, suggesting abnormalities in cortical patterning. We further identify discrete modules of co-expressed genes associated with autism: a neuronal module enriched for known autism susceptibility genes, including the neuronal specific splicing factor A2BP1/FOX1, and a module enriched for immune genes and glial markers. Using high-throughput RNA-sequencing we demonstrate dysregulated splicing of A2BP1-dependent alternative exons in ASD brain. Moreover, using a published autism GWAS dataset, we show that the neuronal module is enriched for genetically associated variants, providing independent support for the causal involvement of these genes in autism. In contrast, the immune-glial module showed no enrichment for autism GWAS signals, indicating a non-genetic etiology for this process. Collectively, our results provide strong evidence for convergent molecular abnormalities in ASD, and implicate transcriptional and splicing dysregulation as underlying mechanisms of neuronal dysfunction in this disorder.
The models in this article generalize current models for both correlation networks and multigraph networks. Correlation networks are widely applied in genomics research. In contrast to general networks, it is straightforward to test the statistical significance of an edge in a correlation network. It is also easy to decompose the underlying correlation matrix and generate informative network statistics such as the module eigenvector. However, correlation networks only capture the connections between numeric variables. An open question is whether one can find suitable decompositions of the similarity measures employed in constructing general networks. Multigraph networks are attractive because they support likelihood based inference. Unfortunately, it is unclear how to adjust current statistical methods to detect the clusters inherent in many data sets.
Here we present an intuitive and parsimonious parametrization of a general similarity measure such as a network adjacency matrix. The cluster and propensity based approximation (CPBA) of a network not only generalizes correlation network methods but also multigraph methods. In particular, it gives rise to a novel and more realistic multigraph model that accounts for clustering and provides likelihood based tests for assessing the significance of an edge after controlling for clustering. We present a novel Majorization-Minimization (MM) algorithm for estimating the parameters of the CPBA. To illustrate the practical utility of the CPBA of a network, we apply it to gene expression data and to a bi-partite network model for diseases and disease genes from the Online Mendelian Inheritance in Man (OMIM).
The CPBA of a network is theoretically appealing since a) it generalizes correlation and multigraph network methods, b) it improves likelihood based significance tests for edge counts, c) it directly models higher-order relationships between clusters, and d) it suggests novel clustering algorithms. The CPBA of a network is implemented in Fortran 95 and bundled in the freely available R package PropClust.
Network decomposition; Model-based clustering; MM algorithm; Propensity; Network conformity
We report a systems genetics analysis of high density lipoproteins (HDL) levels in an F2 intercross between inbred strains CAST/EiJ and C57BL/6J. We previously showed that there are dramatic differences in HDL metabolism in a cross between these strains, and we now report co-expression network analysis of HDL that integrates global expression data from liver and adipose with relevant metabolic traits. Using data from a total of 293 F2 intercross mice, we constructed weighted gene co-expression networks and identified modules (subnetworks) associated with HDL and clinical traits. These were examined for genes implicated in HDL levels based on large human genome-wide associations studies (GWAS) and examined with respect to conservation between tissue and sexes in a total of 9 data sets. We identify genes that are consistently ranked high by association with HDL across the 9 data sets. We focus in particular on two genes, Wfdc2 and Hdac3, that are located in close proximity to HDL QTL peaks where causal testing indicates that they may affect HDL. Our results provide a rich resource for studies of complex metabolic interactions involving HDL.
Human Immunodeficiency Virus-1 (HIV) infection frequently results in neurocognitive impairment. While the cause remains unclear, recent gene expression studies have identified genes whose transcription is dysregulated in individuals with HIV-association neurocognitive disorder (HAND). However, the methods for interpretation of such data have lagged behind the technical advances allowing the decoding genetic material. Here, we employ systems biology methods novel to the field of NeuroAIDS to further interrogate extant transcriptome data derived from brains of HIV + patients in order to further elucidate the neuropathogenesis of HAND. Additionally, we compare these data to those derived from brains of individuals with Alzheimer’s disease (AD) in order to identify common pathways of neuropathogenesis.
In Study 1, using data from three brain regions in 6 HIV-seronegative and 15 HIV + cases, we first employed weighted gene co-expression network analysis (WGCNA) to further explore transcriptome networks specific to HAND with HIV-encephalitis (HIVE) and HAND without HIVE. We then used a symptomatic approach, employing standard expression analysis and WGCNA to identify networks associated with neurocognitive impairment (NCI), regardless of HIVE or HAND diagnosis. Finally, we examined the association between the CNS penetration effectiveness (CPE) of antiretroviral regimens and brain transcriptome. In Study 2, we identified common gene networks associated with NCI in both HIV and AD by correlating gene expression with pre-mortem neurocognitive functioning.
Study 1: WGCNA largely corroborated findings from standard differential gene expression analyses, but also identified possible meta-networks composed of multiple gene ontology categories and oligodendrocyte dysfunction. Differential expression analysis identified hub genes highly correlated with NCI, including genes implicated in gliosis, inflammation, and dopaminergic tone. Enrichment analysis identified gene ontology categories that varied across the three brain regions, the most notable being downregulation of genes involved in mitochondrial functioning. Finally, WGCNA identified dysregulated networks associated with NCI, including oligodendrocyte and mitochondrial functioning. Study 2: Common gene networks dysregulated in relation to NCI in AD and HIV included mitochondrial genes, whereas upregulation of various cancer-related genes was found.
While under-powered, this study identified possible biologically-relevant networks correlated with NCI in HIV, and common networks shared with AD, opening new avenues for inquiry in the investigation of HAND neuropathogenesis. These results suggest that further interrogation of existing transcriptome data using systems biology methods can yield important information.
HIV encephalitis; HIV-associated dementia; HIV-associated neurocognitive disorder; Weighted gene coexpression network analysis; WGCNA; CNS penetration effectiveness; National neuroAIDS tissue consortium; Coexpression module
Similarities between speech and birdsong make songbirds advantageous for investigating the neurogenetics of learned vocal communication; a complex phenotype likely supported by ensembles of interacting genes in cortico-basal ganglia pathways of both species. To date, only FoxP2 has been identified as critical to both speech and birdsong. We performed weighted gene co-expression network analysis on microarray data from singing zebra finches to discover gene ensembles regulated during vocal behavior. We found ~2,000 singing-regulated genes comprising 3 co-expression groups unique to area X, the basal ganglia subregion dedicated to learned vocalizations. These contained known targets of human FOXP2 and potential avian targets. We validated novel biological pathways for vocalization. Higher order gene co-expression patterns, rather than expression levels, molecularly distinguish area X from the ventral striato-pallidum during singing. The previously unknown structure of singing-driven networks enables prioritization of molecular interactors that likely bear on human motor disorders, especially those affecting speech.
Ensemble predictors such as the random forest are known to have superior accuracy but their black-box predictions are difficult to interpret. In contrast, a generalized linear model (GLM) is very interpretable especially when forward feature selection is used to construct the model. However, forward feature selection tends to overfit the data and leads to low predictive accuracy. Therefore, it remains an important research goal to combine the advantages of ensemble predictors (high accuracy) with the advantages of forward regression modeling (interpretability). To address this goal several articles have explored GLM based ensemble predictors. Since limited evaluations suggested that these ensemble predictors were less accurate than alternative predictors, they have found little attention in the literature.
Comprehensive evaluations involving hundreds of genomic data sets, the UCI machine learning benchmark data, and simulations are used to give GLM based ensemble predictors a new and careful look. A novel bootstrap aggregated (bagged) GLM predictor that incorporates several elements of randomness and instability (random subspace method, optional interaction terms, forward variable selection) often outperforms a host of alternative prediction methods including random forests and penalized regression models (ridge regression, elastic net, lasso). This random generalized linear model (RGLM) predictor provides variable importance measures that can be used to define a “thinned” ensemble predictor (involving few features) that retains excellent predictive accuracy.
RGLM is a state of the art predictor that shares the advantages of a random forest (excellent predictive accuracy, feature importance measures, out-of-bag estimates of accuracy) with those of a forward selected generalized linear model (interpretability). These methods are implemented in the freely available R software package randomGLM.
Co-expression measures are often used to define networks among genes. Mutual information (MI) is often used as a generalized correlation measure. It is not clear how much MI adds beyond standard (robust) correlation measures or regression model based association measures. Further, it is important to assess what transformations of these and other co-expression measures lead to biologically meaningful modules (clusters of genes).
We provide a comprehensive comparison between mutual information and several correlation measures in 8 empirical data sets and in simulations. We also study different approaches for transforming an adjacency matrix, e.g. using the topological overlap measure. Overall, we confirm close relationships between MI and correlation in all data sets which reflects the fact that most gene pairs satisfy linear or monotonic relationships. We discuss rare situations when the two measures disagree. We also compare correlation and MI based approaches when it comes to defining co-expression network modules. We show that a robust measure of correlation (the biweight midcorrelation transformed via the topological overlap transformation) leads to modules that are superior to MI based modules and maximal information coefficient (MIC) based modules in terms of gene ontology enrichment. We present a function that relates correlation to mutual information which can be used to approximate the mutual information from the corresponding correlation coefficient. We propose the use of polynomial or spline regression models as an alternative to MI for capturing non-linear relationships between quantitative variables.
The biweight midcorrelation outperforms MI in terms of elucidating gene pairwise relationships. Coupled with the topological overlap matrix transformation, it often leads to more significantly enriched co-expression modules. Spline and polynomial networks form attractive alternatives to MI in case of non-linear relationships. Our results indicate that MI networks can safely be replaced by correlation networks when it comes to measuring co-expression relationships in stationary data.
High serum triglyceride (TG) levels is an established risk factor for coronary heart disease (CHD). Fat is stored in the form of TGs in human adipose tissue. We hypothesized that gene co-expression networks in human adipose tissue may be correlated with serum TG levels and help reveal novel genes involved in TG regulation.
Gene co-expression networks were constructed from two Finnish and one Mexican study sample using the blockwiseModules R function in Weighted Gene Co-expression Network Analysis (WGCNA). Overlap between TG-associated networks from each of the three study samples were calculated using a Fisher’s Exact test. Gene ontology was used to determine known pathways enriched in each TG-associated network.
We measured gene expression in adipose samples from two Finnish and one Mexican study sample. In each study sample, we observed a gene co-expression network that was significantly associated with serum TG levels. The TG modules observed in Finns and Mexicans significantly overlapped and shared 34 genes. Seven of the 34 genes (ARHGAP30, CCR1, CXCL16, FERMT3, HCST, RNASET2, SELPG) were identified as the key hub genes of all three TG modules. Furthermore, two of the 34 genes (ARHGAP9, LST1) reside in previous TG GWAS regions, suggesting them as the regional candidates underlying the GWAS signals.
This study presents a novel adipose gene co-expression network with 34 genes significantly correlated with serum TG across populations.
Mexicans; Finns; RNA sequencing; Triglycerides; Adipose tissue; Weighted gene co-expression network analysis