Rationale and Objective
In this Emerging Science Review, we discuss a systems genetics strategy, which we call Gene Module Association Study (GMAS), as a novel approach complementing Genome Wide Association Studies (GWAS), to understand complex diseases by focusing on how genes work together in groups rather than singly.
The first step is to characterize phenotypic differences among a genetically diverse population. The second step is to use gene expression microarray (or other high throughput) data from the population to construct gene co-expression networks. Co-expression analysis typically groups 20,000 genes into 20–30 modules containing 10’s to 100’s of genes, whose aggregate behavior can be represented by the module’s “eigengene.” The third step is to correlate expression patterns with phenotype, as in GWAS, only applied to eigengenes instead of SNPs.
Results and Conclusions
The goal of the GMAS approach is to identify groups of co-regulated genes that explain complex traits from a systems perspective. From an evolutionary standpoint, we hypothesize that variability in eigengene patterns reflects the “good enough solution” concept, that biological systems are sufficiently complex so that many possible combinations of the same elements (in this case eigengenes) can produce an equivalent output, i.e. a “good enough solution” to accomplish normal biological functions. However, when faced with environmental stresses, some “good enough solutions” adapt better than others, explaining individual variability to disease and drug susceptibility. If validated, GMAS may imply that common polygenic diseases are related as much to group interactions between normal genes, as to multiple gene mutations.
systems genetics; genetics of complex diseases; scale-free networks; hybrid mouse diversity panel; computational biology
We used affinity-purification mass spectrometry to identify 747 candidate proteins that are complexed with Huntingtin (Htt) in distinct brain regions and ages in Huntington’s disease (HD) and wildtype mouse brains. To gain a systems-level view of the Htt interactome, we applied Weighted Gene Correlation Network Analysis (WGCNA) to the entire proteomic dataset to unveil a verifiable rank of Htt-correlated proteins and a network of Htt-interacting protein modules, with each module highlighting distinct aspects of Htt biology. Importantly, the Htt-containing module is highly enriched with proteins involved in 14-3-3 signaling, microtubule-based transport, and proteostasis. Top-ranked proteins in this module were validated as novel Htt interactors and genetic modifiers in an HD Drosophila model. Together, our study provides a compendium of spatiotemporal Htt-interacting proteins in the mammalian brain, and presents a conceptually novel approach to analyze proteomic interactome datasets to build in vivo protein networks in complex tissues such as the brain.
Transcriptional studies suggest Alzheimer's disease (AD) involves dysfunction of many cellular pathways, including synaptic transmission, cytoskeletal dynamics, energetics, and apoptosis. Despite known progression of AD pathologies, it is unclear how such striking regional vulnerability occurs, or which genes play causative roles in disease progression.
To address these issues, we performed a large-scale transcriptional analysis in the CA1 and relatively less vulnerable CA3 brain regions of individuals with advanced AD and nondemented controls. In our study, we assessed differential gene expression across region and disease status, compared our results to previous studies of similar design, and performed an unbiased co-expression analysis using weighted gene co-expression network analysis (WGCNA). Several disease genes were identified and validated using qRT-PCR.
We find disease signatures consistent with several previous microarray studies, then extend these results to show a relationship between disease status and brain region. Specifically, genes showing decreased expression with AD progression tend to show enrichment in CA3 (and vice versa), suggesting transcription levels may reflect a region's vulnerability to disease. Additionally, we find several candidate vulnerability (ABCA1, MT1H, PDK4, RHOBTB3) and protection (FAM13A1, LINGO2, UNC13C) genes based on expression patterns. Finally, we use a systems-biology approach based on WGCNA to uncover disease-relevant expression patterns for major cell types, including pathways consistent with a key role for early microglial activation in AD.
These results paint a picture of AD as a multifaceted disease involving slight transcriptional changes in many genes between regions, coupled with a systemic immune response, gliosis, and neurodegeneration. Despite this complexity, we find that a consistent picture of gene expression in AD is emerging.
Activation of the epidermal growth factor receptor (EGFR) in glioblastoma (GBM) occurs through mutations or deletions in the extracellular (EC) domain. Unlike lung cancers with EGFR kinase domain (KD) mutations, GBMs respond poorly to the EGFR inhibitor erlotinib. Using RNAi, we show that GBM cells carrying EGFR EC mutations display EGFR addiction. In contrast to KD mutants found in lung cancer, glioma-specific EGFR EC mutants are poorly inhibited by EGFR inhibitors that target the active kinase conformation (e.g., erlotinib). Inhibitors which bind to the inactive EGFR conformation, on the other hand, potently inhibit EGFR EC mutants and induce cell death in EGFR mutant GBM cells. Our results provide first evidence for single kinase addiction in GBM, and suggest that the disappointing clinical activity of first-generation EGFR inhibitors in GBM versus lung cancer may be attributed to the different conformational requirements of mutant EGFR in these two cancer types.
Many network analyses of fMRI data begin by defining a set of regions, extracting the mean signal from each region and then analyzing the correlations between regions. One essential question that has not been addressed in the literature is how to best define the network neighborhoods over which a signal is combined for network analyses. Here we present a novel unsupervised method for the identification of tightly interconnected voxels, or modules, from fMRI data. This approach, weighted voxel coactivation network analysis (WVCNA) is based on a method that was originally developed to find modules of genes in gene networks. This approach differs from many of the standard network approaches in fMRI in that connections between voxels are described by a continuous measure, whereas typically voxels are considered to be either connected or not connected depending on whether the correlation between the two voxels survives a hard threshold value. Additionally, instead of simply using pairwise correlations to describe the connection between two voxels, WVCNA relies on a measure of topological overlap, which not only compares how correlated two voxels are, but also the degree to which the pair of voxels is highly correlated with the same other voxels. We demonstrate the use of WVCNA to parcellate the brain into a set of modules that are reliably detected across data within the same subject and across subjects. In addition we compare WVCNA to ICA and show that the WVCNA modules have some of the same structure as the ICA components, but tend to be more spatially focused. We also demonstrate the use of some of the WVCNA network metrics for assessing a voxel’s membership to a module and also how that voxel relates to other modules. Last, we illustrate how WVCNA modules can be used in a network analysis to find connections between regions of the brain and show that it produces reasonable results.
Functional Magnetic Resonance Imaging; Functional Connectivity; Graph Theory; Small World Networks
Since hub nodes have been found to play important roles in many networks, highly connected hub genes are expected to play an important role in biology as well. However, the empirical evidence remains ambiguous. An open question is whether (or when) hub gene selection leads to more meaningful gene lists than a standard statistical analysis based on significance testing when analyzing genomic data sets (e.g., gene expression or DNA methylation data). Here we address this question for the special case when multiple genomic data sets are available. This is of great practical importance since for many research questions multiple data sets are publicly available. In this case, the data analyst can decide between a standard statistical approach (e.g., based on meta-analysis) and a co-expression network analysis approach that selects intramodular hubs in consensus modules. We assess the performance of these two types of approaches according to two criteria. The first criterion evaluates the biological insights gained and is relevant in basic research. The second criterion evaluates the validation success (reproducibility) in independent data sets and often applies in clinical diagnostic or prognostic applications. We compare meta-analysis with consensus network analysis based on weighted correlation network analysis (WGCNA) in three comprehensive and unbiased empirical studies: (1) Finding genes predictive of lung cancer survival, (2) finding methylation markers related to age, and (3) finding mouse genes related to total cholesterol. The results demonstrate that intramodular hub gene status with respect to consensus modules is more useful than a meta-analysis p-value when identifying biologically meaningful gene lists (reflecting criterion 1). However, standard meta-analysis methods perform as good as (if not better than) a consensus network approach in terms of validation success (criterion 2). The article also reports a comparison of meta-analysis techniques applied to gene expression data and presents novel R functions for carrying out consensus network analysis, network based screening, and meta analysis.
Autism spectrum disorder (ASD) is a common, highly heritable neuro-developmental condition characterized by marked genetic heterogeneity1–3. Thus, a fundamental question is whether autism represents an etiologically heterogeneous disorder in which the myriad genetic or environmental risk factors perturb common underlying molecular pathways in the brain4. Here, we demonstrate consistent differences in transcriptome organization between autistic and normal brain by gene co-expression network analysis. Remarkably, regional patterns of gene expression that typically distinguish frontal and temporal cortex are significantly attenuated in the ASD brain, suggesting abnormalities in cortical patterning. We further identify discrete modules of co-expressed genes associated with autism: a neuronal module enriched for known autism susceptibility genes, including the neuronal specific splicing factor A2BP1/FOX1, and a module enriched for immune genes and glial markers. Using high-throughput RNA-sequencing we demonstrate dysregulated splicing of A2BP1-dependent alternative exons in ASD brain. Moreover, using a published autism GWAS dataset, we show that the neuronal module is enriched for genetically associated variants, providing independent support for the causal involvement of these genes in autism. In contrast, the immune-glial module showed no enrichment for autism GWAS signals, indicating a non-genetic etiology for this process. Collectively, our results provide strong evidence for convergent molecular abnormalities in ASD, and implicate transcriptional and splicing dysregulation as underlying mechanisms of neuronal dysfunction in this disorder.
The models in this article generalize current models for both correlation networks and multigraph networks. Correlation networks are widely applied in genomics research. In contrast to general networks, it is straightforward to test the statistical significance of an edge in a correlation network. It is also easy to decompose the underlying correlation matrix and generate informative network statistics such as the module eigenvector. However, correlation networks only capture the connections between numeric variables. An open question is whether one can find suitable decompositions of the similarity measures employed in constructing general networks. Multigraph networks are attractive because they support likelihood based inference. Unfortunately, it is unclear how to adjust current statistical methods to detect the clusters inherent in many data sets.
Here we present an intuitive and parsimonious parametrization of a general similarity measure such as a network adjacency matrix. The cluster and propensity based approximation (CPBA) of a network not only generalizes correlation network methods but also multigraph methods. In particular, it gives rise to a novel and more realistic multigraph model that accounts for clustering and provides likelihood based tests for assessing the significance of an edge after controlling for clustering. We present a novel Majorization-Minimization (MM) algorithm for estimating the parameters of the CPBA. To illustrate the practical utility of the CPBA of a network, we apply it to gene expression data and to a bi-partite network model for diseases and disease genes from the Online Mendelian Inheritance in Man (OMIM).
The CPBA of a network is theoretically appealing since a) it generalizes correlation and multigraph network methods, b) it improves likelihood based significance tests for edge counts, c) it directly models higher-order relationships between clusters, and d) it suggests novel clustering algorithms. The CPBA of a network is implemented in Fortran 95 and bundled in the freely available R package PropClust.
Network decomposition; Model-based clustering; MM algorithm; Propensity; Network conformity
We report a systems genetics analysis of high density lipoproteins (HDL) levels in an F2 intercross between inbred strains CAST/EiJ and C57BL/6J. We previously showed that there are dramatic differences in HDL metabolism in a cross between these strains, and we now report co-expression network analysis of HDL that integrates global expression data from liver and adipose with relevant metabolic traits. Using data from a total of 293 F2 intercross mice, we constructed weighted gene co-expression networks and identified modules (subnetworks) associated with HDL and clinical traits. These were examined for genes implicated in HDL levels based on large human genome-wide associations studies (GWAS) and examined with respect to conservation between tissue and sexes in a total of 9 data sets. We identify genes that are consistently ranked high by association with HDL across the 9 data sets. We focus in particular on two genes, Wfdc2 and Hdac3, that are located in close proximity to HDL QTL peaks where causal testing indicates that they may affect HDL. Our results provide a rich resource for studies of complex metabolic interactions involving HDL.
The molecular complexity of genetic diseases requires novel approaches to break it down into coherent biological modules. For this purpose, many disease network models have been created and analyzed. We highlight two of them, “the human diseases networks” (HDN) and “the orphan disease networks” (ODN). However, in these models, each single node represents one disease or an ambiguous group of diseases. In these cases, the notion of diseases as unique entities reduces the usefulness of network-based methods. We hypothesize that using the clinical features (pathophenotypes) to define pathophenotypic connections between disease-causing genes improve our understanding of the molecular events originated by genetic disturbances. For this, we have built a pathophenotypic similarity gene network (PSGN) and compared it with the unipartite projections (based on gene-to-gene edges) similar to those used in previous network models (HDN and ODN). Unlike these disease network models, the PSGN uses semantic similarities. This pathophenotypic similarity has been calculated by comparing pathophenotypic annotations of genes (human abnormalities of HPO terms) in the “Human Phenotype Ontology”. The resulting network contains 1075 genes (nodes) and 26197 significant pathophenotypic similarities (edges). A global analysis of this network reveals: unnoticed pairs of genes showing significant pathophenotypic similarity, a biological meaningful re-arrangement of the pathological relationships between genes, correlations of biochemical interactions with higher similarity scores and functional biases in metabolic and essential genes toward the pathophenotypic specificity and the pleiotropy, respectively. Additionally, pathophenotypic similarities and metabolic interactions of genes associated with maple syrup urine disease (MSUD) have been used to merge into a coherent pathological module.
Our results indicate that pathophenotypes contribute to identify underlying co-dependencies among disease-causing genes that are useful to describe disease modularity.
Human Immunodeficiency Virus-1 (HIV) infection frequently results in neurocognitive impairment. While the cause remains unclear, recent gene expression studies have identified genes whose transcription is dysregulated in individuals with HIV-association neurocognitive disorder (HAND). However, the methods for interpretation of such data have lagged behind the technical advances allowing the decoding genetic material. Here, we employ systems biology methods novel to the field of NeuroAIDS to further interrogate extant transcriptome data derived from brains of HIV + patients in order to further elucidate the neuropathogenesis of HAND. Additionally, we compare these data to those derived from brains of individuals with Alzheimer’s disease (AD) in order to identify common pathways of neuropathogenesis.
In Study 1, using data from three brain regions in 6 HIV-seronegative and 15 HIV + cases, we first employed weighted gene co-expression network analysis (WGCNA) to further explore transcriptome networks specific to HAND with HIV-encephalitis (HIVE) and HAND without HIVE. We then used a symptomatic approach, employing standard expression analysis and WGCNA to identify networks associated with neurocognitive impairment (NCI), regardless of HIVE or HAND diagnosis. Finally, we examined the association between the CNS penetration effectiveness (CPE) of antiretroviral regimens and brain transcriptome. In Study 2, we identified common gene networks associated with NCI in both HIV and AD by correlating gene expression with pre-mortem neurocognitive functioning.
Study 1: WGCNA largely corroborated findings from standard differential gene expression analyses, but also identified possible meta-networks composed of multiple gene ontology categories and oligodendrocyte dysfunction. Differential expression analysis identified hub genes highly correlated with NCI, including genes implicated in gliosis, inflammation, and dopaminergic tone. Enrichment analysis identified gene ontology categories that varied across the three brain regions, the most notable being downregulation of genes involved in mitochondrial functioning. Finally, WGCNA identified dysregulated networks associated with NCI, including oligodendrocyte and mitochondrial functioning. Study 2: Common gene networks dysregulated in relation to NCI in AD and HIV included mitochondrial genes, whereas upregulation of various cancer-related genes was found.
While under-powered, this study identified possible biologically-relevant networks correlated with NCI in HIV, and common networks shared with AD, opening new avenues for inquiry in the investigation of HAND neuropathogenesis. These results suggest that further interrogation of existing transcriptome data using systems biology methods can yield important information.
HIV encephalitis; HIV-associated dementia; HIV-associated neurocognitive disorder; Weighted gene coexpression network analysis; WGCNA; CNS penetration effectiveness; National neuroAIDS tissue consortium; Coexpression module
Similarities between speech and birdsong make songbirds advantageous for investigating the neurogenetics of learned vocal communication; a complex phenotype likely supported by ensembles of interacting genes in cortico-basal ganglia pathways of both species. To date, only FoxP2 has been identified as critical to both speech and birdsong. We performed weighted gene co-expression network analysis on microarray data from singing zebra finches to discover gene ensembles regulated during vocal behavior. We found ~2,000 singing-regulated genes comprising 3 co-expression groups unique to area X, the basal ganglia subregion dedicated to learned vocalizations. These contained known targets of human FOXP2 and potential avian targets. We validated novel biological pathways for vocalization. Higher order gene co-expression patterns, rather than expression levels, molecularly distinguish area X from the ventral striato-pallidum during singing. The previously unknown structure of singing-driven networks enables prioritization of molecular interactors that likely bear on human motor disorders, especially those affecting speech.
Ensemble predictors such as the random forest are known to have superior accuracy but their black-box predictions are difficult to interpret. In contrast, a generalized linear model (GLM) is very interpretable especially when forward feature selection is used to construct the model. However, forward feature selection tends to overfit the data and leads to low predictive accuracy. Therefore, it remains an important research goal to combine the advantages of ensemble predictors (high accuracy) with the advantages of forward regression modeling (interpretability). To address this goal several articles have explored GLM based ensemble predictors. Since limited evaluations suggested that these ensemble predictors were less accurate than alternative predictors, they have found little attention in the literature.
Comprehensive evaluations involving hundreds of genomic data sets, the UCI machine learning benchmark data, and simulations are used to give GLM based ensemble predictors a new and careful look. A novel bootstrap aggregated (bagged) GLM predictor that incorporates several elements of randomness and instability (random subspace method, optional interaction terms, forward variable selection) often outperforms a host of alternative prediction methods including random forests and penalized regression models (ridge regression, elastic net, lasso). This random generalized linear model (RGLM) predictor provides variable importance measures that can be used to define a “thinned” ensemble predictor (involving few features) that retains excellent predictive accuracy.
RGLM is a state of the art predictor that shares the advantages of a random forest (excellent predictive accuracy, feature importance measures, out-of-bag estimates of accuracy) with those of a forward selected generalized linear model (interpretability). These methods are implemented in the freely available R software package randomGLM.
Co-expression measures are often used to define networks among genes. Mutual information (MI) is often used as a generalized correlation measure. It is not clear how much MI adds beyond standard (robust) correlation measures or regression model based association measures. Further, it is important to assess what transformations of these and other co-expression measures lead to biologically meaningful modules (clusters of genes).
We provide a comprehensive comparison between mutual information and several correlation measures in 8 empirical data sets and in simulations. We also study different approaches for transforming an adjacency matrix, e.g. using the topological overlap measure. Overall, we confirm close relationships between MI and correlation in all data sets which reflects the fact that most gene pairs satisfy linear or monotonic relationships. We discuss rare situations when the two measures disagree. We also compare correlation and MI based approaches when it comes to defining co-expression network modules. We show that a robust measure of correlation (the biweight midcorrelation transformed via the topological overlap transformation) leads to modules that are superior to MI based modules and maximal information coefficient (MIC) based modules in terms of gene ontology enrichment. We present a function that relates correlation to mutual information which can be used to approximate the mutual information from the corresponding correlation coefficient. We propose the use of polynomial or spline regression models as an alternative to MI for capturing non-linear relationships between quantitative variables.
The biweight midcorrelation outperforms MI in terms of elucidating gene pairwise relationships. Coupled with the topological overlap matrix transformation, it often leads to more significantly enriched co-expression modules. Spline and polynomial networks form attractive alternatives to MI in case of non-linear relationships. Our results indicate that MI networks can safely be replaced by correlation networks when it comes to measuring co-expression relationships in stationary data.
High serum triglyceride (TG) levels is an established risk factor for coronary heart disease (CHD). Fat is stored in the form of TGs in human adipose tissue. We hypothesized that gene co-expression networks in human adipose tissue may be correlated with serum TG levels and help reveal novel genes involved in TG regulation.
Gene co-expression networks were constructed from two Finnish and one Mexican study sample using the blockwiseModules R function in Weighted Gene Co-expression Network Analysis (WGCNA). Overlap between TG-associated networks from each of the three study samples were calculated using a Fisher’s Exact test. Gene ontology was used to determine known pathways enriched in each TG-associated network.
We measured gene expression in adipose samples from two Finnish and one Mexican study sample. In each study sample, we observed a gene co-expression network that was significantly associated with serum TG levels. The TG modules observed in Finns and Mexicans significantly overlapped and shared 34 genes. Seven of the 34 genes (ARHGAP30, CCR1, CXCL16, FERMT3, HCST, RNASET2, SELPG) were identified as the key hub genes of all three TG modules. Furthermore, two of the 34 genes (ARHGAP9, LST1) reside in previous TG GWAS regions, suggesting them as the regional candidates underlying the GWAS signals.
This study presents a novel adipose gene co-expression network with 34 genes significantly correlated with serum TG across populations.
Mexicans; Finns; RNA sequencing; Triglycerides; Adipose tissue; Weighted gene co-expression network analysis
Constructing coexpression networks and performing network analysis using large-scale gene expression data sets is an effective way to uncover new biological knowledge; however, the methods used for gene association in constructing these coexpression networks have not been thoroughly evaluated. Since different methods lead to structurally different coexpression networks and provide different information, selecting the optimal gene association method is critical.
Methods and Results
In this study, we compared eight gene association methods – Spearman rank correlation, Weighted Rank Correlation, Kendall, Hoeffding's D measure, Theil-Sen, Rank Theil-Sen, Distance Covariance, and Pearson – and focused on their true knowledge discovery rates in associating pathway genes and construction coordination networks of regulatory genes. We also examined the behaviors of different methods to microarray data with different properties, and whether the biological processes affect the efficiency of different methods.
We found that the Spearman, Hoeffding and Kendall methods are effective in identifying coexpressed pathway genes, whereas the Theil-sen, Rank Theil-Sen, Spearman, and Weighted Rank methods perform well in identifying coordinated transcription factors that control the same biological processes and traits. Surprisingly, the widely used Pearson method is generally less efficient, and so is the Distance Covariance method that can find gene pairs of multiple relationships. Some analyses we did clearly show Pearson and Distance Covariance methods have distinct behaviors as compared to all other six methods. The efficiencies of different methods vary with the data properties to some degree and are largely contingent upon the biological processes, which necessitates the pre-analysis to identify the best performing method for gene association and coexpression network construction.
The predominant model for regulation of gene expression through DNA methylation is an inverse association in which increased methylation results in decreased gene expression levels. However, recent studies suggest that the relationship between genetic variation, DNA methylation and expression is more complex.
Systems genetic approaches for examining relationships between gene expression and methylation array data were used to find both negative and positive associations between these levels. A weighted correlation network analysis revealed that i) both transcriptome and methylome are organized in modules, ii) co-expression modules are generally not preserved in the methylation data and vice-versa, and iii) highly significant correlations exist between co-expression and co-methylation modules, suggesting the existence of factors that affect expression and methylation of different modules (i.e., trans effects at the level of modules). We observed that methylation probes associated with expression in cis were more likely to be located outside CpG islands, whereas specificity for CpG island shores was present when methylation, associated with expression, was under local genetic control. A structural equation model based analysis found strong support in particular for a traditional causal model in which gene expression is regulated by genetic variation via DNA methylation instead of gene expression affecting DNA methylation levels.
Our results provide new insights into the complex mechanisms between genetic markers, epigenetic mechanisms and gene expression. We find strong support for the classical model of genetic variants regulating methylation, which in turn regulates gene expression. Moreover we show that, although the methylation and expression modules differ, they are highly correlated.
DNA methylation; Gene expression; Association; Epigenetics; WGCNA
Both avian and mammalian basal ganglia are involved in voluntary motor control. In birds, such movements include hopping, perching and flying. Two organizational features that distinguish the songbird basal ganglia are that striatal and pallidal neurons are intermingled, and that neurons dedicated to vocal-motor function are clustered together in a dense cell group known as area X that sits within the surrounding striato-pallidum. This specification allowed us to perform molecular profiling of two striato-pallidal subregions, comparing transcriptional patterns in tissue dedicated to vocal-motor function (area X) to those in tissue that contains similar cell types but supports non-vocal behaviors: the striato-pallidum ventral to area X (VSP), our focus here. Since any behavior is likely underpinned by the coordinated actions of many molecules, we constructed gene co-expression networks from microarray data to study large-scale transcriptional patterns in both subregions. Our goal was to investigate any relationship between VSP network structure and singing and identify gene co-expression groups, or modules, found in the VSP but not area X. We observed mild, but surprising, relationships between VSP modules and song spectral features, and found a group of four VSP modules that were highly specific to the region. These modules were unrelated to singing, but were composed of genes involved in many of the same biological processes as those we previously observed in area X-specific singing-related modules. The VSP-specific modules were also enriched for processes disrupted in Parkinson's and Huntington's Diseases. Our results suggest that the activation/inhibition of a single pathway is not sufficient to functionally specify area X versus the VSP and support the notion that molecular processes are not in and of themselves specialized for behavior. Instead, unique interactions between molecular pathways create functional specificity in particular brain regions during distinct behavioral states.
Understanding how gene transcription relates to behavior is challenging. Learned vocal-motor behavior is a complex trait that represents the output of multiple converging genes, pathways, and patterns of neural activity. Here, we applied a systems analytical approach to determine how thousands of genes change their expression levels simultaneously in a region of the vertebrate brain important for vocal-motor function, the basal ganglia, during a specific vocal-motor behavior, singing. We used the zebra finch species of songbird based on similarities between song learning/production and speech, and because they possess a set of brain subregions dedicated to singing. Microarrays were used to measure gene expression levels in one such song-dedicated region and in an adjacent motor area that is not thought to play a role in vocal function. This allowed us to address the question of whether distinct gene co-expression patterns could be found in each area. We found that each area contained unique patterns of transcriptional co-activity, but there were also unexpected overlaps. We conclude that the particular behaviors (singing versus non-vocal behaviors) supported by these subregions depend on the particular sets of interactions between molecular pathways that occur in each subregion.
Estrogen signaling pathways may play a significant role in the pathogenesis of non-small cell lung cancers (NSCLC) as evidenced by the expression of aromatase and estrogen receptors (ERα and ERβ) in many of these tumors. Here we examine whether ERα and ERβ levels in conjunction with aromatase define patient groups with respect to survival outcomes and possible treatment regimens. Immunohistochemistry was performed on a high-density tissue microarray with resulting data and clinical information available for 377 patients. Patients were subdivided by gender, age and tumor histology, and survival data was determined using the Cox proportional hazards model and Kaplan-Meier curves. Neither ERα nor ERβ alone were predictors of survival in NSCLC. However, when coupled with aromatase expression, higher ERβ levels predicted worse survival in patients whose tumors expressed higher levels of aromatase. Although this finding was present in patients of both genders, it was especially pronounced in women ≥ 65 years old, where higher expression of both ERβ and aromatase indicated a markedly worse survival rate than that determined by aromatase alone. Conclusion: Expression of ERβ together with aromatase has predictive value for survival in different gender and age subgroups of NSCLC patients. This predictive value is stronger than each individual marker alone. Our results suggest treatment with aromatase inhibitors alone or combined with estrogen receptor modulators may be of benefit in some subpopulations of these patients.
NSCLC; tissue microarray; aromatase; estrogen receptor; immunohistochemistry; prognosis
It has been debated whether human induced pluripotent stem cells (iPSCs) and embryonic stem cells (ESCs) express distinctive transcriptomes. By using the method of weighted gene co-expression network analysis, we showed here that iPSCs exhibit altered functional modules compared with ESCs. Notably, iPSCs and ESCs differentially express 17 modules that primarily function in transcription, metabolism, development, and immune response. These module activations (up- and downregulation) are highly conserved in a variety of iPSCs, and genes in each module are coherently co-expressed. Furthermore, the activation levels of these modular genes can be used as quantitative variables to discriminate iPSCs and ESCs with high accuracy (96%). Thus, differential activations of these functional modules are the conserved features distinguishing iPSCs from ESCs. Strikingly, the overall activation level of these modules is inversely correlated with the DNA methylation level, suggesting that DNA methylation may be one mechanism regulating the module differences. Overall, we conclude that human iPSCs and ESCs exhibit distinct gene expression networks, which are likely associated with different epigenetic reprogramming events during the derivation of iPSCs and ESCs.
Primary Sjögren's syndrome (pSS) is a chronic autoimmune disease with complex etiopathogenesis. Despite extensive studies to understand the disease process utilizing human and mouse models, the intersection between these species remains elusive. To address this gap, we utilized a novel systems biology approach to identify disease-related gene modules and signaling pathways that overlap between humans and mice.
Parotid gland tissues were harvested from 24 pSS and 16 non-pSS sicca patients and 25 controls. For mouse studies, salivary glands were harvested from C57BL/6.NOD-Aec1Aec2 mice at various times during development of pSS-like disease. RNA was analyzed with Affymetrix HG U133+2.0 arrays for human samples and with MOE430+2.0 arrays for mouse samples. The images were processed with Affymetrix software. Weighted-gene co-expression network analysis was used to identify disease-related and functional pathways.
Nineteen co-expression modules were identified in human parotid tissue, of which four were significantly upregulated and three were downregulated in pSS patients compared with non-pSS sicca patients and controls. Notably, one of the human disease-related modules was highly preserved in the mouse model, and was enriched with genes involved in immune and inflammatory responses. Further comparison between these two species led to the identification of genes associated with leukocyte recruitment and germinal center formation.
Our systems biology analysis of genome-wide expression data from salivary gland tissue of pSS patients and from a pSS mouse model identified common dysregulated biological pathways and molecular targets underlying critical molecular alterations in pSS pathogenesis.
Many high-throughput biological data analyses require the calculation of large correlation matrices and/or clustering of a large number of objects. The standard R function for calculating Pearson correlation can handle calculations without missing values efficiently, but is inefficient when applied to data sets with a relatively small number of missing data. We present an implementation of Pearson correlation calculation that can lead to substantial speedup on data with relatively small number of missing entries. Further, we parallelize all calculations and thus achieve further speedup on systems where parallel processing is available. A robust correlation measure, the biweight midcorrelation, is implemented in a similar manner and provides comparable speed. The functions cor and bicor for fast Pearson and biweight midcorrelation, respectively, are part of the updated, freely available R package WGCNA.
The hierarchical clustering algorithm implemented in R function hclust is an order n3 (n is the number of clustered objects) version of a publicly available clustering algorithm (Murtagh 2012). We present the package flashClust that implements the original algorithm which in practice achieves order approximately n2, leading to substantial time savings when clustering large data sets.
Pearson correlation; robust correlation; hierarchical clustering; R
Complex traits and other polygenic processes require coordinated gene expression. Co-expression networks model mRNA co-expression: the product of gene regulatory networks. To identify regulatory mechanisms underlying coordinated gene expression in a tissue-enriched context, ten Arabidopsis thaliana co-expression networks were constructed after manually sorting 4,566 RNA profiling datasets into aerial, flower, leaf, root, rosette, seedling, seed, shoot, whole plant, and global (all samples combined) groups. Collectively, the ten networks contained 30% of the measurable genes of Arabidopsis and were circumscribed into 5,491 modules. Modules were scrutinized for cis regulatory mechanisms putatively encoded in conserved non-coding sequences (CNSs) previously identified as remnants of a whole genome duplication event. We determined the non-random association of 1,361 unique CNSs to 1,904 co-expression network gene modules. Furthermore, the CNS elements were placed in the context of known gene regulatory networks (GRNs) by connecting 250 CNS motifs with known GRN cis elements. Our results provide support for a regulatory role of some CNS elements and suggest the functional consequences of CNS activation of co-expression in specific gene sets dispersed throughout the genome.
Lung cancer is the most common cause of cancer mortality in male and female patients in the US. Although it is clear that tobacco smoking is a major cause of lung cancer, about half of all women with lung cancer worldwide are never-smokers. Despite a declining smoking population, the incidence of non-small cell lung cancer (NSCLC), the predominant form of lung cancer, has reached epidemic proportions particularly in women. Emerging data suggest that factors other than tobacco, namely endogenous and exogenous female sex hormones, have a role in stimulating NSCLC progression. Aromatase, a key enzyme for estrogen biosynthesis, is expressed in NSCLC. Clinical data show that women with high levels of tumor aromatase (and high intratumoral estrogen) have worse survival than those with low aromatase. The present and previous studies also reveal significant expression and activity of estrogen receptors (ERα, ERβ) in both extranuclear and nuclear sites in most NSCLC. We now report further on the expression of progesterone receptor (PR) transcripts and protein in NSCLC. PR transcripts were significantly lower in cancerous as compared to non-malignant tissue. Using immunohistochemistry, expression of PR was observed in the nucleus and/or extranuclear compartments in the majority of human tumor specimens examined. Combinations of estrogen and progestins administered in vitro cooperate in promoting tumor secretion of vascular endothelial growth factor and, consequently, support tumor-associated angiogenesis. Further, dual treatment with estradiol and progestin increased the numbers of putative tumor stem/progenitor cells. Thus, ER- and/or PR-targeted therapies may offer new approaches to manage NSCLC.
Progesterone; Estrogen; Steroid hormone receptor; Non-small cell lung cancer; VEGF; Progenitor cells; Cancer stem cells; Angiogenesis
Sjögren’s syndrome is a tissue-specific autoimmune disease that affects exocrine tissues, especially salivary glands and lacrimal glands. Despite a large body of evidence gathered over the past 60 years, significant gaps still exist in our understanding of Sjögren’s syndrome. The goal of this study was to develop a database that collects and organizes gene and protein expression data from the existing literature for comparative analysis with future gene expression and proteomic studies of Sjögren’s syndrome.
To catalog the existing knowledge in the field, we used text mining to generate the Sjögren’s Syndrome Knowledge Base (SSKB) of published gene/protein data, which were extracted from PubMed using text mining of over 7,700 abstracts and listing approximately 500 potential genes/proteins. The raw data were manually evaluated to remove duplicates and false-positives and assign gene names. The data base was manually curated to 477 entries, including 377 potential functional genes, which were used for enrichment and pathway analysis using gene ontology and KEGG pathway analysis.
The Sjögren’s syndrome knowledge base (
http://sskb.umn.edu) can form the foundation for an informed search of existing knowledge in the field as new potential therapeutic targets are identified by conventional or high throughput experimental techniques.