We report a systems genetics analysis of high density lipoproteins (HDL) levels in an F2 intercross between inbred strains CAST/EiJ and C57BL/6J. We previously showed that there are dramatic differences in HDL metabolism in a cross between these strains, and we now report co-expression network analysis of HDL that integrates global expression data from liver and adipose with relevant metabolic traits. Using data from a total of 293 F2 intercross mice, we constructed weighted gene co-expression networks and identified modules (subnetworks) associated with HDL and clinical traits. These were examined for genes implicated in HDL levels based on large human genome-wide associations studies (GWAS) and examined with respect to conservation between tissue and sexes in a total of 9 data sets. We identify genes that are consistently ranked high by association with HDL across the 9 data sets. We focus in particular on two genes, Wfdc2 and Hdac3, that are located in close proximity to HDL QTL peaks where causal testing indicates that they may affect HDL. Our results provide a rich resource for studies of complex metabolic interactions involving HDL.
doi:10.1016/j.bbalip.2011.07.014
PMCID: PMC3265689
PMID: 21807117
The molecular complexity of genetic diseases requires novel approaches to break it down into coherent biological modules. For this purpose, many disease network models have been created and analyzed. We highlight two of them, “the human diseases networks” (HDN) and “the orphan disease networks” (ODN). However, in these models, each single node represents one disease or an ambiguous group of diseases. In these cases, the notion of diseases as unique entities reduces the usefulness of network-based methods. We hypothesize that using the clinical features (pathophenotypes) to define pathophenotypic connections between disease-causing genes improve our understanding of the molecular events originated by genetic disturbances. For this, we have built a pathophenotypic similarity gene network (PSGN) and compared it with the unipartite projections (based on gene-to-gene edges) similar to those used in previous network models (HDN and ODN). Unlike these disease network models, the PSGN uses semantic similarities. This pathophenotypic similarity has been calculated by comparing pathophenotypic annotations of genes (human abnormalities of HPO terms) in the “Human Phenotype Ontology”. The resulting network contains 1075 genes (nodes) and 26197 significant pathophenotypic similarities (edges). A global analysis of this network reveals: unnoticed pairs of genes showing significant pathophenotypic similarity, a biological meaningful re-arrangement of the pathological relationships between genes, correlations of biochemical interactions with higher similarity scores and functional biases in metabolic and essential genes toward the pathophenotypic specificity and the pleiotropy, respectively. Additionally, pathophenotypic similarities and metabolic interactions of genes associated with maple syrup urine disease (MSUD) have been used to merge into a coherent pathological module.
Our results indicate that pathophenotypes contribute to identify underlying co-dependencies among disease-causing genes that are useful to describe disease modularity.
doi:10.1371/journal.pone.0056653
PMCID: PMC3578923
PMID: 23437198
Summary
Similarities between speech and birdsong make songbirds advantageous for investigating the neurogenetics of learned vocal communication; a complex phenotype likely supported by ensembles of interacting genes in cortico-basal ganglia pathways of both species. To date, only FoxP2 has been identified as critical to both speech and birdsong. We performed weighted gene co-expression network analysis on microarray data from singing zebra finches to discover gene ensembles regulated during vocal behavior. We found ~2,000 singing-regulated genes comprising 3 co-expression groups unique to area X, the basal ganglia subregion dedicated to learned vocalizations. These contained known targets of human FOXP2 and potential avian targets. We validated novel biological pathways for vocalization. Higher order gene co-expression patterns, rather than expression levels, molecularly distinguish area X from the ventral striato-pallidum during singing. The previously unknown structure of singing-driven networks enables prioritization of molecular interactors that likely bear on human motor disorders, especially those affecting speech.
doi:10.1016/j.neuron.2012.01.005
PMCID: PMC3278710
PMID: 22325205
Background
Co-expression measures are often used to define networks among genes. Mutual information (MI) is often used as a generalized correlation measure. It is not clear how much MI adds beyond standard (robust) correlation measures or regression model based association measures. Further, it is important to assess what transformations of these and other co-expression measures lead to biologically meaningful modules (clusters of genes).
Results
We provide a comprehensive comparison between mutual information and several correlation measures in 8 empirical data sets and in simulations. We also study different approaches for transforming an adjacency matrix, e.g. using the topological overlap measure. Overall, we confirm close relationships between MI and correlation in all data sets which reflects the fact that most gene pairs satisfy linear or monotonic relationships. We discuss rare situations when the two measures disagree. We also compare correlation and MI based approaches when it comes to defining co-expression network modules. We show that a robust measure of correlation (the biweight midcorrelation transformed via the topological overlap transformation) leads to modules that are superior to MI based modules and maximal information coefficient (MIC) based modules in terms of gene ontology enrichment. We present a function that relates correlation to mutual information which can be used to approximate the mutual information from the corresponding correlation coefficient. We propose the use of polynomial or spline regression models as an alternative to MI for capturing non-linear relationships between quantitative variables.
Conclusion
The biweight midcorrelation outperforms MI in terms of elucidating gene pairwise relationships. Coupled with the topological overlap matrix transformation, it often leads to more significantly enriched co-expression modules. Spline and polynomial networks form attractive alternatives to MI in case of non-linear relationships. Our results indicate that MI networks can safely be replaced by correlation networks when it comes to measuring co-expression relationships in stationary data.
doi:10.1186/1471-2105-13-328
PMCID: PMC3586947
PMID: 23217028
Haas, Blake E | Horvath, Steve | Pietiläinen, Kirsi H | Cantor, Rita M | Nikkola, Elina | Weissglas-Volkov, Daphna | Rissanen, Aila | Civelek, Mete | Cruz-Bautista, Ivette | Riba, Laura | Kuusisto, Johanna | Kaprio, Jaakko | Tusie-Luna, Teresa | Laakso, Markku | Aguilar-Salinas, Carlos A | Pajukanta, Päivi
Background
High serum triglyceride (TG) levels is an established risk factor for coronary heart disease (CHD). Fat is stored in the form of TGs in human adipose tissue. We hypothesized that gene co-expression networks in human adipose tissue may be correlated with serum TG levels and help reveal novel genes involved in TG regulation.
Methods
Gene co-expression networks were constructed from two Finnish and one Mexican study sample using the blockwiseModules R function in Weighted Gene Co-expression Network Analysis (WGCNA). Overlap between TG-associated networks from each of the three study samples were calculated using a Fisher’s Exact test. Gene ontology was used to determine known pathways enriched in each TG-associated network.
Results
We measured gene expression in adipose samples from two Finnish and one Mexican study sample. In each study sample, we observed a gene co-expression network that was significantly associated with serum TG levels. The TG modules observed in Finns and Mexicans significantly overlapped and shared 34 genes. Seven of the 34 genes (ARHGAP30, CCR1, CXCL16, FERMT3, HCST, RNASET2, SELPG) were identified as the key hub genes of all three TG modules. Furthermore, two of the 34 genes (ARHGAP9, LST1) reside in previous TG GWAS regions, suggesting them as the regional candidates underlying the GWAS signals.
Conclusions
This study presents a novel adipose gene co-expression network with 34 genes significantly correlated with serum TG across populations.
doi:10.1186/1755-8794-5-61
PMCID: PMC3543280
PMID: 23217153
Mexicans; Finns; RNA sequencing; Triglycerides; Adipose tissue; Weighted gene co-expression network analysis
Kumari, Sapna | Nie, Jeff | Chen, Huann-Sheng | Ma, Hao | Stewart, Ron | Li, Xiang | Lu, Meng-Zhu | Taylor, William M. | Wei, Hairong | Horvath, Steve
Background
Constructing coexpression networks and performing network analysis using large-scale gene expression data sets is an effective way to uncover new biological knowledge; however, the methods used for gene association in constructing these coexpression networks have not been thoroughly evaluated. Since different methods lead to structurally different coexpression networks and provide different information, selecting the optimal gene association method is critical.
Methods and Results
In this study, we compared eight gene association methods – Spearman rank correlation, Weighted Rank Correlation, Kendall, Hoeffding's D measure, Theil-Sen, Rank Theil-Sen, Distance Covariance, and Pearson – and focused on their true knowledge discovery rates in associating pathway genes and construction coordination networks of regulatory genes. We also examined the behaviors of different methods to microarray data with different properties, and whether the biological processes affect the efficiency of different methods.
Conclusions
We found that the Spearman, Hoeffding and Kendall methods are effective in identifying coexpressed pathway genes, whereas the Theil-sen, Rank Theil-Sen, Spearman, and Weighted Rank methods perform well in identifying coordinated transcription factors that control the same biological processes and traits. Surprisingly, the widely used Pearson method is generally less efficient, and so is the Distance Covariance method that can find gene pairs of multiple relationships. Some analyses we did clearly show Pearson and Distance Covariance methods have distinct behaviors as compared to all other six methods. The efficiencies of different methods vary with the data properties to some degree and are largely contingent upon the biological processes, which necessitates the pre-analysis to identify the best performing method for gene association and coexpression network construction.
doi:10.1371/journal.pone.0050411
PMCID: PMC3511551
PMID: 23226279
van Eijk, Kristel R | de Jong, Simone | Boks, Marco PM | Langeveld, Terry | Colas, Fabrice | Veldink, Jan H | de Kovel, Carolien GF | Janson, Esther | Strengman, Eric | Langfelder, Peter | Kahn, René S | van den Berg, Leonard H | Horvath, Steve | Ophoff, Roel A
Background
The predominant model for regulation of gene expression through DNA methylation is an inverse association in which increased methylation results in decreased gene expression levels. However, recent studies suggest that the relationship between genetic variation, DNA methylation and expression is more complex.
Results
Systems genetic approaches for examining relationships between gene expression and methylation array data were used to find both negative and positive associations between these levels. A weighted correlation network analysis revealed that i) both transcriptome and methylome are organized in modules, ii) co-expression modules are generally not preserved in the methylation data and vice-versa, and iii) highly significant correlations exist between co-expression and co-methylation modules, suggesting the existence of factors that affect expression and methylation of different modules (i.e., trans effects at the level of modules). We observed that methylation probes associated with expression in cis were more likely to be located outside CpG islands, whereas specificity for CpG island shores was present when methylation, associated with expression, was under local genetic control. A structural equation model based analysis found strong support in particular for a traditional causal model in which gene expression is regulated by genetic variation via DNA methylation instead of gene expression affecting DNA methylation levels.
Conclusions
Our results provide new insights into the complex mechanisms between genetic markers, epigenetic mechanisms and gene expression. We find strong support for the classical model of genetic variants regulating methylation, which in turn regulates gene expression. Moreover we show that, although the methylation and expression modules differ, they are highly correlated.
doi:10.1186/1471-2164-13-636
PMCID: PMC3583143
PMID: 23157493
DNA methylation; Gene expression; Association; Epigenetics; WGCNA
Both avian and mammalian basal ganglia are involved in voluntary motor control. In birds, such movements include hopping, perching and flying. Two organizational features that distinguish the songbird basal ganglia are that striatal and pallidal neurons are intermingled, and that neurons dedicated to vocal-motor function are clustered together in a dense cell group known as area X that sits within the surrounding striato-pallidum. This specification allowed us to perform molecular profiling of two striato-pallidal subregions, comparing transcriptional patterns in tissue dedicated to vocal-motor function (area X) to those in tissue that contains similar cell types but supports non-vocal behaviors: the striato-pallidum ventral to area X (VSP), our focus here. Since any behavior is likely underpinned by the coordinated actions of many molecules, we constructed gene co-expression networks from microarray data to study large-scale transcriptional patterns in both subregions. Our goal was to investigate any relationship between VSP network structure and singing and identify gene co-expression groups, or modules, found in the VSP but not area X. We observed mild, but surprising, relationships between VSP modules and song spectral features, and found a group of four VSP modules that were highly specific to the region. These modules were unrelated to singing, but were composed of genes involved in many of the same biological processes as those we previously observed in area X-specific singing-related modules. The VSP-specific modules were also enriched for processes disrupted in Parkinson's and Huntington's Diseases. Our results suggest that the activation/inhibition of a single pathway is not sufficient to functionally specify area X versus the VSP and support the notion that molecular processes are not in and of themselves specialized for behavior. Instead, unique interactions between molecular pathways create functional specificity in particular brain regions during distinct behavioral states.
Author Summary
Understanding how gene transcription relates to behavior is challenging. Learned vocal-motor behavior is a complex trait that represents the output of multiple converging genes, pathways, and patterns of neural activity. Here, we applied a systems analytical approach to determine how thousands of genes change their expression levels simultaneously in a region of the vertebrate brain important for vocal-motor function, the basal ganglia, during a specific vocal-motor behavior, singing. We used the zebra finch species of songbird based on similarities between song learning/production and speech, and because they possess a set of brain subregions dedicated to singing. Microarrays were used to measure gene expression levels in one such song-dedicated region and in an adjacent motor area that is not thought to play a role in vocal function. This allowed us to address the question of whether distinct gene co-expression patterns could be found in each area. We found that each area contained unique patterns of transcriptional co-activity, but there were also unexpected overlaps. We conclude that the particular behaviors (singing versus non-vocal behaviors) supported by these subregions depend on the particular sets of interactions between molecular pathways that occur in each subregion.
doi:10.1371/journal.pcbi.1002773
PMCID: PMC3493463
PMID: 23144607
Mah, Vei | Marquez, Diana | Alavi, Mohammad | Maresh, Erin L. | Zhang, Li | Yoon, Nam | Horvath, Steve | Bagryanova, Lora | Fishbein, Michael C. | Chia, David | Pietras, Richard | Goodglick, Lee
Estrogen signaling pathways may play a significant role in the pathogenesis of non-small cell lung cancers (NSCLC) as evidenced by the expression of aromatase and estrogen receptors (ERα and ERβ) in many of these tumors. Here we examine whether ERα and ERβ levels in conjunction with aromatase define patient groups with respect to survival outcomes and possible treatment regimens. Immunohistochemistry was performed on a high-density tissue microarray with resulting data and clinical information available for 377 patients. Patients were subdivided by gender, age and tumor histology, and survival data was determined using the Cox proportional hazards model and Kaplan-Meier curves. Neither ERα nor ERβ alone were predictors of survival in NSCLC. However, when coupled with aromatase expression, higher ERβ levels predicted worse survival in patients whose tumors expressed higher levels of aromatase. Although this finding was present in patients of both genders, it was especially pronounced in women ≥ 65 years old, where higher expression of both ERβ and aromatase indicated a markedly worse survival rate than that determined by aromatase alone. Conclusion: Expression of ERβ together with aromatase has predictive value for survival in different gender and age subgroups of NSCLC patients. This predictive value is stronger than each individual marker alone. Our results suggest treatment with aromatase inhibitors alone or combined with estrogen receptor modulators may be of benefit in some subpopulations of these patients.
doi:10.1016/j.lungcan.2011.03.009
PMCID: PMC3175023
PMID: 21511357
NSCLC; tissue microarray; aromatase; estrogen receptor; immunohistochemistry; prognosis
It has been debated whether human induced pluripotent stem cells (iPSCs) and embryonic stem cells (ESCs) express distinctive transcriptomes. By using the method of weighted gene co-expression network analysis, we showed here that iPSCs exhibit altered functional modules compared with ESCs. Notably, iPSCs and ESCs differentially express 17 modules that primarily function in transcription, metabolism, development, and immune response. These module activations (up- and downregulation) are highly conserved in a variety of iPSCs, and genes in each module are coherently co-expressed. Furthermore, the activation levels of these modular genes can be used as quantitative variables to discriminate iPSCs and ESCs with high accuracy (96%). Thus, differential activations of these functional modules are the conserved features distinguishing iPSCs from ESCs. Strikingly, the overall activation level of these modules is inversely correlated with the DNA methylation level, suggesting that DNA methylation may be one mechanism regulating the module differences. Overall, we conclude that human iPSCs and ESCs exhibit distinct gene expression networks, which are likely associated with different epigenetic reprogramming events during the derivation of iPSCs and ESCs.
doi:10.1089/scd.2010.0574
PMCID: PMC3202894
PMID: 21542696
Many high-throughput biological data analyses require the calculation of large correlation matrices and/or clustering of a large number of objects. The standard R function for calculating Pearson correlation can handle calculations without missing values efficiently, but is inefficient when applied to data sets with a relatively small number of missing data. We present an implementation of Pearson correlation calculation that can lead to substantial speedup on data with relatively small number of missing entries. Further, we parallelize all calculations and thus achieve further speedup on systems where parallel processing is available. A robust correlation measure, the biweight midcorrelation, is implemented in a similar manner and provides comparable speed. The functions cor and bicor for fast Pearson and biweight midcorrelation, respectively, are part of the updated, freely available R package WGCNA.
The hierarchical clustering algorithm implemented in R function hclust is an order n3 (n is the number of clustered objects) version of a publicly available clustering algorithm (Murtagh 2012). We present the package flashClust that implements the original algorithm which in practice achieves order approximately n2, leading to substantial time savings when clustering large data sets.
PMCID: PMC3465711
PMID: 23050260
Pearson correlation; robust correlation; hierarchical clustering; R
Complex traits and other polygenic processes require coordinated gene expression. Co-expression networks model mRNA co-expression: the product of gene regulatory networks. To identify regulatory mechanisms underlying coordinated gene expression in a tissue-enriched context, ten Arabidopsis thaliana co-expression networks were constructed after manually sorting 4,566 RNA profiling datasets into aerial, flower, leaf, root, rosette, seedling, seed, shoot, whole plant, and global (all samples combined) groups. Collectively, the ten networks contained 30% of the measurable genes of Arabidopsis and were circumscribed into 5,491 modules. Modules were scrutinized for cis regulatory mechanisms putatively encoded in conserved non-coding sequences (CNSs) previously identified as remnants of a whole genome duplication event. We determined the non-random association of 1,361 unique CNSs to 1,904 co-expression network gene modules. Furthermore, the CNS elements were placed in the context of known gene regulatory networks (GRNs) by connecting 250 CNS motifs with known GRN cis elements. Our results provide support for a regulatory role of some CNS elements and suggest the functional consequences of CNS activation of co-expression in specific gene sets dispersed throughout the genome.
doi:10.1371/journal.pone.0045041
PMCID: PMC3443200
PMID: 23024789
Marquez-Garban, Diana C. | Mah, Vei | Alavi, Mohammad | Maresh, Erin L. | Chen, Hsiao-Wang | Bagryanova, Lora | Horvath, Steve | Chia, David | Garon, Edward | Goodglick, Lee | Pietras, Richard J.
Lung cancer is the most common cause of cancer mortality in male and female patients in the US. Although it is clear that tobacco smoking is a major cause of lung cancer, about half of all women with lung cancer worldwide are never-smokers. Despite a declining smoking population, the incidence of non-small cell lung cancer (NSCLC), the predominant form of lung cancer, has reached epidemic proportions particularly in women. Emerging data suggest that factors other than tobacco, namely endogenous and exogenous female sex hormones, have a role in stimulating NSCLC progression. Aromatase, a key enzyme for estrogen biosynthesis, is expressed in NSCLC. Clinical data show that women with high levels of tumor aromatase (and high intratumoral estrogen) have worse survival than those with low aromatase. The present and previous studies also reveal significant expression and activity of estrogen receptors (ERα, ERβ) in both extranuclear and nuclear sites in most NSCLC. We now report further on the expression of progesterone receptor (PR) transcripts and protein in NSCLC. PR transcripts were significantly lower in cancerous as compared to non-malignant tissue. Using immunohistochemistry, expression of PR was observed in the nucleus and/or extranuclear compartments in the majority of human tumor specimens examined. Combinations of estrogen and progestins administered in vitro cooperate in promoting tumor secretion of vascular endothelial growth factor and, consequently, support tumor-associated angiogenesis. Further, dual treatment with estradiol and progestin increased the numbers of putative tumor stem/progenitor cells. Thus, ER- and/or PR-targeted therapies may offer new approaches to manage NSCLC.
doi:10.1016/j.steroids.2011.04.015
PMCID: PMC3129425
PMID: 21600232
Progesterone; Estrogen; Steroid hormone receptor; Non-small cell lung cancer; VEGF; Progenitor cells; Cancer stem cells; Angiogenesis
Background
Sjögren’s syndrome is a tissue-specific autoimmune disease that affects exocrine tissues, especially salivary glands and lacrimal glands. Despite a large body of evidence gathered over the past 60 years, significant gaps still exist in our understanding of Sjögren’s syndrome. The goal of this study was to develop a database that collects and organizes gene and protein expression data from the existing literature for comparative analysis with future gene expression and proteomic studies of Sjögren’s syndrome.
Description
To catalog the existing knowledge in the field, we used text mining to generate the Sjögren’s Syndrome Knowledge Base (SSKB) of published gene/protein data, which were extracted from PubMed using text mining of over 7,700 abstracts and listing approximately 500 potential genes/proteins. The raw data were manually evaluated to remove duplicates and false-positives and assign gene names. The data base was manually curated to 477 entries, including 377 potential functional genes, which were used for enrichment and pathway analysis using gene ontology and KEGG pathway analysis.
Conclusions
The Sjögren’s syndrome knowledge base (
http://sskb.umn.edu) can form the foundation for an informed search of existing knowledge in the field as new potential therapeutic targets are identified by conventional or high throughput experimental techniques.
doi:10.1186/1471-2474-13-119
PMCID: PMC3495204
PMID: 22759918
The hallmarks of many haematological malignancies and solid tumours are chromosomal translocations, which may lead to gene fusions. Recently, next-generation sequencing techniques at the transcriptome level (RNA-Seq) have been used to verify known and discover novel transcribed gene fusions. We present FusionFinder, a Perl-based software designed to automate the discovery of candidate gene fusion partners from single-end (SE) or paired-end (PE) RNA-Seq read data. FusionFinder was applied to data from a previously published analysis of the K562 chronic myeloid leukaemia (CML) cell line. Using FusionFinder we successfully replicated the findings of this study and detected additional previously unreported fusion genes in their dataset, which were confirmed experimentally. These included two isoforms of a fusion involving the genes BRK1 and VHL, whose co-deletion has previously been associated with the prevalence and severity of renal-cell carcinoma. FusionFinder is made freely available for non-commercial use and can be downloaded from the project website (http://bioinformatics.childhealthresearch.org.au/software/fusionfinder/).
doi:10.1371/journal.pone.0039987
PMCID: PMC3384600
PMID: 22761941
de Jong, Simone | Boks, Marco P. M. | Fuller, Tova F. | Strengman, Eric | Janson, Esther | de Kovel, Carolien G. F. | Ori, Anil P. S. | Vi, Nancy | Mulder, Flip | Blom, Jan Dirk | Glenthøj, Birte | Schubart, Chris D. | Cahn, Wiepke | Kahn, René S. | Horvath, Steve | Ophoff, Roel A. | Mazza, Marianna
Despite large-scale genome-wide association studies (GWAS), the underlying genes for schizophrenia are largely unknown. Additional approaches are therefore required to identify the genetic background of this disorder. Here we report findings from a large gene expression study in peripheral blood of schizophrenia patients and controls. We applied a systems biology approach to genome-wide expression data from whole blood of 92 medicated and 29 antipsychotic-free schizophrenia patients and 118 healthy controls. We show that gene expression profiling in whole blood can identify twelve large gene co-expression modules associated with schizophrenia. Several of these disease related modules are likely to reflect expression changes due to antipsychotic medication. However, two of the disease modules could be replicated in an independent second data set involving antipsychotic-free patients and controls. One of these robustly defined disease modules is significantly enriched with brain-expressed genes and with genetic variants that were implicated in a GWAS study, which could imply a causal role in schizophrenia etiology. The most highly connected intramodular hub gene in this module (ABCF1), is located in, and regulated by the major histocompatibility (MHC) complex, which is intriguing in light of the fact that common allelic variants from the MHC region have been implicated in schizophrenia. This suggests that the MHC increases schizophrenia susceptibility via altered gene expression of regulatory genes in this network.
doi:10.1371/journal.pone.0039498
PMCID: PMC3384650
PMID: 22761806
Background
Genomic datasets generated by new technologies are increasingly prevalent in disparate areas of biological research. While many studies have sought to characterize relationships among genomic features, commensurate efforts to characterize relationships among biological samples have been less common. Consequently, the full extent of sample variation in genomic studies is often under-appreciated, complicating downstream analytical tasks such as gene co-expression network analysis.
Results
Here we demonstrate the use of network methods for characterizing sample relationships in microarray data generated from human brain tissue. We describe an approach for identifying outlying samples that does not depend on the choice or use of clustering algorithms. We introduce a battery of measures for quantifying the consistency and integrity of sample relationships, which can be compared across disparate studies, technology platforms, and biological systems. Among these measures, we provide evidence that the correlation between the connectivity and the clustering coefficient (two important network concepts) is a sensitive indicator of homogeneity among biological samples. We also show that this measure, which we refer to as cor(K,C), can distinguish biologically meaningful relationships among subgroups of samples. Specifically, we find that cor(K,C) reveals the profound effect of Huntington’s disease on samples from the caudate nucleus relative to other brain regions. Furthermore, we find that this effect is concentrated in specific modules of genes that are naturally co-expressed in human caudate nucleus, highlighting a new strategy for exploring the effects of disease on sets of genes.
Conclusions
These results underscore the importance of systematically exploring sample relationships in large genomic datasets before seeking to analyze genomic feature activity. We introduce a standardized platform for this purpose using freely available R software that has been designed to enable iterative and interactive exploration of sample networks.
doi:10.1186/1752-0509-6-63
PMCID: PMC3441531
PMID: 22691535
Sample networks; Sample network analysis; Huntington’s disease; Clustering coefficient; cor(K,C); Standardized C(k) curve; Data pre-processing; Microarrays; Gene expression
A critical step in detecting variants from next-generation sequencing data is post hoc filtering of putative variants called or predicted by computational tools. Here, we highlight four critical parameters that could enhance the accuracy of called single nucleotide variants and insertions/deletions: quality and deepness, refinement and improvement of initial mapping, allele/strand balance, and examination of spurious genes. Use of these sequence features appropriately in variant filtering could greatly improve validation rates, thereby saving time and costs in next-generation sequencing projects.
doi:10.1371/journal.pone.0038470
PMCID: PMC3371040
PMID: 22715385
Background
Pathway analysis of a set of genes represents an important area in large-scale omic data analysis. However, the application of traditional pathway enrichment methods to next-generation sequencing (NGS) data is prone to several potential biases, including genomic/genetic factors (e.g., the particular disease and gene length) and environmental factors (e.g., personal life-style and frequency and dosage of exposure to mutagens). Therefore, novel methods are urgently needed for these new data types, especially for individual-specific genome data.
Methodology
In this study, we proposed a novel method for the pathway analysis of NGS mutation data by explicitly taking into account the gene-wise mutation rate. We estimated the gene-wise mutation rate based on the individual-specific background mutation rate along with the gene length. Taking the mutation rate as a weight for each gene, our weighted resampling strategy builds the null distribution for each pathway while matching the gene length patterns. The empirical P value obtained then provides an adjusted statistical evaluation.
Principal Findings/Conclusions
We demonstrated our weighted resampling method to a lung adenocarcinomas dataset and a glioblastoma dataset, and compared it to other widely applied methods. By explicitly adjusting gene-length, the weighted resampling method performs as well as the standard methods for significant pathways with strong evidence. Importantly, our method could effectively reject many marginally significant pathways detected by standard methods, including several long-gene-based, cancer-unrelated pathways. We further demonstrated that by reducing such biases, pathway crosstalk for each individual and pathway co-mutation map across multiple individuals can be objectively explored and evaluated. This method performs pathway analysis in a sample-centered fashion, and provides an alternative way for accurate analysis of cancer-personalized genomes. It can be extended to other types of genomic data (genotyping and methylation) that have similar bias problems.
doi:10.1371/journal.pone.0037595
PMCID: PMC3356304
PMID: 22624051
Flachner, Beáta | Lörincz, Zsolt | Carotti, Angelo | Nicolotti, Orazio | Kuchipudi, Praveena | Remez, Nikita | Sanz, Ferran | Tóvári, József | Szabó, Miklós J. | Bertók, Béla | Cseh, Sándor | Mestres, Jordi | Dormán, György | Horvath, Steve
A novel chemocentric approach to identifying cancer-relevant targets is introduced. Starting with a large chemical collection, the strategy uses the list of small molecule hits arising from a differential cytotoxicity screening on tumor HCT116 and normal MRC-5 cell lines to identify proteins associated with cancer emerging from a differential virtual target profiling of the most selective compounds detected in both cell lines. It is shown that this smart combination of differential in vitro and in silico screenings (DIVISS) is capable of detecting a list of proteins that are already well accepted cancer drug targets, while complementing it with additional proteins that, targeted selectively or in combination with others, could lead to synergistic benefits for cancer therapeutics. The complete list of 115 proteins identified as being hit uniquely by compounds showing selective antiproliferative effects for tumor cell lines is provided.
doi:10.1371/journal.pone.0035582
PMCID: PMC3338416
PMID: 22558171
Diekstra, Frank P. | Saris, Christiaan G. J. | van Rheenen, Wouter | Franke, Lude | Jansen, Ritsert C. | van Es, Michael A. | van Vught, Paul W. J. | Blauw, Hylke M. | Groen, Ewout J. N. | Horvath, Steve | Estrada, Karol | Rivadeneira, Fernando | Hofman, Albert | Uitterlinden, Andre G. | Robberecht, Wim | Andersen, Peter M. | Melki, Judith | Meininger, Vincent | Hardiman, Orla | Landers, John E. | Brown, Robert H. | Shatunov, Aleksey | Shaw, Christopher E. | Leigh, P. Nigel | Al-Chalabi, Ammar | Ophoff, Roel A. | van den Berg, Leonard H. | Veldink, Jan H. | van der Brug, Marcel P.
Amyotrophic lateral sclerosis (ALS) is a progressive, neurodegenerative disease characterized by loss of upper and lower motor neurons. ALS is considered to be a complex trait and genome-wide association studies (GWAS) have implicated a few susceptibility loci. However, many more causal loci remain to be discovered. Since it has been shown that genetic variants associated with complex traits are more likely to be eQTLs than frequency-matched variants from GWAS platforms, we conducted a two-stage genome-wide screening for eQTLs associated with ALS. In addition, we applied an eQTL analysis to finemap association loci. Expression profiles using peripheral blood of 323 sporadic ALS patients and 413 controls were mapped to genome-wide genotyping data. Subsequently, data from a two-stage GWAS (3,568 patients and 10,163 controls) were used to prioritize eQTLs identified in the first stage (162 ALS, 207 controls). These prioritized eQTLs were carried forward to the second sample with both gene-expression and genotyping data (161 ALS, 206 controls). Replicated eQTL SNPs were then tested for association in the second-stage GWAS data to find SNPs associated with disease, that survived correction for multiple testing. We thus identified twelve cis eQTLs with nominally significant associations in the second-stage GWAS data. Eight SNP-transcript pairs of highest significance (lowest p = 1.27×10−51) withstood multiple-testing correction in the second stage and modulated CYP27A1 gene expression. Additionally, we show that C9orf72 appears to be the only gene in the 9p21.2 locus that is regulated in cis, showing the potential of this approach in identifying causative genes in association loci in ALS. This study has identified candidate genes for sporadic ALS, most notably CYP27A1. Mutations in CYP27A1 are causal to cerebrotendinous xanthomatosis which can present as a clinical mimic of ALS with progressive upper motor neuron loss, making it a plausible susceptibility gene for ALS.
doi:10.1371/journal.pone.0035333
PMCID: PMC3324559
PMID: 22509407
Major depression is nearly twice as prevalent in women compared to men. In bipolar disorder, depressive episodes have been reported to be more common amongst female patients. Furthermore, periods of depression often correlate with periods of hormonal fluctuations. A link between hormone signaling and these mood disorders has, therefore, been suggested to exist in many studies. Estrogen, one of the primary female sex hormones, mediates its effect mostly by binding to estrogen receptors (ERs). Nuclear ERs function as transcription factors and regulate gene transcription by binding to specific DNA sequences. A nucleotide change in the binding sequence might alter the binding efficiency, which could affect transcription levels of nearby genes. In order to investigate if variation in ER DNA-binding sequences may be involved in mood disorders, we conducted a genome-wide study of ER DNA-binding in patients diagnosed with major depression or bipolar disorder. Association studies were performed within each gender separately and the results were corrected for multiple testing by the Bonferroni method. In the female bipolar disorder material a significant association result was found for rs6023059 (corrected p-value = 0.023; odds ratio (OR) 0.681, 95% confidence interval (CI) 0.570–0.814), a single nucleotide polymorphism (SNP) placed downstream of the gene coding for transglutaminase 2 (TGM2). Thus, females with a specific genotype at this SNP may be more vulnerable to fluctuating estrogen levels, which may then act as a triggering factor for bipolar disorder.
doi:10.1371/journal.pone.0032304
PMCID: PMC3289647
PMID: 22389694
Symptoms of Major Depressive Disorder (MDD) are hypothesized to arise from dysfunction in brain networks linking the limbic system and cortical regions. Alterations in brain functional cortical connectivity in resting-state networks have been detected with functional imaging techniques, but neurophysiologic connectivity measures have not been systematically examined. We used weighted network analysis to examine resting state functional connectivity as measured by quantitative electroencephalographic (qEEG) coherence in 121 unmedicated subjects with MDD and 37 healthy controls. Subjects with MDD had significantly higher overall coherence as compared to controls in the delta (0.5–4 Hz), theta (4–8 Hz), alpha (8–12 Hz), and beta (12–20 Hz) frequency bands. The frontopolar region contained the greatest number of “hub nodes” (surface recording locations) with high connectivity. MDD subjects expressed higher theta and alpha coherence primarily in longer distance connections between frontopolar and temporal or parietooccipital regions, and higher beta coherence primarily in connections within and between electrodes overlying the dorsolateral prefrontal cortical (DLPFC) or temporal regions. Nearest centroid analysis indicated that MDD subjects were best characterized by six alpha band connections primarily involving the prefrontal region. The present findings indicate a loss of selectivity in resting functional connectivity in MDD. The overall greater coherence observed in depressed subjects establishes a new context for the interpretation of previous studies showing differences in frontal alpha power and synchrony between subjects with MDD and normal controls. These results can inform the development of qEEG state and trait biomarkers for MDD.
doi:10.1371/journal.pone.0032508
PMCID: PMC3286480
PMID: 22384265
Colorectal cancer (CRC) is one of the leading malignant cancers with a rapid increase in incidence and mortality. The recurrences of CRC after curative resection are sometimes unavoidable and often take place within the first year after surgery. MicroRNAs may serve as biomarkers to predict early recurrence of CRC, but identifying them from over 1,400 known human microRNAs is challenging and costly. An alternative approach is to analyze existing expression data of messenger RNAs (mRNAs) because generally speaking the expression levels of microRNAs and their target mRNAs are inversely correlated. In this study, we extracted six mRNA expression data of CRC in four studies (GSE12032, GSE17538, GSE4526 and GSE17181) from the gene expression omnibus (GEO). We inferred microRNA expression profiles and performed computational analysis to identify microRNAs associated with CRC recurrence using the IMRE method based on the MicroCosm database that includes 568,071 microRNA-target connections between 711 microRNAs and 20,884 gene targets. Two microRNAs, miR-29a and miR-29c, were disclosed and further meta-analysis of the six mRNA expression datasets showed that these two microRNAs were highly significant based on the Fisher p-value combination (p = 9.14×10−9 for miR-29a and p = 1.14×10−6 for miR-29c). Furthermore, these two microRNAs were experimentally tested in 78 human CRC samples to validate their effect on early recurrence. Our empirical results showed that the two microRNAs were significantly down-regulated (p = 0.007 for miR-29a and p = 0.007 for miR-29c) in the early-recurrence patients. This study shows the feasibility of using mRNA profiles to indicate microRNAs. We also shows miR-29a/c could be potential biomarkers for CRC early recurrence.
doi:10.1371/journal.pone.0031587
PMCID: PMC3278467
PMID: 22348113
Cell
2009;136(2):364-377.
Summary
Induced pluripotent stem (iPS) cells can be obtained from fibroblasts upon expression of Oct4, Sox2, Klf4 and c-Myc. To understand how these factors induce pluripotency, we carried out genome-wide analyses of their promoter binding and expression in iPS and partially reprogrammed cells. We find that target genes of the four factors strongly overlap in iPS and embryonic stem (ES) cells. In partially reprogrammed cells, many genes co-occupied by c-Myc and any of the other three factors already show an ES-like binding and expression pattern. In contrast, genes that are specifically co-bound by Oct4, Sox2 and Klf4 in ES cells and encode pluripotency regulators severely lack binding and transcriptional activation. Among the four factors, c-Myc promotes the most ES cell-like transcription pattern when expressed individually in fibroblasts. These data uncover temporal and separable contributions of the four factors during the reprogramming process and indicate that ectopic c-Myc predominantly acts before pluripotency regulators are activated.
doi:10.1016/j.cell.2009.01.001
PMCID: PMC3273494
PMID: 19167336