|Home | About | Journals | Submit | Contact Us | Français|
Fibroproliferative diseases of organs are poorly understood and generally lack effective anti-fibrotic treatments. Our goal was to identify the key regulatory factors in pathologic fibrosis, common between organ-based fibrotic disease. We analyzed 9 microarray datasets publicly available in the GEO datasets from lung, heart, liver and kidney fibrotic disease tissue (489 microarrays total, disease and control). We identified a set of 90 genes differentially expressed in at least five microarray datasets. We used IPA and DAVID analysis to identify gene networks and their molecular functions. A mutual information based network work activity analysis showed that a connective tissue disorders network was the most active for all types of fibrosis included in this analysis. Conclusion: Our analysis indicates that despite different disease manifestation, organ fibrosis share a specific set of genes suggesting the potential for a common origin.
Fibrotic diseases are responsible for 45% of deaths1 in the developed world, hence interventions targeting excessive fibrosis are a major therapeutic goal. The concept of 'core' and 'regulatory' pathways in focused on the fundamental regulatory pathways involved in multi-organ fibrosis, specific for human fibrotic disease rather than mouse models, is still an enigma as recently reviewed in Nature Medicine2. Fibrosis is a degenerative process that can lead to end-stage disease and loss of function in the lung3,4, heart5, liver6 or kidney7 (top 15 causes of deaths in U.S.)8. Organ fibrosis is marked by fibroblast activation and abundant extracellular matrix (ECM) deposition, suggesting abnormal wound healing. Fibroblasts are the main contributor to ECM remodeling and excessive collagen deposition in fibrosis. Fibroblast transition to myofibroblast is a common denominator for pathologic fibrosis1 and tumor stroma9–11. In addition, pathological organ fibrosis share several common pathways and biological processes, like TGF-β, MAPK, and PDGF activation; epithelial-to-mesenchymal transition (EMT); metalloproteinase (MMP) activation; mechanical tension; oxidative stress; and inflammation5,6,12–18.
Despite extensive efforts, there are still large gaps in understanding the fundamental molecular pathways in fibrotic diseases across multiple organs. To date, predictive gene signatures have been interrogated for liver or kidney fibrotic disease, but this approach exists more as separate organ-based disease, as opposed for a connected pathological syndrome. For example, a seven-gene19 predictive signature has been outlined for hepatitis C viral infection-induced liver fibrosis. For kidney fibrosis, comparative analysis of four independent microarray datasets generated the “molecular Banff” signature, a 70-genes signature for acute rejection transcript set and a molecular diagnostic approach for early rejection in renal transplant20. Recently, one study using a cross-organ classifier of fibrotic conditions identified a set of markers of human solid organ fibrosis, applicable for renal post-transplant outcome21. Since fibrotic disorders share similar pathological characteristics, including increased collagen deposition and myofibroblast activation, we hypothesized that common genes and pathways regulating these processes are perturbed in fibrosis and we selected gene expression data from fibrosis in multiple organs.
Identifying common differentially expressed genes in multiple datasets is an effective way to increase the power of discovering genes related to key disease phenotypes regardless of different etiology and tissue specificity. Similar methods have been applied on other diseases such as diabetes22, multiple types of cancers23 and neural diseases24. In this study, we extend this method for fibrotic diseases. However, it has become increasingly accepted that for complicated diseases, genes and their protein products act in networks with orchestrated activities. Therefore it is critical to identify the protein-protein interaction (PPI) networks which are involved in the disease development processes. In this study, we use the common differentially expressed genes as “baits” to fish out relevant PPI networks. Starting with these genes, we adopt a commonly used network analysis software Ingenuity Pathway Analysis (IPA, http://www.ingenuity.com) to infer potential relevant PPI networks. The IPA KnowledgeBase has a large collection of known PPI relationships and thus can serve as a adequate resource for our study. The relevance of the networks with individual fibrotic conditions is determined using a mutual information based network activity score25 which has been applied to many diseases studies such as colon cancers 26,27.
To summarize, we developed a translational bioinformatics pipeline for identifying key genes and networks related to fibrosis. We use techniques that have been validated in various diseases such as diabetes22, cancer23, and neural diseases24 for a novel application to identify common pathways underlying fibrosis of different organs. Our analysis provides a novel view to better understanding the common fundamental pathways in organ fibrosis, by using bioinformatics approach for screening genes and networks across multiple organ fibrosis.
The GEO database was searched for the following fibrotic disorders: idiopathic pulmonary fibrosis (IPF), liver cirrhosis, kidney fibrosis, and heart failure (Figure 1). Our search criteria included: 1) both control and disease groups should be present in the datasets, 2) the tissue samples are collected from disease organ (e.g., immortal cell lines and blood are excluded), and 3) at least 5 samples in each group are required for robust statistical analysis. 3 lung datasets, 2 liver, 2 kidney, and 2 heart datasets were downloaded and used for analysis (Table 1).
Figure 1 gives an overview of our workflow for analyzing genes and PPI networks involved in multiple organ fibrosis. The details of the algorithms are given in following sections.
In general, we first identify genes that are consistently differentially regulated in more than half (5 out of 9) of the datasets. Then PPI networks associated with these genes are established using the Ingenuity Pathway Analysis (IPA) software with the network building function. These networks are further screened for their association with specific fibrotic disease status using a mutual information based network activity score developed by Chuang et al 25. In order to determine the threshold to select gene networks with high mutual information (and hence high relevance with metastasis), 1,000 random sets of genes are selected and the mutual information score is calculated. The threshold is set to be the top five percentile of the random simulations. Finally the selected gene networks with high mutual information are subject to gene set enrichment analysis using DAVID (http://david.abcc.ncifcrf.gov/).
All expression values were quantile normalized within each dataset using the MATLAB Bioinformatics Toolbox (R2010a). An unpaired two-sample student's t-test was performed on each gene between fibrosis and control expression values for each dataset. Genes with p < 0.05 and at least 1.5 mean fold change that were consistently upor downregulated were identified as significantly differentially expressed. A list of genes that satisfied these criteria were selected and the number of times each gene was significant in a dataset was counted. Expression values were averaged if multiple probes mapped to the same genes. Probes that did not map to a known gene were eliminated from further analysis. Mean fold change between fibrosis and control was calculated for each gene and averaged across all datasets in which the gene was differentially expressed.
All genes significant in at least 5 datasets were input into Ingenuity Pathway Analysis (IPA) (http://www.ingenuity.com) which has a well curated protein-protein interaction knowledgebase. For this analysis we chose genes that met the criteria for differential expression (>1.5 fold change, p < 0.05) and were found in more than half of the total number of datasets.
Given a network with a set of genes, a network activity score over different groups of samples was developed previously in 25 based on mutual information (MI). While the details of the method are given in 25, here we outline the steps:
Comparing to other methods such as t- or Wilcox score, MI does not require assumptions on data distribution. In addition, this metric can also accommodate the case when there are subgroups in each group, which is possible among patients with complicated diseases such as fibrosis.
In this study, significant networks identified using IPA were tested using a MATLAB implementation of the mutual information approach applied to all 9 datasets 25.
Given the network activity scores for all the networks, we need to select the ones with high scores. To find the threshold, we carry out random simulations by randomly select a set of N0 genes where N0 is between 5 and 17 and compute its network activity score. This range of genes was chosen based on the number of genes from our list that IPA used to generate the original networks, called “focus molecules”. This is then repeated 1,000 times, the top five-percentile level is used as the 5% FDR threshold for selecting the networks by Srand. Networks with S > Srand were significantly active in that dataset. Each network had a count corresponding to the number of datasets its S > Srand. Gene ontology information for the genes of the selected networks was accessed using the NIH DAVID webtool.
After applying quantile normalization and a t-test for each dataset as described in Methods, the combined list contained 17,335 genes that were expressed in at least one dataset, with p < 0.05 regardless of mean fold change (Figure 2). There were no genes expressed in all 9 datasets with p < 0.05. COL1A1, ITSN1, RUNX3, SMAD2, and WIPF1 were the only genes expressed in 8 out of 9 datasets, with p < 0.05 without considering the fold difference vs. control. These genes have important implications for fibrosis, particularly COL1A139, which encodes the pro-alpha1 chain of type I collagen. SMAD2 is a known to be activated by TGF-β, responsible for the downstream effects of TGF-β like fibroblast activation, myofibroblast production, cell apoptosis and proliferation40,41. ITSN1 (Intersectin 1)42 and WIPF1 (WAS/WASL interacting protein family, member 1)43 are involved in regulating the actin cytoskeleton, are novel for fibrosis. RUNX3 (Runt-related transcription factor 3) is a transcription factor involved in tumor suppression44. WIPF1, an important protein in Wiskott-Aldrich syndrome, was the most frequently differentially expressed gene and met the criteria for 1.5 fold change in 8 datasets. TGFBI (Transforming growth factor, beta-induced) and RNASET2 (Ribonuclease T2) were the only genes differentially expressed in 7 datasets.
Of the 17,335 genes expressed in at least one dataset (p < 0.05), only 839 genes were present in 5 or more datasets, over half of the datasets. From these 839 genes, 90 genes were significantly differentially expressed (p < 0.05, |MFC| > 1.5) in at least 5 different datasets (Table 2). Of these, 83 genes were consistently upregulated and 7 genes were consistently down regulated. These 90 genes were input to IPA to uncover the biological functions and pathways involved.
IPA generated 9 regulatory networks from this list of 90 genes and ignored the 2 genes that were unmapped. Networks 1–7 each contained 35 genes, proteins, other molecules and the regulatory relationships between them, while networks 8 and 9 each contained 3 genes. Networks 8 and 9 were excluded from further analysis due to the small size of the network.
The seven resulting networks shared several genes. Network 5 shared CTSK and the molecule P38 MAPK with Network 1, CCL23 and IL32 with Network 3, and SERPINB3 with Network 6. Network 6 also shared LUM with Network 3 and PLC with Network 7. Networks 4 and 7 had CXCL12 in common.
To test the association of the activity of these networks in the fibrosis datasets, we sought to find the networks that were most relevant to multi-organ fibrosis25. Applying the mutual information approach to the IPA networks returned the activity score (S) for each network, repeated for every dataset (Table 3). It revealed Network 2 as having the highest average score, S = 0.399. A random permutation test of 1000 iterations was performed for each of the 9 datasets, as described in methods. Networks were deemed significantly active for that dataset if S > Srand. Network 1 was significantly active in the greatest number of datasets (4), with the mean of S = 0.3740. Our analysis focused on these two networks because they were most active based on the mutual information scores (Figure 3).
Using DAVID45,46 we identified the molecular functions of the genes belonging to Networks 1 and 2 (Table 4). Proteins and other molecules that were not explicitly genes were ignored by DAVID. DAVID calculated the significance of the molecular functions for the gene list. The molecular functions identified for these genes relate to the top overall functions of their network. The molecular functions of integrin binding and platelet-derived growth factor (PDGF) binding relate the genes of Network 1 to their overall role in connective tissue disorders and tissue development and function. Likewise for Network II, the broad function of genetic disorders could include dysregulation of MHC I binding and MHC II receptor activity.
Since fibrotic disorders are often associated with common syndromes such as chronic inflammation, the goal of our research was to observe foundational gene and network perturbations in multiple types of organ-based fibrosis. The bioinformatic analysis of microarray datasets publicly available for lung, heart, liver and kidney fibrosis has provided an opportunity to investigate the hypothesis that there are foundational molecular mechanisms for all types of organ-based fibrosis. This approach is vital to understanding the core genes involved in fibrosis that mark potential targets for therapy.
The results paint a picture of the development of fibrosis characterized by a core set of genes and molecular pathways. ~17,000 genes were identified as being expressed in at least one dataset with p < 0.05, but only 0.52% of these genes (83 genes up-regulated and 7 genes down-regulated) were significantly differentially expressed across these datasets and were used for further analysis.
Numerous observations indicate that fibrosis, aging and abnormal wound healing may be linked, as both may represent loss of cellular reserve pathways47,48. This would explain the abnormal expression of genes implicated in stem cells biology, wound repair and epithelial damage beside the known “myofibroblasts” genes.
To understand the function of these core genes, we used two different programs to analyze the genes and identify their biological and molecular functions. First, IPA software provided the top biological functions and revealed that dermatological diseases as the most disorder that aligned the most significant fit for genes found differentially expressed at p < 10−16. Other relevant functions identified by this approach included inflammatory disease, connective tissue disorders, genetic disorder, and respiratory and cardiac disease, all with p-values < 10−7. IPA then generated networks using the list of genes as “seed” for these networks, with the activity of these networks scored by the mutual information algorithm and confirmed by subjecting each network to a test of significance against a random simulation of mutual information. This aided in identifying which processes were most perturbed by fibrosis. The network deemed the most active in the greatest number of networks (Network 1) and the network most active overall with the highest average mutual information score (Network 2) are involved in critical cellular, immune and matrix functions. The genes and molecules of Network 1 play a key role in connective tissue disorders. Network 2 is important in genetic, skeletal and muscular disorders through several HLA and MHC genes involved in antigen presentation and recognition in immune system response. Secondly, DAVID analysis of the selected genes identified molecular functions of these genes that are in line with the biological functions of their networks and the pathogenesis of fibrosis. Network 1 contained genes important in cell-matrix interactions like integrins are significantly dysregulated per our analysis of the microarray datasets and are potentially play important roles in fibrosis.
In addition, there were many genes with p < 0.05 that were expressed in 5 or more datasets, but had mean fold-changes of smaller magnitude than 1.5 (data not shown). These genes are members of pathways not directly involved in fibrosis, and represent other potential targets. The top biological functions of these networks were involved in hematopoiesis, tissue morphology, cell cycle and death, and may indicate the important influence of stem cells, developmental pathways and repair on organ fibrosis. These functions has p values ranging from 10−5 to 10−2, less significant that the functions of the differentially expressed genes.
Our analysis revealed several genes previously unknown for fibrosis and widely up-regulated in multi-organ fibrosis: WIPF1 (the highest expression, 8 microarrys), ITGBL1 (Integrin, beta-like 1 (with EGF-like repeat domains)) involved in inflammation49, EPHA3 (EPH receptor A3) in mediating developmental processes and homing of hematopoietic cells50,51, GRN (Granulin) in epithelial development, tissue remodeling and wound healing52,53. Most of the common down-regulated genes present in our analysis are novel for organ fibrosis: MT1M (Metallothionein 1M) linked to progressive degeneration of motoneurons in sporadic amyotrophic lateral sclerosis54, FZD5 (Frizzled family receptor 5) which is part of WNT signaling and ST3GAL6 (ST3 beta-galactoside alpha-2,3-sialyltransferase 6) important for stem cell development and regeneration55,56, SLC1A1 (Solute carrier family 1 (neuronal/epithelial high affinity glutamate transporter, system Xag), member 1), a major epithelial transporter of glutamate and aspartate57. Chronic inflammation, extracellular matrix remodeling and epithelial development as part of abnormal wound healing involved in fibrosis are well represented in our analysis by up-regulated genes like LUM (Lumican)58, GPNMB (Glycoprotein (transmembrane) nmb59) or genes downregulated: AQP3 (Aquaporin 3 (Gill blood group))60,61 and IL6R (Interleukin 6 receptor)62,63. It was interesting to see that our analysis has identified a set of well known genes for fibrosis such as chemokine CXCR4 ((C-X-C motif) receptor 4), SDF1/CXCL12 (Chemokine (C-X-C motif) ligand 12), metalloproteinases (MMP7, MMP2, at the level of multi-organ fibrosis. In addition, several genes from our analysis have been previously described in gene signatures for organ fibrosis. 14 genes from our list are present in the molecular Banff signature for acute kidney transplant rejection20. In another fibrosis study, two genes from our list, MMP2 and COL3A1, were found in fibrotic heart, kidney, lung and pancreas tissues, while ADAM28 was found only in lung tissues21.
Our analysis indicates that besides regular fibroblasts-myofibroblast genes, there is a common set of genes abnormally expressed, involved in epithelial development, stem cells regeneration and inflammation. We go beyond the approach of single organ fibrosis to address core genes and molecules involved in fibrosis that are present in multiple organs. While these genes may not be the original disease causing genes whose genetic variations leading to the onset or predisposition of these diseases, they reflect a set of potential commonly changed phenotypes at the gene transcription level which can be experimentally tested and as well as targeted in therapy. Nevertheless, our results are only the beginning, as the genes identified give way to experimental research to confirm the role of the identified genes in multi-organ fibrosis, and identify therapeutic targets to slow or even reverse fibrotic activity.