PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1198038)

Clipboard (0)
None

Related Articles

1.  IPAD: the Integrated Pathway Analysis Database for Systematic Enrichment Analysis 
BMC Bioinformatics  2012;13(Suppl 15):S7.
Background
Next-Generation Sequencing (NGS) technologies and Genome-Wide Association Studies (GWAS) generate millions of reads and hundreds of datasets, and there is an urgent need for a better way to accurately interpret and distill such large amounts of data. Extensive pathway and network analysis allow for the discovery of highly significant pathways from a set of disease vs. healthy samples in the NGS and GWAS. Knowledge of activation of these processes will lead to elucidation of the complex biological pathways affected by drug treatment, to patient stratification studies of new and existing drug treatments, and to understanding the underlying anti-cancer drug effects. There are approximately 141 biological human pathway resources as of Jan 2012 according to the Pathguide database. However, most currently available resources do not contain disease, drug or organ specificity information such as disease-pathway, drug-pathway, and organ-pathway associations. Systematically integrating pathway, disease, drug and organ specificity together becomes increasingly crucial for understanding the interrelationships between signaling, metabolic and regulatory pathway, drug action, disease susceptibility, and organ specificity from high-throughput omics data (genomics, transcriptomics, proteomics and metabolomics).
Results
We designed the Integrated Pathway Analysis Database for Systematic Enrichment Analysis (IPAD, http://bioinfo.hsc.unt.edu/ipad), defining inter-association between pathway, disease, drug and organ specificity, based on six criteria: 1) comprehensive pathway coverage; 2) gene/protein to pathway/disease/drug/organ association; 3) inter-association between pathway, disease, drug, and organ; 4) multiple and quantitative measurement of enrichment and inter-association; 5) assessment of enrichment and inter-association analysis with the context of the existing biological knowledge and a "gold standard" constructed from reputable and reliable sources; and 6) cross-linking of multiple available data sources.
IPAD is a comprehensive database covering about 22,498 genes, 25,469 proteins, 1956 pathways, 6704 diseases, 5615 drugs, and 52 organs integrated from databases including the BioCarta, KEGG, NCI-Nature curated, Reactome, CTD, PharmGKB, DrugBank, PharmGKB, and HOMER. The database has a web-based user interface that allows users to perform enrichment analysis from genes/proteins/molecules and inter-association analysis from a pathway, disease, drug, and organ.
Moreover, the quality of the database was validated with the context of the existing biological knowledge and a "gold standard" constructed from reputable and reliable sources. Two case studies were also presented to demonstrate: 1) self-validation of enrichment analysis and inter-association analysis on brain-specific markers, and 2) identification of previously undiscovered components by the enrichment analysis from a prostate cancer study.
Conclusions
IPAD is a new resource for analyzing, identifying, and validating pathway, disease, drug, organ specificity and their inter-associations. The statistical method we developed for enrichment and similarity measurement and the two criteria we described for setting the threshold parameters can be extended to other enrichment applications. Enriched pathways, diseases, drugs, organs and their inter-associations can be searched, displayed, and downloaded from our online user interface. The current IPAD database can help users address a wide range of biological pathway related, disease susceptibility related, drug target related and organ specificity related questions in human disease studies.
doi:10.1186/1471-2105-13-S15-S7
PMCID: PMC3439721  PMID: 23046449
2.  Survival-Related Profile, Pathways, and Transcription Factors in Ovarian Cancer 
PLoS Medicine  2009;6(2):e1000024.
Background
Ovarian cancer has a poor prognosis due to advanced stage at presentation and either intrinsic or acquired resistance to classic cytotoxic drugs such as platinum and taxoids. Recent large clinical trials with different combinations and sequences of classic cytotoxic drugs indicate that further significant improvement in prognosis by this type of drugs is not to be expected. Currently a large number of drugs, targeting dysregulated molecular pathways in cancer cells have been developed and are introduced in the clinic. A major challenge is to identify those patients who will benefit from drugs targeting these specific dysregulated pathways.The aims of our study were (1) to develop a gene expression profile associated with overall survival in advanced stage serous ovarian cancer, (2) to assess the association of pathways and transcription factors with overall survival, and (3) to validate our identified profile and pathways/transcription factors in an independent set of ovarian cancers.
Methods and Findings
According to a randomized design, profiling of 157 advanced stage serous ovarian cancers was performed in duplicate using ∼35,000 70-mer oligonucleotide microarrays. A continuous predictor of overall survival was built taking into account well-known issues in microarray analysis, such as multiple testing and overfitting. A functional class scoring analysis was utilized to assess pathways/transcription factors for their association with overall survival. The prognostic value of genes that constitute our overall survival profile was validated on a fully independent, publicly available dataset of 118 well-defined primary serous ovarian cancers. Furthermore, functional class scoring analysis was also performed on this independent dataset to assess the similarities with results from our own dataset. An 86-gene overall survival profile discriminated between patients with unfavorable and favorable prognosis (median survival, 19 versus 41 mo, respectively; permutation p-value of log-rank statistic = 0.015) and maintained its independent prognostic value in multivariate analysis. Genes that composed the overall survival profile were also able to discriminate between the two risk groups in the independent dataset. In our dataset 17/167 pathways and 13/111 transcription factors were associated with overall survival, of which 16 and 12, respectively, were confirmed in the independent dataset.
Conclusions
Our study provides new clues to genes, pathways, and transcription factors that contribute to the clinical outcome of serous ovarian cancer and might be exploited in designing new treatment strategies.
Ate van der Zee and colleagues analyze the gene expression profiles of ovarian cancer samples from 157 patients, and identify an 86-gene expression profile that seems to predict overall survival.
Editors' Summary
Background.
Ovarian cancer kills more than 100,000 women every year and is one of the most frequent causes of cancer death in women in Western countries. Most ovarian cancers develop when an epithelial cell in one of the ovaries (two small organs in the pelvis that produce eggs) acquires genetic changes that allow it to grow uncontrollably and to spread around the body (metastasize). In its early stages, ovarian cancer is confined to the ovaries and can often be treated successfully by surgery alone. Unfortunately, early ovarian cancer rarely has symptoms so a third of women with ovarian cancer have advanced disease when they first visit their doctor with symptoms that include vague abdominal pains and mild digestive disturbances. That is, cancer cells have spread into their abdominal cavity and metastasized to other parts of the body (so-called stage III and IV disease). The outlook for women diagnosed with stage III and IV disease, which are treated with a combination of surgery and chemotherapy, is very poor. Only 30% of women with stage III, and 5% with stage IV, are still alive five years after their cancer is diagnosed.
Why Was This Study Done?
If the cellular pathways that determine the biological behavior of ovarian cancer could be identified, it might be possible to develop more effective treatments for women with stage III and IV disease. One way to identify these pathways is to use gene expression profiling (a technique that catalogs all the genes expressed by a cell) to compare gene expression patterns in the ovarian cancers of women who survive for different lengths of time. Genes with different expression levels in tumors with different outcomes could be targets for new treatments. For example, it might be worth developing inhibitors of proteins whose expression is greatest in tumors with short survival times. In this study, the researchers develop an expression profile that is associated with overall survival in advanced-stage serous ovarian cancer (more than half of ovarian cancers originate in serous cells, epithelial cells that secrete a watery fluid). The researchers also assess the association of various cellular pathways and transcription factors (proteins that control the expression of other proteins) with survival in this type of ovarian carcinoma.
What Did the Researchers Do and Find?
The researchers analyzed the gene expression profiles of tumor samples taken from 157 patients with advanced stage serous ovarian cancer and used the “supervised principal components” method to build a predictor of overall survival from these profiles and patient survival times. This 86-gene predictor discriminated between patients with favorable and unfavorable outcomes (average survival times of 41 and 19 months, respectively). It also discriminated between groups of patients with these two outcomes in an independent dataset collected from 118 additional serous ovarian cancers. Next, the researchers used “functional class scoring” analysis to assess the association between pathway and transcription factor expression in the tumor samples and overall survival. Seventeen of 167 KEGG pathways (“wiring” diagrams of molecular interactions, reactions and relations involved in cellular processes and human diseases listed in the Kyoto Encyclopedia of Genes and Genomes) were associated with survival, 16 of which were confirmed in the independent dataset. Finally, 13 of 111 analyzed transcription factors were associated with overall survival in the tumor samples, 12 of which were confirmed in the independent dataset.
What Do These Findings Mean?
These findings identify an 86-gene overall survival gene expression profile that seems to predict overall survival for women with advanced serous ovarian cancer. However, before this profile can be used clinically, further validation of the profile and more robust methods for determining gene expression profiles are needed. Importantly, these findings also provide new clues about the genes, pathways and transcription factors that contribute to the clinical outcome of serous ovarian cancer, clues that can now be exploited in the search for new treatment strategies. Finally, these findings suggest that it might eventually be possible to tailor therapies to the needs of individual patients by analyzing which pathways are activated in their tumors and thus improve survival times for women with advanced ovarian cancer.
Additional Information.
Please access these Web sites via the online version of this summary at http://dx.doi.org/10.1371/journal.pmed.1000024.
This study is further discussed in a PLoS Medicine Perspective by Simon Gayther and Kate Lawrenson
See also a related PLoS Medicine Research Article by Huntsman and colleagues
The US National Cancer Institute provides a brief description of what cancer is and how it develops, and information on all aspects of ovarian cancer for patients and professionals (in English and Spanish)
The UK charity Cancerbackup provides general information about cancer, and more specific information about ovarian cancer
MedlinePlus also provides links to other information about ovarian cancer (in English and Spanish)
The KEGG Pathway database provides pathway maps of known molecular networks involved in a wide range of cellular processes
doi:10.1371/journal.pmed.1000024
PMCID: PMC2634794  PMID: 19192944
3.  Seeking unique and common biological themes in multiple gene lists or datasets: pathway pattern extraction pipeline for pathway-level comparative analysis 
BMC Bioinformatics  2009;10:200.
Background
One of the challenges in the analysis of microarray data is to integrate and compare the selected (e.g., differential) gene lists from multiple experiments for common or unique underlying biological themes. A common way to approach this problem is to extract common genes from these gene lists and then subject these genes to enrichment analysis to reveal the underlying biology. However, the capacity of this approach is largely restricted by the limited number of common genes shared by datasets from multiple experiments, which could be caused by the complexity of the biological system itself.
Results
We now introduce a new Pathway Pattern Extraction Pipeline (PPEP), which extends the existing WPS application by providing a new pathway-level comparative analysis scheme. To facilitate comparing and correlating results from different studies and sources, PPEP contains new interfaces that allow evaluation of the pathway-level enrichment patterns across multiple gene lists. As an exploratory tool, this analysis pipeline may help reveal the underlying biological themes at both the pathway and gene levels. The analysis scheme provided by PPEP begins with multiple gene lists, which may be derived from different studies in terms of the biological contexts, applied technologies, or methodologies. These lists are then subjected to pathway-level comparative analysis for extraction of pathway-level patterns. This analysis pipeline helps to explore the commonality or uniqueness of these lists at the level of pathways or biological processes from different but relevant biological systems using a combination of statistical enrichment measurements, pathway-level pattern extraction, and graphical display of the relationships of genes and their associated pathways as Gene-Term Association Networks (GTANs) within the WPS platform. As a proof of concept, we have used the new method to analyze many datasets from our collaborators as well as some public microarray datasets.
Conclusion
This tool provides a new pathway-level analysis scheme for integrative and comparative analysis of data derived from different but relevant systems. The tool is freely available as a Pathway Pattern Extraction Pipeline implemented in our existing software package WPS, which can be obtained at
doi:10.1186/1471-2105-10-200
PMCID: PMC2709625  PMID: 19563622
4.  COFECO: composite function annotation enriched by protein complex data 
Nucleic Acids Research  2009;37(Web Server issue):W350-W355.
COFECO is a web-based tool for a composite annotation of protein complexes, KEGG pathways and Gene Ontology (GO) terms within a class of genes and their orthologs under study. Widely used functional enrichment tools using GO and KEGG pathways create large list of annotations that make it difficult to derive consolidated information and often include over-generalized terms. The interrelationship of annotation terms can be more clearly delineated by integrating the information of physically interacting proteins with biological pathways and GO terms. COFECO has the following advanced characteristics: (i) The composite annotation sets of correlated functions and cellular processes for a given gene set can be identified in a more comprehensive and specified way by the employment of protein complex data together with GO and KEGG pathways as annotation resources. (ii) Orthology based integrative annotations among different species complement the defective annotations in an individual genome and provide the information of evolutionary conserved correlations. (iii) A term filtering feature enables users to collect the specified annotations enriched with selected function terms. (iv) A cross-comparison of annotation results between two different datasets is possible. In addition, COFECO provides a web-based GO hierarchical viewer and KEGG pathway viewer where the enrichment results can be summarized and further explored. COFECO is freely accessible at http://piech.kaist.ac.kr/cofeco.
doi:10.1093/nar/gkp331
PMCID: PMC2703949  PMID: 19429688
5.  Bi-directional gene set enrichment and canonical correlation analysis identify key diet-sensitive pathways and biomarkers of metabolic syndrome 
BMC Bioinformatics  2010;11:499.
Background
Currently, a number of bioinformatics methods are available to generate appropriate lists of genes from a microarray experiment. While these lists represent an accurate primary analysis of the data, fewer options exist to contextualise those lists. The development and validation of such methods is crucial to the wider application of microarray technology in the clinical setting. Two key challenges in clinical bioinformatics involve appropriate statistical modelling of dynamic transcriptomic changes, and extraction of clinically relevant meaning from very large datasets.
Results
Here, we apply an approach to gene set enrichment analysis that allows for detection of bi-directional enrichment within a gene set. Furthermore, we apply canonical correlation analysis and Fisher's exact test, using plasma marker data with known clinical relevance to aid identification of the most important gene and pathway changes in our transcriptomic dataset. After a 28-day dietary intervention with high-CLA beef, a range of plasma markers indicated a marked improvement in the metabolic health of genetically obese mice. Tissue transcriptomic profiles indicated that the effects were most dramatic in liver (1270 genes significantly changed; p < 0.05), followed by muscle (601 genes) and adipose (16 genes). Results from modified GSEA showed that the high-CLA beef diet affected diverse biological processes across the three tissues, and that the majority of pathway changes reached significance only with the bi-directional test. Combining the liver tissue microarray results with plasma marker data revealed 110 CLA-sensitive genes showing strong canonical correlation with one or more plasma markers of metabolic health, and 9 significantly overrepresented pathways among this set; each of these pathways was also significantly changed by the high-CLA diet. Closer inspection of two of these pathways - selenoamino acid metabolism and steroid biosynthesis - illustrated clear diet-sensitive changes in constituent genes, as well as strong correlations between gene expression and plasma markers of metabolic syndrome independent of the dietary effect.
Conclusion
Bi-directional gene set enrichment analysis more accurately reflects dynamic regulatory behaviour in biochemical pathways, and as such highlighted biologically relevant changes that were not detected using a traditional approach. In such cases where transcriptomic response to treatment is exceptionally large, canonical correlation analysis in conjunction with Fisher's exact test highlights the subset of pathways showing strongest correlation with the clinical markers of interest. In this case, we have identified selenoamino acid metabolism and steroid biosynthesis as key pathways mediating the observed relationship between metabolic health and high-CLA beef. These results indicate that this type of analysis has the potential to generate novel transcriptome-based biomarkers of disease.
doi:10.1186/1471-2105-11-499
PMCID: PMC3098081  PMID: 20929581
6.  CMS: A Web-Based System for Visualization and Analysis of Genome-Wide Methylation Data of Human Cancers 
PLoS ONE  2013;8(4):e60980.
Background
DNA methylation of promoter CpG islands is associated with gene suppression, and its unique genome-wide profiles have been linked to tumor progression. Coupled with high-throughput sequencing technologies, it can now efficiently determine genome-wide methylation profiles in cancer cells. Also, experimental and computational technologies make it possible to find the functional relationship between cancer-specific methylation patterns and their clinicopathological parameters.
Methodology/Principal Findings
Cancer methylome system (CMS) is a web-based database application designed for the visualization, comparison and statistical analysis of human cancer-specific DNA methylation. Methylation intensities were obtained from MBDCap-sequencing, pre-processed and stored in the database. 191 patient samples (169 tumor and 22 normal specimen) and 41 breast cancer cell-lines are deposited in the database, comprising about 6.6 billion uniquely mapped sequence reads. This provides comprehensive and genome-wide epigenetic portraits of human breast cancer and endometrial cancer to date. Two views are proposed for users to better understand methylation structure at the genomic level or systemic methylation alteration at the gene level. In addition, a variety of annotation tracks are provided to cover genomic information. CMS includes important analytic functions for interpretation of methylation data, such as the detection of differentially methylated regions, statistical calculation of global methylation intensities, multiple gene sets of biologically significant categories, interactivity with UCSC via custom-track data. We also present examples of discoveries utilizing the framework.
Conclusions/Significance
CMS provides visualization and analytic functions for cancer methylome datasets. A comprehensive collection of datasets, a variety of embedded analytic functions and extensive applications with biological and translational significance make this system powerful and unique in cancer methylation research. CMS is freely accessible at: http://cbbiweb.uthscsa.edu/KMethylomes/.
doi:10.1371/journal.pone.0060980
PMCID: PMC3632540  PMID: 23630576
7.  Extending pathways based on gene lists using InterPro domain signatures 
BMC Bioinformatics  2008;9:3.
Background
High-throughput technologies like functional screens and gene expression analysis produce extended lists of candidate genes. Gene-Set Enrichment Analysis is a commonly used and well established technique to test for the statistically significant over-representation of particular pathways. A shortcoming of this method is however, that most genes that are investigated in the experiments have very sparse functional or pathway annotation and therefore cannot be the target of such an analysis. The approach presented here aims to assign lists of genes with limited annotation to previously described functional gene collections or pathways. This works by comparing InterPro domain signatures of the candidate gene lists with domain signatures of gene sets derived from known classifications, e.g. KEGG pathways.
Results
In order to validate our approach, we designed a simulation study. Based on all pathways available in the KEGG database, we create test gene lists by randomly selecting pathway genes, removing these genes from the known pathways and adding variable amounts of noise in the form of genes not annotated to the pathway. We show that we can recover pathway memberships based on the simulated gene lists with high accuracy. We further demonstrate the applicability of our approach on a biological example.
Conclusion
Results based on simulation and data analysis show that domain based pathway enrichment analysis is a very sensitive method to test for enrichment of pathways in sparsely annotated lists of genes. An R based software package domainsignatures, to routinely perform this analysis on the results of high-throughput screening, is available via Bioconductor.
doi:10.1186/1471-2105-9-3
PMCID: PMC2245903  PMID: 18177498
8.  Meta-Analysis of Pathway Enrichment: Combining Independent and Dependent Omics Data Sets 
PLoS ONE  2014;9(2):e89297.
A major challenge in current systems biology is the combination and integrative analysis of large data sets obtained from different high-throughput omics platforms, such as mass spectrometry based Metabolomics and Proteomics or DNA microarray or RNA-seq-based Transcriptomics. Especially in the case of non-targeted Metabolomics experiments, where it is often impossible to unambiguously map ion features from mass spectrometry analysis to metabolites, the integration of more reliable omics technologies is highly desirable. A popular method for the knowledge-based interpretation of single data sets is the (Gene) Set Enrichment Analysis. In order to combine the results from different analyses, we introduce a methodical framework for the meta-analysis of p-values obtained from Pathway Enrichment Analysis (Set Enrichment Analysis based on pathways) of multiple dependent or independent data sets from different omics platforms. For dependent data sets, e.g. obtained from the same biological samples, the framework utilizes a covariance estimation procedure based on the nonsignificant pathways in single data set enrichment analysis. The framework is evaluated and applied in the joint analysis of Metabolomics mass spectrometry and Transcriptomics DNA microarray data in the context of plant wounding. In extensive studies of simulated data set dependence, the introduced correlation could be fully reconstructed by means of the covariance estimation based on pathway enrichment. By restricting the range of p-values of pathways considered in the estimation, the overestimation of correlation, which is introduced by the significant pathways, could be reduced. When applying the proposed methods to the real data sets, the meta-analysis was shown not only to be a powerful tool to investigate the correlation between different data sets and summarize the results of multiple analyses but also to distinguish experiment-specific key pathways.
doi:10.1371/journal.pone.0089297
PMCID: PMC3938450  PMID: 24586671
9.  The phosphoproteome of toll-like receptor-activated macrophages 
First global and quantitative analysis of phosphorylation cascades induced by toll-like receptor (TLR) stimulation in macrophages identifies nearly 7000 phosphorylation sites and shows extensive and dynamic up-regulation and down-regulation after lipopolysaccharide (LPS).In addition to the canonical TLR-associated pathways, mining of the phosphorylation data suggests an involvement of ATM/ATR kinases in signalling and shows that the cytoskeleton is a hotspot of TLR-induced phosphorylation.Intersecting transcription factor phosphorylation with bioinformatic promoter analysis of genes induced by LPS identified several candidate transcriptional regulators that were previously not implicated in TLR-induced transcriptional control.
Toll-like receptors (TLR) are a family of pattern recognition receptors that enable innate immune cells to sense infectious danger. Recognition of microbial structures, like lipopolysaccharide (LPS) of Gram-negative bacteria by TLR4, causes within hours substantial re-programming of macrophage gene expression, including up-regulation of chemokines driving inflammation, anti-microbial effector molecules and cytokines directing adaptive immune responses. TLR signalling is initiated by the adapter protein Myd88 and leads to the activation of kinase cascades that result in activation of the MAPK and NFkB pathways. Phosphorylation has an essential role in these early steps of TLR signalling, and in addition regulates critical transcription factors (TFs). Although TLR signalling has been extensively studied, a comprehensive analysis of phosphorylation events in TLR-activated macrophages is lacking. It is therefore unknown whether the canonical MAPK and NFkB pathways comprise the main phosphorylation events and which other molecular functions and processes are regulated by phosphorylation after stimulation with LPS.
Recent progress in mass spectrometry-based proteomics has opened the possibility to quantitatively investigate global changes in protein abundance and post-translational modifications. Stable isotope labelling with amino acids in cell culture (SILAC) allows highly accurate quantification, and has proved especially useful for direct comparison of phosphopeptide abundance in time-course or treatment analyses.
Here, we adapted SILAC to primary mouse macrophages, and performed a global, quantitative and kinetic analysis of the macrophage phosphoproteome after LPS stimulation. Bioinformatic analyses were used to identify kinases, pathways and biological processes enriched in the LPS-regulated phosphoproteome. To connect TF phosphorylation with transcription, we generated a parallel dataset of nascent RNA and used in silico promoter analysis to identify transcriptional regulators with binding site enrichment among the LPS-regulated gene set.
After establishing SILAC conditions for efficient labelling of primary bone marrow-derived macrophages in two independent experiments 1850 phosphoproteins with a total of 6956 phosphorylation sites were reproducibly identified. Phosphoproteins were detected from all cellular compartments, with a clear enrichment for nuclear and cytoskeleton-associated proteins. LPS caused major regulation of a large fraction of phosphopeptides, with 24% of all sites up-regulated and 9% down-regulated after stimulation (Figure 3A and B). These changes were highly dynamic, as the majority of the regulated phosphopeptides were up-regulated or down-regulated transiently or in a delayed manner (Figure 3C). Overall, the extent of changes in the phosphoproteome was comparable to the transcriptional re-programming, underscoring the importance of phosphorylation cascades in TLR signalling. Our parallel transcriptome data also showed that widespread phosphorylation precedes massive transcriptional changes.
To obtain footprints of kinase activation in response to TLR ligation, we searched phosphopeptide sequences for known linear sequence motifs of 33 kinases and identified kinase motifs enriched among LPS-regulated phosphorylation sites (compared to non-regulated phosphorylation sites) (Table I). Motif ERK/MAPK was highly enriched, in accordance with the essential role of the MAPK module in TLR signalling. Other kinases with motif enrichment have also recently been linked to TLR signalling (e.g. PKD; AKT and its targets GSK3 and mTOR). However, the DNA damage-actviated kinases ATM/ATR and the cell cycle-associated kinases AURORA and CHK1/2 have not been associated with the macrophage response to TLR activation yet. These finding shed new light on older data on the effect of TLR on macrophage proliferation in response to macrophage colony stimulating factor. Of interest, in follow-up experiments using pharmacological inhibitors of the kinases with motif enrichment, we observed that inhibition of ATM kinase activity caused increased LPS-induced expression of several cytokines and chemokines, suggesting that this pathway regulates inflammatory responses.
In further bioinformatic analyses, the Gene Ontology and signalling pathway annotations of phosphoproteins were used to identify signalling pathways and cellular processes targeted by TLR4-controlled phosphorylation (Table II). Among the expected hits, based on the known TLR pathways, were TLR signalling, MAPK and AKT as well as mTOR signalling. Of interest, the annotation terms ‘Rho GTPase cycle' and ‘cytoskeleton' were significantly enriched among LPS-regulated phosphoproteins, indicating a more prominent role for cytoskeletal proteins in the transduction of TLR signals or in the biological response to it.
We were especially interested in the phosphorylation of TFs and its regulation by LPS (Figure 6A). We hypothesised that functionally important TFs should have an increased frequency of binding sites in the promoters of LPS-regulated genes (Figure 6B). To identify transcriptionally regulated genes with high sensitivity, we isolated nascent RNA after metabolic labelling (Figure 6C–E). In silico promoter scanning using Genomatix software for binding sites for all 50 TF families with phosphorylated members was used to test for enrichment in transciptionally induced genes (Figure 6F). At the early time point, binding site enrichment for the canonical TLR-associated TF NFkB was detected, and in addition we found that several other TF families with an established role in the transcription of individual LPS-target genes showed binding site enrichment (CEBP, MEF2, NFAT and HEAT). In addition, enrichment for OCT and HOXC binding sites at the early time point and SORY matrices later after stimulation indicated an involvement of the phosphorylated members of the respective TF families in the execution of TLR-induced transcriptional responses. An initial test of the function for a few of these candidate transcriptional regulators was performed using siRNA knockdown in primary macrophages. These experiments suggested that knock down of the SORY binding phosphoprotein Capicua homolog (Cic) and to a lesser extent of the CREB family member Atf7 selectively attenuates LPS-induced expression of Il1a and Il1b.
In summary, this study provides a novel and global perspective on innate immune activation by TLR signalling (Figure 5). We quantitatively detected a large number of previously unknown site-specific phosphorylation events, which are now publicly available through the Phosida database. By combining different data mining approaches, we consistently identified canonical and newly implicated TLR-activated signalling modules. In particular, the PI3K/AKT and the related mTOR pathway were highlighted; furthermore, DNA damage–response associated ATM/ATR kinases and the cytoskeleton emerged as unexpected hotspots for phosphorylation. Finally, weaving together corresponding phophoproteome and nascent transcriptome datasets through the loom of in silico promoter analysis we identified TFs with a likely role in mediating TLR-induced gene expression programmes.
Recognition of microbial danger signals by toll-like receptors (TLR) causes re-programming of macrophages. To investigate kinase cascades triggered by the TLR4 ligand lipopolysaccharide (LPS) on systems level, we performed a global, quantitative and kinetic analysis of the phosphoproteome of primary macrophages using stable isotope labelling with amino acids in cell culture, phosphopeptide enrichment and high-resolution mass spectrometry. In parallel, nascent RNA was profiled to link transcription factor (TF) phosphorylation to TLR4-induced transcriptional activation. We reproducibly identified 1850 phosphoproteins with 6956 phosphorylation sites, two thirds of which were not reported earlier. LPS caused major dynamic changes in the phosphoproteome (24% up-regulation and 9% down-regulation). Functional bioinformatic analyses confirmed canonical players of the TLR pathway and highlighted other signalling modules (e.g. mTOR, ATM/ATR kinases) and the cytoskeleton as hotspots of LPS-regulated phosphorylation. Finally, weaving together phosphoproteome and nascent transcriptome data by in silico promoter analysis, we implicated several phosphorylated TFs in primary LPS-controlled gene expression.
doi:10.1038/msb.2010.29
PMCID: PMC2913394  PMID: 20531401
macrophage; nascent RNA; phosphoproteome; SILAC; toll-like receptors
10.  PathNet: a tool for pathway analysis using topological information 
Background
Identification of canonical pathways through enrichment of differentially expressed genes in a given pathway is a widely used method for interpreting gene lists generated from high-throughput experimental studies. However, most algorithms treat pathways as sets of genes, disregarding any inter- and intra-pathway connectivity information, and do not provide insights beyond identifying lists of pathways.
Results
We developed an algorithm (PathNet) that utilizes the connectivity information in canonical pathway descriptions to help identify study-relevant pathways and characterize non-obvious dependencies and connections among pathways using gene expression data. PathNet considers both the differential expression of genes and their pathway neighbors to strengthen the evidence that a pathway is implicated in the biological conditions characterizing the experiment. As an adjunct to this analysis, PathNet uses the connectivity of the differentially expressed genes among all pathways to score pathway contextual associations and statistically identify biological relations among pathways. In this study, we used PathNet to identify biologically relevant results in two Alzheimer’s disease microarray datasets, and compared its performance with existing methods. Importantly, PathNet identified de-regulation of the ubiquitin-mediated proteolysis pathway as an important component in Alzheimer’s disease progression, despite the absence of this pathway in the standard enrichment analyses.
Conclusions
PathNet is a novel method for identifying enrichment and association between canonical pathways in the context of gene expression data. It takes into account topological information present in pathways to reveal biological information. PathNet is available as an R workspace image from http://www.bhsai.org/downloads/pathnet/.
doi:10.1186/1751-0473-7-10
PMCID: PMC3563509  PMID: 23006764
Canonical pathways; Pathway enrichment; Pathway association; Pathway interaction; Pathway topology
11.  In Vivo Response to Methotrexate Forecasts Outcome of Acute Lymphoblastic Leukemia and Has a Distinct Gene Expression Profile 
PLoS Medicine  2008;5(4):e83.
Background
Childhood acute lymphoblastic leukemia (ALL) is the most common cancer in children, and can now be cured in approximately 80% of patients. Nevertheless, drug resistance is the major cause of treatment failure in children with ALL. The drug methotrexate (MTX), which is widely used to treat many human cancers, is used in essentially all treatment protocols worldwide for newly diagnosed ALL. Although MTX has been extensively studied for many years, relatively little is known about mechanisms of de novo resistance in primary cancer cells, including leukemia cells. This lack of knowledge is due in part to the fact that existing in vitro methods are not sufficiently reliable to permit assessment of MTX resistance in primary ALL cells. Therefore, we measured the in vivo antileukemic effects of MTX and identified genes whose expression differed significantly in patients with a good versus poor response to MTX.
Methods and Findings
We utilized measures of decreased circulating leukemia cells of 293 newly diagnosed children after initial “up-front” in vivo MTX treatment (1 g/m2) to elucidate interpatient differences in the antileukemic effects of MTX. To identify genomic determinants of these effects, we performed a genome-wide assessment of gene expression in primary ALL cells from 161 of these newly diagnosed children (1–18 y). We identified 48 genes and two cDNA clones whose expression was significantly related to the reduction of circulating leukemia cells after initial in vivo treatment with MTX. This finding was validated in an independent cohort of children with ALL. Furthermore, this measure of initial MTX in vivo response and the associated gene expression pattern were predictive of long-term disease-free survival (p < 0.001, p = 0.02).
Conclusions
Together, these data provide new insights into the genomic basis of MTX resistance and interpatient differences in MTX response, pointing to new strategies to overcome MTX resistance in childhood ALL.
Trial registrations: Total XV, Therapy for Newly Diagnosed Patients With Acute Lymphoblastic Leukemia, http://www.ClinicalTrials.gov (NCT00137111); Total XIIIBH, Phase III Randomized Study of Antimetabolite-Based Induction plus High-Dose MTX Consolidation for Newly Diagnosed Pediatric Acute Lymphocytic Leukemia at Intermediate or High Risk of Treatment Failure (NCI-T93-0101D); Total XIIIBL, Phase III Randomized Study of Antimetabolite-Based Induction plus High-Dose MTX Consolidation for Newly Diagnosed Pediatric Acute Lymphocytic Leukemia at Lower Risk of Treatment Failure (NCI-T93-0103D).
William Evans and colleagues investigate the genomic determinants of methotrexate resistance and interpatient differences in methotrexate response in patients newly diagnosed with childhood acute lymphoblastic leukemia.
Editors' Summary
Background.
Every year about 10,000 children develop cancer in the US. Acute lymphoblastic leukemia (ALL), a rapidly progressing blood cancer, accounts for a quarter of these childhood cancers. Normally, cells in the bone marrow (the spongy material inside bones) develop into lymphocytes (white blood cells that fight infections), red blood cells (which carry oxygen round the body), platelets (which prevent excessive bleeding), and granulocytes (another type of white blood cell). However, in ALL, genetic changes in immature lymphocytes (lymphoblasts) mean that these cells divide uncontrollably and fail to mature. Eventually, the bone marrow fills up with these abnormal cells and can no longer make healthy blood cells. As a result, children with ALL cannot fight infections. They also bruise and bleed easily and, because they do not have enough red blood cells, they often complain of tiredness and weakness. With modern chemotherapy protocols (combinations of drugs that kill the fast-dividing cancer cells but leave the normal, nondividing cells in the body largely unscathed), more than 80% of children with ALL live for at least 5 years.
Why Was This Study Done?
Although this survival rate is good, some patients still die because their cancer cells are resistant to one or more chemotherapy drugs. For some drugs, the genetic characteristics of the ALL cells that make them resistant are known. Unfortunately, little is known about why some ALL cells are resistant to methotrexate, a component of most treatment protocols for newly diagnosed ALL. Methotrexate kills dividing cells by interfering with DNA synthesis and repair. Cancer cells can be resistant to methotrexate for many reasons—they may have acquired genetic changes that stop the drug from entering them, for example. These resistance mechanisms need to be understood better before new strategies can be developed for the treatment of methotrexate-resistant ALL. In this study, the researchers have determined the response of newly diagnosed patients to methotrexate and have investigated the gene expression patterns in ALL cells that correlate with good and bad responses to methotrexate.
What Did the Researchers Do and Find?
The researchers measured the reduction in circulating leukemia cells that followed the first treatment with methotrexate of nearly 300 patients with newly diagnosed ALL. They also used “microarray” analysis to investigate the gene expression patterns in lymphoblast samples taken from the bone marrow of 161 patients before treatment. They found that the expression of 50 genes was significantly related to the reduction in circulating leukemia cells after methotrexate treatment (a result confirmed in an independent group of patients). Of these genes, the expression of 29 was higher in patients who responded poorly to methotrexate than in patients who responded well. A “global analysis test,” which examined the gene expression profile of different cellular pathways in relation to the methotrexate response, found a significant association between the nucleotide biosynthesis pathway (which is needed for DNA synthesis and cellular proliferation) and the methotrexate response. Finally, patients with the best methotrexate response and the 50-gene expression profile indicative of a good response were more likely to be alive after 5 years than patients with the worst methotrexate response and the poor-response gene expression profile.
What Do These Findings Mean?
These findings provide important new insights into the genetic basis of methotrexate resistance in newly diagnosed childhood ALL and begin to explain why some patients fail to respond to this drug. They also show that the reduction in circulating leukemic cells shortly after the first methotrexate dose and a specific gene expression profile both predict the long-term survival of patients. These findings also suggest new ways to modulate sensitivity to methotrexate. Down-regulation of the expression of the genes that are expressed more highly in poor responders than in good responders might improve patient responses to methotrexate. Alternatively, it might be possible to find ways to increase the expression of the genes that are underexpressed in methotrexate poor responders and so improve the outlook for at least some of the children with ALL who fail to respond to current chemotherapy protocols.
Additional Information.
Please access these Web sites via the online version of this summary at http://dx.doi.org/10.1371/journal.pmed.0050083.
• The US National Cancer Institute provides a fact sheet for patients and caregivers about ALL in children and information about its treatment(in English and Spanish)
• The UK charity Cancerbackup provides information for patients and caregivers on ALL in children and on methotrexate
• The US Leukemia and Lymphoma Society also provides information for patients and caregivers about ALL
• The Children's Cancer and Leukaemia Group (a UK charity) provides information for children with cancer and their families
• MedlinePlus provides additional information about methotrexate (in English and Spanish)
doi:10.1371/journal.pmed.0050083
PMCID: PMC2292747  PMID: 18416598
12.  Genome-wide pathway analysis of memory impairment in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort implicates gene candidates, canonical pathways, and networks 
Brain imaging and behavior  2012;6(4):634-648.
Memory deficits are prominent features of mild cognitive impairment (MCI) and Alzheimer’s disease (AD). The genetic architecture underlying these memory deficits likely involves the combined effects of multiple genetic variants operative within numerous biological pathways. In order to identify functional pathways associated with memory impairment, we performed a pathway enrichment analysis on genome-wide association data from 742 Alzheimer’s Disease Neuroimaging Initiative (ADNI) participants. A composite measure of memory was generated as the phenotype for this analysis by applying modern psychometric theory to item-level data from the ADNI neuropsychological test battery. Using the GSA-SNP software tool, we identified 27 canonical, expertly-curated pathways with enrichment (FDR-corrected p-value < 0.05) against this composite memory score. Processes classically understood to be involved in memory consolidation, such as neurotransmitter receptor-mediated calcium signaling and long-term potentiation, were highly represented among the enriched pathways. In addition, pathways related to cell adhesion, neuronal differentiation and guided outgrowth, and glucose- and inflammation-related signaling were also enriched. Among genes that were highly-represented in these enriched pathways, we found indications of coordinated relationships, including one large gene set that is subject to regulation by the SP1 transcription factor, and another set that displays co-localized expression in normal brain tissue along with known AD risk genes. These results 1) demonstrate that psychometrically-derived composite memory scores are an effective phenotype for genetic investigations of memory impairment and 2) highlight the promise of pathway analysis in elucidating key mechanistic targets for future studies and for therapeutic interventions.
doi:10.1007/s11682-012-9196-x
PMCID: PMC3713637  PMID: 22865056
memory; psychometrics; Alzheimer’s disease; mild cognitive impairment; pathway analysis; genome-wide association study
13.  Investigating the concordance of Gene Ontology terms reveals the intra- and inter-platform reproducibility of enrichment analysis 
BMC Bioinformatics  2013;14:143.
Background
Reliability and Reproducibility of differentially expressed genes (DEGs) are essential for the biological interpretation of microarray data. The microarray quality control (MAQC) project launched by US Food and Drug Administration (FDA) elucidated that the lists of DEGs generated by intra- and inter-platform comparisons can reach a high level of concordance, which mainly depended on the statistical criteria used for ranking and selecting DEGs. Generally, it will produce reproducible lists of DEGs when combining fold change ranking with a non-stringent p-value cutoff. For further interpretation of the gene expression data, statistical methods of gene enrichment analysis provide powerful tools for associating the DEGs with prior biological knowledge, e.g. Gene Ontology (GO) terms and pathways, and are widely used in genome-wide research. Although the DEG lists generated from the same compared conditions proved to be reliable, the reproducible enrichment results are still crucial to the discovery of the underlying molecular mechanism differentiating the two conditions. Therefore, it is important to know whether the enrichment results are still reproducible, when using the lists of DEGs generated by different statistic criteria from inter-laboratory and cross-platform comparisons. In our study, we used the MAQC data sets for systematically accessing the intra- and inter-platform concordance of GO terms enriched by Gene Set Enrichment Analysis (GSEA) and LRpath.
Results
In intra-platform comparisons, the overlapped percentage of enriched GO terms was as high as ~80% when the inputted lists of DEGs were generated by fold change ranking and Significance Analysis of Microarrays (SAM), whereas the percentages decreased about 20% when generating the lists of DEGs by using fold change ranking and t-test, or by using SAM and t-test. Similar results were found in inter-platform comparisons.
Conclusions
Our results demonstrated that the lists of DEGs in a high level of concordance can ensure the high concordance of enrichment results. Importantly, based on the lists of DEGs generated by a straightforward method of combining fold change ranking with a non-stringent p-value cutoff, enrichment analysis will produce reproducible enriched GO terms for the biological interpretation.
doi:10.1186/1471-2105-14-143
PMCID: PMC3644270  PMID: 23627640
DNA microarray; Intra-/inter-platform comparison; Gene Ontology enrichment; Microarray quality control (MAQC)
14.  Chapter 8: Biological Knowledge Assembly and Interpretation 
PLoS Computational Biology  2012;8(12):e1002858.
Most methods for large-scale gene expression microarray and RNA-Seq data analysis are designed to determine the lists of genes or gene products that show distinct patterns and/or significant differences. The most challenging and rate-liming step, however, is to determine what the resulting lists of genes and/or transcripts biologically mean. Biomedical ontology and pathway-based functional enrichment analysis is widely used to interpret the functional role of tightly correlated or differentially expressed genes. The groups of genes are assigned to the associated biological annotations using Gene Ontology terms or biological pathways and then tested if they are significantly enriched with the corresponding annotations. Unlike previous approaches, Gene Set Enrichment Analysis takes quite the reverse approach by using pre-defined gene sets. Differential co-expression analysis determines the degree of co-expression difference of paired gene sets across different conditions. Outcomes in DNA microarray and RNA-Seq data can be transformed into the graphical structure that represents biological semantics. A number of biomedical annotation and external repositories including clinical resources can be systematically integrated by biological semantics within the framework of concept lattice analysis. This array of methods for biological knowledge assembly and interpretation has been developed during the past decade and clearly improved our biological understanding of large-scale genomic data from the high-throughput technologies.
doi:10.1371/journal.pcbi.1002858
PMCID: PMC3531281  PMID: 23300429
15.  Integrative pathway analysis of genome-wide association studies and gene expression data in prostate cancer 
BMC Systems Biology  2012;6(Suppl 3):S13.
Background
Pathway analysis of large-scale omics data assists us with the examination of the cumulative effects of multiple functionally related genes, which are difficult to detect using the traditional single gene/marker analysis. So far, most of the genomic studies have been conducted in a single domain, e.g., by genome-wide association studies (GWAS) or microarray gene expression investigation. A combined analysis of disease susceptibility genes across multiple platforms at the pathway level is an urgent need because it can reveal more reliable and more biologically important information.
Results
We performed an integrative pathway analysis of a GWAS dataset and a microarray gene expression dataset in prostate cancer. We obtained a comprehensive pathway annotation set from knowledge-based public resources, including KEGG pathways and the prostate cancer candidate gene set, and gene sets specifically defined based on cross-platform information. By leveraging on this pathway collection, we first searched for significant pathways in the GWAS dataset using four methods, which represent two broad groups of pathway analysis approaches. The significant pathways identified by each method varied greatly, but the results were more consistent within each method group than between groups. Next, we conducted a gene set enrichment analysis of the microarray gene expression data and found 13 pathways with cross-platform evidence, including "Fc gamma R-mediated phagocytosis" (PGWAS = 0.003, Pexpr < 0.001, and Pcombined = 6.18 × 10-8), "regulation of actin cytoskeleton" (PGWAS = 0.003, Pexpr = 0.009, and Pcombined = 3.34 × 10-4), and "Jak-STAT signaling pathway" (PGWAS = 0.001, Pexpr = 0.084, and Pcombined = 8.79 × 10-4).
Conclusions
Our results provide evidence at both the genetic variation and expression levels that several key pathways might have been involved in the pathological development of prostate cancer. Our framework that employs gene expression data to facilitate pathway analysis of GWAS data is not only feasible but also much needed in studying complex disease.
doi:10.1186/1752-0509-6-S3-S13
PMCID: PMC3524313  PMID: 23281744
16.  ReactomeFIViz: a Cytoscape app for pathway and network-based data analysis 
F1000Research  2014;3:146.
High-throughput experiments are routinely performed in modern biological studies. However, extracting meaningful results from massive experimental data sets is a challenging task for biologists. Projecting data onto pathway and network contexts is a powerful way to unravel patterns embedded in seemingly scattered large data sets and assist knowledge discovery related to cancer and other complex diseases. We have developed a Cytoscape app called “ReactomeFIViz”, which utilizes a highly reliable gene functional interaction network combined with human curated pathways derived from Reactome and other pathway databases. This app provides a suite of features to assist biologists in performing pathway- and network-based data analysis in a biologically intuitive and user-friendly way. Biologists can use this app to uncover network and pathway patterns related to their studies, search for gene signatures from gene expression data sets, reveal pathways significantly enriched by genes in a list, and integrate multiple genomic data types into a pathway context using probabilistic graphical models. We believe our app will give researchers substantial power to analyze intrinsically noisy high-throughput experimental data to find biologically relevant information.
doi:10.12688/f1000research.4431.2
PMCID: PMC4184317  PMID: 25309732
17.  Graphite Web: web tool for gene set analysis exploiting pathway topology 
Nucleic Acids Research  2013;41(Web Server issue):W89-W97.
Graphite web is a novel web tool for pathway analyses and network visualization for gene expression data of both microarray and RNA-seq experiments. Several pathway analyses have been proposed either in the univariate or in the global and multivariate context to tackle the complexity and the interpretation of expression results. These methods can be further divided into ‘topological’ and ‘non-topological’ methods according to their ability to gain power from pathway topology. Biological pathways are, in fact, not only gene lists but can be represented through a network where genes and connections are, respectively, nodes and edges. To this day, the most used approaches are non-topological and univariate although they miss the relationship among genes. On the contrary, topological and multivariate approaches are more powerful, but difficult to be used by researchers without bioinformatic skills. Here we present Graphite web, the first public web server for pathway analysis on gene expression data that combines topological and multivariate pathway analyses with an efficient system of interactive network visualizations for easy results interpretation. Specifically, Graphite web implements five different gene set analyses on three model organisms and two pathway databases. Graphite Web is freely available at http://graphiteweb.bio.unipd.it/.
doi:10.1093/nar/gkt386
PMCID: PMC3977659  PMID: 23666626
18.  Concordant integrative gene set enrichment analysis of multiple large-scale two-sample expression data sets 
BMC Genomics  2014;15(Suppl 1):S6.
Background
Gene set enrichment analysis (GSEA) is an important approach to the analysis of coordinate expression changes at a pathway level. Although many statistical and computational methods have been proposed for GSEA, the issue of a concordant integrative GSEA of multiple expression data sets has not been well addressed. Among different related data sets collected for the same or similar study purposes, it is important to identify pathways or gene sets with concordant enrichment.
Methods
We categorize the underlying true states of differential expression into three representative categories: no change, positive change and negative change. Due to data noise, what we observe from experiments may not indicate the underlying truth. Although these categories are not observed in practice, they can be considered in a mixture model framework. Then, we define the mathematical concept of concordant gene set enrichment and calculate its related probability based on a three-component multivariate normal mixture model. The related false discovery rate can be calculated and used to rank different gene sets.
Results
We used three published lung cancer microarray gene expression data sets to illustrate our proposed method. One analysis based on the first two data sets was conducted to compare our result with a previous published result based on a GSEA conducted separately for each individual data set. This comparison illustrates the advantage of our proposed concordant integrative gene set enrichment analysis. Then, with a relatively new and larger pathway collection, we used our method to conduct an integrative analysis of the first two data sets and also all three data sets. Both results showed that many gene sets could be identified with low false discovery rates. A consistency between both results was also observed. A further exploration based on the KEGG cancer pathway collection showed that a majority of these pathways could be identified by our proposed method.
Conclusions
This study illustrates that we can improve detection power and discovery consistency through a concordant integrative analysis of multiple large-scale two-sample gene expression data sets.
doi:10.1186/1471-2164-15-S1-S6
PMCID: PMC4046697  PMID: 24564564
19.  Lists2Networks: Integrated analysis of gene/protein lists 
BMC Bioinformatics  2010;11:87.
Background
Systems biologists are faced with the difficultly of analyzing results from large-scale studies that profile the activity of many genes, RNAs and proteins, applied in different experiments, under different conditions, and reported in different publications. To address this challenge it is desirable to compare the results from different related studies such as mRNA expression microarrays, genome-wide ChIP-X, RNAi screens, proteomics and phosphoproteomics experiments in a coherent global framework. In addition, linking high-content multilayered experimental results with prior biological knowledge can be useful for identifying functional themes and form novel hypotheses.
Results
We present Lists2Networks, a web-based system that allows users to upload lists of mammalian genes/proteins onto a server-based program for integrated analysis. The system includes web-based tools to manipulate lists with different set operations, to expand lists using existing mammalian networks of protein-protein interactions, co-expression correlation, or background knowledge co-annotation correlation, as well as to apply gene-list enrichment analyses against many gene-list libraries of prior biological knowledge such as pathways, gene ontology terms, kinase-substrate, microRNA-mRAN, and protein-protein interactions, metabolites, and protein domains. Such analyses can be applied to several lists at once against many prior knowledge libraries of gene-lists associated with specific annotations. The system also contains features that allow users to export networks and share lists with other users of the system.
Conclusions
Lists2Networks is a user friendly web-based software system expected to significantly ease the computational analysis process for experimental systems biologists employing high-throughput experiments at multiple layers of regulation. The system is freely available at http://www.lists2networks.org.
doi:10.1186/1471-2105-11-87
PMCID: PMC2843617  PMID: 20152038
20.  Algal Functional Annotation Tool: a web-based analysis suite to functionally interpret large gene lists using integrated annotation and expression data 
BMC Bioinformatics  2011;12:282.
Background
Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from a limited number of databases. Another challenge is the use of annotations to interpret large lists of 'interesting' genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene lists. While several such databases have been constructed for animals, none is currently available for the study of algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal genome sequences, a significant need has arisen for such a database to process the growing compendiums of algal genomic data.
Description
The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG pathway maps and batch gene identifier conversion.
Conclusions
The Algal Functional Annotation Tool aims to provide an integrated data-mining environment for algal genomics by combining data from multiple annotation databases into a centralized tool. This site is designed to expedite the process of functional annotation and the interpretation of gene lists, such as those derived from high-throughput RNA-seq experiments. The tool is publicly available at http://pathways.mcdb.ucla.edu.
doi:10.1186/1471-2105-12-282
PMCID: PMC3144025  PMID: 21749710
21.  Comparison and consolidation of microarray data sets of human tissue expression 
BMC Genomics  2010;11:305.
Background
Human tissue displays a remarkable diversity in structure and function. To understand how such diversity emerges from the same DNA, systematic measurements of gene expression across different tissues in the human body are essential. Several recent studies addressed this formidable task using microarray technologies. These large tissue expression data sets have provided us an important basis for biomedical research. However, it is well known that microarray data can be compromised by high noise level and various experimental artefacts. Critical comparison of different data sets can help to reveal such errors and to avoid pitfalls in their application.
Results
We present here the first comparison and integration of four freely available tissue expression data sets generated using three different microarray platforms and containing a total of 377 microarray hybridizations. When assessing the tissue expression of genes, we found that the results considerably depend on the chosen data set. Nevertheless, the comparison also revealed statistically significant similarity of gene expression profiles across different platforms. This enabled us to construct consolidated lists of platform-independent tissue-specific genes using a set of complementary measures. Follow-up analyses showed that results based on consolidated data tend to be more reliable.
Conclusions
Our study strongly indicates that the consolidation of the four different tissue expression data sets can increase data quality and can lead to biologically more meaningful results. The provided compendium of platform-independent gene lists should facilitate the identification of novel tissue-specific marker genes.
doi:10.1186/1471-2164-11-305
PMCID: PMC2885367  PMID: 20465848
22.  Integrated Enrichment Analysis of Variants and Pathways in Genome-Wide Association Studies Indicates Central Role for IL-2 Signaling Genes in Type 1 Diabetes, and Cytokine Signaling Genes in Crohn's Disease 
PLoS Genetics  2013;9(10):e1003770.
Pathway analyses of genome-wide association studies aggregate information over sets of related genes, such as genes in common pathways, to identify gene sets that are enriched for variants associated with disease. We develop a model-based approach to pathway analysis, and apply this approach to data from the Wellcome Trust Case Control Consortium (WTCCC) studies. Our method offers several benefits over existing approaches. First, our method not only interrogates pathways for enrichment of disease associations, but also estimates the level of enrichment, which yields a coherent way to promote variants in enriched pathways, enhancing discovery of genes underlying disease. Second, our approach allows for multiple enriched pathways, a feature that leads to novel findings in two diseases where the major histocompatibility complex (MHC) is a major determinant of disease susceptibility. Third, by modeling disease as the combined effect of multiple markers, our method automatically accounts for linkage disequilibrium among variants. Interrogation of pathways from eight pathway databases yields strong support for enriched pathways, indicating links between Crohn's disease (CD) and cytokine-driven networks that modulate immune responses; between rheumatoid arthritis (RA) and “Measles” pathway genes involved in immune responses triggered by measles infection; and between type 1 diabetes (T1D) and IL2-mediated signaling genes. Prioritizing variants in these enriched pathways yields many additional putative disease associations compared to analyses without enrichment. For CD and RA, 7 of 8 additional non-MHC associations are corroborated by other studies, providing validation for our approach. For T1D, prioritization of IL-2 signaling genes yields strong evidence for 7 additional non-MHC candidate disease loci, as well as suggestive evidence for several more. Of the 7 strongest associations, 4 are validated by other studies, and 3 (near IL-2 signaling genes RAF1, MAPK14, and FYN) constitute novel putative T1D loci for further study.
Author Summary
Genome-wide association studies have helped locate gene variants that affect our susceptibility to diseases. The analysis of these studies is typically straightforward: test each genetic variant whether it is correlated with predisposition to disease. This approach often works well for identifying commonly occurring variants with moderate effects on disease risk. However, the effects of many variants are so small they fail to register statistically significant correlations. This is a concern because many diseases are modulated by many genetic factors with small effects on disease risk. An alternative is to examine groups of variants, such as variants sharing a common pathway, and assess whether these groups are “enriched” for correlations with disease. This can be a more effective approach to identifying genetic factors relevant to disease. However, it does not tell us which genes are associated with disease. To address this limitation, we describe an approach that integrates enrichment analysis with tests for disease-variant correlations within a single framework. We illustrate this approach in genome-wide studies of seven complex diseases. We show that our approach supports enriched pathways in several diseases, and uncovers disease-susceptibility genes in these pathways not identified in conventional analyses of the same data.
doi:10.1371/journal.pgen.1003770
PMCID: PMC3789883  PMID: 24098138
23.  iRegulon: From a Gene List to a Gene Regulatory Network Using Large Motif and Track Collections 
PLoS Computational Biology  2014;10(7):e1003731.
Identifying master regulators of biological processes and mapping their downstream gene networks are key challenges in systems biology. We developed a computational method, called iRegulon, to reverse-engineer the transcriptional regulatory network underlying a co-expressed gene set using cis-regulatory sequence analysis. iRegulon implements a genome-wide ranking-and-recovery approach to detect enriched transcription factor motifs and their optimal sets of direct targets. We increase the accuracy of network inference by using very large motif collections of up to ten thousand position weight matrices collected from various species, and linking these to candidate human TFs via a motif2TF procedure. We validate iRegulon on gene sets derived from ENCODE ChIP-seq data with increasing levels of noise, and we compare iRegulon with existing motif discovery methods. Next, we use iRegulon on more challenging types of gene lists, including microRNA target sets, protein-protein interaction networks, and genetic perturbation data. In particular, we over-activate p53 in breast cancer cells, followed by RNA-seq and ChIP-seq, and could identify an extensive up-regulated network controlled directly by p53. Similarly we map a repressive network with no indication of direct p53 regulation but rather an indirect effect via E2F and NFY. Finally, we generalize our computational framework to include regulatory tracks such as ChIP-seq data and show how motif and track discovery can be combined to map functional regulatory interactions among co-expressed genes. iRegulon is available as a Cytoscape plugin from http://iregulon.aertslab.org.
Author Summary
Gene regulatory networks control developmental, homeostatic, and disease processes by governing precise levels and spatio-temporal patterns of gene expression. Determining their topology can provide mechanistic insight into these processes. Gene regulatory networks consist of interactions between transcription factors and their direct target genes. Each regulatory interaction represents the binding of the transcription factor to a specific DNA binding site near its target gene. Here we present a computational method, called iRegulon, to identify master regulators and direct target genes in a human gene signature, i.e. a set of co-expressed genes. iRegulon relies on the analysis of the regulatory sequences around each gene in the gene set to detect enriched TF motifs or ChIP-seq peaks, using databases of nearly 10.000 TF motifs and 1000 ChIP-seq data sets or “tracks”. Next, it associates enriched motifs and tracks with candidate transcription factors and determines the optimal subset of direct target genes. We validate iRegulon on ENCODE data, and use it in combination with RNA-seq and ChIP-seq data to map a p53 downstream network with new predicted co-factors and targets. iRegulon is available as a Cytoscape plugin, supporting human, mouse, and Drosophila genes, and provides access to hundreds of cancer-related TF-target subnetworks or “regulons”.
doi:10.1371/journal.pcbi.1003731
PMCID: PMC4109854  PMID: 25058159
24.  Appearance frequency modulated gene set enrichment testing 
BMC Bioinformatics  2011;12:81.
Background
Gene set enrichment testing has helped bridge the gap from an individual gene to a systems biology interpretation of microarray data. Although gene sets are defined a priori based on biological knowledge, current methods for gene set enrichment testing treat all genes equal. It is well-known that some genes, such as those responsible for housekeeping functions, appear in many pathways, whereas other genes are more specialized and play a unique role in a single pathway. Drawing inspiration from the field of information retrieval, we have developed and present here an approach to incorporate gene appearance frequency (in KEGG pathways) into two current methods, Gene Set Enrichment Analysis (GSEA) and logistic regression-based LRpath framework, to generate more reproducible and biologically meaningful results.
Results
Two breast cancer microarray datasets were analyzed to identify gene sets differentially expressed between histological grade 1 and 3 breast cancer. The correlation of Normalized Enrichment Scores (NES) between gene sets, generated by the original GSEA and GSEA with the appearance frequency of genes incorporated (GSEA-AF), was compared. GSEA-AF resulted in higher correlation between experiments and more overlapping top gene sets. Several cancer related gene sets achieved higher NES in GSEA-AF as well. The same datasets were also analyzed by LRpath and LRpath with the appearance frequency of genes incorporated (LRpath-AF). Two well-studied lung cancer datasets were also analyzed in the same manner to demonstrate the validity of the method, and similar results were obtained.
Conclusions
We introduce an alternative way to integrate KEGG PATHWAY information into gene set enrichment testing. The performance of GSEA and LRpath can be enhanced with the integration of appearance frequency of genes. We conclude that, generally, gene set analysis methods with the integration of information from KEGG PATHWAY performs better both statistically and biologically.
doi:10.1186/1471-2105-12-81
PMCID: PMC3213687  PMID: 21418606
25.  A chain reaction approach to modelling gene pathways 
Translational cancer research  2012;1(2):61-73.
Background
Of great interest in cancer prevention is how nutrient components affect gene pathways associated with the physiological events of puberty. Nutrient-gene interactions may cause changes in breast or prostate cells and, therefore, may result in cancer risk later in life. Analysis of gene pathways can lead to insights about nutrient-gene interactions and the development of more effective prevention approaches to reduce cancer risk. To date, researchers have relied heavily upon experimental assays (such as microarray analysis, etc.) to identify genes and their associated pathways that are affected by nutrient and diets. However, the vast number of genes and combinations of gene pathways, coupled with the expense of the experimental analyses, has delayed the progress of gene-pathway research. The development of an analytical approach based on available test data could greatly benefit the evaluation of gene pathways, and thus advance the study of nutrient-gene interactions in cancer prevention. In the present study, we have proposed a chain reaction model to simulate gene pathways, in which the gene expression changes through the pathway are represented by the species undergoing a set of chemical reactions. We have also developed a numerical tool to solve for the species changes due to the chain reactions over time. Through this approach we can examine the impact of nutrient-containing diets on the gene pathway; moreover, transformation of genes over time with a nutrient treatment can be observed numerically, which is very difficult to achieve experimentally. We apply this approach to microarray analysis data from an experiment which involved the effects of three polyphenols (nutrient treatments), epigallo-catechin-3-O-gallate (EGCG), genistein, and resveratrol, in a study of nutrient-gene interaction in the estrogen synthesis pathway during puberty.
Results
In this preliminary study, the estrogen synthesis pathway was simulated by a chain reaction model. By applying it to microarray data, the chain reaction model computed a set of reaction rates to examine the effects of three polyphenols (EGCG, genistein, and resveratrol) on gene expression in this pathway during puberty. We first performed statistical analysis to test the time factor on the estrogen synthesis pathway. Global tests were used to evaluate an overall gene expression change during puberty for each experimental group. Then, a chain reaction model was employed to simulate the estrogen synthesis pathway. Specifically, the model computed the reaction rates in a set of ordinary differential equations to describe interactions between genes in the pathway (A reaction rate K of A to B represents gene A will induce gene B per unit at a rate of K; we give details in the “method” section). Since disparate changes of gene expression may cause numerical error problems in solving these differential equations, we used an implicit scheme to address this issue. We first applied the chain reaction model to obtain the reaction rates for the control group. A sensitivity study was conducted to evaluate how well the model fits to the control group data at Day 50. Results showed a small bias and mean square error. These observations indicated the model is robust to low random noises and has a good fit for the control group. Then the chain reaction model derived from the control group data was used to predict gene expression at Day 50 for the three polyphenol groups. If these nutrients affect the estrogen synthesis pathways during puberty, we expect discrepancy between observed and expected expressions. Results indicated some genes had large differences in the EGCG (e.g., Hsd3b and Sts) and the resveratrol (e.g., Hsd3b and Hrmt12) groups.
Conclusions
In the present study, we have presented (I) experimental studies of the effect of nutrient diets on the gene expression changes in a selected estrogen synthesis pathway. This experiment is valuable because it allows us to examine how the nutrient-containing diets regulate gene expression in the estrogen synthesis pathway during puberty; (II) global tests to assess an overall association of this particular pathway with time factor by utilizing generalized linear models to analyze microarray data; and (III) a chain reaction model to simulate the pathway. This is a novel application because we are able to translate the gene pathway into the chemical reactions in which each reaction channel describes gene-gene relationship in the pathway. In the chain reaction model, the implicit scheme is employed to efficiently solve the differential equations. Data analysis results show the proposed model is capable of predicting gene expression changes and demonstrating the effect of nutrient-containing diets on gene expression changes in the pathway.
One of the objectives of this study is to explore and develop a numerical approach for simulating the gene expression change so that it can be applied and calibrated when the data of more time slices are available, and thus can be used to interpolate the expression change at a desired time point without conducting expensive experiments for a large amount of time points. Hence, we are not claiming this is either essential or the most efficient way for simulating this problem, rather a mathematical/numerical approach that can model the expression change of a large set of genes of a complex pathway. In addition, we understand the limitation of this experiment and realize that it is still far from being a complete model of predicting nutrient-gene interactions. The reason is that in the present model, the reaction rates were estimated based on available data at two time points; hence, the gene expression change is dependent upon the reaction rates and a linear function of the gene expressions. More data sets containing gene expression at various time slices are needed in order to improve the present model so that a non-linear variation of gene expression changes at different time can be predicted.
PMCID: PMC3431024  PMID: 22943042
Chain reaction; gene expression change; estrogen synthesis pathway; nutrient diets

Results 1-25 (1198038)