Fueled by widespread applications of high-throughput next generation sequencing (NGS) technologies and urgent need to counter threats of pathogenic viruses, large-scale studies were conducted recently to investigate virus integration in host genomes (for example, human tumor genomes) that may cause carcinogenesis or other diseases. A limiting factor in these studies, however, is rapid virus evolution and resulting polymorphisms, which prevent reads from aligning readily to commonly used virus reference genomes, and, accordingly, make virus integration sites difficult to detect. Another confounding factor is host genomic instability as a result of virus insertions. To tackle these challenges and improve our capability to identify cryptic virus-host fusions, we present a new approach that detects Virus intEgration sites through iterative Reference SEquence customization (VERSE). To the best of our knowledge, VERSE is the first approach to improve detection through customizing reference genomes. Using 19 human tumors and cancer cell lines as test data, we demonstrated that VERSE substantially enhanced the sensitivity of virus integration site detection. VERSE is implemented in the open source package VirusFinder 2 that is available at http://bioinfo.mc.vanderbilt.edu/VirusFinder/.
Electronic supplementary material
The online version of this article (doi:10.1186/s13073-015-0126-6) contains supplementary material, which is available to authorized users.
Recent genome-wide association studies have identified many promising schizophrenia candidate genes and demonstrated that common polygenic variation contributes to schizophrenia risk. However, whether these genes represent perturbations to a common but limited set of underlying molecular processes (pathways) that modulate risk to schizophrenia remains elusive, and it is not known whether these genes converge on common biological pathways (networks) or represent different pathways. In addition, the theoretical and genetic mechanisms underlying the strong genetic heterogeneity of schizophrenia remain largely unknown. Using 4 well-defined data sets that contain top schizophrenia susceptibility genes and applying protein-protein interaction (PPI) network analysis, we investigated the interactions among proteins encoded by top schizophrenia susceptibility genes. We found proteins encoded by top schizophrenia susceptibility genes formed a highly significant interconnected network, and, compared with random networks, these PPI networks are statistically highly significant for both direct connectivity and indirect connectivity. We further validated these results using empirical functional data (transcriptome data from a clinical sample). These highly significant findings indicate that top schizophrenia susceptibility genes encode proteins that significantly directly interacted and formed a densely interconnected network, suggesting perturbations of common underlying molecular processes or pathways that modulate risk to schizophrenia. Our findings that schizophrenia susceptibility genes encode a highly interconnected protein network may also provide a novel explanation for the observed genetic heterogeneity of schizophrenia, ie, mutation in any member of this molecular network will lead to same functional consequences that eventually contribute to risk of schizophrenia.
genome-wide association study; schizophrenia susceptibility genes; protein-protein interaction; common molecular networks; genetic heterogeneity; enrichment
Prostate cancer (PrCa) is the most commonly diagnosed cancer in men in the world. Despite the fact that a large number of its genes have been investigated, its etiology remains poorly understood. Furthermore, most PrCa candidate genes have not been rigorously replicated, and the methods by which they biologically function in PrCa remain largely unknown.
Aiming to identify key players in the complex prostate cancer system, we reconstructed PrCa co-expressed modules within functional gene sets defined by the Gene Ontology (GO) annotation (biological process, GO_BP). We primarily identified 118 GO_BP terms that were well-preserved between two independent gene expression datasets and a consequent 55 conserved co-expression modules within them. Five modules were then found to be significantly enriched with PrCa candidate genes collected from expression Quantitative Trait Loci (eQTL), somatic copy number alteration (SCNA), somatic mutation data, or prognostic analyses. Specifically, two transcription factors (TFs) (NFAT and SP1) and three microRNAs (hsa-miR-19a, hsa-miR-15a, and hsa-miR-200b) regulating these five candidate modules were found to be critical to the development of PrCa.
Collectively, our results indicated that genes with similar functions may play important roles in disease through co-expression, and modules with different functions could be regulated by similar genetic components, such as TFs and microRNAs, in a synergistic manner.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-1015) contains supplementary material, which is available to authorized users.
Prostate cancer; Co-expression; Gene Ontology; Module; Transcription factor; MicroRNA
Many cancer genes form mutation hotspots that disrupt their functional domains or active sites, leading to gain- or loss-of-function. We propose a mutation set enrichment analysis (MSEA) implemented by two novel methods, MSEA-clust and MSEA-domain, to predict cancer genes based on mutation hotspot patterns. MSEA methods are evaluated by both simulated and real cancer data. We find approximately 51% of the eligible known cancer genes form detectable mutation hotspots. Application of MSEA in eight cancers reveals a total of 82 genes with mutation hotspots, including well-studied cancer genes, known cancer genes re-found in new cancer types, and novel cancer genes.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0489-9) contains supplementary material, which is available to authorized users.
HIV sensory neuropathy and distal neuropathic pain (DNP) are common, disabling complications associated with combination antiretroviral therapy (cART). We previously associated iron-regulatory genetic polymorphisms with a reduced risk of HIV sensory neuropathy during more neurotoxic types of cART. We here evaluated the impact of polymorphisms in 19 iron-regulatory genes on DNP in 560 HIV-infected subjects from a prospective, observational study, who underwent neurological examinations to ascertain peripheral neuropathy and structured interviews to ascertain DNP. Genotype-DNP associations were explored by logistic regression and permutation-based analytical methods. Among 559 evaluable subjects, 331 (59%) developed HIV-SN, and 168 (30%) reported DNP. Fifteen polymorphisms in 8 genes (p<0.05) and 5 variants in 4 genes (p<0.01) were nominally associated with DNP: polymorphisms in TF, TFRC, BMP6, ACO1, SLC11A2, and FXN conferred reduced risk (adjusted odds ratios [ORs] ranging from 0.2 to 0.7, all p<0.05); other variants in TF, CP, ACO1, BMP6, and B2M conferred increased risk (ORs ranging from 1.3 to 3.1, all p<0.05). Risks associated with some variants were statistically significant either in black or white subgroups but were consistent in direction. ACO1 rs2026739 remained significantly associated with DNP in whites (permutation p<0.0001) after correction for multiple tests. Several of the same iron-regulatory-gene polymorphisms, including ACO1 rs2026739, were also associated with severity of DNP (all p<0.05). Common polymorphisms in iron-management genes are associated with DNP and with DNP severity in HIV-infected persons receiving cART. Consistent risk estimates across population subgroups and persistence of the ACO1 rs2026739 association after adjustment for multiple testing suggest that genetic variation in iron-regulation and transport modulates susceptibility to DNP.
While genome-wide association studies (GWAS) have revealed thousands of disease risk single nucleotide polymorphisms (SNPs), their functions remain largely unknown. Recent studies have suggested the regulatory roles of GWAS risk variants in several common diseases; however, the complex regulatory structure in prostate cancer is unclear.
We investigated the potential regulatory roles of risk variants in two prostate cancer GWAS datasets by their interactions with expression quantitative trait loci (eQTL) and/or transcription factor binding sites (TFBSs) in three populations.
Our results indicated that the moderately associated GWAS SNPs were significantly enriched with cis-eQTLs and TFBSs in Caucasians (CEU), but not in African Americans (AA) or Japanese (JPT); this was also observed in an independent pan-cancer related SNPs from the GWAS Catalog. We found that the eQTL enrichment in the CEU population was tissue-specific to eQTLs from CEU lymphoblastoid cell lines. Importantly, we pinpointed two SNPs, rs2861405 and rs4766642, by overlapping results from cis-eQTL and TFBS as applied to the CEU data.
These results suggested that prostate cancer associated SNPs and pan-cancer associated SNPs are likely to play regulatory roles in CEU. However, the negative enrichment results in AA or JPT and the potential mechanisms remain to be elucidated in additional samples.
prostate cancer; genome-wide association studies; eQTL; TFBS; regulatory variants
We conducted data-mining analyses of genome wide association (GWA) studies of the CATIE and MGS-GAIN datasets, and found 13 markers in the two physically linked genes, PTPN21 and EML5, showing nominally significant association with schizophrenia. Linkage disequilibrium (LD) analysis indicated that all 7 markers from PTPN21 shared high LD (r2>0.8), including rs2274736 and rs2401751, the two non-synonymous markers with the most significant association signals (rs2401751, P=1.10×10−3 and rs2274736, P=1.21×10−3). In a meta-analysis of all 13 replication datasets with a total of 13,940 subjects, we found that the two non-synonymous markers are significantly associated with schizophrenia (rs2274736, OR=0.92, 95% CI: 0.86–0.97, P=5.45×10−3 and rs2401751, OR = 0.92, 95% CI: 0.86–0.97, P=5.29×10−3). One SNP (rs7147796) in EML5 is also significantly associated with the disease (OR = 1.08, 95% CI: 1.02-1.14, P=6.43×10−3). These 3 markers remain significant after Bonferroni correction. Furthermore, haplotype conditioned analyses indicated that the association signals observed between rs2274736/rs2401751 and rs7147796 are statistically independent. Given the results that 2 non-synonymous markers in PTPN21 are associated with schizophrenia, further investigation of this locus is warranted.
Data-mining; Informatic prioritization; Genetic association study; PTPN21; Non-synonymous SNP
Gene fusions are important genomic events in human cancer because their fusion gene products can drive the development of cancer and thus are potential prognostic tools or therapeutic targets in anti-cancer treatment. Major advancements have been made in computational approaches for fusion gene discovery over the past 3 years due to improvements and widespread applications of high-throughput next generation sequencing (NGS) technologies. To identify fusions from NGS data, existing methods typically leverage the strengths of both sequencing technologies and computational strategies. In this article, we review the NGS and computational features of existing methods for fusion gene detection and suggest directions for future development.
gene fusion; next generation sequencing; cancer; whole genome sequencing; transcriptome sequencing; computational tools
MicroRNAs (miRNAs) play critical roles in tumor development and progression. The fact that a single miRNA can regulate hundreds of genes places miRNAs at critical hubs of signaling pathways. In this study, we investigated the miRNA expression profile in gastric adenocarcinomas and compared it to esophageal adenocarcinomas to better identify a unique miRNA signature of gastric adenocarcinoma.
Methods and Results
The miRNA expression profile was obtained using Agilent and Exiqon microarray platforms on primary gastric adenocarcinoma tissue samples. The cross comparison of results identified 17 up-regulated and 12 down-regulated miRNAs that overlapped in both platforms. Quantitative real-time RT-PCR was performed for independent validation of a representative set of 8 miRNAs in gastric and esophageal adenocarcinomas as compared to normal gastric mucosa or esophageal mucosa, respectively. The de-regulation of miR-146b-5p, -375, -148a, -31, and -451 was significantly associated with gastric adenocarcinomas. On the other hand, de-regulation of miR-21 (up-regulation) and miR-133b (down-regulation) was detectable in both gastric and esophageal adenocarcinomas. Interestingly, miR-200a was significantly down-regulated in gastric adenocarcinoma (p=0.04) but up-regulated in esophageal adenocarcinoma samples (p=0.001). In addition, the expression level of miR-146b-5p displayed a strong correlation with the tumor staging of gastric cancer.
Gastric adenocarcinoma displays a unique miRNA signature that distinguishes it from esophageal adenocarcinoma. This specific signature could reflect differences in the etiology and/or molecular signaling in these two closely related cancers. Our findings suggest important miRNA candidates that can be investigated for their molecular functions and possible diagnostic, prognostic, and therapeutic role in gastric adenocarcinoma.
miRNA; esophageal adenocarcinoma; gastric adenocarcinoma; microarray; prognosis
The human kinome is gaining importance through its promising cancer therapeutic targets, yet no general model to address the kinase inhibitor resistance has emerged. Here, we constructed a systems biology-based framework to catalogue the human kinome, including 538 kinase genes, in the broader context of the human interactome. Specifically, we constructed three networks: a kinase-substrate interaction network containing 7,346 pairs connecting 379 kinases to 36,576 phosphorylation sites in 1,961 substrates, a protein-protein interaction network (PPIN) containing 92,699 pairs, and an atomic resolution PPIN containing 4,278 pairs. We identified the conserved regulatory phosphorylation motifs (e.g., Ser/Thr-Pro) using a sequence logo analysis. We found the typical anticancer target selection strategy that uses network hubs as drug targets, might lead to a high adverse drug reaction risk. Furthermore, we found the distinct network centrality of kinases creates a high anticancer drug resistance risk by feedback or crosstalk mechanisms within cellular networks. This notion is supported by the systematic network and pathway analyses that anticancer drug resistance genes are significantly enriched as hubs and heavily participate in multiple signaling pathways. Collectively, this comprehensive human kinome interactome map sheds light on anticancer drug resistance mechanisms and provides an innovative resource for rational kinase inhibitor design.
Kinome; kinase-substrate interaction; phosphorylation; interactome; resistance; systems biology
Cancer genomes harbor hundreds to thousands of somatic nonsynonymous mutations. DNA damage and deficiency of DNA repair systems are two major forces to cause somatic mutations, marking cancer genomes with specific somatic mutation patterns. Recently, several pan-cancer genome studies revealed more than 20 mutation signatures across multiple cancer types. However, detailed cancer-type specific mutation signatures and their different features within (intra-) and between (inter-) cancer types remain largely unexplored.
We employed a matrix decomposition algorithm, namely Non-negative Matrix Factorization, to survey the somatic mutations in nine major human cancers, involving a total of ~2100 genomes.
Our results revealed 3-5 independent mutational signatures in each cancer, implying that a range of 3-5 predominant mutational processes likely underlie each cancer genome. Both mutagen exposure (tobacco and sun) and changes in DNA repair systems (APOBEC family, POLE, and MLH1) were found as mutagenesis forces, each of which marks the genome with an evident mutational signature. We studied the features of several signatures and their combinatory patterns within and across cancers. On one hand, we found each signature may influence a cancer genome with different influential magnitudes even in the same cancer type and the signature-specific load reflects intra-cancer heterogeneity (e.g., the smoking-related signature in lung cancer smokers and never smokers). On the other hand, inter-cancer heterogeneity is characterized by combinatory patterns of mutational signatures, where no cancers share the same signature profile, even between two lung cancer subtypes (lung adenocarcinoma and squamous cell lung cancer).
Our work provides a detailed overview of the mutational characteristics in each of nine major cancers and highlights that the mutational signature profile is representative of each cancer.
Somatic mutation; Cancer; Kataegis; Mutation signature; Mutagen; Heterogeneity
A major challenge in interpreting the large volume of mutation data identified by next-generation sequencing (NGS) is to distinguish driver mutations from neutral passenger mutations to facilitate the identification of targetable genes and new drugs. Current approaches are primarily based on mutation frequencies of single-genes, which lack the power to detect infrequently mutated driver genes and ignore functional interconnection and regulation among cancer genes. We propose a novel mutation network method, VarWalker, to prioritize driver genes in large scale cancer mutation data. VarWalker fits generalized additive models for each sample based on sample-specific mutation profiles and builds on the joint frequency of both mutation genes and their close interactors. These interactors are selected and optimized using the Random Walk with Restart algorithm in a protein-protein interaction network. We applied the method in >300 tumor genomes in two large-scale NGS benchmark datasets: 183 lung adenocarcinoma samples and 121 melanoma samples. In each cancer, we derived a consensus mutation subnetwork containing significantly enriched consensus cancer genes and cancer-related functional pathways. These cancer-specific mutation networks were then validated using independent datasets for each cancer. Importantly, VarWalker prioritizes well-known, infrequently mutated genes, which are shown to interact with highly recurrently mutated genes yet have been ignored by conventional single-gene-based approaches. Utilizing VarWalker, we demonstrated that network-assisted approaches can be effectively adapted to facilitate the detection of cancer driver genes in NGS data.
A cancer genome typically harbors both driver mutations, which contribute to tumorigenesis, and passenger mutations, which tend to be neutral and occur randomly. Cancer genomes differ dramatically due to genetic and environmental factors. A major challenge in interpreting the large volume of mutation data identified in cancer genomes using next-generation sequencing (NGS) is to distinguish driver mutations from neutral passenger mutations. We propose a novel mutation network method, VarWalker, to prioritize driver genes in large scale cancer mutation data. Applying our approach in a large cohort of lung adenocarcinoma samples and melanoma samples, we derived a consensus mutation subnetwork for each cancer containing significantly enriched cancer genes and cancer-related functional pathways. Our results indicated that driver genes occur within a broad spectrum of frequency, interact with each other, and converge in several key pathways that play critical roles in tumorigenesis.
Recent studies have demonstrated that gene set analysis, which tests disease association with genetic variants in a group of functionally related genes, is a promising approach for analyzing and interpreting genome-wide association studies (GWAS) data. These approaches aim to increase power by combining association signals from multiple genes in the same gene set. In addition, gene set analysis can also shed more light on the biological processes underlying complex diseases. However, current approaches for gene set analysis are still in an early stage of development in that analysis results are often prone to sources of bias, including gene set size and gene length, linkage disequilibrium patterns and the presence of overlapping genes. In this paper, we provide an in-depth review of the gene set analysis procedures, along with parameter choices and the particular methodology challenges at each stage. In addition to providing a survey of recently developed tools, we also classify the analysis methods into larger categories and discuss their strengths and limitations. In the last section, we outline several important areas for improving the analytical strategies in gene set analysis.
Genome-wide association study; Gene set; Pathway; Gene-set enrichment analysis; Statistical significance; Complex disease
A number of genetic studies have suggested numerous susceptibility genes for dental caries over the past decade with few definite conclusions. The rapid accumulation of relevant information, along with the complex architecture of the disease, provides a challenging but also unique opportunity to review and integrate the heterogeneous data for follow-up validation and exploration. In this study, we collected and curated candidate genes from four major categories: association studies, linkage scans, gene expression analyses, and literature mining. Candidate genes were prioritized according to the magnitude of evidence related to dental caries. We then searched for dense modules enriched with the prioritized candidate genes through their protein-protein interactions (PPIs). We identified 23 modules comprising of 53 genes. Functional analyses of these 53 genes revealed three major clusters: cytokine network relevant genes, matrix metalloproteinases (MMPs) family, and transforming growth factor-beta (TGF-β) family, all of which have been previously implicated to play important roles in tooth development and carious lesions. Through our extensive data collection and an integrative application of gene prioritization and PPI network analyses, we built a dental caries-specific sub-network for the first time. Our study provided insights into the molecular mechanisms underlying dental caries. The framework we proposed in this work can be applied to other complex diseases.
Driven by high throughput next generation sequencing technologies and the pressing need to decipher cancer genomes, computational approaches for detecting somatic single nucleotide variants (sSNVs) have undergone dramatic improvements during the past 2 years. The recently developed tools typically compare a tumor sample directly with a matched normal sample at each variant locus in order to increase the accuracy of sSNV calling. These programs also address the detection of sSNVs at low allele frequencies, allowing for the study of tumor heterogeneity, cancer subclones, and mutation evolution in cancer development.
We used whole genome sequencing (Illumina Genome Analyzer IIx platform) of a melanoma sample and matched blood, whole exome sequencing (Illumina HiSeq 2000 platform) of 18 lung tumor-normal pairs and seven lung cancer cell lines to evaluate six tools for sSNV detection: EBCall, JointSNVMix, MuTect, SomaticSniper, Strelka, and VarScan 2, with a focus on MuTect and VarScan 2, two widely used publicly available software tools. Default/suggested parameters were used to run these tools. The missense sSNVs detected in these samples were validated through PCR and direct sequencing of genomic DNA from the samples. We also simulated 10 tumor-normal pairs to explore the ability of these programs to detect low allelic-frequency sSNVs.
Out of the 237 sSNVs successfully validated in our cancer samples, VarScan 2 and MuTect detected the most of any tools (that is, 204 and 192, respectively). MuTect identified 11 more low-coverage validated sSNVs than VarScan 2, but missed 11 more sSNVs with alternate alleles in normal samples than VarScan 2. When examining the false calls of each tool using 169 invalidated sSNVs, we observed >63% false calls detected in the lung cancer cell lines had alternate alleles in normal samples. Additionally, from our simulation data, VarScan 2 identified more sSNVs than other tools, while MuTect characterized most low allelic-fraction sSNVs.
Our study explored the typical false-positive and false-negative detections that arise from the use of sSNV-calling tools. Our results suggest that despite recent progress, these tools have significant room for improvement, especially in the discrimination of low coverage/allelic-frequency sSNVs and sSNVs with alternate alleles in normal samples.
Copy number variation (CNV) is a prevalent form of critical genetic variation that leads to an abnormal number of copies of large genomic regions in a cell. Microarray-based comparative genome hybridization (arrayCGH) or genotyping arrays have been standard technologies to detect large regions subject to copy number changes in genomes until most recently high-resolution sequence data can be analyzed by next-generation sequencing (NGS). During the last several years, NGS-based analysis has been widely applied to identify CNVs in both healthy and diseased individuals. Correspondingly, the strong demand for NGS-based CNV analyses has fuelled development of numerous computational methods and tools for CNV detection. In this article, we review the recent advances in computational methods pertaining to CNV detection using whole genome and whole exome sequencing data. Additionally, we discuss their strengths and weaknesses and suggest directions for future development.
Kinase inhibitors are accepted treatment for metastatic melanomas that harbor specific driver mutations in BRAF or KIT, but only 40–50% of cases are positive. To uncover other potential targetable mutations, we performed whole-genome sequencing of a highly aggressive BRAF (V600) and KIT (W557, V559, L576, K642, D816) wildtype melanoma. Surprisingly, we found a somatic BRAF L597R mutation in exon 15. Analysis of BRAF exon 15 in 49 tumors negative for BRAF V600 mutations as well as driver mutations in KIT, NRAS, GNAQ, and GNA11, showed that 2 (4%) harbored L597 mutations and another 2 involved BRAF D594 and K601 mutations. In vitro signaling induced by L597R/S/Q mutants was suppressed by MEK inhibition. A patient with BRAF L597S mutant metastatic melanoma responded significantly to treatment with the MEK inhibitor, TAK-733. Collectively, these data demonstrate clinical significance to BRAF L597 mutations in melanoma.
melanoma; BRAF L597; whole genome sequencing; BRAF inhibitor; MEK inhibitor; TAK-733
Gene set-based analysis of genome-wide association study (GWAS) data has recently emerged as a useful approach to examine the joint effects of multiple risk loci in complex human diseases or phenotypes. Dental caries is a common, chronic, and complex disease leading to a decrease in quality of life worldwide. In this study, we applied the approaches of gene set enrichment analysis to a major dental caries GWAS dataset, which consists of 537 cases and 605 controls. Using four complementary gene set analysis methods, we analyzed 1331 Gene Ontology (GO) terms collected from the Molecular Signatures Database (MSigDB). Setting false discovery rate (FDR) threshold as 0.05, we identified 13 significantly associated GO terms. Additionally, 17 terms were further included as marginally associated because they were top ranked by each method, although their FDR is higher than 0.05. In total, we identified 30 promising GO terms, including ‘Sphingoid metabolic process,’ ‘Ubiquitin protein ligase activity,’ ‘Regulation of cytokine secretion,’ and ‘Ceramide metabolic process.’ These GO terms encompass broad functions that potentially interact and contribute to the oral immune response related to caries development, which have not been reported in the standard single marker based analysis. Collectively, our gene set enrichment analysis provided complementary insights into the molecular mechanisms and polygenic interactions in dental caries, revealing promising association signals that could not be detected through single marker analysis of GWAS data.
Integrating evidence from multiple domains is useful in prioritizing disease candidate genes for subsequent testing. We ranked all known human genes (n = 3819) under linkage peaks in the Irish Study of High-Density Schizophrenia Families using three different evidence domains: 1) a meta-analysis of microarray gene expression results using the Stanley Brain collection, 2) a schizophrenia protein-protein interaction network, and 3) a systematic literature search. Each gene was assigned a domain-specific p-value and ranked after evaluating the evidence within each domain. For comparison to this ranking process, a large-scale candidate gene hypothesis was also tested by including genes with Gene Ontology terms related to neurodevelopment. Subsequently, genotypes of 3725 SNPs in 167 genes from a custom Illumina iSelect array were used to evaluate the top ranked vs. hypothesis selected genes. Seventy-three genes were both highly ranked and involved in neurodevelopment (category 1) while 42 and 52 genes were exclusive to neurodevelopment (category 2) or highly ranked (category 3), respectively. The most significant associations were observed in genes PRKG1, PRKCE, and CNTN4 but no individual SNPs were significant after correction for multiple testing. Comparison of the approaches showed an excess of significant tests using the hypothesis-driven neurodevelopment category. Random selection of similar sized genes from two independent genome-wide association studies (GWAS) of schizophrenia showed the excess was unlikely by chance. In a further meta-analysis of three GWAS datasets, four candidate SNPs reached nominal significance. Although gene ranking using integrated sources of prior information did not enrich for significant results in the current experiment, gene selection using an a priori hypothesis (neurodevelopment) was superior to random selection. As such, further development of gene ranking strategies using more carefully selected sources of information is warranted.
Background and Methods
Esophageal adenocarcinoma (EAC) is characterized by a steep rise in incidence rates in the Western population. The unique miRNA signature that distinguishes EAC from other upper gastrointestinal cancers remains unclear. Herein, we performed a comprehensive microarray profiling for the specific miRNA signature associated with EAC. We validated this signature by qRT-PCR.
Microarray analysis showed that 21 miRNAs were consistently deregulated in EAC. miR-194, miR-192, miR-200a, miR-21, miR-203, miR-205, miR-133b, and miR-31 were selected for validation using 46 normal squamous (NS), 23 Barrett’s esophagus (BE), 17 Barrett’s high grade dysplasia (HGD), 34 EAC, 33 gastric adenocarcinoma (GC), and 45 normal gastric (NG) tissues. The qRT-PCR analysis indicated that 2 miRNAs (miR-21 and miR-133b) were deregulated in both EAC and GC, and 6 miRNAs (up-regulated: miR-194, miR-31, miR-192, and miR-200a; down-regulated: miR-203 and miR-205) in EAC, as compared to BE but not in GC, indicating their potential unique role in EAC. Our data showed that miR-194, miR-192, miR-21, and miR-31 were up-regulated in BE adjacent to HGD lesions relative to isolated BE samples. Analysis of clinicopathological features indicated that down-regulation of miR-203 is significantly associated with progression and tumor stages in EAC. Interestingly, the overexpression levels of miR-194, miR-200a, and miR-192 were significantly higher in early EAC stages, suggesting that these miRNAs may be involved in EAC tumor development rather than progression.
Our findings demonstrate the presence of a unique miRNA signature for EAC. This may provide some clues for the distinct molecular features of EAC to be considered in future studies of the role of miRNAs in EAC and their utility as disease biomarkers.
Next generation sequencing (NGS) technologies allow us to explore virus interactions with host genomes that lead to carcinogenesis or other diseases; however, this effort is largely hindered by the dearth of efficient computational tools. Here, we present a new tool, VirusFinder, for the identification of viruses and their integration sites in host genomes using NGS data, including whole transcriptome sequencing (RNA-Seq), whole genome sequencing (WGS), and targeted sequencing data. VirusFinder’s unique features include the characterization of insertion loci of virus of arbitrary type in the host genome and high accuracy and computational efficiency as a result of its well-designed pipeline. The source code as well as additional data of VirusFinder is publicly available at http://bioinfo.mc.vanderbilt.edu/VirusFinder/.
Pathway analysis of large-scale omics data assists us with the examination of the cumulative effects of multiple functionally related genes, which are difficult to detect using the traditional single gene/marker analysis. So far, most of the genomic studies have been conducted in a single domain, e.g., by genome-wide association studies (GWAS) or microarray gene expression investigation. A combined analysis of disease susceptibility genes across multiple platforms at the pathway level is an urgent need because it can reveal more reliable and more biologically important information.
We performed an integrative pathway analysis of a GWAS dataset and a microarray gene expression dataset in prostate cancer. We obtained a comprehensive pathway annotation set from knowledge-based public resources, including KEGG pathways and the prostate cancer candidate gene set, and gene sets specifically defined based on cross-platform information. By leveraging on this pathway collection, we first searched for significant pathways in the GWAS dataset using four methods, which represent two broad groups of pathway analysis approaches. The significant pathways identified by each method varied greatly, but the results were more consistent within each method group than between groups. Next, we conducted a gene set enrichment analysis of the microarray gene expression data and found 13 pathways with cross-platform evidence, including "Fc gamma R-mediated phagocytosis" (PGWAS = 0.003, Pexpr < 0.001, and Pcombined = 6.18 × 10-8), "regulation of actin cytoskeleton" (PGWAS = 0.003, Pexpr = 0.009, and Pcombined = 3.34 × 10-4), and "Jak-STAT signaling pathway" (PGWAS = 0.001, Pexpr = 0.084, and Pcombined = 8.79 × 10-4).
Our results provide evidence at both the genetic variation and expression levels that several key pathways might have been involved in the pathological development of prostate cancer. Our framework that employs gene expression data to facilitate pathway analysis of GWAS data is not only feasible but also much needed in studying complex disease.
A variety of species and experimental designs have been used to study genetic influences on alcohol dependence, ethanol response, and related traits. Integration of these heterogeneous data can be used to produce a ranked target gene list for additional investigation.
In this study, we performed a unique multi-species evidence-based data integration using three microarray experiments in mice or humans that generated an initial alcohol dependence (AD) related genes list, human linkage and association results, and gene sets implicated in C. elegans and Drosophila. We then used permutation and false discovery rate (FDR) analyses on the genome-wide association studies (GWAS) dataset from the Collaborative Study on the Genetics of Alcoholism (COGA) to evaluate the ranking results and weighting matrices. We found one weighting score matrix could increase FDR based q-values for a list of 47 genes with a score greater than 2. Our follow up functional enrichment tests revealed these genes were primarily involved in brain responses to ethanol and neural adaptations occurring with alcoholism.
These results, along with our experimental validation of specific genes in mice, C. elegans and Drosophila, suggest that a cross-species evidence-based approach is useful to identify candidate genes contributing to alcoholism.
Genome-wide association studies (GWAS) have generated a wealth of valuable genotyping data for complex diseases/traits. A large proportion of these data are embedded with many weakly associated markers that have been missed in traditional single marker analyses, but they may provide valuable insights in dissecting the genetic components of diseases. Gene set analysis (GSA) augmented by protein-protein interaction network data provides a promising way to examine GWAS data by analyzing the combined effects of multiple genes/markers, each of which may have only individually weak to moderate association effects. A critical issue in GSA of GWAS data is the definition of gene-wise P values based on multiple SNPs mapped to a gene.
In this study, we proposed an alternative restricted search approach based on our previously developed dense module search algorithm, and we demonstrated it in the CATIE GWAS dataset for schizophrenia. Specifically, we explored three ways of computing gene-wise P values and examined their effects on the resultant module genes. These methods calculate gene-wise P values based on all the SNPs, the top ranked SNPs, or the most significant SNP among all the SNPs mapped to a gene. We applied the restricted search approach and identified a module gene set for each of the gene-wise P value data set. In our evaluation using an independent method, ALIGATOR, we showed that although each of these input datasets generated a unique set of module genes, all of them were significant in the GWAS dataset. Further functional enrichment analysis of these module genes showed that at the pathway level, they were all consistently related to neuro- and immune-related pathways. Finally, we compared our method with a previously reported method.
Our results showed that the approaches to computing gene-wise P values in GWAS data are critical in GSA. This work is useful for evaluating key factors in GSA of GWAS data.
QT prolongation is associated with increased risk of cardiac arrhythmias. Identifying the genetic variants that mediate antipsychotic induced prolongation may help to minimize this risk, which might prevent the removal of efficacious drugs from the market. We performed candidate gene analysis and five drug specific genome-wide association studies (GWAS) with 492K SNPs to search for genetic variation mediating antipsychotic induced QT prolongation in 738 schizophrenia patients from the Clinical Antipsychotic Trial of Intervention Effectiveness (CATIE) study.
Our candidate gene study suggests the involvement of NOS1AP and NUBPL (p-values =1.45×10−05 and 2.66×10−13, respectively). Furthermore, our top GWAS hit achieving genome-wide significance, defined as a q-value <0.10, (p-value =1.54×10−7, q-value =0.07), located in SLC22A23, mediated the effects of quetiapine on prolongation. SLC22A23 belongs to a family of organic ion transporters that shuttle a variety of compounds including drugs, environmental toxins, and endogenous metabolites across the cell membrane. This gene is expressed in the heart and is integral in mouse heart development. The genes mediating antipsychotic induced QT prolongation partially overlap with the genes affecting normal QT interval variation. However, some genes may also be unique for drug induced prolongation. This study demonstrates the potential of GWAS to discover genes and pathways that mediate antipsychotic induced QT prolongation.
candidate gene analysis; genome-wide association study; schizophrenia; adverse effects; CATIE