PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (48)
 

Clipboard (0)
None

Select a Filter Below

Year of Publication
Document Types
author:("Jia, pailin")
1.  VERSE: a novel approach to detect virus integration in host genomes through reference genome customization 
Genome Medicine  2015;7(1):2.
Fueled by widespread applications of high-throughput next generation sequencing (NGS) technologies and urgent need to counter threats of pathogenic viruses, large-scale studies were conducted recently to investigate virus integration in host genomes (for example, human tumor genomes) that may cause carcinogenesis or other diseases. A limiting factor in these studies, however, is rapid virus evolution and resulting polymorphisms, which prevent reads from aligning readily to commonly used virus reference genomes, and, accordingly, make virus integration sites difficult to detect. Another confounding factor is host genomic instability as a result of virus insertions. To tackle these challenges and improve our capability to identify cryptic virus-host fusions, we present a new approach that detects Virus intEgration sites through iterative Reference SEquence customization (VERSE). To the best of our knowledge, VERSE is the first approach to improve detection through customizing reference genomes. Using 19 human tumors and cancer cell lines as test data, we demonstrated that VERSE substantially enhanced the sensitivity of virus integration site detection. VERSE is implemented in the open source package VirusFinder 2 that is available at http://bioinfo.mc.vanderbilt.edu/VirusFinder/.
Electronic supplementary material
The online version of this article (doi:10.1186/s13073-015-0126-6) contains supplementary material, which is available to authorized users.
doi:10.1186/s13073-015-0126-6
PMCID: PMC4333248
2.  Protein-Protein Interaction and Pathway Analyses of Top Schizophrenia Genes Reveal Schizophrenia Susceptibility Genes Converge on Common Molecular Networks and Enrichment of Nucleosome (Chromatin) Assembly Genes in Schizophrenia Susceptibility Loci 
Schizophrenia Bulletin  2013;40(1):39-49.
Recent genome-wide association studies have identified many promising schizophrenia candidate genes and demonstrated that common polygenic variation contributes to schizophrenia risk. However, whether these genes represent perturbations to a common but limited set of underlying molecular processes (pathways) that modulate risk to schizophrenia remains elusive, and it is not known whether these genes converge on common biological pathways (networks) or represent different pathways. In addition, the theoretical and genetic mechanisms underlying the strong genetic heterogeneity of schizophrenia remain largely unknown. Using 4 well-defined data sets that contain top schizophrenia susceptibility genes and applying protein-protein interaction (PPI) network analysis, we investigated the interactions among proteins encoded by top schizophrenia susceptibility genes. We found proteins encoded by top schizophrenia susceptibility genes formed a highly significant interconnected network, and, compared with random networks, these PPI networks are statistically highly significant for both direct connectivity and indirect connectivity. We further validated these results using empirical functional data (transcriptome data from a clinical sample). These highly significant findings indicate that top schizophrenia susceptibility genes encode proteins that significantly directly interacted and formed a densely interconnected network, suggesting perturbations of common underlying molecular processes or pathways that modulate risk to schizophrenia. Our findings that schizophrenia susceptibility genes encode a highly interconnected protein network may also provide a novel explanation for the observed genetic heterogeneity of schizophrenia, ie, mutation in any member of this molecular network will lead to same functional consequences that eventually contribute to risk of schizophrenia.
doi:10.1093/schbul/sbt066
PMCID: PMC3885298  PMID: 23671194
genome-wide association study; schizophrenia susceptibility genes; protein-protein interaction; common molecular networks; genetic heterogeneity; enrichment
3.  Key regulators in prostate cancer identified by co-expression module analysis 
BMC Genomics  2014;15(1):1015.
Background
Prostate cancer (PrCa) is the most commonly diagnosed cancer in men in the world. Despite the fact that a large number of its genes have been investigated, its etiology remains poorly understood. Furthermore, most PrCa candidate genes have not been rigorously replicated, and the methods by which they biologically function in PrCa remain largely unknown.
Results
Aiming to identify key players in the complex prostate cancer system, we reconstructed PrCa co-expressed modules within functional gene sets defined by the Gene Ontology (GO) annotation (biological process, GO_BP). We primarily identified 118 GO_BP terms that were well-preserved between two independent gene expression datasets and a consequent 55 conserved co-expression modules within them. Five modules were then found to be significantly enriched with PrCa candidate genes collected from expression Quantitative Trait Loci (eQTL), somatic copy number alteration (SCNA), somatic mutation data, or prognostic analyses. Specifically, two transcription factors (TFs) (NFAT and SP1) and three microRNAs (hsa-miR-19a, hsa-miR-15a, and hsa-miR-200b) regulating these five candidate modules were found to be critical to the development of PrCa.
Conclusions
Collectively, our results indicated that genes with similar functions may play important roles in disease through co-expression, and modules with different functions could be regulated by similar genetic components, such as TFs and microRNAs, in a synergistic manner.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-1015) contains supplementary material, which is available to authorized users.
doi:10.1186/1471-2164-15-1015
PMCID: PMC4258300  PMID: 25418933
Prostate cancer; Co-expression; Gene Ontology; Module; Transcription factor; MicroRNA
4.  MSEA: detection and quantification of mutation hotspots through mutation set enrichment analysis 
Genome Biology  2014;15(10):489.
Many cancer genes form mutation hotspots that disrupt their functional domains or active sites, leading to gain- or loss-of-function. We propose a mutation set enrichment analysis (MSEA) implemented by two novel methods, MSEA-clust and MSEA-domain, to predict cancer genes based on mutation hotspot patterns. MSEA methods are evaluated by both simulated and real cancer data. We find approximately 51% of the eligible known cancer genes form detectable mutation hotspots. Application of MSEA in eight cancers reveals a total of 82 genes with mutation hotspots, including well-studied cancer genes, known cancer genes re-found in new cancer types, and novel cancer genes.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0489-9) contains supplementary material, which is available to authorized users.
doi:10.1186/s13059-014-0489-9
PMCID: PMC4226881  PMID: 25348067
5.  Genetic Variation in Iron Metabolism Is Associated with Neuropathic Pain and Pain Severity in HIV-Infected Patients on Antiretroviral Therapy 
PLoS ONE  2014;9(8):e103123.
HIV sensory neuropathy and distal neuropathic pain (DNP) are common, disabling complications associated with combination antiretroviral therapy (cART). We previously associated iron-regulatory genetic polymorphisms with a reduced risk of HIV sensory neuropathy during more neurotoxic types of cART. We here evaluated the impact of polymorphisms in 19 iron-regulatory genes on DNP in 560 HIV-infected subjects from a prospective, observational study, who underwent neurological examinations to ascertain peripheral neuropathy and structured interviews to ascertain DNP. Genotype-DNP associations were explored by logistic regression and permutation-based analytical methods. Among 559 evaluable subjects, 331 (59%) developed HIV-SN, and 168 (30%) reported DNP. Fifteen polymorphisms in 8 genes (p<0.05) and 5 variants in 4 genes (p<0.01) were nominally associated with DNP: polymorphisms in TF, TFRC, BMP6, ACO1, SLC11A2, and FXN conferred reduced risk (adjusted odds ratios [ORs] ranging from 0.2 to 0.7, all p<0.05); other variants in TF, CP, ACO1, BMP6, and B2M conferred increased risk (ORs ranging from 1.3 to 3.1, all p<0.05). Risks associated with some variants were statistically significant either in black or white subgroups but were consistent in direction. ACO1 rs2026739 remained significantly associated with DNP in whites (permutation p<0.0001) after correction for multiple tests. Several of the same iron-regulatory-gene polymorphisms, including ACO1 rs2026739, were also associated with severity of DNP (all p<0.05). Common polymorphisms in iron-management genes are associated with DNP and with DNP severity in HIV-infected persons receiving cART. Consistent risk estimates across population subgroups and persistence of the ACO1 rs2026739 association after adjustment for multiple testing suggest that genetic variation in iron-regulation and transport modulates susceptibility to DNP.
doi:10.1371/journal.pone.0103123
PMCID: PMC4140681  PMID: 25144566
6.  Top associated SNPs in prostate cancer are significantly enriched in cis-expression quantitative trait loci and at transcription factor binding sites 
Oncotarget  2014;5(15):6168-6177.
While genome-wide association studies (GWAS) have revealed thousands of disease risk single nucleotide polymorphisms (SNPs), their functions remain largely unknown. Recent studies have suggested the regulatory roles of GWAS risk variants in several common diseases; however, the complex regulatory structure in prostate cancer is unclear.
We investigated the potential regulatory roles of risk variants in two prostate cancer GWAS datasets by their interactions with expression quantitative trait loci (eQTL) and/or transcription factor binding sites (TFBSs) in three populations.
Our results indicated that the moderately associated GWAS SNPs were significantly enriched with cis-eQTLs and TFBSs in Caucasians (CEU), but not in African Americans (AA) or Japanese (JPT); this was also observed in an independent pan-cancer related SNPs from the GWAS Catalog. We found that the eQTL enrichment in the CEU population was tissue-specific to eQTLs from CEU lymphoblastoid cell lines. Importantly, we pinpointed two SNPs, rs2861405 and rs4766642, by overlapping results from cis-eQTL and TFBS as applied to the CEU data.
These results suggested that prostate cancer associated SNPs and pan-cancer associated SNPs are likely to play regulatory roles in CEU. However, the negative enrichment results in AA or JPT and the potential mechanisms remain to be elucidated in additional samples.
PMCID: PMC4171620  PMID: 25026280
prostate cancer; genome-wide association studies; eQTL; TFBS; regulatory variants
7.  Two non-synonymous markers in PTPN21, identified by genome-wide association study data-mining and replication, are associated with schizophrenia 
Schizophrenia research  2011;131(0):43-51.
We conducted data-mining analyses of genome wide association (GWA) studies of the CATIE and MGS-GAIN datasets, and found 13 markers in the two physically linked genes, PTPN21 and EML5, showing nominally significant association with schizophrenia. Linkage disequilibrium (LD) analysis indicated that all 7 markers from PTPN21 shared high LD (r2>0.8), including rs2274736 and rs2401751, the two non-synonymous markers with the most significant association signals (rs2401751, P=1.10×10−3 and rs2274736, P=1.21×10−3). In a meta-analysis of all 13 replication datasets with a total of 13,940 subjects, we found that the two non-synonymous markers are significantly associated with schizophrenia (rs2274736, OR=0.92, 95% CI: 0.86–0.97, P=5.45×10−3 and rs2401751, OR = 0.92, 95% CI: 0.86–0.97, P=5.29×10−3). One SNP (rs7147796) in EML5 is also significantly associated with the disease (OR = 1.08, 95% CI: 1.02-1.14, P=6.43×10−3). These 3 markers remain significant after Bonferroni correction. Furthermore, haplotype conditioned analyses indicated that the association signals observed between rs2274736/rs2401751 and rs7147796 are statistically independent. Given the results that 2 non-synonymous markers in PTPN21 are associated with schizophrenia, further investigation of this locus is warranted.
doi:10.1016/j.schres.2011.06.023
PMCID: PMC4117700  PMID: 21752600
Data-mining; Informatic prioritization; Genetic association study; PTPN21; Non-synonymous SNP
8.  Application of next generation sequencing to human gene fusion detection: computational tools, features and perspectives 
Briefings in Bioinformatics  2012;14(4):506-519.
Gene fusions are important genomic events in human cancer because their fusion gene products can drive the development of cancer and thus are potential prognostic tools or therapeutic targets in anti-cancer treatment. Major advancements have been made in computational approaches for fusion gene discovery over the past 3 years due to improvements and widespread applications of high-throughput next generation sequencing (NGS) technologies. To identify fusions from NGS data, existing methods typically leverage the strengths of both sequencing technologies and computational strategies. In this article, we review the NGS and computational features of existing methods for fusion gene detection and suggest directions for future development.
doi:10.1093/bib/bbs044
PMCID: PMC3713712  PMID: 22877769
gene fusion; next generation sequencing; cancer; whole genome sequencing; transcriptome sequencing; computational tools
9.  Gastric adenocarcinoma has a unique microRNA signature not present in esophageal adenocarcinoma 
Cancer  2013;119(11):1985-1993.
Background
MicroRNAs (miRNAs) play critical roles in tumor development and progression. The fact that a single miRNA can regulate hundreds of genes places miRNAs at critical hubs of signaling pathways. In this study, we investigated the miRNA expression profile in gastric adenocarcinomas and compared it to esophageal adenocarcinomas to better identify a unique miRNA signature of gastric adenocarcinoma.
Methods and Results
The miRNA expression profile was obtained using Agilent and Exiqon microarray platforms on primary gastric adenocarcinoma tissue samples. The cross comparison of results identified 17 up-regulated and 12 down-regulated miRNAs that overlapped in both platforms. Quantitative real-time RT-PCR was performed for independent validation of a representative set of 8 miRNAs in gastric and esophageal adenocarcinomas as compared to normal gastric mucosa or esophageal mucosa, respectively. The de-regulation of miR-146b-5p, -375, -148a, -31, and -451 was significantly associated with gastric adenocarcinomas. On the other hand, de-regulation of miR-21 (up-regulation) and miR-133b (down-regulation) was detectable in both gastric and esophageal adenocarcinomas. Interestingly, miR-200a was significantly down-regulated in gastric adenocarcinoma (p=0.04) but up-regulated in esophageal adenocarcinoma samples (p=0.001). In addition, the expression level of miR-146b-5p displayed a strong correlation with the tumor staging of gastric cancer.
Conclusion
Gastric adenocarcinoma displays a unique miRNA signature that distinguishes it from esophageal adenocarcinoma. This specific signature could reflect differences in the etiology and/or molecular signaling in these two closely related cancers. Our findings suggest important miRNA candidates that can be investigated for their molecular functions and possible diagnostic, prognostic, and therapeutic role in gastric adenocarcinoma.
doi:10.1002/cncr.28002
PMCID: PMC3731210  PMID: 23456798
miRNA; esophageal adenocarcinoma; gastric adenocarcinoma; microarray; prognosis
10.  Quantitative network mapping of the human kinome interactome reveals new clues for rational kinase inhibitor discovery and individualized cancer therapy 
Oncotarget  2014;5(11):3697-3710.
The human kinome is gaining importance through its promising cancer therapeutic targets, yet no general model to address the kinase inhibitor resistance has emerged. Here, we constructed a systems biology-based framework to catalogue the human kinome, including 538 kinase genes, in the broader context of the human interactome. Specifically, we constructed three networks: a kinase-substrate interaction network containing 7,346 pairs connecting 379 kinases to 36,576 phosphorylation sites in 1,961 substrates, a protein-protein interaction network (PPIN) containing 92,699 pairs, and an atomic resolution PPIN containing 4,278 pairs. We identified the conserved regulatory phosphorylation motifs (e.g., Ser/Thr-Pro) using a sequence logo analysis. We found the typical anticancer target selection strategy that uses network hubs as drug targets, might lead to a high adverse drug reaction risk. Furthermore, we found the distinct network centrality of kinases creates a high anticancer drug resistance risk by feedback or crosstalk mechanisms within cellular networks. This notion is supported by the systematic network and pathway analyses that anticancer drug resistance genes are significantly enriched as hubs and heavily participate in multiple signaling pathways. Collectively, this comprehensive human kinome interactome map sheds light on anticancer drug resistance mechanisms and provides an innovative resource for rational kinase inhibitor design.
PMCID: PMC4116514  PMID: 25003367
Kinome; kinase-substrate interaction; phosphorylation; interactome; resistance; systems biology
11.  Patterns and processes of somatic mutations in nine major cancers 
BMC Medical Genomics  2014;7:11.
Background
Cancer genomes harbor hundreds to thousands of somatic nonsynonymous mutations. DNA damage and deficiency of DNA repair systems are two major forces to cause somatic mutations, marking cancer genomes with specific somatic mutation patterns. Recently, several pan-cancer genome studies revealed more than 20 mutation signatures across multiple cancer types. However, detailed cancer-type specific mutation signatures and their different features within (intra-) and between (inter-) cancer types remain largely unexplored.
Methods
We employed a matrix decomposition algorithm, namely Non-negative Matrix Factorization, to survey the somatic mutations in nine major human cancers, involving a total of ~2100 genomes.
Results
Our results revealed 3-5 independent mutational signatures in each cancer, implying that a range of 3-5 predominant mutational processes likely underlie each cancer genome. Both mutagen exposure (tobacco and sun) and changes in DNA repair systems (APOBEC family, POLE, and MLH1) were found as mutagenesis forces, each of which marks the genome with an evident mutational signature. We studied the features of several signatures and their combinatory patterns within and across cancers. On one hand, we found each signature may influence a cancer genome with different influential magnitudes even in the same cancer type and the signature-specific load reflects intra-cancer heterogeneity (e.g., the smoking-related signature in lung cancer smokers and never smokers). On the other hand, inter-cancer heterogeneity is characterized by combinatory patterns of mutational signatures, where no cancers share the same signature profile, even between two lung cancer subtypes (lung adenocarcinoma and squamous cell lung cancer).
Conclusions
Our work provides a detailed overview of the mutational characteristics in each of nine major cancers and highlights that the mutational signature profile is representative of each cancer.
doi:10.1186/1755-8794-7-11
PMCID: PMC3942057  PMID: 24552141
Somatic mutation; Cancer; Kataegis; Mutation signature; Mutagen; Heterogeneity
12.  VarWalker: Personalized Mutation Network Analysis of Putative Cancer Genes from Next-Generation Sequencing Data 
PLoS Computational Biology  2014;10(2):e1003460.
A major challenge in interpreting the large volume of mutation data identified by next-generation sequencing (NGS) is to distinguish driver mutations from neutral passenger mutations to facilitate the identification of targetable genes and new drugs. Current approaches are primarily based on mutation frequencies of single-genes, which lack the power to detect infrequently mutated driver genes and ignore functional interconnection and regulation among cancer genes. We propose a novel mutation network method, VarWalker, to prioritize driver genes in large scale cancer mutation data. VarWalker fits generalized additive models for each sample based on sample-specific mutation profiles and builds on the joint frequency of both mutation genes and their close interactors. These interactors are selected and optimized using the Random Walk with Restart algorithm in a protein-protein interaction network. We applied the method in >300 tumor genomes in two large-scale NGS benchmark datasets: 183 lung adenocarcinoma samples and 121 melanoma samples. In each cancer, we derived a consensus mutation subnetwork containing significantly enriched consensus cancer genes and cancer-related functional pathways. These cancer-specific mutation networks were then validated using independent datasets for each cancer. Importantly, VarWalker prioritizes well-known, infrequently mutated genes, which are shown to interact with highly recurrently mutated genes yet have been ignored by conventional single-gene-based approaches. Utilizing VarWalker, we demonstrated that network-assisted approaches can be effectively adapted to facilitate the detection of cancer driver genes in NGS data.
Author Summary
A cancer genome typically harbors both driver mutations, which contribute to tumorigenesis, and passenger mutations, which tend to be neutral and occur randomly. Cancer genomes differ dramatically due to genetic and environmental factors. A major challenge in interpreting the large volume of mutation data identified in cancer genomes using next-generation sequencing (NGS) is to distinguish driver mutations from neutral passenger mutations. We propose a novel mutation network method, VarWalker, to prioritize driver genes in large scale cancer mutation data. Applying our approach in a large cohort of lung adenocarcinoma samples and melanoma samples, we derived a consensus mutation subnetwork for each cancer containing significantly enriched cancer genes and cancer-related functional pathways. Our results indicated that driver genes occur within a broad spectrum of frequency, interact with each other, and converge in several key pathways that play critical roles in tumorigenesis.
doi:10.1371/journal.pcbi.1003460
PMCID: PMC3916227  PMID: 24516372
13.  Gene set analysis of genome-wide association studies: methodological issues and perspectives 
Genomics  2011;98(1):10.1016/j.ygeno.2011.04.006.
Recent studies have demonstrated that gene set analysis, which tests disease association with genetic variants in a group of functionally related genes, is a promising approach for analyzing and interpreting genome-wide association studies (GWAS) data. These approaches aim to increase power by combining association signals from multiple genes in the same gene set. In addition, gene set analysis can also shed more light on the biological processes underlying complex diseases. However, current approaches for gene set analysis are still in an early stage of development in that analysis results are often prone to sources of bias, including gene set size and gene length, linkage disequilibrium patterns and the presence of overlapping genes. In this paper, we provide an in-depth review of the gene set analysis procedures, along with parameter choices and the particular methodology challenges at each stage. In addition to providing a survey of recently developed tools, we also classify the analysis methods into larger categories and discuss their strengths and limitations. In the last section, we outline several important areas for improving the analytical strategies in gene set analysis.
doi:10.1016/j.ygeno.2011.04.006
PMCID: PMC3852939  PMID: 21565265
Genome-wide association study; Gene set; Pathway; Gene-set enrichment analysis; Statistical significance; Complex disease
14.  Multi-Dimensional Prioritization of Dental Caries Candidate Genes and Its Enriched Dense Network Modules 
PLoS ONE  2013;8(10):e76666.
A number of genetic studies have suggested numerous susceptibility genes for dental caries over the past decade with few definite conclusions. The rapid accumulation of relevant information, along with the complex architecture of the disease, provides a challenging but also unique opportunity to review and integrate the heterogeneous data for follow-up validation and exploration. In this study, we collected and curated candidate genes from four major categories: association studies, linkage scans, gene expression analyses, and literature mining. Candidate genes were prioritized according to the magnitude of evidence related to dental caries. We then searched for dense modules enriched with the prioritized candidate genes through their protein-protein interactions (PPIs). We identified 23 modules comprising of 53 genes. Functional analyses of these 53 genes revealed three major clusters: cytokine network relevant genes, matrix metalloproteinases (MMPs) family, and transforming growth factor-beta (TGF-β) family, all of which have been previously implicated to play important roles in tooth development and carious lesions. Through our extensive data collection and an integrative application of gene prioritization and PPI network analyses, we built a dental caries-specific sub-network for the first time. Our study provided insights into the molecular mechanisms underlying dental caries. The framework we proposed in this work can be applied to other complex diseases.
doi:10.1371/journal.pone.0076666
PMCID: PMC3795720  PMID: 24146904
15.  Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers 
Genome Medicine  2013;5(10):91.
Background
Driven by high throughput next generation sequencing technologies and the pressing need to decipher cancer genomes, computational approaches for detecting somatic single nucleotide variants (sSNVs) have undergone dramatic improvements during the past 2 years. The recently developed tools typically compare a tumor sample directly with a matched normal sample at each variant locus in order to increase the accuracy of sSNV calling. These programs also address the detection of sSNVs at low allele frequencies, allowing for the study of tumor heterogeneity, cancer subclones, and mutation evolution in cancer development.
Methods
We used whole genome sequencing (Illumina Genome Analyzer IIx platform) of a melanoma sample and matched blood, whole exome sequencing (Illumina HiSeq 2000 platform) of 18 lung tumor-normal pairs and seven lung cancer cell lines to evaluate six tools for sSNV detection: EBCall, JointSNVMix, MuTect, SomaticSniper, Strelka, and VarScan 2, with a focus on MuTect and VarScan 2, two widely used publicly available software tools. Default/suggested parameters were used to run these tools. The missense sSNVs detected in these samples were validated through PCR and direct sequencing of genomic DNA from the samples. We also simulated 10 tumor-normal pairs to explore the ability of these programs to detect low allelic-frequency sSNVs.
Results
Out of the 237 sSNVs successfully validated in our cancer samples, VarScan 2 and MuTect detected the most of any tools (that is, 204 and 192, respectively). MuTect identified 11 more low-coverage validated sSNVs than VarScan 2, but missed 11 more sSNVs with alternate alleles in normal samples than VarScan 2. When examining the false calls of each tool using 169 invalidated sSNVs, we observed >63% false calls detected in the lung cancer cell lines had alternate alleles in normal samples. Additionally, from our simulation data, VarScan 2 identified more sSNVs than other tools, while MuTect characterized most low allelic-fraction sSNVs.
Conclusions
Our study explored the typical false-positive and false-negative detections that arise from the use of sSNV-calling tools. Our results suggest that despite recent progress, these tools have significant room for improvement, especially in the discrimination of low coverage/allelic-frequency sSNVs and sSNVs with alternate alleles in normal samples.
doi:10.1186/gm495
PMCID: PMC3971343  PMID: 24112718
16.  Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives 
BMC Bioinformatics  2013;14(Suppl 11):S1.
Copy number variation (CNV) is a prevalent form of critical genetic variation that leads to an abnormal number of copies of large genomic regions in a cell. Microarray-based comparative genome hybridization (arrayCGH) or genotyping arrays have been standard technologies to detect large regions subject to copy number changes in genomes until most recently high-resolution sequence data can be analyzed by next-generation sequencing (NGS). During the last several years, NGS-based analysis has been widely applied to identify CNVs in both healthy and diseased individuals. Correspondingly, the strong demand for NGS-based CNV analyses has fuelled development of numerous computational methods and tools for CNV detection. In this article, we review the recent advances in computational methods pertaining to CNV detection using whole genome and whole exome sequencing data. Additionally, we discuss their strengths and weaknesses and suggest directions for future development.
doi:10.1186/1471-2105-14-S11-S1
PMCID: PMC3846878  PMID: 24564169
17.  BRAF L597 mutations in melanoma are associated with sensitivity to MEK inhibitors 
Cancer discovery  2012;2(9):791-797.
Kinase inhibitors are accepted treatment for metastatic melanomas that harbor specific driver mutations in BRAF or KIT, but only 40–50% of cases are positive. To uncover other potential targetable mutations, we performed whole-genome sequencing of a highly aggressive BRAF (V600) and KIT (W557, V559, L576, K642, D816) wildtype melanoma. Surprisingly, we found a somatic BRAF L597R mutation in exon 15. Analysis of BRAF exon 15 in 49 tumors negative for BRAF V600 mutations as well as driver mutations in KIT, NRAS, GNAQ, and GNA11, showed that 2 (4%) harbored L597 mutations and another 2 involved BRAF D594 and K601 mutations. In vitro signaling induced by L597R/S/Q mutants was suppressed by MEK inhibition. A patient with BRAF L597S mutant metastatic melanoma responded significantly to treatment with the MEK inhibitor, TAK-733. Collectively, these data demonstrate clinical significance to BRAF L597 mutations in melanoma.
doi:10.1158/2159-8290.CD-12-0097
PMCID: PMC3449158  PMID: 22798288
melanoma; BRAF L597; whole genome sequencing; BRAF inhibitor; MEK inhibitor; TAK-733
18.  Association Signals Unveiled by a Comprehensive Gene Set Enrichment Analysis of Dental Caries Genome-Wide Association Studies 
PLoS ONE  2013;8(8):e72653.
Gene set-based analysis of genome-wide association study (GWAS) data has recently emerged as a useful approach to examine the joint effects of multiple risk loci in complex human diseases or phenotypes. Dental caries is a common, chronic, and complex disease leading to a decrease in quality of life worldwide. In this study, we applied the approaches of gene set enrichment analysis to a major dental caries GWAS dataset, which consists of 537 cases and 605 controls. Using four complementary gene set analysis methods, we analyzed 1331 Gene Ontology (GO) terms collected from the Molecular Signatures Database (MSigDB). Setting false discovery rate (FDR) threshold as 0.05, we identified 13 significantly associated GO terms. Additionally, 17 terms were further included as marginally associated because they were top ranked by each method, although their FDR is higher than 0.05. In total, we identified 30 promising GO terms, including ‘Sphingoid metabolic process,’ ‘Ubiquitin protein ligase activity,’ ‘Regulation of cytokine secretion,’ and ‘Ceramide metabolic process.’ These GO terms encompass broad functions that potentially interact and contribute to the oral immune response related to caries development, which have not been reported in the standard single marker based analysis. Collectively, our gene set enrichment analysis provided complementary insights into the molecular mechanisms and polygenic interactions in dental caries, revealing promising association signals that could not be detected through single marker analysis of GWAS data.
doi:10.1371/journal.pone.0072653
PMCID: PMC3743773  PMID: 23967329
19.  Association Study of 167 Candidate Genes for Schizophrenia Selected by a Multi-Domain Evidence-Based Prioritization Algorithm and Neurodevelopmental Hypothesis 
PLoS ONE  2013;8(7):e67776.
Integrating evidence from multiple domains is useful in prioritizing disease candidate genes for subsequent testing. We ranked all known human genes (n = 3819) under linkage peaks in the Irish Study of High-Density Schizophrenia Families using three different evidence domains: 1) a meta-analysis of microarray gene expression results using the Stanley Brain collection, 2) a schizophrenia protein-protein interaction network, and 3) a systematic literature search. Each gene was assigned a domain-specific p-value and ranked after evaluating the evidence within each domain. For comparison to this ranking process, a large-scale candidate gene hypothesis was also tested by including genes with Gene Ontology terms related to neurodevelopment. Subsequently, genotypes of 3725 SNPs in 167 genes from a custom Illumina iSelect array were used to evaluate the top ranked vs. hypothesis selected genes. Seventy-three genes were both highly ranked and involved in neurodevelopment (category 1) while 42 and 52 genes were exclusive to neurodevelopment (category 2) or highly ranked (category 3), respectively. The most significant associations were observed in genes PRKG1, PRKCE, and CNTN4 but no individual SNPs were significant after correction for multiple testing. Comparison of the approaches showed an excess of significant tests using the hypothesis-driven neurodevelopment category. Random selection of similar sized genes from two independent genome-wide association studies (GWAS) of schizophrenia showed the excess was unlikely by chance. In a further meta-analysis of three GWAS datasets, four candidate SNPs reached nominal significance. Although gene ranking using integrated sources of prior information did not enrich for significant results in the current experiment, gene selection using an a priori hypothesis (neurodevelopment) was superior to random selection. As such, further development of gene ranking strategies using more carefully selected sources of information is warranted.
doi:10.1371/journal.pone.0067776
PMCID: PMC3726675  PMID: 23922650
20.  Deciphering the Unique MicroRNA Signature in Human Esophageal Adenocarcinoma 
PLoS ONE  2013;8(5):e64463.
Background and Methods
Esophageal adenocarcinoma (EAC) is characterized by a steep rise in incidence rates in the Western population. The unique miRNA signature that distinguishes EAC from other upper gastrointestinal cancers remains unclear. Herein, we performed a comprehensive microarray profiling for the specific miRNA signature associated with EAC. We validated this signature by qRT-PCR.
Results
Microarray analysis showed that 21 miRNAs were consistently deregulated in EAC. miR-194, miR-192, miR-200a, miR-21, miR-203, miR-205, miR-133b, and miR-31 were selected for validation using 46 normal squamous (NS), 23 Barrett’s esophagus (BE), 17 Barrett’s high grade dysplasia (HGD), 34 EAC, 33 gastric adenocarcinoma (GC), and 45 normal gastric (NG) tissues. The qRT-PCR analysis indicated that 2 miRNAs (miR-21 and miR-133b) were deregulated in both EAC and GC, and 6 miRNAs (up-regulated: miR-194, miR-31, miR-192, and miR-200a; down-regulated: miR-203 and miR-205) in EAC, as compared to BE but not in GC, indicating their potential unique role in EAC. Our data showed that miR-194, miR-192, miR-21, and miR-31 were up-regulated in BE adjacent to HGD lesions relative to isolated BE samples. Analysis of clinicopathological features indicated that down-regulation of miR-203 is significantly associated with progression and tumor stages in EAC. Interestingly, the overexpression levels of miR-194, miR-200a, and miR-192 were significantly higher in early EAC stages, suggesting that these miRNAs may be involved in EAC tumor development rather than progression.
Conclusion
Our findings demonstrate the presence of a unique miRNA signature for EAC. This may provide some clues for the distinct molecular features of EAC to be considered in future studies of the role of miRNAs in EAC and their utility as disease biomarkers.
doi:10.1371/journal.pone.0064463
PMCID: PMC3665888  PMID: 23724052
21.  VirusFinder: Software for Efficient and Accurate Detection of Viruses and Their Integration Sites in Host Genomes through Next Generation Sequencing Data 
PLoS ONE  2013;8(5):e64465.
Next generation sequencing (NGS) technologies allow us to explore virus interactions with host genomes that lead to carcinogenesis or other diseases; however, this effort is largely hindered by the dearth of efficient computational tools. Here, we present a new tool, VirusFinder, for the identification of viruses and their integration sites in host genomes using NGS data, including whole transcriptome sequencing (RNA-Seq), whole genome sequencing (WGS), and targeted sequencing data. VirusFinder’s unique features include the characterization of insertion loci of virus of arbitrary type in the host genome and high accuracy and computational efficiency as a result of its well-designed pipeline. The source code as well as additional data of VirusFinder is publicly available at http://bioinfo.mc.vanderbilt.edu/VirusFinder/.
doi:10.1371/journal.pone.0064465
PMCID: PMC3663743  PMID: 23717618
22.  Integrative pathway analysis of genome-wide association studies and gene expression data in prostate cancer 
BMC Systems Biology  2012;6(Suppl 3):S13.
Background
Pathway analysis of large-scale omics data assists us with the examination of the cumulative effects of multiple functionally related genes, which are difficult to detect using the traditional single gene/marker analysis. So far, most of the genomic studies have been conducted in a single domain, e.g., by genome-wide association studies (GWAS) or microarray gene expression investigation. A combined analysis of disease susceptibility genes across multiple platforms at the pathway level is an urgent need because it can reveal more reliable and more biologically important information.
Results
We performed an integrative pathway analysis of a GWAS dataset and a microarray gene expression dataset in prostate cancer. We obtained a comprehensive pathway annotation set from knowledge-based public resources, including KEGG pathways and the prostate cancer candidate gene set, and gene sets specifically defined based on cross-platform information. By leveraging on this pathway collection, we first searched for significant pathways in the GWAS dataset using four methods, which represent two broad groups of pathway analysis approaches. The significant pathways identified by each method varied greatly, but the results were more consistent within each method group than between groups. Next, we conducted a gene set enrichment analysis of the microarray gene expression data and found 13 pathways with cross-platform evidence, including "Fc gamma R-mediated phagocytosis" (PGWAS = 0.003, Pexpr < 0.001, and Pcombined = 6.18 × 10-8), "regulation of actin cytoskeleton" (PGWAS = 0.003, Pexpr = 0.009, and Pcombined = 3.34 × 10-4), and "Jak-STAT signaling pathway" (PGWAS = 0.001, Pexpr = 0.084, and Pcombined = 8.79 × 10-4).
Conclusions
Our results provide evidence at both the genetic variation and expression levels that several key pathways might have been involved in the pathological development of prostate cancer. Our framework that employs gene expression data to facilitate pathway analysis of GWAS data is not only feasible but also much needed in studying complex disease.
doi:10.1186/1752-0509-6-S3-S13
PMCID: PMC3524313  PMID: 23281744
23.  Multi-species data integration and gene ranking enrich significant results in an alcoholism genome-wide association study 
BMC Genomics  2012;13(Suppl 8):S16.
Background
A variety of species and experimental designs have been used to study genetic influences on alcohol dependence, ethanol response, and related traits. Integration of these heterogeneous data can be used to produce a ranked target gene list for additional investigation.
Results
In this study, we performed a unique multi-species evidence-based data integration using three microarray experiments in mice or humans that generated an initial alcohol dependence (AD) related genes list, human linkage and association results, and gene sets implicated in C. elegans and Drosophila. We then used permutation and false discovery rate (FDR) analyses on the genome-wide association studies (GWAS) dataset from the Collaborative Study on the Genetics of Alcoholism (COGA) to evaluate the ranking results and weighting matrices. We found one weighting score matrix could increase FDR based q-values for a list of 47 genes with a score greater than 2. Our follow up functional enrichment tests revealed these genes were primarily involved in brain responses to ethanol and neural adaptations occurring with alcoholism.
Conclusions
These results, along with our experimental validation of specific genes in mice, C. elegans and Drosophila, suggest that a cross-species evidence-based approach is useful to identify candidate genes contributing to alcoholism.
doi:10.1186/1471-2164-13-S8-S16
PMCID: PMC3535715  PMID: 23282140
24.  Searching joint association signals in CATIE schizophrenia genome-wide association studies through a refined integrative network approach 
BMC Genomics  2012;13(Suppl 6):S15.
Background
Genome-wide association studies (GWAS) have generated a wealth of valuable genotyping data for complex diseases/traits. A large proportion of these data are embedded with many weakly associated markers that have been missed in traditional single marker analyses, but they may provide valuable insights in dissecting the genetic components of diseases. Gene set analysis (GSA) augmented by protein-protein interaction network data provides a promising way to examine GWAS data by analyzing the combined effects of multiple genes/markers, each of which may have only individually weak to moderate association effects. A critical issue in GSA of GWAS data is the definition of gene-wise P values based on multiple SNPs mapped to a gene.
Results
In this study, we proposed an alternative restricted search approach based on our previously developed dense module search algorithm, and we demonstrated it in the CATIE GWAS dataset for schizophrenia. Specifically, we explored three ways of computing gene-wise P values and examined their effects on the resultant module genes. These methods calculate gene-wise P values based on all the SNPs, the top ranked SNPs, or the most significant SNP among all the SNPs mapped to a gene. We applied the restricted search approach and identified a module gene set for each of the gene-wise P value data set. In our evaluation using an independent method, ALIGATOR, we showed that although each of these input datasets generated a unique set of module genes, all of them were significant in the GWAS dataset. Further functional enrichment analysis of these module genes showed that at the pathway level, they were all consistently related to neuro- and immune-related pathways. Finally, we compared our method with a previously reported method.
Conclusion
Our results showed that the approaches to computing gene-wise P values in GWAS data are critical in GSA. This work is useful for evaluating key factors in GSA of GWAS data.
doi:10.1186/1471-2164-13-S6-S15
PMCID: PMC3481439  PMID: 23134571
25.  Genome-wide association study of antipsychotic induced QTc interval prolongation 
The Pharmacogenomics Journal  2010;12(2):165-172.
QT prolongation is associated with increased risk of cardiac arrhythmias. Identifying the genetic variants that mediate antipsychotic induced prolongation may help to minimize this risk, which might prevent the removal of efficacious drugs from the market. We performed candidate gene analysis and five drug specific genome-wide association studies (GWAS) with 492K SNPs to search for genetic variation mediating antipsychotic induced QT prolongation in 738 schizophrenia patients from the Clinical Antipsychotic Trial of Intervention Effectiveness (CATIE) study.
Our candidate gene study suggests the involvement of NOS1AP and NUBPL (p-values =1.45×10−05 and 2.66×10−13, respectively). Furthermore, our top GWAS hit achieving genome-wide significance, defined as a q-value <0.10, (p-value =1.54×10−7, q-value =0.07), located in SLC22A23, mediated the effects of quetiapine on prolongation. SLC22A23 belongs to a family of organic ion transporters that shuttle a variety of compounds including drugs, environmental toxins, and endogenous metabolites across the cell membrane. This gene is expressed in the heart and is integral in mouse heart development. The genes mediating antipsychotic induced QT prolongation partially overlap with the genes affecting normal QT interval variation. However, some genes may also be unique for drug induced prolongation. This study demonstrates the potential of GWAS to discover genes and pathways that mediate antipsychotic induced QT prolongation.
doi:10.1038/tpj.2010.76
PMCID: PMC3388904  PMID: 20921969
candidate gene analysis; genome-wide association study; schizophrenia; adverse effects; CATIE

Results 1-25 (48)