Search tips
Search criteria

Results 1-25 (1525423)

Clipboard (0)

Related Articles

1.  ContrastRank: a new method for ranking putative cancer driver genes and classification of tumor samples 
Bioinformatics  2014;30(17):i572-i578.
Motivation: The recent advance in high-throughput sequencing technologies is generating a huge amount of data that are becoming an important resource for deciphering the genotype underlying a given phenotype. Genome sequencing has been extensively applied to the study of the cancer genomes. Although a few methods have been already proposed for the detection of cancer-related genes, their automatic identification is still a challenging task. Using the genomic data made available by The Cancer Genome Atlas Consortium (TCGA), we propose a new prioritization approach based on the analysis of the distribution of putative deleterious variants in a large cohort of cancer samples.
Results: In this paper, we present ContastRank, a new method for the prioritization of putative impaired genes in cancer. The method is based on the comparison of the putative defective rate of each gene in tumor versus normal and 1000 genome samples. We show that the method is able to provide a ranked list of putative impaired genes for colon, lung and prostate adenocarcinomas. The list significantly overlaps with the list of known cancer driver genes previously published. More importantly, by using our scoring approach, we can successfully discriminate between TCGA normal and tumor samples. A binary classifier based on ContrastRank score reaches an overall accuracy >90% and the area under the curve (AUC) of receiver operating characteristics (ROC) >0.95 for all the three types of adenocarcinoma analyzed in this paper. In addition, using ContrastRank score, we are able to discriminate the three tumor types with a minimum overall accuracy of 77% and AUC of 0.83.
Conclusions: We describe ContrastRank, a method for prioritizing putative impaired genes in cancer. The method is based on the comparison of exome sequencing data from different cohorts and can detect putative cancer driver genes.
ContrastRank can also be used to estimate a global score for an individual genome about the risk of adenocarcinoma based on the genetic variants information from a whole-exome VCF (Variant Calling Format) file. We believe that the application of ContrastRank can be an important step in genomic medicine to enable genome-based diagnosis.
Availability and implementation: The lists of ContrastRank scores of all genes in each tumor type are available as supplementary materials. A webserver for evaluating the risk of the three studied adenocarcinomas starting from whole-exome VCF file is under development.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4147919  PMID: 25161249
2.  Disease candidate gene identification and prioritization using protein interaction networks 
BMC Bioinformatics  2009;10:73.
Although most of the current disease candidate gene identification and prioritization methods depend on functional annotations, the coverage of the gene functional annotations is a limiting factor. In the current study, we describe a candidate gene prioritization method that is entirely based on protein-protein interaction network (PPIN) analyses.
For the first time, extended versions of the PageRank and HITS algorithms, and the K-Step Markov method are applied to prioritize disease candidate genes in a training-test schema. Using a list of known disease-related genes from our earlier study as a training set ("seeds"), and the rest of the known genes as a test list, we perform large-scale cross validation to rank the candidate genes and also evaluate and compare the performance of our approach. Under appropriate settings – for example, a back probability of 0.3 for PageRank with Priors and HITS with Priors, and step size 6 for K-Step Markov method – the three methods achieved a comparable AUC value, suggesting a similar performance.
Even though network-based methods are generally not as effective as integrated functional annotation-based methods for disease candidate gene prioritization, in a one-to-one comparison, PPIN-based candidate gene prioritization performs better than all other gene features or annotations. Additionally, we demonstrate that methods used for studying both social and Web networks can be successfully used for disease candidate gene prioritization.
PMCID: PMC2657789  PMID: 19245720
3.  SVM classifier to predict genes important for self-renewal and pluripotency of mouse embryonic stem cells 
BMC Systems Biology  2010;4:173.
Mouse embryonic stem cells (mESCs) are derived from the inner cell mass of a developing blastocyst and can be cultured indefinitely in-vitro. Their distinct features are their ability to self-renew and to differentiate to all adult cell types. Genes that maintain mESCs self-renewal and pluripotency identity are of interest to stem cell biologists. Although significant steps have been made toward the identification and characterization of such genes, the list is still incomplete and controversial. For example, the overlap among candidate self-renewal and pluripotency genes across different RNAi screens is surprisingly small. Meanwhile, machine learning approaches have been used to analyze multi-dimensional experimental data and integrate results from many studies, yet they have not been applied to specifically tackle the task of predicting and classifying self-renewal and pluripotency gene membership.
For this study we developed a classifier, a supervised machine learning framework for predicting self-renewal and pluripotency mESCs stemness membership genes (MSMG) using support vector machines (SVM). The data used to train the classifier was derived from mESCs-related studies using mRNA microarrays, measuring gene expression in various stages of early differentiation, as well as ChIP-seq studies applied to mESCs profiling genome-wide binding of key transcription factors, such as Nanog, Oct4, and Sox2, to the regulatory regions of other genes. Comparison to other classification methods using the leave-one-out cross-validation method was employed to evaluate the accuracy and generality of the classification. Finally, two sets of candidate genes from genome-wide RNA interference screens are used to test the generality and potential application of the classifier.
Our results reveal that an SVM approach can be useful for prioritizing genes for functional validation experiments and complement the analyses of high-throughput profiling experimental data in stem cell research.
PMCID: PMC3019180  PMID: 21176149
4.  Exploiting Protein-Protein Interaction Networks for Genome-Wide Disease-Gene Prioritization 
PLoS ONE  2012;7(9):e43557.
Complex genetic disorders often involve products of multiple genes acting cooperatively. Hence, the pathophenotype is the outcome of the perturbations in the underlying pathways, where gene products cooperate through various mechanisms such as protein-protein interactions. Pinpointing the decisive elements of such disease pathways is still challenging. Over the last years, computational approaches exploiting interaction network topology have been successfully applied to prioritize individual genes involved in diseases. Although linkage intervals provide a list of disease-gene candidates, recent genome-wide studies demonstrate that genes not associated with any known linkage interval may also contribute to the disease phenotype. Network based prioritization methods help highlighting such associations. Still, there is a need for robust methods that capture the interplay among disease-associated genes mediated by the topology of the network. Here, we propose a genome-wide network-based prioritization framework named GUILD. This framework implements four network-based disease-gene prioritization algorithms. We analyze the performance of these algorithms in dozens of disease phenotypes. The algorithms in GUILD are compared to state-of-the-art network topology based algorithms for prioritization of genes. As a proof of principle, we investigate top-ranking genes in Alzheimer's disease (AD), diabetes and AIDS using disease-gene associations from various sources. We show that GUILD is able to significantly highlight disease-gene associations that are not used a priori. Our findings suggest that GUILD helps to identify genes implicated in the pathology of human disorders independent of the loci associated with the disorders.
PMCID: PMC3448640  PMID: 23028459
5.  Deducing corticotropin-releasing hormone receptor type 1 signaling networks from gene expression data by usage of genetic algorithms and graphical Gaussian models 
BMC Systems Biology  2010;4:159.
Dysregulation of the hypothalamic-pituitary-adrenal (HPA) axis is a hallmark of complex and multifactorial psychiatric diseases such as anxiety and mood disorders. About 50-60% of patients with major depression show HPA axis dysfunction, i.e. hyperactivity and impaired negative feedback regulation. The neuropeptide corticotropin-releasing hormone (CRH) and its receptor type 1 (CRHR1) are key regulators of this neuroendocrine stress axis. Therefore, we analyzed CRH/CRHR1-dependent gene expression data obtained from the pituitary corticotrope cell line AtT-20, a well-established in vitro model for CRHR1-mediated signal transduction. To extract significantly regulated genes from a genome-wide microarray data set and to deduce underlying CRHR1-dependent signaling networks, we combined supervised and unsupervised algorithms.
We present an efficient variable selection strategy by consecutively applying univariate as well as multivariate methods followed by graphical models. First, feature preselection was used to exclude genes not differentially regulated over time from the dataset. For multivariate variable selection a maximum likelihood (MLHD) discriminant function within GALGO, an R package based on a genetic algorithm (GA), was chosen. The topmost genes representing major nodes in the expression network were ranked to find highly separating candidate genes. By using groups of five genes (chromosome size) in the discriminant function and repeating the genetic algorithm separately four times we found eleven genes occurring at least in three of the top ranked result lists of the four repetitions. In addition, we compared the results of GA/MLHD with the alternative optimization algorithms greedy selection and simulated annealing as well as with the state-of-the-art method random forest. In every case we obtained a clear overlap of the selected genes independently confirming the results of MLHD in combination with a genetic algorithm.
With two unsupervised algorithms, principal component analysis and graphical Gaussian models, putative interactions of the candidate genes were determined and reconstructed by literature mining. Differential regulation of six candidate genes was validated by qRT-PCR.
The combination of supervised and unsupervised algorithms in this study allowed extracting a small subset of meaningful candidate genes from the genome-wide expression data set. Thereby, variable selection using different optimization algorithms based on linear classifiers as well as the nonlinear random forest method resulted in congruent candidate genes. The calculated interacting network connecting these new target genes was bioinformatically mapped to known CRHR1-dependent signaling pathways. Additionally, the differential expression of the identified target genes was confirmed experimentally.
PMCID: PMC3002901  PMID: 21092110
6.  Combining Genome-Wide Association Mapping and Transcriptional Networks to Identify Novel Genes Controlling Glucosinolates in Arabidopsis thaliana 
PLoS Biology  2011;9(8):e1001125.
Genome-wide association mapping is highly sensitive to environmental changes, but network analysis allows rapid causal gene identification.
Genome-wide association (GWA) is gaining popularity as a means to study the architecture of complex quantitative traits, partially due to the improvement of high-throughput low-cost genotyping and phenotyping technologies. Glucosinolate (GSL) secondary metabolites within Arabidopsis spp. can serve as a model system to understand the genomic architecture of adaptive quantitative traits. GSL are key anti-herbivory defenses that impart adaptive advantages within field trials. While little is known about how variation in the external or internal environment of an organism may influence the efficiency of GWA, GSL variation is known to be highly dependent upon the external stresses and developmental processes of the plant lending it to be an excellent model for studying conditional GWA.
Methodology/Principal Findings
To understand how development and environment can influence GWA, we conducted a study using 96 Arabidopsis thaliana accessions, >40 GSL phenotypes across three conditions (one developmental comparison and one environmental comparison) and ∼230,000 SNPs. Developmental stage had dramatic effects on the outcome of GWA, with each stage identifying different loci associated with GSL traits. Further, while the molecular bases of numerous quantitative trait loci (QTL) controlling GSL traits have been identified, there is currently no estimate of how many additional genes may control natural variation in these traits. We developed a novel co-expression network approach to prioritize the thousands of GWA candidates and successfully validated a large number of these genes as influencing GSL accumulation within A. thaliana using single gene isogenic lines.
Together, these results suggest that complex traits imparting environmentally contingent adaptive advantages are likely influenced by up to thousands of loci that are sensitive to fluctuations in the environment or developmental state of the organism. Additionally, while GWA is highly conditional upon genetics, the use of additional genomic information can rapidly identify causal loci en masse.
Author Summary
Understanding how genetic variation can control phenotypic variation is a fundamental goal of modern biology. A major push has been made using genome-wide association mapping in all organisms to attempt and rapidly identify the genes contributing to phenotypes such as disease and nutritional disorders. But a number of fundamental questions have not been answered about the use of genome-wide association: for example, how does the internal or external environment influence the genes found? Furthermore, the simple question of how many genes may influence a trait is unknown. Finally, a number of studies have identified significant false-positive and -negative issues within genome-wide association studies that are not solvable by direct statistical approaches. We have used genome-wide association mapping in the plant Arabidopsis thaliana to begin exploring these questions. We show that both external and internal environments significantly alter the identified genes, such that using different tissues can lead to the identification of nearly completely different gene sets. Given the large number of potential false-positives, we developed an orthogonal approach to filtering the possible genes, by identifying co-functioning networks using the nominal candidate gene list derived from genome-wide association studies. This allowed us to rapidly identify and validate a large number of novel and unexpected genes that affect Arabidopsis thaliana defense metabolism within phenotypic ranges that have been shown to be selectable within the field. These genes and the associated networks suggest that Arabidopsis thaliana defense metabolism is more readily similar to the infinite gene hypothesis, according to which there is a vast number of causative genes controlling natural variation in this phenotype. It remains to be seen how frequently this is true for other organisms and other phenotypes.
PMCID: PMC3156686  PMID: 21857804
7.  Functional Genomics Complements Quantitative Genetics in Identifying Disease-Gene Associations 
PLoS Computational Biology  2010;6(11):e1000991.
An ultimate goal of genetic research is to understand the connection between genotype and phenotype in order to improve the diagnosis and treatment of diseases. The quantitative genetics field has developed a suite of statistical methods to associate genetic loci with diseases and phenotypes, including quantitative trait loci (QTL) linkage mapping and genome-wide association studies (GWAS). However, each of these approaches have technical and biological shortcomings. For example, the amount of heritable variation explained by GWAS is often surprisingly small and the resolution of many QTL linkage mapping studies is poor. The predictive power and interpretation of QTL and GWAS results are consequently limited. In this study, we propose a complementary approach to quantitative genetics by interrogating the vast amount of high-throughput genomic data in model organisms to functionally associate genes with phenotypes and diseases. Our algorithm combines the genome-wide functional relationship network for the laboratory mouse and a state-of-the-art machine learning method. We demonstrate the superior accuracy of this algorithm through predicting genes associated with each of 1157 diverse phenotype ontology terms. Comparison between our prediction results and a meta-analysis of quantitative genetic studies reveals both overlapping candidates and distinct, accurate predictions uniquely identified by our approach. Focusing on bone mineral density (BMD), a phenotype related to osteoporotic fracture, we experimentally validated two of our novel predictions (not observed in any previous GWAS/QTL studies) and found significant bone density defects for both Timp2 and Abcg8 deficient mice. Our results suggest that the integration of functional genomics data into networks, which itself is informative of protein function and interactions, can successfully be utilized as a complementary approach to quantitative genetics to predict disease risks. All supplementary material is available at
Author Summary
Many recent efforts to understand the genetic origins of complex diseases utilize statistical approaches to analyze phenotypic traits measured in genetically well-characterized populations. While these quantitative genetics methods are powerful, their success is limited by sampling biases and other confounding factors, and the biological interpretation of results can be challenging since these methods are not based on any functional information for candidate loci. On the other hand, the functional genomics field has greatly expanded in past years, both in terms of experimental approaches and analytical algorithms. However, functional approaches have been applied to understanding phenotypes in only the most basic ways. In this study, we demonstrate that functional genomics can complement traditional quantitative genetics by analytically extracting protein function information from large collections of high throughput data, which can then be used to predict genotype-phenotype associations. We applied our prediction methodology to the laboratory mouse, and we experimentally confirmed a role in osteoporosis for two of our predictions that were not candidates from any previous quantitative genetics study. The ability of our approach to produce accurate and unique predictions implies that functional genomics can complement quantitative genetics and can help address previous limitations in identifying disease genes.
PMCID: PMC2978695  PMID: 21085640
8.  RankAggreg, an R package for weighted rank aggregation 
BMC Bioinformatics  2009;10:62.
Researchers in the field of bioinformatics often face a challenge of combining several ordered lists in a proper and efficient manner. Rank aggregation techniques offer a general and flexible framework that allows one to objectively perform the necessary aggregation. With the rapid growth of high-throughput genomic and proteomic studies, the potential utility of rank aggregation in the context of meta-analysis becomes even more apparent. One of the major strengths of rank-based aggregation is the ability to combine lists coming from different sources and platforms, for example different microarray chips, which may or may not be directly comparable otherwise.
The RankAggreg package provides two methods for combining the ordered lists: the Cross-Entropy method and the Genetic Algorithm. Two examples of rank aggregation using the package are given in the manuscript: one in the context of clustering based on gene expression, and the other one in the context of meta-analysis of prostate cancer microarray experiments.
The two examples described in the manuscript clearly show the utility of the RankAggreg package in the current bioinformatics context where ordered lists are routinely produced as a result of modern high-throughput technologies.
PMCID: PMC2669484  PMID: 19228411
9.  Using a large-scale knowledge database on reactions and regulations to propose key upstream regulators of various sets of molecules participating in cell metabolism 
BMC Systems Biology  2014;8:32.
Most of the existing methods to analyze high-throughput data are based on gene ontology principles, providing information on the main functions and biological processes. However, these methods do not indicate the regulations behind the biological pathways. A critical point in this context is the extraction of information from many possible relationships between the regulated genes, and its combination with biochemical regulations. This study aimed at developing an automatic method to propose a reasonable number of upstream regulatory candidates from lists of various regulated molecules by confronting experimental data with encyclopedic information.
A new formalism of regulated reactions combining biochemical transformations and regulatory effects was proposed to unify the different mechanisms contained in knowledge libraries. Based on a related causality graph, an algorithm was developed to propose a reasonable set of upstream regulators from lists of target molecules. Scores were added to candidates according to their ability to explain the greatest number of targets or only few specific ones. By testing 250 lists of target genes as inputs, each with a known solution, the success of the method to provide the expected transcription factor among 50 or 100 proposed regulatory candidates, was evaluated to 62.6% and 72.5% of the situations, respectively. An additional prioritization among candidates might be further realized by adding functional ontology information. The benefit of this strategy was proved by identifying PPAR isotypes and their partners as the upstream regulators of a list of experimentally-identified targets of PPARA, a pivotal transcriptional factor in lipid oxidation. The proposed candidates participated in various biological functions that further enriched the original information. The efficiency of the method in merging reactions and regulations was also illustrated by identifying gene candidates participating in glucose homeostasis from an input list of metabolites involved in cell glycolysis.
This method proposes a reasonable number of regulatory candidates for lists of input molecules that may include transcripts of genes and metabolites. The proposed upstream regulators are the transcription factors themselves and protein complexes, so that a multi-level description of how cell metabolism is regulated is obtained.
PMCID: PMC4004165  PMID: 24635915
Biochemical reactions; Causalities; Gene expression; Knowledge integration; Protein partners; Upstream regulators
10.  Computational selection and prioritization of candidate genes for Fetal Alcohol Syndrome 
BMC Genomics  2007;8:389.
Fetal alcohol syndrome (FAS) is a serious global health problem and is observed at high frequencies in certain South African communities. Although in utero alcohol exposure is the primary trigger, there is evidence for genetic- and other susceptibility factors in FAS development. No genome-wide association or linkage studies have been performed for FAS, making computational selection and -prioritization of candidate disease genes an attractive approach.
10174 Candidate genes were initially selected from the whole genome using a previously described method, which selects candidate genes according to their expression in disease-affected tissues. Hereafter candidates were prioritized for experimental investigation by investigating criteria pertinent to FAS and binary filtering. 29 Criteria were assessed by mining various database sources to populate criteria-specific gene lists. Candidate genes were then prioritized for experimental investigation using a binary system that assessed the criteria gene lists against the candidate list, and candidate genes were scored accordingly. A group of 87 genes was prioritized as candidates and for future experimental validation. The validity of the binary prioritization method was assessed by investigating the protein-protein interactions, functional enrichment and common promoter element binding sites of the top-ranked genes.
This analysis highlighted a list of strong candidate genes from the TGF-β, MAPK and Hedgehog signalling pathways, which are all integral to fetal development and potential targets for alcohol's teratogenic effect. We conclude that this novel bioinformatics approach effectively prioritizes credible candidate genes for further experimental analysis.
PMCID: PMC2194724  PMID: 17961254
11.  mirTarPri: Improved Prioritization of MicroRNA Targets through Incorporation of Functional Genomics Data 
PLoS ONE  2013;8(1):e53685.
MicroRNAs (miRNAs) are a class of small (19–25 nt) non-coding RNAs. This important class of gene regulator downregulates gene expression through sequence-specific binding to the 3′untranslated regions (3′UTRs) of target mRNAs. Several computational target prediction approaches have been developed for predicting miRNA targets. However, the predicted target lists often have high false positive rates. To construct a workable target list for subsequent experimental studies, we need novel approaches to properly rank the candidate targets from traditional methods. We performed a systematic analysis of experimentally validated miRNA targets using functional genomics data, and found significant functional associations between genes that were targeted by the same miRNA. Based on this finding, we developed a miRNA target prioritization method named mirTarPri to rank the predicted target lists from commonly used target prediction methods. Leave-one-out cross validation has proved to be successful in identifying known targets, achieving an AUC score up to 0. 84. Validation in high-throughput data proved that mirTarPri was an unbiased method. Applying mirTarPri to prioritize results of six commonly used target prediction methods allowed us to find more positive targets at the top of the prioritized candidate list. In comparison with other methods, mirTarPri had an outstanding performance in gold standard and CLIP data. mirTarPri was a valuable method to improve the efficacy of current miRNA target prediction methods. We have also developed a web-based server for implementing mirTarPri method, which is freely accessible at
PMCID: PMC3541237  PMID: 23326485
12.  End-Stage Liver Disease Candidates at the Highest MELD Scores Have Higher Wait-list Mortality than Status-1A Candidates 
Hepatology (Baltimore, Md.)  2011;55(1):192-198.
Candidates with fulminant hepatic failure (Status-1A) receive the highest priority for liver transplantation (LT) in the United States. However, no studies have compared wait-list mortality risk among end-stage liver disease (ESLD) candidates with high Model for End-stage Liver Disease (MELD) scores to those listed as Status-1A. We aimed to determine if there are MELD scores for ESLD candidates at which their wait-list mortality risk is higher than that of Status-1A, and to identify the factors predicting wait-list mortality among Status-1A.
Data were obtained from the Scientific Registry of Transplant Recipients for adult LT candidates (n=52,459) listed between 09/01/2001 and 12/31/2007. Candidates listed for repeat LT as Status-1 A were excluded. Starting from the date of wait listing, candidates were followed for 14 days or until the earliest of death, transplant, or granting of an exception MELD score. ESLD candidates were categorized by MELD score, with a separate category for those with calculated MELD >40. We compared wait-list mortality between each MELD category and Status-1A (reference) using time-dependent Cox regression.
ESLD candidates with MELD >40 had almost twice the wait-list mortality risk of Status-1A candidates, with a covariate-adjusted hazard ratio of HR=1.96 (p=0.004). There was no difference in wait-list mortality risk for candidates with MELD 36–40 and Status-1A, while candidates with MELD <36 had significantly lower mortality risk than Status-1A candidates. MELD score did not significantly predict wait-list mortality among Status-1A candidates (p=0.18). Among Status-1A candidates with acetaminophen toxicity, MELD was a significant predictor of wait-list mortality (p<0.0009). Post-transplant survival was similar for Status-1A and ESLD candidates with MELD >20 (p=0.6).
Candidates with MELD >40 have significantly higher wait-list mortality and similar post-transplant survival as Status-1A, and therefore, should be assigned higher priority than Status-1A for allocation. Since ESLD candidates with MELD 36–40 and Status-1A have similar wait-list mortality risk and post-transplant survival, these candidates should be assigned similar rather than sequential priority for deceased donor LT.
PMCID: PMC3235236  PMID: 21898487
decompensated end-stage liver disease; fulminant hepatic failure; model for end-stage liver disease; Status-1A; Status-1B; survival
13.  A gene signature based method for identifying subtypes and subtype-specific drivers in cancer with an application to medulloblastoma 
BMC Bioinformatics  2013;14(Suppl 18):S1.
Subtypes are widely found in cancer. They are characterized with different behaviors in clinical and molecular profiles, such as survival rates, gene signature and copy number aberrations (CNAs). While cancer is generally believed to have been caused by genetic aberrations, the number of such events is tremendous in the cancer tissue and only a small subset of them may be tumorigenic. On the other hand, gene expression signature of a subtype represents residuals of the subtype-specific cancer mechanisms. Using high-throughput data to link these factors to define subtype boundaries and identify subtype-specific drivers, is a promising yet largely unexplored topic.
We report a systematic method to automate the identification of cancer subtypes and candidate drivers. Specifically, we propose an iterative algorithm that alternates between gene expression clustering and gene signature selection. We applied the method to datasets of the pediatric cerebellar tumor medulloblastoma (MB). The subtyping algorithm consistently converges on multiple datasets of medulloblastoma, and the converged signatures and copy number landscapes are also found to be highly reproducible across the datasets. Based on the identified subtypes, we developed a PCA-based approach for subtype-specific identification of cancer drivers. The top-ranked driver candidates are found to be enriched with known pathways in certain subtypes of MB. This might reveal new understandings for these subtypes.
This article is an extended abstract of our ICCABS '12 paper (Chen et al. 2012), with revised methods in iterative subtyping, the use of canonical correlation analysis for driver-identification, and an extra dataset (Northcott90 dataset) for cross-validations. Discussions of the algorithm performance and of the slightly different gene lists identified are also added.
Our study indicates that subtype-signature defines the subtype boundaries, characterizes the subtype-specific processes and can be used to prioritize signature-related drivers.
PMCID: PMC3820164  PMID: 24564171
subtypes of cancer; medulloblastoma; gene signature; copy number aberrations; microarrays; driver genes
14.  Investigating the concordance of Gene Ontology terms reveals the intra- and inter-platform reproducibility of enrichment analysis 
BMC Bioinformatics  2013;14:143.
Reliability and Reproducibility of differentially expressed genes (DEGs) are essential for the biological interpretation of microarray data. The microarray quality control (MAQC) project launched by US Food and Drug Administration (FDA) elucidated that the lists of DEGs generated by intra- and inter-platform comparisons can reach a high level of concordance, which mainly depended on the statistical criteria used for ranking and selecting DEGs. Generally, it will produce reproducible lists of DEGs when combining fold change ranking with a non-stringent p-value cutoff. For further interpretation of the gene expression data, statistical methods of gene enrichment analysis provide powerful tools for associating the DEGs with prior biological knowledge, e.g. Gene Ontology (GO) terms and pathways, and are widely used in genome-wide research. Although the DEG lists generated from the same compared conditions proved to be reliable, the reproducible enrichment results are still crucial to the discovery of the underlying molecular mechanism differentiating the two conditions. Therefore, it is important to know whether the enrichment results are still reproducible, when using the lists of DEGs generated by different statistic criteria from inter-laboratory and cross-platform comparisons. In our study, we used the MAQC data sets for systematically accessing the intra- and inter-platform concordance of GO terms enriched by Gene Set Enrichment Analysis (GSEA) and LRpath.
In intra-platform comparisons, the overlapped percentage of enriched GO terms was as high as ~80% when the inputted lists of DEGs were generated by fold change ranking and Significance Analysis of Microarrays (SAM), whereas the percentages decreased about 20% when generating the lists of DEGs by using fold change ranking and t-test, or by using SAM and t-test. Similar results were found in inter-platform comparisons.
Our results demonstrated that the lists of DEGs in a high level of concordance can ensure the high concordance of enrichment results. Importantly, based on the lists of DEGs generated by a straightforward method of combining fold change ranking with a non-stringent p-value cutoff, enrichment analysis will produce reproducible enriched GO terms for the biological interpretation.
PMCID: PMC3644270  PMID: 23627640
DNA microarray; Intra-/inter-platform comparison; Gene Ontology enrichment; Microarray quality control (MAQC)
15.  Huvariome: a web server resource of whole genome next-generation sequencing allelic frequencies to aid in pathological candidate gene selection 
Next generation sequencing provides clinical research scientists with direct read out of innumerable variants, including personal, pathological and common benign variants. The aim of resequencing studies is to determine the candidate pathogenic variants from individual genomes, or from family-based or tumor/normal genome comparisons. Whilst the use of appropriate controls within the experimental design will minimize the number of false positive variations selected, this number can be reduced further with the use of high quality whole genome reference data to minimize false positives variants prior to candidate gene selection. In addition the use of platform related sequencing error models can help in the recovery of ambiguous genotypes from lower coverage data.
We have developed a whole genome database of human genetic variations, Huvariome, determined by whole genome deep sequencing data with high coverage and low error rates. The database was designed to be sequencing technology independent but is currently populated with 165 individual whole genomes consisting of small pedigrees and matched tumor/normal samples sequenced with the Complete Genomics sequencing platform. Common variants have been determined for a Benelux population cohort and represented as genotypes alongside the results of two sets of control data (73 of the 165 genomes), Huvariome Core which comprises 31 healthy individuals from the Benelux region, and Diversity Panel consisting of 46 healthy individuals representing 10 different populations and 21 samples in three Pedigrees. Users can query the database by gene or position via a web interface and the results are displayed as the frequency of the variations as detected in the datasets. We demonstrate that Huvariome can provide accurate reference allele frequencies to disambiguate sequencing inconsistencies produced in resequencing experiments. Huvariome has been used to support the selection of candidate cardiomyopathy related genes which have a homozygous genotype in the reference cohorts. This database allows the users to see which selected variants are common variants (> 5% minor allele frequency) in the Huvariome core samples, thus aiding in the selection of potentially pathogenic variants by filtering out common variants that are not listed in one of the other public genomic variation databases. The no-call rate and the accuracy of allele calling in Huvariome provides the user with the possibility of identifying platform dependent errors associated with specific regions of the human genome.
Huvariome is a simple to use resource for validation of resequencing results obtained by NGS experiments. The high sequence coverage and low error rates provide scientists with the ability to remove false positive results from pedigree studies. Results are returned via a web interface that displays location-based genetic variation frequency, impact on protein function, association with known genetic variations and a quality score of the variation base derived from Huvariome Core and the Diversity Panel data. These results may be used to identify and prioritize rare variants that, for example, might be disease relevant. In testing the accuracy of the Huvariome database, alleles of a selection of ambiguously called coding single nucleotide variants were successfully predicted in all cases. Data protection of individuals is ensured by restricted access to patient derived genomes from the host institution which is relevant for future molecular diagnostics.
PMCID: PMC3549785  PMID: 23164068
Medical genetics; Medical genomics; Whole genome sequencing; Allele frequency; Cardiomyopathy
16.  Genomic convergence and network analysis approach to identify candidate genes in Alzheimer's disease 
BMC Genomics  2014;15:199.
Alzheimer’s disease (AD) is one of the leading genetically complex and heterogeneous disorder that is influenced by both genetic and environmental factors. The underlying risk factors remain largely unclear for this heterogeneous disorder. In recent years, high throughput methodologies, such as genome-wide linkage analysis (GWL), genome-wide association (GWA) studies, and genome-wide expression profiling (GWE), have led to the identification of several candidate genes associated with AD. However, due to lack of consistency within their findings, an integrative approach is warranted. Here, we have designed a rank based gene prioritization approach involving convergent analysis of multi-dimensional data and protein-protein interaction (PPI) network modelling.
Our approach employs integration of three different AD datasets- GWL,GWA and GWE to identify overlapping candidate genes ranked using a novel cumulative rank score (SR) based method followed by prioritization using clusters derived from PPI network. SR for each gene is calculated by addition of rank assigned to individual gene based on either p value or score in three datasets. This analysis yielded 108 plausible AD genes. Network modelling by creating PPI using proteins encoded by these genes and their direct interactors resulted in a layered network of 640 proteins. Clustering of these proteins further helped us in identifying 6 significant clusters with 7 proteins (EGFR, ACTB, CDC2, IRAK1, APOE, ABCA1 and AMPH) forming the central hub nodes. Functional annotation of 108 genes revealed their role in several biological activities such as neurogenesis, regulation of MAP kinase activity, response to calcium ion, endocytosis paralleling the AD specific attributes. Finally, 3 potential biochemical biomarkers were found from the overlap of 108 AD proteins with proteins from CSF and plasma proteome. EGFR and ACTB were found to be the two most significant AD risk genes.
With the assumption that common genetic signals obtained from different methodological platforms might serve as robust AD risk markers than candidates identified using single dimension approach, here we demonstrated an integrated genomic convergence approach for disease candidate gene prioritization from heterogeneous data sources linked to AD.
PMCID: PMC4028079  PMID: 24628925
Gene prioritization; Protein-protein interaction; Clustering; Functional annotation
17.  Integrated Assessment of Genomic Correlates of Protein Evolutionary Rate 
PLoS Computational Biology  2009;5(6):e1000413.
Rates of evolution differ widely among proteins, but the causes and consequences of such differences remain under debate. With the advent of high-throughput functional genomics, it is now possible to rigorously assess the genomic correlates of protein evolutionary rate. However, dissecting the correlations among evolutionary rate and these genomic features remains a major challenge. Here, we use an integrated probabilistic modeling approach to study genomic correlates of protein evolutionary rate in Saccharomyces cerevisiae. We measure and rank degrees of association between (i) an approximate measure of protein evolutionary rate with high genome coverage, and (ii) a diverse list of protein properties (sequence, structural, functional, network, and phenotypic). We observe, among many statistically significant correlations, that slowly evolving proteins tend to be regulated by more transcription factors, deficient in predicted structural disorder, involved in characteristic biological functions (such as translation), biased in amino acid composition, and are generally more abundant, more essential, and enriched for interaction partners. Many of these results are in agreement with recent studies. In addition, we assess information contribution of different subsets of these protein properties in the task of predicting slowly evolving proteins. We employ a logistic regression model on binned data that is able to account for intercorrelation, non-linearity, and heterogeneity within features. Our model considers features both individually and in natural ensembles (“meta-features”) in order to assess joint information contribution and degree of contribution independence. Meta-features based on protein abundance and amino acid composition make strong, partially independent contributions to the task of predicting slowly evolving proteins; other meta-features make additional minor contributions. The combination of all meta-features yields predictions comparable to those based on paired species comparisons, and approaching the predictive limit of optimal lineage-insensitive features. Our integrated assessment framework can be readily extended to other correlational analyses at the genome scale.
Author Summary
Proteins encoded within a given genome are known to evolve at drastically different rates. Through recent large-scale studies, researchers have measured a wide variety of properties for all proteins in yeast. We are interested to know how these properties relate to one another and to what extent they explain evolutionary rate variation. Protein properties are a heterogeneous mix, a factor which complicates research in this area. For example, some properties (e.g., protein abundance) are numerical, while others (e.g., protein function) are descriptive; protein properties may also suffer from noise and hidden redundancies. We have addressed these issues within a flexible and robust statistical framework. We first ranked a large list of protein properties by the strength of their relationships with evolutionary rate; this confirms many known evolutionary relationships and also highlights several new ones. Similar protein properties were then grouped and applied to predict slowly evolving proteins. Some of these groups were as effective as paired species comparison in making correct predictions, although in both cases a great deal of evolutionary rate variation remained to be explained. Our work has helped to refine the set of protein properties that researchers should consider as they investigate the mechanisms underlying protein evolution.
PMCID: PMC2688033  PMID: 19521505
18.  Integrative Data Mining Highlights Candidate Genes for Monogenic Myopathies 
PLoS ONE  2014;9(10):e110888.
Inherited myopathies are a heterogeneous group of disabling disorders with still barely understood pathological mechanisms. Around 40% of afflicted patients remain without a molecular diagnosis after exclusion of known genes. The advent of high-throughput sequencing has opened avenues to the discovery of new implicated genes, but a working list of prioritized candidate genes is necessary to deal with the complexity of analyzing large-scale sequencing data. Here we used an integrative data mining strategy to analyze the genetic network linked to myopathies, derive specific signatures for inherited myopathy and related disorders, and identify and rank candidate genes for these groups. Training sets of genes were selected after literature review and used in Manteia, a public web-based data mining system, to extract disease group signatures in the form of enriched descriptor terms, which include functional annotation, human and mouse phenotypes, as well as biological pathways and protein interactions. These specific signatures were then used as an input to mine and rank candidate genes, followed by filtration against skeletal muscle expression and association with known diseases. Signatures and identified candidate genes highlight both potential common pathological mechanisms and allelic disease groups. Recent discoveries of gene associations to diseases, like B3GALNT2, GMPPB and B3GNT1 to congenital muscular dystrophies, were prioritized in the ranked lists, suggesting a posteriori validation of our approach and predictions. We show an example of how the ranked lists can be used to help analyze high-throughput sequencing data to identify candidate genes, and highlight the best candidate genes matching genomic regions linked to myopathies without known causative genes. This strategy can be automatized to generate fresh candidate gene lists, which help cope with database annotation updates as new knowledge is incorporated.
PMCID: PMC4213015  PMID: 25353622
19.  Factors affecting reproducibility between genome-scale siRNA-based screens 
Journal of biomolecular screening  2010;15(7):735-747.
RNA interference-based screening is a powerful new genomic technology which addresses gene function en masse. To evaluate factors influencing hit list composition and reproducibility, we performed two identically designed small interfering RNA (siRNA)-based, whole genome screens for host factors supporting yellow fever virus infection. These screens represent two separate experiments completed five months apart and allow the direct assessment of the reproducibility of a given siRNA technology when performed in the same environment. Candidate hit lists generated by sum rank, median absolute deviation, z-score, and strictly standardized mean difference were compared within and between whole genome screens. Application of these analysis methodologies within a single screening dataset using a fixed threshold equivalent to a p-value ≤ 0.001 resulted in hit lists ranging from 82 to 1,140 members and highlighted the tremendous impact analysis methodology has on hit list composition. Intra- and inter-screen reproducibility was significantly influenced by the analysis methodology and ranged from 32% to 99%. This study also highlighted the power of testing at least two independent siRNAs for each gene product in primary screens. To facilitate validation we conclude by suggesting methods to reduce false discovery at the primary screening stage.
In this study we present the first comprehensive comparison of multiple analysis strategies, and demonstrate the impact of the analysis methodology on the composition of the “hit list”. Therefore, we propose that the entire dataset derived from functional genome-scale screens, especially if publicly funded, should be made available as is done with data derived from gene expression and genome-wide association studies.
PMCID: PMC3149892  PMID: 20625183
RNA interference; analysis; RNAi screen analysis; siRNA; RNAi; siRNA screening; sum rank; median absolute deviation; strictly standardized mean difference; genome-wide; whole-genome; comparison; overlap; hit list
20.  64-Slice Computed Tomographic Angiography for the Diagnosis of Intermediate Risk Coronary Artery Disease 
Executive Summary
In July 2009, the Medical Advisory Secretariat (MAS) began work on Non-Invasive Cardiac Imaging Technologies for the Diagnosis of Coronary Artery Disease (CAD), an evidence-based review of the literature surrounding different cardiac imaging modalities to ensure that appropriate technologies are accessed by patients suspected of having CAD. This project came about when the Health Services Branch at the Ministry of Health and Long-Term Care asked MAS to provide an evidentiary platform on effectiveness and cost-effectiveness of non-invasive cardiac imaging modalities.
After an initial review of the strategy and consultation with experts, MAS identified five key non-invasive cardiac imaging technologies for the diagnosis of CAD. Evidence-based analyses have been prepared for each of these five imaging modalities: cardiac magnetic resonance imaging, single photon emission computed tomography, 64-slice computed tomographic angiography, stress echocardiography, and stress echocardiography with contrast. For each technology, an economic analysis was also completed (where appropriate). A summary decision analytic model was then developed to encapsulate the data from each of these reports (available on the OHTAC and MAS website).
The Non-Invasive Cardiac Imaging Technologies for the Diagnosis of Coronary Artery Disease series is made up of the following reports, which can be publicly accessed at the MAS website at: or at
Single Photon Emission Computed Tomography for the Diagnosis of Coronary Artery Disease: An Evidence-Based Analysis
Stress Echocardiography for the Diagnosis of Coronary Artery Disease: An Evidence-Based Analysis
Stress Echocardiography with Contrast for the Diagnosis of Coronary Artery Disease: An Evidence-Based Analysis
64-Slice Computed Tomographic Angiography for the Diagnosis of Coronary Artery Disease: An Evidence-Based Analysis
Cardiac Magnetic Resonance Imaging for the Diagnosis of Coronary Artery Disease: An Evidence-Based Analysis
Pease note that two related evidence-based analyses of non-invasive cardiac imaging technologies for the assessment of myocardial viability are also available on the MAS website:
Positron Emission Tomography for the Assessment of Myocardial Viability: An Evidence-Based Analysis
Magnetic Resonance Imaging for the Assessment of Myocardial Viability: an Evidence-Based Analysis
The Toronto Health Economics and Technology Assessment Collaborative has also produced an associated economic report entitled:
The Relative Cost-effectiveness of Five Non-invasive Cardiac Imaging Technologies for Diagnosing Coronary Artery Disease in Ontario [Internet]. Available from:
The objective of this report is to determine the accuracy of computed tomographic angiography (CTA) compared to the more invasive option of coronary angiography (CA) in the detection of coronary artery disease (CAD) in stable (non-emergent) symptomatic patients.
CT Angiography
CTA is a cardiac imaging test that assesses the presence or absence, as well as the extent, of coronary artery stenosis for the diagnosis of CAD. As such, it is a test of cardiac structure and anatomy, in contrast to the other cardiac imaging modalities that assess cardiac function. It is, however, unclear as to whether cardiac structural features alone, in the absence cardiac function information, are sufficient to determine the presence or absence of intermediate pretest risk of CAD.
CTA technology is changing rapidly with increasing scan speeds and anticipated reductions in radiation exposure. Initial scanners based on 4, 8, 16, 32, and 64 slice machines have been available since the end of 2004. Although 320-slice machines are now available, these are not widely diffused and the existing published evidence is specific to 64-slice scanners. In general, CTA allows for 3-dimensional (3D) viewing of the coronary arteries derived from software algorithms of 2-dimensional (2D) images.
The advantage of CTA over CA, the gold standard for the diagnosis of CAD, is that it is relatively less invasive and may serve as a test in determining which patients are best suited for a CA. CA requires insertion of a catheter through an artery in the arm or leg up to the area being studied, yet both tests involve contrast agents and radiation exposure. Therefore, the identification of patients for whom CTA or CA is more appropriate may help to avoid more invasive tests, treatment delays, and unnecessary radiation exposure. The main advantage of CA, however, is that treatment can be administered in the same session as the test procedure and as such, it’s recommended for patients with a pre-test probability of CAD of ≥80%. The progression to the more invasive CA allows for the diagnosis and treatment in one session without the added radiation exposure from a previous CTA.
The visibility of arteries in CTA images is best in populations with a disease prevalence, or pre-test probabilities of CAD, of 40% to 80%, beyond which patients are considered at high pre-test probability. Visibility decreases with increasing prevalence as arteries become increasingly calcified (coronary artery calcification is based on the Agaston score). Such higher risk patients are not candidates for the less invasive diagnostic procedures and should proceed directly to CA, where treatment can be administered in conjunction with the test itself, while bypassing the radiation exposure from CTA.
CTA requires the addition of an ionated contrast, which can be administered only in patients with sufficient renal function (creatinine levels >30 micromoles/litre) to allow for the clearing of the contrast from the body. In some cases, the contrast is administered in patients with creatinine levels less than 30 micromoles/litre.
A second important criterion for the administration of the CTA is patient heart rate, which should be less than 65 beats/min for the single source CTA machines and less than 80 beats/min for the dual source machines. To decrease heart rates to these levels, beta-blockers are often required. Although the accuracy of these two machines does not differ, the dual source machines can be utilized in a higher proportion of patients than the single source machines for patients with heart beats of up to 80 beats/min. Approximately 10% of patients are considered ineligible for CTA because of this inability to decrease heart rates to the required levels. Additional contra-indications include renal insufficiency as described above and atrial fibrillation, with approximately 10% of intermediate risk patients ineligible for CTA due these contraindications. The duration of the procedure may be between 1 and 1.5 hours, with about 15 minutes for the CTA and the remaining time for the preparation of the patient.
CTA is licensed by Health Canada as a Class III device. Currently, two companies have licenses for 64-slice CT scanners, Toshiba Medical Systems Corporation (License 67604) and Philips Medical Systems (License 67599 and 73260).
Research Questions
How does the accuracy of CTA compare to the more invasive CA in the diagnosis of CAD in symptomatic patients at intermediate risk of the disease?
How does the accuracy for CTA compare to other modalities in the detection of CAD?
Research Methods
Literature Search
A literature search was performed on July 20, 2009 using OVID MEDLINE, MEDLINE In-Process and Other Non-Indexed Citations, EMBASE, the Cumulative Index to Nursing & Allied Health Literature (CINAHL), the Cochrane Library, and the International Agency for Health Technology Assessment (INAHTA) for studies published from January 1, 2004 until July 20, 2009. Abstracts were reviewed by a single reviewer and, for those studies meeting the eligibility criteria, full-text articles were obtained. Reference lists were also examined for any relevant studies not identified through the search. The quality of evidence was assessed as high, moderate, low or very low according to GRADE methodology.
Inclusion Criteria
English language articles and English or French-language HTAs published from January 1, 2004 to July 20, 2009.
Randomized controlled trials (RCTs), non-randomized clinical trials, systematic reviews and meta-analyses.
Studies of symptomatic patients at intermediate pre-test probability of CAD.
Studies of single source CTA compared to CA for the diagnosis of CAD.
Studies in which sensitivity, specificity, negative predictive value (NPV) and positive predictive value (PPV) could be established. HTAs, SRs, clinical trials, observational studies.
Exclusion Criteria
Non-English studies.
Pediatric populations.
Studies of patients at low or high pre-test probability of CAD.
Studies of unstable patients, e.g., emergency room visits, or a prior diagnosis of CAD.
Studies in patients with non-ischemic heart disease.
Studies in which outcomes were not specific to those of interest in this report.
Studies in which CTA was not compared to CA in a stable population.
Outcomes of Interest)
CAD defined as ≥50% stenosis.
Coronary angiography.
Measures of Interest
Sensitivity, specificity;
Negative predictive value (NPV), positive predictive value (PPV);
Area under the curve (AUC) and diagnostic odds ratios (DOR).
Results of Literature Search and Evidence-Based Analysis
The literature search yielded two HTAs, the first published by MAS in April 2005, the other from the Belgian Health Care Knowledge Centre published in 2008, as well as three recent non-randomized clinical studies. The three most significant studies concerning the accuracy of CTA versus CA are the CORE-64 study, the ACCURACY trial, and a prospective, multicenter, multivendor study conducted in the Netherlands. Five additional non-randomized studies were extracted from the Belgian Health Technology Assessment (2008).
To provide summary estimates of sensitivity, specificity, area under the SROC curve (AUC) and diagnostic odds rations (DORs), a meta-analysis of the above-mentioned studies was conducted. Pooled estimates of sensitivity and specificity were 97.7% (95%CI: 95.5% - 99.9%) and 78.8% (95%CI: 70.8% - 86.8%), respectively. These results indicate that the sensitivity of CTA is almost as good as CA, while its specificity is poorer. The diagnostic odds ratio (DOR) was estimated at 157.0 (95%CI: 11.2 - 302.7) and the AUC was found to be 0.94; however, the inability to provide confidence estimates for this estimate decreased its utility as an adequate outcome measure in this review.
This meta-analysis was limited by the significant heterogeneity between studies for both the pooled sensitivity and specificity (heterogeneity Chi-square p=0.000). To minimize these statistical concerns, the analysis was restricted to studies of intermediate risk patients with no previous history of cardiac events. Nevertheless, the underlying prevalence of CAD ranged from 24.8% to 78% between studies, indicating that there was still some variability in the pre-test probabilities of disease within this stable population. The variation in the prevalence of CAD, accompanied with differences in the proportion of calcification, likely affected the specificity directly and the sensitivity indirectly across studies.
In February 2010, the results of the Ontario Multi-detector Computed Tomography Coronary Angiography Study (OMCAS) became available and were thus included in a second meta-analysis of the above studies. The OMCAS was a non-randomized double-blind study conducted in 3 centers in Ontario that was conducted as a result of a MAS review from 2005 requesting an evaluation of the accuracy of 64-slice CTA for CAD detection. Within 10 days of their scheduled CA, all patients received an additional evaluation with CTA. Included in the meta-analysis with the above-mentioned studies are 117 symptomatic patients with intermediate probability of CAD (10% - 90% probability), resulting in a pooled sensitivity of 96.1% (95%CI: 94.0%-98.3%) and pooled specificity of 81.5% (95%CI: 73.0% - 89.9%).
Summary of Findings
CTA is almost as good as CA in detecting true positives but poorer in the rate of false positives. The main value of CTA may be in ruling out significant CAD.
Increased prevalence of CAD decreases study specificity, whereas specificity is increased in the presence of increased arterial calcification even in lower prevalence studies.
Positive CT angiograms may require additional tests such as stress tests or the more invasive CA, partly to identify false positives.
Radiation exposure is an important safety concern that needs to be considered, particularly the cumulative exposures from repeat CTAs.
PMCID: PMC3377576  PMID: 23074388
21.  Use of Non-Steroidal Anti-Inflammatory Drugs That Elevate Cardiovascular Risk: An Examination of Sales and Essential Medicines Lists in Low-, Middle-, and High-Income Countries 
PLoS Medicine  2013;10(2):e1001388.
Patricia McGettigan and David Henry find that, although some non-steroidal anti-inflammatory drugs (NSAIDs) such as diclofenac are known to increase cardiovascular risk, diclofenac is included on 74 countries' essential medicine lists and was the most commonly used NSAID in the 15 countries they evaluated.
Certain non-steroidal anti-inflammatory drugs (NSAIDs) (e.g., rofecoxib [Vioxx]) increase the risk of heart attack and stroke and should be avoided in patients at high risk of cardiovascular events. Rates of cardiovascular disease are high and rising in many low- and middle-income countries. We studied the extent to which evidence on cardiovascular risk with NSAIDs has translated into guidance and sales in 15 countries.
Methods and Findings
Data on the relative risk (RR) of cardiovascular events with individual NSAIDs were derived from meta-analyses of randomised trials and controlled observational studies. Listing of individual NSAIDs on Essential Medicines Lists (EMLs) was obtained from the World Health Organization. NSAID sales or prescription data for 15 low-, middle-, and high-income countries were obtained from Intercontinental Medical Statistics Health (IMS Health) or national prescription pricing audit (in the case of England and Canada). Three drugs (rofecoxib, diclofenac, etoricoxib) ranked consistently highest in terms of cardiovascular risk compared with nonuse. Naproxen was associated with a low risk. Diclofenac was listed on 74 national EMLs, naproxen on just 27. Rofecoxib use was not documented in any country. Diclofenac and etoricoxib accounted for one-third of total NSAID usage across the 15 countries (median 33.2%, range 14.7–58.7%). This proportion did not vary between low- and high-income countries. Diclofenac was by far the most commonly used NSAID, with a market share close to that of the next three most popular drugs combined. Naproxen had an average market share of less than 10%.
Listing of NSAIDs on national EMLs should take account of cardiovascular risk, with preference given to low risk drugs. Diclofenac has a risk very similar to rofecoxib, which was withdrawn from worldwide markets owing to cardiovascular toxicity. Diclofenac should be removed from EMLs.
Please see later in the article for the Editors' Summary
Editors' Summary
Non-steroidal anti-inflammatory drugs (NSAIDs) are among the most widely used drugs. Aspirin, the first NSAID, was developed in 1897 but there are now many different NSAIDs. Some can be bought over-the-counter but others are available only with prescription. NSAIDs can help relieve short- and long-term pain, reduce inflammation (redness and swelling), and reduce high fevers. Common conditions that are treated with NSAIDs include headaches, toothache, back ache, and arthritis. NSAIDs work by stopping a class of enzymes called cyclo-oxygenases (COXs) from making prostaglandins, some of which cause pain and inflammation. Like all drugs, NSAIDs have some unwanted side effects. Because certain prostaglandins protect the stomach lining from the stomach acid that helps to digest food, NSAID use can cause indigestion and stomach ulcers (gastrointestinal complications). In addition, NSAIDs increase the risk of heart attacks and stroke to varying degrees and therefore should be avoided by people at high risk of cardiovascular diseases—conditions that affect the heart and/or blood vessels.
Why Was This Study Done?
Different NSAIDs are associated with different levels of cardiovascular risk. Selective COX-2 inhibitors (e.g., rofecoxib, celecoxib, etoricoxib) generally have fewer stomach-related side effects than non-selective COX inhibitors (e.g., naproxen, ibuprofen, diclofenac). However, some NSAIDs (rofecoxib, diclofenac, etoricoxib) are more likely to cause cardiovascular events than others (e.g., naproxen). When doctors prescribe NSAIDs, they need to consider the patient's risk profile. Particularly for patients with higher risk of cardiovascular events, a doctor should either advise against NSAID use or recommend one that has a relatively low cardiovascular risk. Information on the cardiovascular risk associated with different NSAIDs has been available for several years, but have doctors changed their prescribing of NSAIDs based on the information? This question is of particular concern in low- and middle-income countries where cardiovascular disease is increasingly common. In this study, the researchers investigate the extent to which evidence on the cardiovascular risk associated with different NSAIDs has translated into guidance and sales in 15 low-, middle-, and high-income countries.
What Did the Researchers Do and Find?
The researchers derived data on the relative risk of cardiovascular events associated with individual NSAIDs compared to non-use of NSAIDs from published meta-analyses of randomized trials and observational studies. They obtained information on the NSAIDs recommended in 100 countries from national Essential Medicines Lists (EMLs; essential medicines are drugs that satisfy the priority health care needs of a population). Finally, they obtained information on NSAID sales for 13 countries in the South Asian, Southeast Asian, and Asian Pacific regions and NSAID prescription data for Canada and England. Rofecoxib, diclofenac, and etoricoxib consistently increased cardiovascular risk compared with no NSAIDs. All three had a higher relative risk of cardiovascular events than naproxen in pairwise analyses. Naproxen was associated with the lowest cardiovascular risk. No national EMLs recommended rofecoxib, which was withdrawn from world markets 8 years ago because of its cardiovascular risk. Seventy-four national EMLs listed diclofenac, but only 27 EMLs listed naproxen. Diclofenac was the most commonly used NSAID, with an average market share across the 15 countries of nearly 30%. By contrast, naproxen had an average market share of less than 10%. Finally, across both high- and low-/middle-income countries, diclofenac and etoricoxib accounted for one-third of total NSAID usage.
What Do These Findings Mean?
These findings show that NSAIDs with higher risk of cardiovascular events are widely used in low-/middle- as well as high-income countries. Diclofenac is the most popular NSAID, despite its higher relative risk of cardiovascular events, which is similar to that of rofecoxib. Diclofenac is also widely listed on EMLs even though information on its higher cardiovascular risk has been available since 2006. In contrast, naproxen, one of the safest in relative terms of the NSAIDs examined, was among the least popular and was listed on a minority of EMLs. Some aspects of the study's design may affect the accuracy of these findings. For example, the researchers did not look at the risk profiles of the patients actually taking NSAIDs. However, given the volume of use of high-risk NSAIDS, it is likely that these drugs are taken by many individuals at high risk of cardiovascular events. Overall, these findings have important implications for public health and, given the wide availability of safer alternatives, the researchers suggest that diclofenac should be removed from national EMLs and that its marketing authorizations should be revoked globally.
Additional Information
Please access these Web sites via the online version of this summary at 10.1371/journal.pmed.1001388.
This study is further discussed in a PLOS Medicine Perspective by K. Srinath Reddy and Ambuj Roy
The UK National Health Service Choices website provides detailed information on NSAIDS
MedlinePlus provides information about aspirin, ibuprofen, naproxen, and diclofenac; it also provides links to other information about pain relievers (in English and Spanish)
The American Heart Association has information on cardiovascular disease; Can Patients With Cardiovascular Disease Take Nonsteroidal Antiinflammatory Drugs? is a Cardiology Patient Page in the AHA journal Circulation
The British Heart Foundation also provides information about cardiovascular disease and has a factsheet on NSAIDs and cardiovascular disease
The World Health Organization has a fact sheet on essential medicines; the WHO Model List of Essential Medicines (in English and French), and national EMLs are available
PMCID: PMC3570554  PMID: 23424288
22.  Prioritizing Genomic Drug Targets in Pathogens: Application to Mycobacterium tuberculosis 
PLoS Computational Biology  2006;2(6):e61.
We have developed a software program that weights and integrates specific properties on the genes in a pathogen so that they may be ranked as drug targets. We applied this software to produce three prioritized drug target lists for Mycobacterium tuberculosis, the causative agent of tuberculosis, a disease for which a new drug is desperately needed. Each list is based on an individual criterion. The first list prioritizes metabolic drug targets by the uniqueness of their roles in the M. tuberculosis metabolome (“metabolic chokepoints”) and their similarity to known “druggable” protein classes (i.e., classes whose activity has previously been shown to be modulated by binding a small molecule). The second list prioritizes targets that would specifically impair M. tuberculosis, by weighting heavily those that are closely conserved within the Actinobacteria class but lack close homology to the host and gut flora. M. tuberculosis can survive asymptomatically in its host for many years by adapting to a dormant state referred to as “persistence.” The final list aims to prioritize potential targets involved in maintaining persistence in M. tuberculosis. The rankings of current, candidate, and proposed drug targets are highlighted with respect to these lists. Some features were found to be more accurate than others in prioritizing studied targets. It can also be shown that targets can be prioritized by using evolutionary programming to optimize the weights of each desired property. We demonstrate this approach in prioritizing persistence targets.
The search for drugs to prevent or treat infections remains an urgent focus in infectious disease research. A new software program has been developed by the authors of this article that can be used to rank genes as potential drug targets in pathogens. Traditional prioritization approaches to drug target identification, such as searching the literature and trying to mentally integrate varied criteria, can quickly become overwhelming for the drug discovery researcher. Alternatively, one can computationally integrate different criteria to create a ranking function that can help to identify targets. The authors demonstrate the applicability of this approach on the genome of Mycobacterium tuberculosis, the organism that causes tuberculosis (TB), a disease for which new drug treatments are especially needed because of emerging drug-resistant strains. The experiences gained from this work will be useful for both wet-lab and informatics scientists working in infectious disease research; first, it demonstrates that ample public data already exist on the M. tuberculosis genome that can be tuned effectively for prioritizing drug targets. Second, the output from numerous freely available bioinformatics tools can be pushed to achieve these goals. Third, the methodology can easily be extended to other pathogens of interest. Currently studied TB targets are also highlighted in terms of the authors' ranking system, which should be useful for researchers focusing on TB drug discovery.
PMCID: PMC1475714  PMID: 16789813
23.  Whole genome identification of Mycobacterium tuberculosis vaccine candidates by comprehensive data mining and bioinformatic analyses 
BMC Medical Genomics  2008;1:18.
Mycobacterium tuberculosis, the causative agent of tuberculosis (TB), infects ~8 million annually culminating in ~2 million deaths. Moreover, about one third of the population is latently infected, 10% of which develop disease during lifetime. Current approved prophylactic TB vaccines (BCG and derivatives thereof) are of variable efficiency in adult protection against pulmonary TB (0%–80%), and directed essentially against early phase infection.
A genome-scale dataset was constructed by analyzing published data of: (1) global gene expression studies under conditions which simulate intra-macrophage stress, dormancy, persistence and/or reactivation; (2) cellular and humoral immunity, and vaccine potential. This information was compiled along with revised annotation/bioinformatic characterization of selected gene products and in silico mapping of T-cell epitopes. Protocols for scoring, ranking and prioritization of the antigens were developed and applied.
Cross-matching of literature and in silico-derived data, in conjunction with the prioritization scheme and biological rationale, allowed for selection of 189 putative vaccine candidates from the entire genome. Within the 189 set, the relative distribution of antigens in 3 functional categories differs significantly from their distribution in the whole genome, with reduction in the Conserved hypothetical category (due to improved annotation) and enrichment in Lipid and in Virulence categories. Other prominent representatives in the 189 set are the PE/PPE proteins; iron sequestration, nitroreductases and proteases, all within the Intermediary metabolism and respiration category; ESX secretion systems, resuscitation promoting factors and lipoproteins, all within the Cell wall category. Application of a ranking scheme based on qualitative and quantitative scores, resulted in a list of 45 best-scoring antigens, of which: 74% belong to the dormancy/reactivation/resuscitation classes; 30% belong to the Cell wall category; 13% are classical vaccine candidates; 9% are categorized Conserved hypotheticals, all potentially very potent T-cell antigens.
The comprehensive literature and in silico-based analyses allowed for the selection of a repertoire of 189 vaccine candidates, out of the whole-genome 3989 ORF products. This repertoire, which was ranked to generate a list of 45 top-hits antigens, is a platform for selection of genes covering all stages of M. tuberculosis infection, to be incorporated in rBCG or subunit-based vaccines.
PMCID: PMC2442614  PMID: 18505592
24.  Gene Prospector: An evidence gateway for evaluating potential susceptibility genes and interacting risk factors for human diseases 
BMC Bioinformatics  2008;9:528.
Millions of single nucleotide polymorphisms have been identified as a result of the human genome project and the rapid advance of high throughput genotyping technology. Genetic association studies, such as recent genome-wide association studies (GWAS), have provided a springboard for exploring the contribution of inherited genetic variation and gene/environment interactions in relation to disease. Given the capacity of such studies to produce a plethora of information that may then be described in a number of publications, selecting possible disease susceptibility genes and identifying related modifiable risk factors is a major challenge. A Web-based application for finding evidence of such relationships is key to the development of follow-up studies and evidence for translational research.
We developed a Web-based application that selects and prioritizes potential disease-related genes by using a highly curated and updated literature database of genetic association studies. The application, called Gene Prospector, also provides a comprehensive set of links to additional data sources.
We compared Gene Prospector results for the query "Parkinson" with a list of 13 leading candidate genes (Top Results) from a curated, specialty database for genetic associations with Parkinson disease (PDGene). Nine of the thirteen leading candidate genes from PDGene were in the top 10th percentile of the ranked list from Gene Prospector. In fact, Gene Prospector included more published genetic association studies for the 13 leading candidate genes than PDGene did.
Gene Prospector provides an online gateway for searching for evidence about human genes in relation to diseases, other phenotypes, and risk factors, and provides links to published literature and other online data sources. Gene Prospector can be accessed via .
PMCID: PMC2613935  PMID: 19063745
25.  Prioritization and Evaluation of Depression Candidate Genes by Combining Multidimensional Data Resources 
PLoS ONE  2011;6(4):e18696.
Large scale and individual genetic studies have suggested numerous susceptible genes for depression in the past decade without conclusive results. There is a strong need to review and integrate multi-dimensional data for follow up validation. The present study aimed to apply prioritization procedures to build-up an evidence-based candidate genes dataset for depression.
Depression candidate genes were collected in human and animal studies across various data resources. Each gene was scored according to its magnitude of evidence related to depression and was multiplied by a source-specific weight to form a combined score measure. All genes were evaluated through a prioritization system to obtain an optimal weight matrix to rank their relative importance with depression using the combined scores. The resulting candidate gene list for depression (DEPgenes) was further evaluated by a genome-wide association (GWA) dataset and microarray gene expression in human tissues.
A total of 5,055 candidate genes (4,850 genes from human and 387 genes from animal studies with 182 being overlapped) were included from seven data sources. Through the prioritization procedures, we identified 169 DEPgenes, which exhibited high chance to be associated with depression in GWA dataset (Wilcoxon rank-sum test, p = 0.00005). Additionally, the DEPgenes had a higher percentage to express in human brain or nerve related tissues than non-DEPgenes, supporting the neurotransmitter and neuroplasticity theories in depression.
With comprehensive data collection and curation and an application of integrative approach, we successfully generated DEPgenes through an effective gene prioritization system. The prioritized DEPgenes are promising for future biological experiments or replication efforts to discoverthe underlying molecular mechanisms for depression.
PMCID: PMC3071871  PMID: 21494644

Results 1-25 (1525423)