Search tips
Search criteria

Results 1-25 (51)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
more »
Document Types
author:("Tong, meida")
1.  Estimating relative noise to signal in DNA microarray data 
A method was proposed for estimating noise relative to signal in microarray data. A signal to noise index, SNI, was defined and used to measure the level of signal compared to the noise contained in two microarray data sets. Simulations were conducted to generate the quantitative relationship between the SNI and its measurement of relative noise. The method was applied to two well known microarray data sets. Relative noise was estimated for both data sets, and the results were consistent with the observations in the original papers, demonstrating the proposed method is reliable for estimating relative noise in microarray data.
PMCID: PMC3983989  PMID: 24001721
microarray; gene expression; noise; signal; estimation
2.  Perspectives on Validation of High-Throughput Assays Supporting 21st Century Toxicity Testing1 
ALTEX  2013;30(1):51-56.
In vitro, high-throughput screening (HTS) assays are seeing increasing use in toxicity testing. HTS assays can simultaneously test many chemicals, but have seen limited use in the regulatory arena, in part because of the need to undergo rigorous, time-consuming formal validation. Here we discuss streamlining the validation process, specifically for prioritization applications in which HTS assays are used to identify a high-concern subset of a collection of chemicals. The high-concern chemicals could then be tested sooner rather than later in standard guideline bioassays. The streamlined validation process would continue to ensure the reliability and relevance of assays for this application. We discuss the following practical guidelines: (1) follow current validation practice to the extent possible and practical; (2) make increased use of reference compounds to better demonstrate assay reliability and relevance; (3) deemphasize the need for cross-laboratory testing, and; (4) implement a web-based, transparent and expedited peer review process.
PMCID: PMC3934015  PMID: 23338806
Validation; in vitro; high-throughput screening
3.  Chemometric Analysis for Identification of Botanical Raw Materials for Pharmaceutical Use: A Case Study Using Panax notoginseng 
PLoS ONE  2014;9(1):e87462.
The overall control of the quality of botanical drugs starts from the botanical raw material, continues through preparation of the botanical drug substance and culminates with the botanical drug product. Chromatographic and spectroscopic fingerprinting has been widely used as a tool for the quality control of herbal/botanical medicines. However, discussions are still on-going on whether a single technique provides adequate information to control the quality of botanical drugs. In this study, high performance liquid chromatography (HPLC), ultra performance liquid chromatography (UPLC), capillary electrophoresis (CE) and near infrared spectroscopy (NIR) were used to generate fingerprints of different plant parts of Panax notoginseng. The power of these chromatographic and spectroscopic techniques to evaluate the identity of botanical raw materials were further compared and investigated in light of the capability to distinguishing different parts of Panax notoginseng. Principal component analysis (PCA) and clustering results showed that samples were classified better when UPLC- and HPLC-based fingerprints were employed, which suggested that UPLC- and HPLC-based fingerprinting are superior to CE- and NIR-based fingerprinting. The UPLC- and HPLC- based fingerprinting with PCA were able to correctly distinguish between samples sourced from rhizomes and main root. Using chemometrics and its ability to distinguish between different plant parts could be a powerful tool to help assure the identity and quality of the botanical raw materials and to support the safety and efficacy of the botanical drug products.
PMCID: PMC3909187  PMID: 24498109
4.  A systems approach for analysis of high content screening assay data with topic modeling 
BMC Bioinformatics  2013;14(Suppl 14):S11.
High Content Screening (HCS) has become an important tool for toxicity assessment, partly due to its advantage of handling multiple measurements simultaneously. This approach has provided insight and contributed to the understanding of systems biology at cellular level. To fully realize this potential, the simultaneously measured multiple endpoints from a live cell should be considered in a probabilistic relationship to assess the cell's condition to response stress from a treatment, which poses a great challenge to extract hidden knowledge and relationships from these measurements.
In this work, we applied a text mining method of Latent Dirichlet Allocation (LDA) to analyze cellular endpoints from in vitro HCS assays and related to the findings to in vivo histopathological observations. We measured multiple HCS assay endpoints for 122 drugs. Since LDA requires the data to be represented in document-term format, we first converted the continuous value of the measurements to the word frequency that can processed by the text mining tool. For each of the drugs, we generated a document for each of the 4 time points. Thus, we ended with 488 documents (drug-hour) each having different values for the 10 endpoints which are treated as words. We extracted three topics using LDA and examined these to identify diagnostic topics for 45 common drugs located in vivo experiments from the Japanese Toxicogenomics Project (TGP) observing their necrosis findings at 6 and 24 hours after treatment.
We found that assay endpoints assigned to particular topics were in concordance with the histopathology observed. Drugs showing necrosis at 6 hour were linked to severe damage events such as Steatosis, DNA Fragmentation, Mitochondrial Potential, and Lysosome Mass. DNA Damage and Apoptosis were associated with drugs causing necrosis at 24 hours, suggesting an interplay of the two pathways in these drugs. Drugs with no sign of necrosis we related to the Cell Loss and Nuclear Size assays, which is suggestive of hepatocyte regeneration.
The evidence from this study suggests that topic modeling with LDA can enable us to interpret relationships of endpoints of in vitro assays along with an in vivo histological finding, necrosis. Effectiveness of this approach may add substantially to our understanding of systems biology.
PMCID: PMC3851019  PMID: 24267543
5.  Homology modeling, molecular docking, and molecular dynamics simulations elucidated α-fetoprotein binding modes 
BMC Bioinformatics  2013;14(Suppl 14):S6.
An important mechanism of endocrine activity is chemicals entering target cells via transport proteins and then interacting with hormone receptors such as the estrogen receptor (ER). α-Fetoprotein (AFP) is a major transport protein in rodent serum that can bind and sequester estrogens, thus preventing entry to the target cell and where they could otherwise induce ER-mediated endocrine activity. Recently, we reported rat AFP binding affinities for a large set of structurally diverse chemicals, including 53 binders and 72 non-binders. However, the lack of three-dimensional (3D) structures of rat AFP hinders further understanding of the structural dependence for binding. Therefore, a 3D structure of rat AFP was built using homology modeling in order to elucidate rat AFP-ligand binding modes through docking analyses and molecular dynamics (MD) simulations.
Homology modeling was first applied to build a 3D structure of rat AFP. Molecular docking and Molecular Mechanics-Generalized Born Surface Area (MM-GBSA) scoring were then used to examine potential rat AFP ligand binding modes. MD simulations and free energy calculations were performed to refine models of binding modes.
A rat AFP tertiary structure was first obtained using homology modeling and MD simulations. The rat AFP-ligand binding modes of 13 structurally diverse, representative binders were calculated using molecular docking, (MM-GBSA) ranking and MD simulations. The key residues for rat AFP-ligand binding were postulated through analyzing the binding modes.
The optimized 3D rat AFP structure and associated ligand binding modes shed light on rat AFP-ligand binding interactions that, in turn, provide a means to estimate binding affinity of unknown chemicals. Our results will assist in the evaluation of the endocrine disruption potential of chemicals.
PMCID: PMC3851483  PMID: 24266910
6.  Dissecting the Characteristics and Dynamics of Human Protein Complexes at Transcriptome Cascade Using RNA-Seq Data 
PLoS ONE  2013;8(6):e66521.
Human protein complexes play crucial roles in various biological processes as the functional module. However, the expression features of human protein complexes at the transcriptome cascade are poorly understood. Here, we used the RNA-Seq data from 16 disparate tissues and four types of human cancers to explore the characteristics and dynamics of human protein complexes. We observed that many individual components of human protein complexes can be generated by multiple distinct transcripts. Similar with yeast, the human protein complex constituents are inclined to co-express in diverse tissues. The dominant isoform of the genes involved in protein complexes tend to encode the complex constituents in each tissue. Our results indicate that the protein complex dynamics not only correlate with the presence or absence of complexes, but may also be related to the major isoform switching for complex subunits. Between any two cancers of breast, colon, lung and prostate, we found that only a few of the differentially expressed transcripts associated with complexes were identical, but 5–10 times more protein complexes involved in differentially expressed transcripts were common. Collectively, our study reveals novel properties and dynamics of human protein complexes at the transcriptome cascade in diverse normal tissues and different cancers.
PMCID: PMC3688907  PMID: 23824284
7.  Investigating drug repositioning opportunities in FDA drug labels through topic modeling 
BMC Bioinformatics  2012;13(Suppl 15):S6.
Drug repositioning offers an opportunity to revitalize the slowing drug discovery pipeline by finding new uses for currently existing drugs. Our hypothesis is that drugs sharing similar side effect profiles are likely to be effective for the same disease, and thus repositioning opportunities can be identified by finding drug pairs with similar side effects documented in U.S. Food and Drug Administration (FDA) approved drug labels. The safety information in the drug labels is usually obtained in the clinical trial and augmented with the observations in the post-market use of the drug. Therefore, our drug repositioning approach can take the advantage of more comprehensive safety information comparing with conventional de novo approach.
A probabilistic topic model was constructed based on the terms in the Medical Dictionary for Regulatory Activities (MedDRA) that appeared in the Boxed Warning, Warnings and Precautions, and Adverse Reactions sections of the labels of 870 drugs. Fifty-two unique topics, each containing a set of terms, were identified by using topic modeling. The resulting probabilistic topic associations were used to measure the distance (similarity) between drugs. The success of the proposed model was evaluated by comparing a drug and its nearest neighbor (i.e., a drug pair) for common indications found in the Indications and Usage Section of the drug labels.
Given a drug with more than three indications, the model yielded a 75% recall, meaning 75% of drug pairs shared one or more common indications. This is significantly higher than the 22% recall rate achieved by random selection. Additionally, the recall rate grows rapidly as the number of drug indications increases and reaches 84% for drugs with 11 indications. The analysis also demonstrated that 65 drugs with a Boxed Warning, which indicates significant risk of serious and possibly life-threatening adverse effects, might be replaced with safer alternatives that do not have a Boxed Warning. In addition, we identified two therapeutic groups of drugs (Musculo-skeletal system and Anti-infective for systemic use) where over 80% of the drugs have a potential replacement with high significance.
Topic modeling can be a powerful tool for the identification of repositioning opportunities by examining the adverse event terms in FDA approved drug labels. The proposed framework not only suggests drugs that can be repurposed, but also provides insight into the safety of repositioned drugs.
PMCID: PMC3439728  PMID: 23046522
8.  Technical Reproducibility of Genotyping SNP Arrays Used in Genome-Wide Association Studies 
PLoS ONE  2012;7(9):e44483.
During the last several years, high-density genotyping SNP arrays have facilitated genome-wide association studies (GWAS) that successfully identified common genetic variants associated with a variety of phenotypes. However, each of the identified genetic variants only explains a very small fraction of the underlying genetic contribution to the studied phenotypic trait. Moreover, discordance observed in results between independent GWAS indicates the potential for Type I and II errors. High reliability of genotyping technology is needed to have confidence in using SNP data and interpreting GWAS results. Therefore, reproducibility of two widely genotyping technology platforms from Affymetrix and Illumina was assessed by analyzing four technical replicates from each of the six individuals in five laboratories. Genotype concordance of 99.40% to 99.87% within a laboratory for the sample platform, 98.59% to 99.86% across laboratories for the same platform, and 98.80% across genotyping platforms was observed. Moreover, arrays with low quality data were detected when comparing genotyping data from technical replicates, but they could not be detected according to venders’ quality control (QC) suggestions. Our results demonstrated the technical reliability of currently available genotyping platforms but also indicated the importance of incorporating some technical replicates for genotyping QC in order to improve the reliability of GWAS results. The impact of discordant genotypes on association analysis results was simulated and could explain, at least in part, the irreproducibility of some GWAS findings when the effect size (i.e. the odds ratio) and the minor allele frequencies are low.
PMCID: PMC3436888  PMID: 22970228
9.  Toward interoperable bioscience data 
Nature genetics  2012;44(2):121-126.
To make full use of research data, the bioscience community needs to adopt technologies and reward mechanisms that support interoperability and promote the growth of an open ‘data commoning’ culture. Here we describe the prerequisites for data commoning and present an established and growing ecosystem of solutions using the shared ‘Investigation-Study-Assay’ framework to support that vision.
PMCID: PMC3428019  PMID: 22281772
10.  atBioNet– an integrated network analysis tool for genomics and biomarker discovery 
BMC Genomics  2012;13:325.
Large amounts of mammalian protein-protein interaction (PPI) data have been generated and are available for public use. From a systems biology perspective, Proteins/genes interactions encode the key mechanisms distinguishing disease and health, and such mechanisms can be uncovered through network analysis. An effective network analysis tool should integrate different content-specific PPI databases into a comprehensive network format with a user-friendly platform to identify key functional modules/pathways and the underlying mechanisms of disease and toxicity.
atBioNet integrates seven publicly available PPI databases into a network-specific knowledge base. Knowledge expansion is achieved by expanding a user supplied proteins/genes list with interactions from its integrated PPI network. The statistically significant functional modules are determined by applying a fast network-clustering algorithm (SCAN: a Structural Clustering Algorithm for Networks). The functional modules can be visualized either separately or together in the context of the whole network. Integration of pathway information enables enrichment analysis and assessment of the biological function of modules. Three case studies are presented using publicly available disease gene signatures as a basis to discover new biomarkers for acute leukemia, systemic lupus erythematosus, and breast cancer. The results demonstrated that atBioNet can not only identify functional modules and pathways related to the studied diseases, but this information can also be used to hypothesize novel biomarkers for future analysis.
atBioNet is a free web-based network analysis tool that provides a systematic insight into proteins/genes interactions through examining significant functional modules. The identified functional modules are useful for determining underlying mechanisms of disease and biomarker discovery. It can be accessed at:
PMCID: PMC3443675  PMID: 22817640
Protein-protein interaction; Network analysis; Functional module; Disease biomarker; KEGG pathway analysis; Visualization tool; Genomics
11.  SNPTrackTM : an integrated bioinformatics system for genetic association studies 
Human Genomics  2012;6(1):5.
A genetic association study is a complicated process that involves collecting phenotypic data, generating genotypic data, analyzing associations between genotypic and phenotypic data, and interpreting genetic biomarkers identified. SNPTrack is an integrated bioinformatics system developed by the US Food and Drug Administration (FDA) to support the review and analysis of pharmacogenetics data resulting from FDA research or submitted by sponsors. The system integrates data management, analysis, and interpretation in a single platform for genetic association studies. Specifically, it stores genotyping data and single-nucleotide polymorphism (SNP) annotations along with study design data in an Oracle database. It also integrates popular genetic analysis tools, such as PLINK and Haploview. SNPTrack provides genetic analysis capabilities and captures analysis results in its database as SNP lists that can be cross-linked for biological interpretation to gene/protein annotations, Gene Ontology, and pathway analysis data. With SNPTrack, users can do the entire stream of bioinformatics jobs for genetic association studies. SNPTrack is freely available to the public at
PMCID: PMC3437569  PMID: 23245293
12.  Next generation sequencing for profiling expression of miRNAs: technical progress and applications in drug development 
miRNAs are non-coding RNAs that play a regulatory role in expression of genes and are associated with diseases. Quantitatively measuring expression levels of miRNAs can help in understanding the mechanisms of human diseases and discovering new drug targets. There are three major methods that have been used to measure the expression levels of miRNAs: real-time reverse transcription PCR (qRT-PCR), microarray, and the newly introduced next-generation sequencing (NGS). NGS is not only suitable for profiling of known miRNAs as qRT-PCR and microarray can do too but it also is able to detect unknown miRNAs which the other two methods are incapable of doing. Profiling of miRNAs by NGS has progressed rapidly and is a promising field for applications in drug development. This paper reviews the technical advancement of NGS for profiling miRNAs, including comparative analyses between different platforms and software packages for analyzing NGS data. Examples and future perspectives of applications of NGS profiling miRNAs in drug development will be discussed.
PMCID: PMC3312786  PMID: 22457835
miRNAs; Next-Generation Sequencing; Expression; Data Analysis; Drug Development
13.  Shifting from Population-wide to Personalized Cancer Prognosis with Microarrays 
PLoS ONE  2012;7(1):e29534.
The era of personalized medicine for cancer therapeutics has taken an important step forward in making accurate prognoses for individual patients with the adoption of high-throughput microarray technology. However, microarray technology in cancer diagnosis or prognosis has been primarily used for the statistical evaluation of patient populations, and thus excludes inter-individual variability and patient-specific predictions. Here we propose a metric called clinical confidence that serves as a measure of prognostic reliability to facilitate the shift from population-wide to personalized cancer prognosis using microarray-based predictive models. The performance of sample-based models predicted with different clinical confidences was evaluated and compared systematically using three large clinical datasets studying the following cancers: breast cancer, multiple myeloma, and neuroblastoma. Survival curves for patients, with different confidences, were also delineated. The results show that the clinical confidence metric separates patients with different prediction accuracies and survival times. Samples with high clinical confidence were likely to have accurate prognoses from predictive models. Moreover, patients with high clinical confidence would be expected to live for a notably longer or shorter time if their prognosis was good or grim based on the models, respectively. We conclude that clinical confidence could serve as a beneficial metric for personalized cancer prognosis prediction utilizing microarrays. Ascribing a confidence level to prognosis with the clinical confidence metric provides the clinician an objective, personalized basis for decisions, such as choosing the severity of the treatment.
PMCID: PMC3266237  PMID: 22295060
14.  Maximum predictive power of the microarray-based models for clinical outcomes is limited by correlation between endpoint and gene expression profile 
BMC Genomics  2011;12(Suppl 5):S3.
Microarray data have been used for gene signature selection to predict clinical outcomes. Many studies have attempted to identify factors that affect models' performance with only little success. Fine-tuning of model parameters and optimizing each step of the modeling process often results in over-fitting problems without improving performance.
We propose a quantitative measurement, termed consistency degree, to detect the correlation between disease endpoint and gene expression profile. Different endpoints were shown to have different consistency degrees to gene expression profiles. The validity of this measurement to estimate the consistency was tested with significance at a p-value less than 2.2e-16 for all of the studied endpoints. According to the consistency degree score, overall survival milestone outcome of multiple myeloma was proposed to extend from 730 days to 1561 days, which is more consistent with gene expression profile.
For various clinical endpoints, the maximum predictive powers of different microarray-based models are limited by the correlation between endpoint and gene expression profile of disease samples as indicated by the consistency degree score. In addition, previous defined clinical outcomes can also be reassessed and refined more coherent according to related disease gene expression profile. Our findings point to an entirely new direction for assessing the microarray-based predictive models and provide important information to gene signature based clinical applications.
PMCID: PMC3287499  PMID: 22369035
15.  Maximizing biomarker discovery by minimizing gene signatures 
BMC Genomics  2011;12(Suppl 5):S6.
The use of gene signatures can potentially be of considerable value in the field of clinical diagnosis. However, gene signatures defined with different methods can be quite various even when applied the same disease and the same endpoint. Previous studies have shown that the correct selection of subsets of genes from microarray data is key for the accurate classification of disease phenotypes, and a number of methods have been proposed for the purpose. However, these methods refine the subsets by only considering each single feature, and they do not confirm the association between the genes identified in each gene signature and the phenotype of the disease. We proposed an innovative new method termed Minimize Feature's Size (MFS) based on multiple level similarity analyses and association between the genes and disease for breast cancer endpoints by comparing classifier models generated from the second phase of MicroArray Quality Control (MAQC-II), trying to develop effective meta-analysis strategies to transform the MAQC-II signatures into a robust and reliable set of biomarker for clinical applications.
We analyzed the similarity of the multiple gene signatures in an endpoint and between the two endpoints of breast cancer at probe and gene levels, the results indicate that disease-related genes can be preferably selected as the components of gene signature, and that the gene signatures for the two endpoints could be interchangeable. The minimized signatures were built at probe level by using MFS for each endpoint. By applying the approach, we generated a much smaller set of gene signature with the similar predictive power compared with those gene signatures from MAQC-II.
Our results indicate that gene signatures of both large and small sizes could perform equally well in clinical applications. Besides, consistency and biological significances can be detected among different gene signatures, reflecting the studying endpoints. New classifiers built with MFS exhibit improved performance with both internal and external validation, suggesting that MFS method generally reduces redundancies for features within gene signatures and improves the performance of the model. Consequently, our strategy will be beneficial for the microarray-based clinical applications.
PMCID: PMC3287502  PMID: 22369133
16.  Translating Clinical Findings into Knowledge in Drug Safety Evaluation - Drug Induced Liver Injury Prediction System (DILIps) 
PLoS Computational Biology  2011;7(12):e1002310.
Drug-induced liver injury (DILI) is a significant concern in drug development due to the poor concordance between preclinical and clinical findings of liver toxicity. We hypothesized that the DILI types (hepatotoxic side effects) seen in the clinic can be translated into the development of predictive in silico models for use in the drug discovery phase. We identified 13 hepatotoxic side effects with high accuracy for classifying marketed drugs for their DILI potential. We then developed in silico predictive models for each of these 13 side effects, which were further combined to construct a DILI prediction system (DILIps). The DILIps yielded 60–70% prediction accuracy for three independent validation sets. To enhance the confidence for identification of drugs that cause severe DILI in humans, the “Rule of Three” was developed in DILIps by using a consensus strategy based on 13 models. This gave high positive predictive value (91%) when applied to an external dataset containing 206 drugs from three independent literature datasets. Using the DILIps, we screened all the drugs in DrugBank and investigated their DILI potential in terms of protein targets and therapeutic categories through network modeling. We demonstrated that two therapeutic categories, anti-infectives for systemic use and musculoskeletal system drugs, were enriched for DILI, which is consistent with current knowledge. We also identified protein targets and pathways that are related to drugs that cause DILI by using pathway analysis and co-occurrence text mining. While marketed drugs were the focus of this study, the DILIps has a potential as an evaluation tool to screen and prioritize new drug candidates or chemicals, such as environmental chemicals, to avoid those that might cause liver toxicity. We expect that the methodology can be also applied to other drug safety endpoints, such as renal or cardiovascular toxicity.
Author Summary
Translational research involves utilization of clinical data to address challenges in drug discovery and development. The rationale behind this study is that the side effects observed in clinical trial and post-marketing surveillance can be translated into a screening system for use in drug discovery. As a proof-of-concept study, we developed an in silico system based on 13 hepatotoxic side effects to predict drug-induced liver injury (DILI), which is one of the most frequent causes of drug failure in clinical trial and withdrawal from post-marketing application, and also one of the most difficult clinical endpoints to predict from preclinical studies. We first identified 13 types of liver injury which yielded high prediction accuracy to distinguish drugs known to cause DILI from these don't. To effectively apply these 13 hepatotoxic side effects to the drug discovery process for DILI, we developed in silico models for each of these side effects solely based on chemical structure data. Finally, we constructed a DILI prediction system (DILIps) by combining these 13 in silico models in a consensus fashion, which yielded >91% positive predictive value for DILI in humans. The DILIps methodology can be extended in applications for addressing other drug safety issues, such as renal and cardiovascular toxicity.
PMCID: PMC3240589  PMID: 22194678
17.  Mining FDA drug labels using an unsupervised learning technique - topic modeling 
BMC Bioinformatics  2011;12(Suppl 10):S11.
The Food and Drug Administration (FDA) approved drug labels contain a broad array of information, ranging from adverse drug reactions (ADRs) to drug efficacy, risk-benefit consideration, and more. However, the labeling language used to describe these information is free text often containing ambiguous semantic descriptions, which poses a great challenge in retrieving useful information from the labeling text in a consistent and accurate fashion for comparative analysis across drugs. Consequently, this task has largely relied on the manual reading of the full text by experts, which is time consuming and labor intensive.
In this study, a novel text mining method with unsupervised learning in nature, called topic modeling, was applied to the drug labeling with a goal of discovering “topics” that group drugs with similar safety concerns and/or therapeutic uses together. A total of 794 FDA-approved drug labels were used in this study. First, the three labeling sections (i.e., Boxed Warning, Warnings and Precautions, Adverse Reactions) of each drug label were processed by the Medical Dictionary for Regulatory Activities (MedDRA) to convert the free text of each label to the standard ADR terms. Next, the topic modeling approach with latent Dirichlet allocation (LDA) was applied to generate 100 topics, each associated with a set of drugs grouped together based on the probability analysis. Lastly, the efficacy of the topic modeling was evaluated based on known information about the therapeutic uses and safety data of drugs.
The results demonstrate that drugs grouped by topics are associated with the same safety concerns and/or therapeutic uses with statistical significance (P<0.05). The identified topics have distinct context that can be directly linked to specific adverse events (e.g., liver injury or kidney injury) or therapeutic application (e.g., antiinfectives for systemic use). We were also able to identify potential adverse events that might arise from specific medications via topics.
The successful application of topic modeling on the FDA drug labeling demonstrates its potential utility as a hypothesis generation means to infer hidden relationships of concepts such as, in this study, drug safety and therapeutic use in the study of biomedical documents.
PMCID: PMC3236833  PMID: 22166012
18.  Selecting a single model or combining multiple models for microarray-based classifier development? – A comparative analysis based on large and diverse datasets generated from the MAQC-II project 
BMC Bioinformatics  2011;12(Suppl 10):S3.
Genomic biomarkers play an increasing role in both preclinical and clinical application. Development of genomic biomarkers with microarrays is an area of intensive investigation. However, despite sustained and continuing effort, developing microarray-based predictive models (i.e., genomics biomarkers) capable of reliable prediction for an observed or measured outcome (i.e., endpoint) of unknown samples in preclinical and clinical practice remains a considerable challenge. No straightforward guidelines exist for selecting a single model that will perform best when presented with unknown samples. In the second phase of the MicroArray Quality Control (MAQC-II) project, 36 analysis teams produced a large number of models for 13 preclinical and clinical endpoints. Before external validation was performed, each team nominated one model per endpoint (referred to here as 'nominated models') from which MAQC-II experts selected 13 'candidate models' to represent the best model for each endpoint. Both the nominated and candidate models from MAQC-II provide benchmarks to assess other methodologies for developing microarray-based predictive models.
We developed a simple ensemble method by taking a number of the top performing models from cross-validation and developing an ensemble model for each of the MAQC-II endpoints. We compared the ensemble models with both nominated and candidate models from MAQC-II using blinded external validation.
For 10 of the 13 MAQC-II endpoints originally analyzed by the MAQC-II data analysis team from the National Center for Toxicological Research (NCTR), the ensemble models achieved equal or better predictive performance than the NCTR nominated models. Additionally, the ensemble models had performance comparable to the MAQC-II candidate models. Most ensemble models also had better performance than the nominated models generated by five other MAQC-II data analysis teams that analyzed all 13 endpoints.
Our findings suggest that an ensemble method can often attain a higher average predictive performance in an external validation set than a corresponding “optimized” model method. Using an ensemble method to determine a final model is a potentially important supplement to the good modeling practices recommended by the MAQC-II project for developing microarray-based genomic biomarkers.
PMCID: PMC3236846  PMID: 22166133
19.  Constructing a robust protein-protein interaction network by integrating multiple public databases 
BMC Bioinformatics  2011;12(Suppl 10):S7.
Protein-protein interactions (PPIs) are a critical component for many underlying biological processes. A PPI network can provide insight into the mechanisms of these processes, as well as the relationships among different proteins and toxicants that are potentially involved in the processes. There are many PPI databases publicly available, each with a specific focus. The challenge is how to effectively combine their contents to generate a robust and biologically relevant PPI network.
In this study, seven public PPI databases, BioGRID, DIP, HPRD, IntAct, MINT, REACTOME, and SPIKE, were used to explore a powerful approach to combine multiple PPI databases for an integrated PPI network. We developed a novel method called k-votes to create seven different integrated networks by using values of k ranging from 1-7. Functional modules were mined by using SCAN, a Structural Clustering Algorithm for Networks. Overall module qualities were evaluated for each integrated network using the following statistical and biological measures: (1) modularity, (2) similarity-based modularity, (3) clustering score, and (4) enrichment.
Each integrated human PPI network was constructed based on the number of votes (k) for a particular interaction from the committee of the original seven PPI databases. The performance of functional modules obtained by SCAN from each integrated network was evaluated. The optimal value for k was determined by the functional module analysis. Our results demonstrate that the k-votes method outperforms the traditional union approach in terms of both statistical significance and biological meaning. The best network is achieved at k=2, which is composed of interactions that are confirmed in at least two PPI databases. In contrast, the traditional union approach yields an integrated network that consists of all interactions of seven PPI databases, which might be subject to high false positives.
We determined that the k-votes method for constructing a robust PPI network by integrating multiple public databases outperforms previously reported approaches and that a value of k=2 provides the best results. The developed strategies for combining databases show promise in the advancement of network construction and modeling.
PMCID: PMC3236850  PMID: 22165958
20.  Genomic indicators in the blood predict drug-induced liver injury 
The pharmacogenomics journal  2010;10(4):267-277.
Genomic biomarkers for the detection of drug-induced liver injury (DILI) from blood are urgently needed for monitoring drug safety. We used a unique data set as part of the Food and Drug Administration led MicroArray Quality Control Phase-II (MAQC-II) project consisting of gene expression data from the two tissues (blood and liver) to test cross-tissue predictability of genomic indicators to a form of chemically-induced liver injury. We then use the genomic indicators from the blood as biomarkers for prediction of acetaminophen-induced liver injury and show that the cross tissue predictability of a response to the pharmaceutical agent (accuracy as high as 92.1%) is better than, or at least comparable to, that of non-therapeutic compounds. We provide a database of gene expression for the highly informative predictors which brings biological context to the possible mechanisms involved in DILI. Pathway-based predictors were associated with inflammation, angiogenesis, Toll-like receptor signaling, apoptosis and mitochondrial damage. The results demonstrate for the first time and support the hypothesis that genomic indicators in the blood can serve as potential diagnostic biomarkers predictive of DILI.
PMCID: PMC3180890  PMID: 20676066
prediction; acetaminophen; blood; cross tissue; liver injury; microarray gene expression
21.  Identification of New SRF Binding Sites in Genes Modulated by SRF Over-Expression in Mouse Hearts 
To identify in vivo new cardiac binding sites of serum response factor (SRF) in genes and to study the response of these genes to mild over-expression of SRF, we employed a cardiac-specific, transgenic mouse model, with mild over-expression of SRF (Mild-O SRF Tg).
Microarray experiments were performed on hearts of Mild-O-SRF Tg at 6 months of age. We identified 207 genes that are important for cardiac function that were differentially expressed in vivo. Among them the promoter region of 192 genes had SRF binding motifs, the classic CArG or CArG-like (CArG-L) elements. Fifty-one of the 56 genes with classic SRF binding sites had not been previously reported. These SRF-modulated genes were grouped into 12 categories based on their function. It was observed that genes associated with cardiac energy metabolism shifted toward that of carbohydrate metabolism and away from that of fatty acid metabolism. The expression of genes that are involved in transcription and ion regulation were decreased, but expression of cytoskeletal genes was significantly increased. Using public databases of mouse models of hemodynamic stress (GEO database), we also found that similar altered expression of the SRF-modulated genes occurred in these hearts with cardiac ischemia or aortic constriction as well.
Conclusion and significance:
SRF-modulated genes are actively regulated under various physiological and pathological conditions. We have discovered that a large number of cardiac genes have classic SRF binding sites and were significantly modulated in the Mild-O-SRF Tg mouse hearts. Hence, the mild elevation of SRF protein in the heart that is observed during typical adult aging may have a major impact on many SRF-modulated genes, thereby affecting cardiac structure and performance. The results from our study could help to enhance our understanding of SRF regulation of cellular processes in the aged heart.
PMCID: PMC3140411  PMID: 21792293
SRF modulated genes; SRF binding sites; mouse heart; mild-SRF over-expression; gene expression; striated muscle
22.  Meta-analysis of microarray data using a pathway-based approach identifies a 37-gene expression signature for systemic lupus erythematosus in human peripheral blood mononuclear cells 
BMC Medicine  2011;9:65.
A number of publications have reported the use of microarray technology to identify gene expression signatures to infer mechanisms and pathways associated with systemic lupus erythematosus (SLE) in human peripheral blood mononuclear cells. However, meta-analysis approaches with microarray data have not been well-explored in SLE.
In this study, a pathway-based meta-analysis was applied to four independent gene expression oligonucleotide microarray data sets to identify gene expression signatures for SLE, and these data sets were confirmed by a fifth independent data set.
Differentially expressed genes (DEGs) were identified in each data set by comparing expression microarray data from control samples and SLE samples. Using Ingenuity Pathway Analysis software, pathways associated with the DEGs were identified in each of the four data sets. Using the leave one data set out pathway-based meta-analysis approach, a 37-gene metasignature was identified. This SLE metasignature clearly distinguished SLE patients from controls as observed by unsupervised learning methods. The final confirmation of the metasignature was achieved by applying the metasignature to a fifth independent data set.
The novel pathway-based meta-analysis approach proved to be a useful technique for grouping disparate microarray data sets. This technique allowed for validated conclusions to be drawn across four different data sets and confirmed by an independent fifth data set. The metasignature and pathways identified by using this approach may serve as a source for identifying therapeutic targets for SLE and may possibly be used for diagnostic and monitoring purposes. Moreover, the meta-analysis approach provides a simple, intuitive solution for combining disparate microarray data sets to identify a strong metasignature.
Please see Research Highlight:
PMCID: PMC3126731  PMID: 21624134
23.  Differential gene expression profiling of mouse skin after sulfur mustard exposure: Extended time response and inhibitor effect 
Toxicology and applied pharmacology  2008;234(2):156-165.
Sulfur mustard (HD, SM), is a chemical warfare agent that within hours causes extensive blistering at the dermal–epidermal junction of skin. To better understand the progression of SM-induced blistering, gene expression profiling for mouse skin was performed after a single high dose of SM exposure. Punch biopsies of mouse ears were collected at both early and late time periods following SM exposure (previous studies only considered early time periods). The biopsies were examined for pathological disturbances and the samples further assayed for gene expression profiling using the Affymetrix microarray analysis system. Principal component analysis and hierarchical cluster analysis of the differently expressed genes, performed with ArrayTrack showed clear separation of the various groups. Pathway analysis employing the KEGG library and Ingenuity Pathway Analysis (IPA) indicated that cytokine–cytokine receptor interaction, cell adhesion molecules (CAMs), and hematopoietic cell lineage are common pathways affected at different time points. Gene ontology analysis identified the most significantly altered biological processes as the immune response, inflammatory response, and chemotaxis; these findings are consistent with other reported results for shorter time periods. Selected genes were chosen for RT-PCR verification and showed correlations in the general trends for the microarrays. Interleukin 1 beta was checked for biological analysis to confirm the presence of protein correlated to the corresponding microarray data. The impact of a matrix metalloproteinase inhibitor, MMP-2/MMP-9 inhibitor I, against SM exposure was assessed. These results can help in understanding the molecular mechanism of SM-induced blistering, as well as to test the efficacy of different inhibitors.
PMCID: PMC3066660  PMID: 18955075
Vesicant; Sulfur mustard; Microarray; Alkylating agent; Skin; MMP inhibitor; MMP; Matrix metalloproteinase
24.  Cross-Platform Comparison of Microarray-Based Multiple-Class Prediction 
PLoS ONE  2011;6(1):e16067.
High-throughput microarray technology has been widely applied in biological and medical decision-making research during the past decade. However, the diversity of platforms has made it a challenge to re-use and/or integrate datasets generated in different experiments or labs for constructing array-based diagnostic models. Using large toxicogenomics datasets generated using both Affymetrix and Agilent microarray platforms, we carried out a benchmark evaluation of cross-platform consistency in multiple-class prediction using three widely-used machine learning algorithms. After an initial assessment of model performance on different platforms, we evaluated whether predictive signature features selected in one platform could be directly used to train a model in the other platform and whether predictive models trained using data from one platform could predict datasets profiled using the other platform with comparable performance. Our results established that it is possible to successfully apply multiple-class prediction models across different commercial microarray platforms, offering a number of important benefits such as accelerating the possible translation of biomarkers identified with microarrays to clinically-validated assays. However, this investigation focuses on a technical platform comparison and is actually only the beginning of exploring cross-platform consistency. Further studies are needed to confirm the feasibility of microarray-based cross-platform prediction, especially using independent datasets.
PMCID: PMC3019174  PMID: 21264309
25.  Evaluation of gene expression data generated from expired Affymetrix GeneChip® microarrays using MAQC reference RNA samples 
BMC Bioinformatics  2010;11(Suppl 6):S10.
The Affymetrix GeneChip® system is a commonly used platform for microarray analysis but the technology is inherently expensive. Unfortunately, changes in experimental planning and execution, such as the unavailability of previously anticipated samples or a shift in research focus, may render significant numbers of pre-purchased GeneChip® microarrays unprocessed before their manufacturer’s expiration dates. Researchers and microarray core facilities wonder whether expired microarrays are still useful for gene expression analysis. In addition, it was not clear whether the two human reference RNA samples established by the MAQC project in 2005 still maintained their transcriptome integrity over a period of four years. Experiments were conducted to answer these questions.
Microarray data were generated in 2009 in three replicates for each of the two MAQC samples with either expired Affymetrix U133A or unexpired U133Plus2 microarrays. These results were compared with data obtained in 2005 on the U133Plus2 microarray. The percentage of overlap between the lists of differentially expressed genes (DEGs) from U133Plus2 microarray data generated in 2009 and in 2005 was 97.44%. While there was some degree of fold change compression in the expired U133A microarrays, the percentage of overlap between the lists of DEGs from the expired and unexpired microarrays was as high as 96.99%. Moreover, the microarray data generated using the expired U133A microarrays in 2009 were highly concordant with microarray and TaqMan® data generated by the MAQC project in 2005.
Our results demonstrated that microarray data generated using U133A microarrays, which were more than four years past the manufacturer’s expiration date, were highly specific and consistent with those from unexpired microarrays in identifying DEGs despite some appreciable fold change compression and decrease in sensitivity. Our data also suggested that the MAQC reference RNA samples, stored at -80°C, were stable over a time frame of at least four years.
PMCID: PMC3026357  PMID: 20946593

Results 1-25 (51)