Human protein complexes play crucial roles in various biological processes as the functional module. However, the expression features of human protein complexes at the transcriptome cascade are poorly understood. Here, we used the RNA-Seq data from 16 disparate tissues and four types of human cancers to explore the characteristics and dynamics of human protein complexes. We observed that many individual components of human protein complexes can be generated by multiple distinct transcripts. Similar with yeast, the human protein complex constituents are inclined to co-express in diverse tissues. The dominant isoform of the genes involved in protein complexes tend to encode the complex constituents in each tissue. Our results indicate that the protein complex dynamics not only correlate with the presence or absence of complexes, but may also be related to the major isoform switching for complex subunits. Between any two cancers of breast, colon, lung and prostate, we found that only a few of the differentially expressed transcripts associated with complexes were identical, but 5–10 times more protein complexes involved in differentially expressed transcripts were common. Collectively, our study reveals novel properties and dynamics of human protein complexes at the transcriptome cascade in diverse normal tissues and different cancers.
Drug repositioning offers an opportunity to revitalize the slowing drug discovery pipeline by finding new uses for currently existing drugs. Our hypothesis is that drugs sharing similar side effect profiles are likely to be effective for the same disease, and thus repositioning opportunities can be identified by finding drug pairs with similar side effects documented in U.S. Food and Drug Administration (FDA) approved drug labels. The safety information in the drug labels is usually obtained in the clinical trial and augmented with the observations in the post-market use of the drug. Therefore, our drug repositioning approach can take the advantage of more comprehensive safety information comparing with conventional de novo approach.
A probabilistic topic model was constructed based on the terms in the Medical Dictionary for Regulatory Activities (MedDRA) that appeared in the Boxed Warning, Warnings and Precautions, and Adverse Reactions sections of the labels of 870 drugs. Fifty-two unique topics, each containing a set of terms, were identified by using topic modeling. The resulting probabilistic topic associations were used to measure the distance (similarity) between drugs. The success of the proposed model was evaluated by comparing a drug and its nearest neighbor (i.e., a drug pair) for common indications found in the Indications and Usage Section of the drug labels.
Given a drug with more than three indications, the model yielded a 75% recall, meaning 75% of drug pairs shared one or more common indications. This is significantly higher than the 22% recall rate achieved by random selection. Additionally, the recall rate grows rapidly as the number of drug indications increases and reaches 84% for drugs with 11 indications. The analysis also demonstrated that 65 drugs with a Boxed Warning, which indicates significant risk of serious and possibly life-threatening adverse effects, might be replaced with safer alternatives that do not have a Boxed Warning. In addition, we identified two therapeutic groups of drugs (Musculo-skeletal system and Anti-infective for systemic use) where over 80% of the drugs have a potential replacement with high significance.
Topic modeling can be a powerful tool for the identification of repositioning opportunities by examining the adverse event terms in FDA approved drug labels. The proposed framework not only suggests drugs that can be repurposed, but also provides insight into the safety of repositioned drugs.
During the last several years, high-density genotyping SNP arrays have facilitated genome-wide association studies (GWAS) that successfully identified common genetic variants associated with a variety of phenotypes. However, each of the identified genetic variants only explains a very small fraction of the underlying genetic contribution to the studied phenotypic trait. Moreover, discordance observed in results between independent GWAS indicates the potential for Type I and II errors. High reliability of genotyping technology is needed to have confidence in using SNP data and interpreting GWAS results. Therefore, reproducibility of two widely genotyping technology platforms from Affymetrix and Illumina was assessed by analyzing four technical replicates from each of the six individuals in five laboratories. Genotype concordance of 99.40% to 99.87% within a laboratory for the sample platform, 98.59% to 99.86% across laboratories for the same platform, and 98.80% across genotyping platforms was observed. Moreover, arrays with low quality data were detected when comparing genotyping data from technical replicates, but they could not be detected according to venders’ quality control (QC) suggestions. Our results demonstrated the technical reliability of currently available genotyping platforms but also indicated the importance of incorporating some technical replicates for genotyping QC in order to improve the reliability of GWAS results. The impact of discordant genotypes on association analysis results was simulated and could explain, at least in part, the irreproducibility of some GWAS findings when the effect size (i.e. the odds ratio) and the minor allele frequencies are low.
To make full use of research data, the bioscience community needs to adopt technologies and reward mechanisms that support interoperability and promote the growth of an open ‘data commoning’ culture. Here we describe the prerequisites for data commoning and present an established and growing ecosystem of solutions using the shared ‘Investigation-Study-Assay’ framework to support that vision.
Large amounts of mammalian protein-protein interaction (PPI) data have been generated and are available for public use. From a systems biology perspective, Proteins/genes interactions encode the key mechanisms distinguishing disease and health, and such mechanisms can be uncovered through network analysis. An effective network analysis tool should integrate different content-specific PPI databases into a comprehensive network format with a user-friendly platform to identify key functional modules/pathways and the underlying mechanisms of disease and toxicity.
atBioNet integrates seven publicly available PPI databases into a network-specific knowledge base. Knowledge expansion is achieved by expanding a user supplied proteins/genes list with interactions from its integrated PPI network. The statistically significant functional modules are determined by applying a fast network-clustering algorithm (SCAN: a Structural Clustering Algorithm for Networks). The functional modules can be visualized either separately or together in the context of the whole network. Integration of pathway information enables enrichment analysis and assessment of the biological function of modules. Three case studies are presented using publicly available disease gene signatures as a basis to discover new biomarkers for acute leukemia, systemic lupus erythematosus, and breast cancer. The results demonstrated that atBioNet can not only identify functional modules and pathways related to the studied diseases, but this information can also be used to hypothesize novel biomarkers for future analysis.
atBioNet is a free web-based network analysis tool that provides a systematic insight into proteins/genes interactions through examining significant functional modules. The identified functional modules are useful for determining underlying mechanisms of disease and biomarker discovery. It can be accessed at: http://www.fda.gov/ScienceResearch/BioinformaticsTools/ucm285284.htm.
Protein-protein interaction; Network analysis; Functional module; Disease biomarker; KEGG pathway analysis; Visualization tool; Genomics
A genetic association study is a complicated process that involves collecting phenotypic data, generating genotypic data, analyzing associations between genotypic and phenotypic data, and interpreting genetic biomarkers identified. SNPTrack is an integrated bioinformatics system developed by the US Food and Drug Administration (FDA) to support the review and analysis of pharmacogenetics data resulting from FDA research or submitted by sponsors. The system integrates data management, analysis, and interpretation in a single platform for genetic association studies. Specifically, it stores genotyping data and single-nucleotide polymorphism (SNP) annotations along with study design data in an Oracle database. It also integrates popular genetic analysis tools, such as PLINK and Haploview. SNPTrack provides genetic analysis capabilities and captures analysis results in its database as SNP lists that can be cross-linked for biological interpretation to gene/protein annotations, Gene Ontology, and pathway analysis data. With SNPTrack, users can do the entire stream of bioinformatics jobs for genetic association studies. SNPTrack is freely available to the public at http://www.fda.gov/ScienceResearch/BioinformaticsTools/SNPTrack/default.htm.
miRNAs are non-coding RNAs that play a regulatory role in expression of genes and are associated with diseases. Quantitatively measuring expression levels of miRNAs can help in understanding the mechanisms of human diseases and discovering new drug targets. There are three major methods that have been used to measure the expression levels of miRNAs: real-time reverse transcription PCR (qRT-PCR), microarray, and the newly introduced next-generation sequencing (NGS). NGS is not only suitable for profiling of known miRNAs as qRT-PCR and microarray can do too but it also is able to detect unknown miRNAs which the other two methods are incapable of doing. Profiling of miRNAs by NGS has progressed rapidly and is a promising field for applications in drug development. This paper reviews the technical advancement of NGS for profiling miRNAs, including comparative analyses between different platforms and software packages for analyzing NGS data. Examples and future perspectives of applications of NGS profiling miRNAs in drug development will be discussed.
miRNAs; Next-Generation Sequencing; Expression; Data Analysis; Drug Development
The era of personalized medicine for cancer therapeutics has taken an important step forward in making accurate prognoses for individual patients with the adoption of high-throughput microarray technology. However, microarray technology in cancer diagnosis or prognosis has been primarily used for the statistical evaluation of patient populations, and thus excludes inter-individual variability and patient-specific predictions. Here we propose a metric called clinical confidence that serves as a measure of prognostic reliability to facilitate the shift from population-wide to personalized cancer prognosis using microarray-based predictive models. The performance of sample-based models predicted with different clinical confidences was evaluated and compared systematically using three large clinical datasets studying the following cancers: breast cancer, multiple myeloma, and neuroblastoma. Survival curves for patients, with different confidences, were also delineated. The results show that the clinical confidence metric separates patients with different prediction accuracies and survival times. Samples with high clinical confidence were likely to have accurate prognoses from predictive models. Moreover, patients with high clinical confidence would be expected to live for a notably longer or shorter time if their prognosis was good or grim based on the models, respectively. We conclude that clinical confidence could serve as a beneficial metric for personalized cancer prognosis prediction utilizing microarrays. Ascribing a confidence level to prognosis with the clinical confidence metric provides the clinician an objective, personalized basis for decisions, such as choosing the severity of the treatment.
Microarray data have been used for gene signature selection to predict clinical outcomes. Many studies have attempted to identify factors that affect models' performance with only little success. Fine-tuning of model parameters and optimizing each step of the modeling process often results in over-fitting problems without improving performance.
We propose a quantitative measurement, termed consistency degree, to detect the correlation between disease endpoint and gene expression profile. Different endpoints were shown to have different consistency degrees to gene expression profiles. The validity of this measurement to estimate the consistency was tested with significance at a p-value less than 2.2e-16 for all of the studied endpoints. According to the consistency degree score, overall survival milestone outcome of multiple myeloma was proposed to extend from 730 days to 1561 days, which is more consistent with gene expression profile.
For various clinical endpoints, the maximum predictive powers of different microarray-based models are limited by the correlation between endpoint and gene expression profile of disease samples as indicated by the consistency degree score. In addition, previous defined clinical outcomes can also be reassessed and refined more coherent according to related disease gene expression profile. Our findings point to an entirely new direction for assessing the microarray-based predictive models and provide important information to gene signature based clinical applications.
The use of gene signatures can potentially be of considerable value in the field of clinical diagnosis. However, gene signatures defined with different methods can be quite various even when applied the same disease and the same endpoint. Previous studies have shown that the correct selection of subsets of genes from microarray data is key for the accurate classification of disease phenotypes, and a number of methods have been proposed for the purpose. However, these methods refine the subsets by only considering each single feature, and they do not confirm the association between the genes identified in each gene signature and the phenotype of the disease. We proposed an innovative new method termed Minimize Feature's Size (MFS) based on multiple level similarity analyses and association between the genes and disease for breast cancer endpoints by comparing classifier models generated from the second phase of MicroArray Quality Control (MAQC-II), trying to develop effective meta-analysis strategies to transform the MAQC-II signatures into a robust and reliable set of biomarker for clinical applications.
We analyzed the similarity of the multiple gene signatures in an endpoint and between the two endpoints of breast cancer at probe and gene levels, the results indicate that disease-related genes can be preferably selected as the components of gene signature, and that the gene signatures for the two endpoints could be interchangeable. The minimized signatures were built at probe level by using MFS for each endpoint. By applying the approach, we generated a much smaller set of gene signature with the similar predictive power compared with those gene signatures from MAQC-II.
Our results indicate that gene signatures of both large and small sizes could perform equally well in clinical applications. Besides, consistency and biological significances can be detected among different gene signatures, reflecting the studying endpoints. New classifiers built with MFS exhibit improved performance with both internal and external validation, suggesting that MFS method generally reduces redundancies for features within gene signatures and improves the performance of the model. Consequently, our strategy will be beneficial for the microarray-based clinical applications.
Drug-induced liver injury (DILI) is a significant concern in drug development due to the poor concordance between preclinical and clinical findings of liver toxicity. We hypothesized that the DILI types (hepatotoxic side effects) seen in the clinic can be translated into the development of predictive in silico models for use in the drug discovery phase. We identified 13 hepatotoxic side effects with high accuracy for classifying marketed drugs for their DILI potential. We then developed in silico predictive models for each of these 13 side effects, which were further combined to construct a DILI prediction system (DILIps). The DILIps yielded 60–70% prediction accuracy for three independent validation sets. To enhance the confidence for identification of drugs that cause severe DILI in humans, the “Rule of Three” was developed in DILIps by using a consensus strategy based on 13 models. This gave high positive predictive value (91%) when applied to an external dataset containing 206 drugs from three independent literature datasets. Using the DILIps, we screened all the drugs in DrugBank and investigated their DILI potential in terms of protein targets and therapeutic categories through network modeling. We demonstrated that two therapeutic categories, anti-infectives for systemic use and musculoskeletal system drugs, were enriched for DILI, which is consistent with current knowledge. We also identified protein targets and pathways that are related to drugs that cause DILI by using pathway analysis and co-occurrence text mining. While marketed drugs were the focus of this study, the DILIps has a potential as an evaluation tool to screen and prioritize new drug candidates or chemicals, such as environmental chemicals, to avoid those that might cause liver toxicity. We expect that the methodology can be also applied to other drug safety endpoints, such as renal or cardiovascular toxicity.
Translational research involves utilization of clinical data to address challenges in drug discovery and development. The rationale behind this study is that the side effects observed in clinical trial and post-marketing surveillance can be translated into a screening system for use in drug discovery. As a proof-of-concept study, we developed an in silico system based on 13 hepatotoxic side effects to predict drug-induced liver injury (DILI), which is one of the most frequent causes of drug failure in clinical trial and withdrawal from post-marketing application, and also one of the most difficult clinical endpoints to predict from preclinical studies. We first identified 13 types of liver injury which yielded high prediction accuracy to distinguish drugs known to cause DILI from these don't. To effectively apply these 13 hepatotoxic side effects to the drug discovery process for DILI, we developed in silico models for each of these side effects solely based on chemical structure data. Finally, we constructed a DILI prediction system (DILIps) by combining these 13 in silico models in a consensus fashion, which yielded >91% positive predictive value for DILI in humans. The DILIps methodology can be extended in applications for addressing other drug safety issues, such as renal and cardiovascular toxicity.
The Food and Drug Administration (FDA) approved drug labels contain a broad array of information, ranging from adverse drug reactions (ADRs) to drug efficacy, risk-benefit consideration, and more. However, the labeling language used to describe these information is free text often containing ambiguous semantic descriptions, which poses a great challenge in retrieving useful information from the labeling text in a consistent and accurate fashion for comparative analysis across drugs. Consequently, this task has largely relied on the manual reading of the full text by experts, which is time consuming and labor intensive.
In this study, a novel text mining method with unsupervised learning in nature, called topic modeling, was applied to the drug labeling with a goal of discovering “topics” that group drugs with similar safety concerns and/or therapeutic uses together. A total of 794 FDA-approved drug labels were used in this study. First, the three labeling sections (i.e., Boxed Warning, Warnings and Precautions, Adverse Reactions) of each drug label were processed by the Medical Dictionary for Regulatory Activities (MedDRA) to convert the free text of each label to the standard ADR terms. Next, the topic modeling approach with latent Dirichlet allocation (LDA) was applied to generate 100 topics, each associated with a set of drugs grouped together based on the probability analysis. Lastly, the efficacy of the topic modeling was evaluated based on known information about the therapeutic uses and safety data of drugs.
The results demonstrate that drugs grouped by topics are associated with the same safety concerns and/or therapeutic uses with statistical significance (P<0.05). The identified topics have distinct context that can be directly linked to specific adverse events (e.g., liver injury or kidney injury) or therapeutic application (e.g., antiinfectives for systemic use). We were also able to identify potential adverse events that might arise from specific medications via topics.
The successful application of topic modeling on the FDA drug labeling demonstrates its potential utility as a hypothesis generation means to infer hidden relationships of concepts such as, in this study, drug safety and therapeutic use in the study of biomedical documents.
Genomic biomarkers play an increasing role in both preclinical and clinical application. Development of genomic biomarkers with microarrays is an area of intensive investigation. However, despite sustained and continuing effort, developing microarray-based predictive models (i.e., genomics biomarkers) capable of reliable prediction for an observed or measured outcome (i.e., endpoint) of unknown samples in preclinical and clinical practice remains a considerable challenge. No straightforward guidelines exist for selecting a single model that will perform best when presented with unknown samples. In the second phase of the MicroArray Quality Control (MAQC-II) project, 36 analysis teams produced a large number of models for 13 preclinical and clinical endpoints. Before external validation was performed, each team nominated one model per endpoint (referred to here as 'nominated models') from which MAQC-II experts selected 13 'candidate models' to represent the best model for each endpoint. Both the nominated and candidate models from MAQC-II provide benchmarks to assess other methodologies for developing microarray-based predictive models.
We developed a simple ensemble method by taking a number of the top performing models from cross-validation and developing an ensemble model for each of the MAQC-II endpoints. We compared the ensemble models with both nominated and candidate models from MAQC-II using blinded external validation.
For 10 of the 13 MAQC-II endpoints originally analyzed by the MAQC-II data analysis team from the National Center for Toxicological Research (NCTR), the ensemble models achieved equal or better predictive performance than the NCTR nominated models. Additionally, the ensemble models had performance comparable to the MAQC-II candidate models. Most ensemble models also had better performance than the nominated models generated by five other MAQC-II data analysis teams that analyzed all 13 endpoints.
Our findings suggest that an ensemble method can often attain a higher average predictive performance in an external validation set than a corresponding “optimized” model method. Using an ensemble method to determine a final model is a potentially important supplement to the good modeling practices recommended by the MAQC-II project for developing microarray-based genomic biomarkers.
Protein-protein interactions (PPIs) are a critical component for many underlying biological processes. A PPI network can provide insight into the mechanisms of these processes, as well as the relationships among different proteins and toxicants that are potentially involved in the processes. There are many PPI databases publicly available, each with a specific focus. The challenge is how to effectively combine their contents to generate a robust and biologically relevant PPI network.
In this study, seven public PPI databases, BioGRID, DIP, HPRD, IntAct, MINT, REACTOME, and SPIKE, were used to explore a powerful approach to combine multiple PPI databases for an integrated PPI network. We developed a novel method called k-votes to create seven different integrated networks by using values of k ranging from 1-7. Functional modules were mined by using SCAN, a Structural Clustering Algorithm for Networks. Overall module qualities were evaluated for each integrated network using the following statistical and biological measures: (1) modularity, (2) similarity-based modularity, (3) clustering score, and (4) enrichment.
Each integrated human PPI network was constructed based on the number of votes (k) for a particular interaction from the committee of the original seven PPI databases. The performance of functional modules obtained by SCAN from each integrated network was evaluated. The optimal value for k was determined by the functional module analysis. Our results demonstrate that the k-votes method outperforms the traditional union approach in terms of both statistical significance and biological meaning. The best network is achieved at k=2, which is composed of interactions that are confirmed in at least two PPI databases. In contrast, the traditional union approach yields an integrated network that consists of all interactions of seven PPI databases, which might be subject to high false positives.
We determined that the k-votes method for constructing a robust PPI network by integrating multiple public databases outperforms previously reported approaches and that a value of k=2 provides the best results. The developed strategies for combining databases show promise in the advancement of network construction and modeling.
Genomic biomarkers for the detection of drug-induced liver injury (DILI) from blood are urgently needed for monitoring drug safety. We used a unique data set as part of the Food and Drug Administration led MicroArray Quality Control Phase-II (MAQC-II) project consisting of gene expression data from the two tissues (blood and liver) to test cross-tissue predictability of genomic indicators to a form of chemically-induced liver injury. We then use the genomic indicators from the blood as biomarkers for prediction of acetaminophen-induced liver injury and show that the cross tissue predictability of a response to the pharmaceutical agent (accuracy as high as 92.1%) is better than, or at least comparable to, that of non-therapeutic compounds. We provide a database of gene expression for the highly informative predictors which brings biological context to the possible mechanisms involved in DILI. Pathway-based predictors were associated with inflammation, angiogenesis, Toll-like receptor signaling, apoptosis and mitochondrial damage. The results demonstrate for the first time and support the hypothesis that genomic indicators in the blood can serve as potential diagnostic biomarkers predictive of DILI.
prediction; acetaminophen; blood; cross tissue; liver injury; microarray gene expression
To identify in vivo new cardiac binding sites of serum response factor (SRF) in genes and to study the response of these genes to mild over-expression of SRF, we employed a cardiac-specific, transgenic mouse model, with mild over-expression of SRF (Mild-O SRF Tg).
Microarray experiments were performed on hearts of Mild-O-SRF Tg at 6 months of age. We identified 207 genes that are important for cardiac function that were differentially expressed in vivo. Among them the promoter region of 192 genes had SRF binding motifs, the classic CArG or CArG-like (CArG-L) elements. Fifty-one of the 56 genes with classic SRF binding sites had not been previously reported. These SRF-modulated genes were grouped into 12 categories based on their function. It was observed that genes associated with cardiac energy metabolism shifted toward that of carbohydrate metabolism and away from that of fatty acid metabolism. The expression of genes that are involved in transcription and ion regulation were decreased, but expression of cytoskeletal genes was significantly increased. Using public databases of mouse models of hemodynamic stress (GEO database), we also found that similar altered expression of the SRF-modulated genes occurred in these hearts with cardiac ischemia or aortic constriction as well.
Conclusion and significance:
SRF-modulated genes are actively regulated under various physiological and pathological conditions. We have discovered that a large number of cardiac genes have classic SRF binding sites and were significantly modulated in the Mild-O-SRF Tg mouse hearts. Hence, the mild elevation of SRF protein in the heart that is observed during typical adult aging may have a major impact on many SRF-modulated genes, thereby affecting cardiac structure and performance. The results from our study could help to enhance our understanding of SRF regulation of cellular processes in the aged heart.
SRF modulated genes; SRF binding sites; mouse heart; mild-SRF over-expression; gene expression; striated muscle
A number of publications have reported the use of microarray technology to identify gene expression signatures to infer mechanisms and pathways associated with systemic lupus erythematosus (SLE) in human peripheral blood mononuclear cells. However, meta-analysis approaches with microarray data have not been well-explored in SLE.
In this study, a pathway-based meta-analysis was applied to four independent gene expression oligonucleotide microarray data sets to identify gene expression signatures for SLE, and these data sets were confirmed by a fifth independent data set.
Differentially expressed genes (DEGs) were identified in each data set by comparing expression microarray data from control samples and SLE samples. Using Ingenuity Pathway Analysis software, pathways associated with the DEGs were identified in each of the four data sets. Using the leave one data set out pathway-based meta-analysis approach, a 37-gene metasignature was identified. This SLE metasignature clearly distinguished SLE patients from controls as observed by unsupervised learning methods. The final confirmation of the metasignature was achieved by applying the metasignature to a fifth independent data set.
The novel pathway-based meta-analysis approach proved to be a useful technique for grouping disparate microarray data sets. This technique allowed for validated conclusions to be drawn across four different data sets and confirmed by an independent fifth data set. The metasignature and pathways identified by using this approach may serve as a source for identifying therapeutic targets for SLE and may possibly be used for diagnostic and monitoring purposes. Moreover, the meta-analysis approach provides a simple, intuitive solution for combining disparate microarray data sets to identify a strong metasignature.
Please see Research Highlight: http://genomemedicine.com/content/3/5/30
Sulfur mustard (HD, SM), is a chemical warfare agent that within hours causes extensive blistering at the dermal–epidermal junction of skin. To better understand the progression of SM-induced blistering, gene expression profiling for mouse skin was performed after a single high dose of SM exposure. Punch biopsies of mouse ears were collected at both early and late time periods following SM exposure (previous studies only considered early time periods). The biopsies were examined for pathological disturbances and the samples further assayed for gene expression profiling using the Affymetrix microarray analysis system. Principal component analysis and hierarchical cluster analysis of the differently expressed genes, performed with ArrayTrack showed clear separation of the various groups. Pathway analysis employing the KEGG library and Ingenuity Pathway Analysis (IPA) indicated that cytokine–cytokine receptor interaction, cell adhesion molecules (CAMs), and hematopoietic cell lineage are common pathways affected at different time points. Gene ontology analysis identified the most significantly altered biological processes as the immune response, inflammatory response, and chemotaxis; these findings are consistent with other reported results for shorter time periods. Selected genes were chosen for RT-PCR verification and showed correlations in the general trends for the microarrays. Interleukin 1 beta was checked for biological analysis to confirm the presence of protein correlated to the corresponding microarray data. The impact of a matrix metalloproteinase inhibitor, MMP-2/MMP-9 inhibitor I, against SM exposure was assessed. These results can help in understanding the molecular mechanism of SM-induced blistering, as well as to test the efficacy of different inhibitors.
Vesicant; Sulfur mustard; Microarray; Alkylating agent; Skin; MMP inhibitor; MMP; Matrix metalloproteinase
High-throughput microarray technology has been widely applied in biological and medical decision-making research during the past decade. However, the diversity of platforms has made it a challenge to re-use and/or integrate datasets generated in different experiments or labs for constructing array-based diagnostic models. Using large toxicogenomics datasets generated using both Affymetrix and Agilent microarray platforms, we carried out a benchmark evaluation of cross-platform consistency in multiple-class prediction using three widely-used machine learning algorithms. After an initial assessment of model performance on different platforms, we evaluated whether predictive signature features selected in one platform could be directly used to train a model in the other platform and whether predictive models trained using data from one platform could predict datasets profiled using the other platform with comparable performance. Our results established that it is possible to successfully apply multiple-class prediction models across different commercial microarray platforms, offering a number of important benefits such as accelerating the possible translation of biomarkers identified with microarrays to clinically-validated assays. However, this investigation focuses on a technical platform comparison and is actually only the beginning of exploring cross-platform consistency. Further studies are needed to confirm the feasibility of microarray-based cross-platform prediction, especially using independent datasets.
The Affymetrix GeneChip® system is a commonly used platform for microarray analysis but the technology is inherently expensive. Unfortunately, changes in experimental planning and execution, such as the unavailability of previously anticipated samples or a shift in research focus, may render significant numbers of pre-purchased GeneChip® microarrays unprocessed before their manufacturer’s expiration dates. Researchers and microarray core facilities wonder whether expired microarrays are still useful for gene expression analysis. In addition, it was not clear whether the two human reference RNA samples established by the MAQC project in 2005 still maintained their transcriptome integrity over a period of four years. Experiments were conducted to answer these questions.
Microarray data were generated in 2009 in three replicates for each of the two MAQC samples with either expired Affymetrix U133A or unexpired U133Plus2 microarrays. These results were compared with data obtained in 2005 on the U133Plus2 microarray. The percentage of overlap between the lists of differentially expressed genes (DEGs) from U133Plus2 microarray data generated in 2009 and in 2005 was 97.44%. While there was some degree of fold change compression in the expired U133A microarrays, the percentage of overlap between the lists of DEGs from the expired and unexpired microarrays was as high as 96.99%. Moreover, the microarray data generated using the expired U133A microarrays in 2009 were highly concordant with microarray and TaqMan® data generated by the MAQC project in 2005.
Our results demonstrated that microarray data generated using U133A microarrays, which were more than four years past the manufacturer’s expiration date, were highly specific and consistent with those from unexpired microarrays in identifying DEGs despite some appreciable fold change compression and decrease in sensitivity. Our data also suggested that the MAQC reference RNA samples, stored at -80°C, were stable over a time frame of at least four years.
Advances in microbial genomics and bioinformatics are offering greater insights into the emergence and spread of foodborne pathogens in outbreak scenarios. The Food and Drug Administration (FDA) has developed a genomics tool, ArrayTrackTM, which provides extensive functionalities to manage, analyze, and interpret genomic data for mammalian species. ArrayTrackTM has been widely adopted by the research community and used for pharmacogenomics data review in the FDA’s Voluntary Genomics Data Submission program.
ArrayTrackTM has been extended to manage and analyze genomics data from bacterial pathogens of human, animal, and food origin. It was populated with bioinformatics data from public databases such as NCBI, Swiss-Prot, KEGG Pathway, and Gene Ontology to facilitate pathogen detection and characterization. ArrayTrackTM’s data processing and visualization tools were enhanced with analysis capabilities designed specifically for microbial genomics including flag-based hierarchical clustering analysis (HCA), flag concordance heat maps, and mixed scatter plots. These specific functionalities were evaluated on data generated from a custom Affymetrix array (FDA-ECSG) previously developed within the FDA. The FDA-ECSG array represents 32 complete genomes of Escherichia coli and Shigella. The new functions were also used to analyze microarray data focusing on antimicrobial resistance genes from Salmonella isolates in a poultry production environment using a universal antimicrobial resistance microarray developed by the United States Department of Agriculture (USDA).
The application of ArrayTrackTM to different microarray platforms demonstrates its utility in microbial genomics research, and thus will improve the capabilities of the FDA to rapidly identify foodborne bacteria and their genetic traits (e.g., antimicrobial resistance, virulence, etc.) during outbreak investigations. ArrayTrackTM is free to use and available to public, private, and academic researchers at http://www.fda.gov/ArrayTrack.
Endocrine disruptors (EDs) and their broad range of potential adverse effects in humans and other animals have been a concern for nearly two decades. Many putative EDs are widely used in commercial products regulated by the Food and Drug Administration (FDA) such as food packaging materials, ingredients of cosmetics, medical and dental devices, and drugs. The Endocrine Disruptor Knowledge Base (EDKB) project was initiated in the mid 1990’s by the FDA as a resource for the study of EDs. The EDKB database, a component of the project, contains data across multiple assay types for chemicals across a broad structural diversity. This paper demonstrates the utility of EDKB database, an integral part of the EDKB project, for understanding and prioritizing EDs for testing.
The EDKB database currently contains 3,257 records of over 1,800 EDs from different assays including estrogen receptor binding, androgen receptor binding, uterotropic activity, cell proliferation, and reporter gene assays. Information for each compound such as chemical structure, assay type, potency, etc. is organized to enable efficient searching. A user-friendly interface provides rapid navigation, Boolean searches on EDs, and both spreadsheet and graphical displays for viewing results. The search engine implemented in the EDKB database enables searching by one or more of the following fields: chemical structure (including exact search and similarity search), name, molecular formula, CAS registration number, experiment source, molecular weight, etc. The data can be cross-linked to other publicly available and related databases including TOXNET, Cactus, ChemIDplus, ChemACX, Chem Finder, and NCI DTP.
The EDKB database enables scientists and regulatory reviewers to quickly access ED data from multiple assays for specific or similar compounds. The data have been used to categorize chemicals according to potential risks for endocrine activity, thus providing a basis for prioritizing chemicals for more definitive but expensive testing. The EDKB database is publicly available and can be found online at http://edkb.fda.gov/webstart/edkb/index.html.
Disclaimer:The views presented in this article do not necessarily reflect those of the US Food and Drug Administration.
Recent advances in high-throughput genotyping technology are paving the way for research in personalized medicine and nutrition. However, most of the genetic markers identified from association studies account for a small contribution to the total risk/benefit of the studied phenotypic trait. Testing whether the candidate genes identified by association studies are causal is critically important to the development of personalized medicine and nutrition. An efficient data mining strategy and a set of sophisticated tools are necessary to help better understand and utilize the findings from genetic association studies.
SNP (single nucleotide polymorphism) and QTL (quantitative trait locus) libraries were constructed and incorporated into ArrayTrack, with user-friendly interfaces and powerful search features. Data from several public repositories were collected in the SNP and QTL libraries and connected to other domain libraries (genes, proteins, metabolites, and pathways) in ArrayTrack. Linking the data sets within ArrayTrack allows searching of SNP and QTL data as well as their relationships to other biological molecules. The SNP library includes approximately 15 million human SNPs and their annotations, while the QTL library contains publically available QTLs identified in mouse, rat, and human. The QTL library was developed for finding the overlap between the map position of a candidate or metabolic gene and QTLs from these species. Two use cases were included to demonstrate the utility of these tools. The SNP and QTL libraries are freely available to the public through ArrayTrack at http://www.fda.gov/ArrayTrack.
These libraries developed in ArrayTrack contain comprehensive information on SNPs and QTLs and are further cross-linked to other libraries. Connecting domain specific knowledge is a cornerstone of systems biology strategies and allows for a better understanding of the genetic and biological context of the findings from genetic association studies.
Summary: The first open source software suite for experimentalists and curators that (i) assists in the annotation and local management of experimental metadata from high-throughput studies employing one or a combination of omics and other technologies; (ii) empowers users to uptake community-defined checklists and ontologies; and (iii) facilitates submission to international public repositories.
Availability and Implementation: Software, documentation, case studies and implementations at http://www.isa-tools.org
ArrayTrack™is a Food and Drug Administration (FDA) bioinformatics tool that has been widely adopted by the research community for genomics studies. It provides an integrated environment for microarray data management, analysis and interpretation. Most of its functionality for statistical, pathway and gene ontology analysis can also be applied independently to data generated by other molecular technologies. ArrayTrack has been undergoing active development and enhancement since its inception in 2001. This review summarises its key functionalities, with emphasis on the most recent extensions in support of the evolving needs of FDA's research programmes. ArrayTrack has added capability to manage, analyse and interpret proteomics and metabolomics data after quantification of peptides and metabolites abundance, respectively. Annotation information about single nucleotide polymorphisms and quantitative trait loci has been integrated to support genetics-related studies. Other extensions have been added to manage and analyse genomics data related to bacterial food-borne pathogens.
Microarray; bioinformatics; omics integration; SNP; food-borne pathogens