Search tips
Search criteria

Results 1-25 (155)

Clipboard (0)

Select a Filter Below

Year of Publication
1.  Interaction networks for identifying coupled molecular processes in microbial communities 
BioData Mining  2015;8:21.
Microbial communities adapt to environmental conditions for optimizing metabolic flux. Such adaption may include cooperative mechanisms eventually resulting in phenotypic observables as emergent properties that cannot be attributed to an individual species alone. Understanding the molecular basis of cross-species cooperation adds to utilization of microbial communities in industrial applications including metal bioleaching and bioremediation processes. With significant advancements in metagenomics the composition of microbial communities became amenable for integrative analysis on the level of entangled molecular processes involving more than one species, in turn offering a data matrix for analyzing the molecular basis of cooperative phenomena.
We present an analysis framework aligned with a dynamical hierarchies concept for unraveling emergent properties in microbial communities, and exemplify this approach for a co-culture setting of At. ferrooxidans and At. thiooxidans. This minimum microbial community demonstrates a significant increase in bioleaching efficiency compared to the activity of individual species, involving mechanisms of the thiosulfate, the polysulfide and the iron oxidation pathway.
Populating gene-centric data structures holding rich functional annotation and interaction information allows deriving network models at the functional level coupling energy production and transport processes of both microbial species. Applying a network segmentation approach on the interaction network of ortholog genes covering energy production and transport proposes a set of specific molecular processes of relevance in bioleaching. The resulting molecular process model essentially involves functionalities such as iron oxidation, nitrogen metabolism and proton transport, complemented by sulfur oxidation and nitrogen metabolism, as well as a set of ion transporter functionalities. At. ferrooxidans-specific genes embedded in the molecular model representation hold gene functions supportive for ammonia utilization as well as for biofilm formation, resembling key elements for effective chalcopyrite bioleaching as emergent property in the co-culture situation.
Analyzing the entangled molecular processes of a microbial community on the level of segmented, gene-centric interaction networks allows identification of core molecular processes and functionalities adding to our mechanistic understanding of emergent properties of microbial consortia.
PMCID: PMC4502522  PMID: 26180552
Network biology; Microbial cooperation; Bioleaching; Chalcopyrite; Acidithiobacillus; Emergence
2.  Uncovering correlated variabilityin epigenomic datasets usingthe Karhunen-Loeve transform 
BioData Mining  2015;8:20.
Larger variation exists in epigenomes than in genomes, as a single genome shapes the identity of multiple cell types. With the advent of next-generation sequencing, one of the key problems in computational epigenomics is the poor understanding of correlations and quantitative differences between large scale data sets.
Here we bring to genomics a scenario of functional principal component analysis, a finite Karhunen-Loève transform, and explicitly decompose the variation in the coverage profiles of 27 chromatin mark ChIP-seq datasets at transcription start sites for H1, one of the most used human embryonic stem cell lines. Using this approach we identify positive correlations between H3K4me3 and H3K36me3, as well as between H3K9ac and H3K36me3, so far undetected by the most commonly used Pearson correlation between read enrichment coverages. We uncover highly negative correlations between H2A.Z, H3K4me3, and several histone acetylation marks, but these occur only between principal components of first and second order. We also demonstrate that levels of gene expression correlate significantly with scores of components of order higher than one, demonstrating that transcriptional regulation by histone marks escapes simple one-to-one relationships. This correlations were higher in significance and magnitude in protein coding genes than in non-coding RNAs.
In summary, we present a methodology to explore and uncover novel patterns of epigenomic variability and covariability in genomic data sets by using a functional eigenvalue decomposition of genomic data. R code is available at:
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-015-0051-7) contains supplementary material, which is available to authorized users.
PMCID: PMC4488123  PMID: 26140054
Histone modifications; ChIP-seq; Functional data analysis; Stem cells; H1; Roadmap Epigenomics Consortium; H3K4me3; H3K36me3; H3K9ac; H2A.Z
3.  TaxKB: a knowledge base for new taxane-related drug discovery 
BioData Mining  2015;8:19.
Taxanes are naturally occurring compounds which belong to a powerful group of chemotherapeutic drugs with anticancer properties. Their current use, clinical efficacy, and unique mechanism of action indicate their potentiality for cancer drug discovery and development thereby promising to reduce the high economy associated with cancer worldwide. Extensive research has been carried out on taxanes with the aim to combat issues of drug resistance, side effects, limited natural supply, and also to increase the therapeutic index of these molecules. These efforts have led to the isolation of many naturally occurring compounds belonging to this family (more than 350 different kinds), and the synthesis of semisynthetic analogs of the naturally existing molecules (>500), and has also led to the characterization of many (>1000) of them. A web-based database system on clinically exploitable taxanes, providing a link between the structure and the pharmacological property of these molecules could help to reduce the druggability gap for these molecules.
Taxane knowledge base (TaxKB,, is an online multi-tier relational database that currently holds data on 42 parameters of 250 natural and 503 semisynthetic analogs of taxanes. This database provides researchers with much-needed information necessary for drug development. TaxKB enables the user to search data on the structure, drug-likeness, and physicochemical properties of both natural and synthetic taxanes with a “General Search” option in addition to a “Parameter Specific Search.” It displays 2D structure and allows the user to download the 3D structure (a PDB file) of taxanes that can be viewed with any molecular visualization tool. The ultimate aim of TaxKB is to provide information on Absorption, Distribution, Metabolism, and Excretion/Toxicity (ADME/T) as well as data on bioavailability and target interaction properties of candidate anticancer taxanes, ahead of expensive clinical trials.
This first web-based single-information portal will play a central role and help researchers to move forward in taxane-based cancer drug research.
PMCID: PMC4485360  PMID: 26131021
Anticancer drugs; Database; Drug discovery; Taxanes; Semisynthetic taxanes
4.  DNA microarray integromics analysis platform 
BioData Mining  2015;8:18.
The study of interactions between molecules belonging to different biochemical families (such as lipids and nucleic acids) requires specialized data analysis methods. This article describes the DNA Microarray Integromics Analysis Platform, a unique web application that focuses on computational integration and analysis of “multi-omics” data. Our tool supports a range of complex analyses, including – among others – low- and high-level analyses of DNA microarray data, integrated analysis of transcriptomics and lipidomics data and the ability to infer miRNA-mRNA interactions.
We demonstrate the characteristics and benefits of the DNA Microarray Integromics Analysis Platform using two different test cases. The first test case involves the analysis of the nutrimouse dataset, which contains measurements of the expression of genes involved in nutritional problems and the concentrations of hepatic fatty acids. The second test case involves the analysis of miRNA-mRNA interactions in polysaccharide-stimulated human dermal fibroblasts infected with porcine endogenous retroviruses.
The DNA Microarray Integromics Analysis Platform is a web-based graphical user interface for “multi-omics” data management and analysis. Its intuitive nature and wide range of available workflows make it an effective tool for molecular biology research. The platform is hosted at
PMCID: PMC4479227  PMID: 26110022
5.  Testing multiple hypotheses through IMP weighted FDR based on a genetic functional network with application to a new zebrafish transcriptome study 
BioData Mining  2015;8:17.
In genome-wide studies, hundreds of thousands of hypothesis tests are performed simultaneously. Bonferroni correction and False Discovery Rate (FDR) can effectively control type I error but often yield a high false negative rate. We aim to develop a more powerful method to detect differentially expressed genes. We present a Weighted False Discovery Rate (WFDR) method that incorporate biological knowledge from genetic networks. We first identify weights using Integrative Multi-species Prediction (IMP) and then apply the weights in WFDR to identify differentially expressed genes through an IMP-WFDR algorithm. We performed a gene expression experiment to identify zebrafish genes that change expression in the presence of arsenic during a systemic Pseudomonas aeruginosa infection. Zebrafish were exposed to arsenic at 10 parts per billion and/or infected with P. aeruginosa. Appropriate controls were included. We then applied IMP-WFDR during the analysis of differentially expressed genes. We compared the mRNA expression for each group and found over 200 differentially expressed genes and several enriched pathways including defense response pathways, arsenic response pathways, and the Notch signaling pathway.
PMCID: PMC4474579  PMID: 26097506
False discovery rate; Family-wise error rate; Genomic studies; Data integration
7.  The effects of electronic medical record phenotyping details on genetic association studies: HDL-C as a case study 
BioData Mining  2015;8:15.
Biorepositories linked to de-identified electronic medical records (EMRs) have the potential to complement traditional epidemiologic studies in genotype-phenotype studies of complex human diseases and traits. A major challenge in meeting this potential is the use of EMR-derived data to extract phenotypes and covariates for genetic association studies. Unlike traditional epidemiologic data, EMR-derived data are collected for clinical care and are therefore highly variable across patients. The variability of clinical data coupled with the challenges associated with searching unstructured clinical notes requires the development of algorithms to extract phenotypes for analysis. Given the number of possible algorithms that could be developed for any one EMR-derived phenotype, we explored here the impact algorithm decision logic has on genetic association study results for a single quantitative trait, high density lipoprotein cholesterol (HDL-C).
We used five different algorithms to extract HDL-C from African American subjects genotyped on the Illumina Metabochip (n = 11,519) as part of Epidemiologic Architecture for Genes Linked to Environment (EAGLE). Tests of association between HDL-C and genetic risk scores for HDL-C associated variants suggest that the genetic effect size does not vary substantially across the five HDL-C definitions.
These data collectively suggest that, at least for this quantitative trait, algorithm decision logic and phenotyping details do not appreciably impact genetic association study test statistics.
PMCID: PMC4428098  PMID: 25969697
Electronic medical record; Genetic risk score; HDL-C; eMERGE network; PAGE I study
8.  Predicting linear B-cell epitopes using amino acid anchoring pair composition 
BioData Mining  2015;8:14.
Accurate identification of linear B-cell epitopes plays an important role in peptide vaccine designs, immunodiagnosis, and antibody productions. Although several prediction methods have been reported, unsatisfied accuracy has limited the broad usages in linear B-cell epitope prediction. Therefore, developing a reliable model with significant improvement on prediction accuracy is highly desirable.
In this study, we developed a novel model for prediction of linear B-cell epitopes, APCpred, which was derived from the combination of amino acid anchoring pair composition (APC) and Support Vector Machine (SVM) methods. Systematic comparisons with the existing prediction models demonstrated that APCpred method significantly improved the prediction accuracy both in fivefold cross-validation of training datasets and in independent blind datasets. In the fivefold cross-validation test with Chen872 dataset at window size of 20, APCpred achieved AUC of 0.809 and accuracy of 72.94%, which was much more accurate than the existing models, e.g., Bayesb, Chen’s AAP methods and the enhanced combination method of AAP with five AP scales. For the fivefold cross-validation test with ABC16 dataset, APCpred achieved an improved AUC of 0.794 and ACC of 73.00% at window size of 16, and attained an AUC of 0.748 and ACC of 67.96% on Blind387 dataset after being trained with ABC16 dataset. Trained with Lbtope_Confirm dataset, APCpred achieved an increased Acc of 55.09% on FBC934 dataset. Within sequence window sizes from 12 to 20, APCpred final model on homology-reduced dataset achieved an optimal AUC of 0.748 and ACC of 68.43% in fivefold cross-validation at the window size of 20.
APCpred model demonstrated a significant improvement in predicting linear B-cell epitopes using the features of amino acid anchoring pair composition (APC). Based on our study, a webserver has been developed for on-line prediction of linear B-cell epitopes, which is a free access at: http:/
PMCID: PMC4449562  PMID: 26029265
Linear B-cell epitopes; Epitopes prediction; Amino acid anchoring pair composition
9.  Mining causal relationships among clinical variables for cancer diagnosis based on Bayesian analysis 
BioData Mining  2015;8:13.
Cancer is the second leading cause of death around the world after cardiovascular diseases. Over the past decades, various data mining studies have tried to predict the outcome of cancer. However, only a few reports describe the causal relationships among clinical variables or attributes, which may provide theoretical guidance for cancer diagnosis and therapy. Different restricted Bayesian classifiers have been used to discover information from numerous domains. This research work designed a novel Bayesian learning strategy to predict cause-specific death classes and proposed a graphical structure of key attributes to clarify the implicit relationships implicated in the data set.
The working mechanisms of 3 classical restricted Bayesian classifiers, namely, NB, TAN and KDB, were analysed and summarised. To retain the properties of global optimisation and high-order dependency representation, the proposed learning algorithm, i.e., flexible K-dependence Bayesian network (FKBN), applies the greedy search of conditional mutual information space to identify the globally optimal ordering of the attributes and to allow the classifiers to be constructed at arbitrary points (values of K) along the attribute dependence spectrum. This method represents the relationships between different attributes by using a directed acyclic graph (DAG) model. A total of 12 data sets were selected from the SEER database and KRBM repository by 10-fold cross-validation for evaluation purposes. The findings revealed that the FKBN model outperformed NB, TAN and KDB.
A Bayesian classifier can graphically describe the conditional dependency among attributes. The proposed algorithm offers a trade-off between probability estimation and network structure complexity. The direct and indirect relationships between the predictive attributes and class variable should be considered simultaneously to achieve global optimisation and high-order dependency representation. By analysing the DAG inferred from the breast cancer data set of the SEER database we divided the attributes into two subgroups, namely, key attributes that should be considered first for cancer diagnosis and those that are independent of each other but are closely related to key attributes. The statistical analysis results clarify some of the causal relationships implicated in the DAG.
PMCID: PMC4404584  PMID: 25901184
Causal relationship; Cancer diagnosis; Restricted Bayesian classifier
10.  Mining severe drug-drug interaction adverse events using Semantic Web technologies: a case study 
BioData Mining  2015;8:12.
Drug-drug interactions (DDIs) are a major contributing factor for unexpected adverse drug events (ADEs). However, few of knowledge resources cover the severity information of ADEs that is critical for prioritizing the medical need. The objective of the study is to develop and evaluate a Semantic Web-based approach for mining severe DDI-induced ADEs.
We utilized a normalized FDA Adverse Event Report System (AERS) dataset and performed a case study of three frequently prescribed cardiovascular drugs: Warfarin, Clopidogrel and Simvastatin. We extracted putative DDI-ADE pairs and their associated outcome codes. We developed a pipeline to filter the associations using ADE datasets from SIDER and PharmGKB. We also performed a signal enrichment using electronic medical records (EMR) data. We leveraged the Common Terminology Criteria for Adverse Event (CTCAE) grading system and classified the DDI-induced ADEs into the CTCAE in the Web Ontology Language (OWL).
We identified 601 DDI-ADE pairs for the three drugs using the filtering pipeline, of which 61 pairs are in Grade 5, 56 pairs in Grade 4 and 484 pairs in Grade 3. Among 601 pairs, the signals of 59 DDI-ADE pairs were identified from the EMR data.
The approach developed could be generalized to detect the signals of putative severe ADEs induced by DDIs in other drug domains and would be useful for supporting translational and pharmacovigilance study of severe ADEs.
PMCID: PMC4379609  PMID: 25829948
Drug-drug Interaction; Adverse drug event; Data mining; Semantic web technology; Electronic medical records
11.  Bacterial rose garden for metagenomic SNP-based phylogeny visualization 
BioData Mining  2015;8:10.
One of the most challenging tasks in genomic analysis nowadays is metagenomics. Biomedical applications of metagenomics give rise to datasets containing hundreds and thousands of samples from various body sites for hundreds of patients. Inherently metagenome is by far more complex than a single genome as it varies in time by the amount of bacteria comprising it. Other levels of data complexity include geography of the samples and phylogenetic distance between the genomes of the same operational taxonomic unit (OTU). We have developed the visualization concept for the representation of multilayer metagenomics data – the bacterial rose garden. The approach allows to display the taxonomic distance between the representatives of the same OTU in different samples and use variety of the metadata for display.
We have developed the principle of visualization allowing for multilayer information representation. We have incorporated data on OTU diversity across metagenomes and origin of the samples. The visual representation we have called “rose” is focused on the phylogenetic distance between the representatives of the same OTU. The visual representation is realized as interactive data chart which allows user to interact with data and explore variables. It is known that classical representation of the taxonomic tree is a reduction of information from original pairwise distance matrix. The visualization presented is a way to save all the information available through projection of distance matrix into single dimensional space of one sample. It could serve as a basis for further more complex information representation. We have used the principle proposed for visualization of 101 bacterial OTUs phylogenetic distances, finally we provide open code for the web page generation.
Bacterial rose garden is a versatile visualization principle coping with the major difficulties of metagenomic big-data visualization without loss of data. The method proposed is showing the interconnectedness of variables and is realized as user-friendly web page allowing for dynamic data exploration. The concept provided serves as one of the original approaches for metagenomic data representation and sharing. Full functional prototype could be found at
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-015-0045-5) contains supplementary material, which is available to authorized users.
PMCID: PMC4374582  PMID: 25815061
Metagenomic data visualization; Rose garden; Gut microbiota; Phylogeny visualization
12.  A bibliometric analysis on tobacco regulation investigators 
BioData Mining  2015;8:11.
To facilitate the implementation of the Family Smoking Prevention and Tobacco Control Act of 2009, the Federal Drug Agency (FDA) Center for Tobacco Products (CTP) has identified research priorities under the umbrella of tobacco regulatory science (TRS). As a newly integrated field, the current boundaries and landscape of TRS research are in need of definition. In this work, we conducted a bibliometric study of TRS research by applying author topic modeling (ATM) on MEDLINE citations published by currently-funded TRS principle investigators (PIs).
We compared topics generated with ATM on dataset collected with TRS PIs and topics generated with ATM on dataset collected with a TRS keyword list. It is found that all those topics show a good alignment with FDA’s funding protocols. More interestingly, we can see clear interactive relationships among PIs and between PIs and topics. Based on those interactions, we can discover how diverse each PI is, how productive they are, which topics are more popular and what main components each topic involves. Temporal trend analysis of key words shows the significant evaluation in four prime TRS areas.
The results show that ATM can efficiently group articles into discriminative categories without any supervision. This indicates that we may incorporate ATM into author identification systems to infer the identity of an author of articles using topics generated by the model. It can also be useful to grantees and funding administrators in suggesting potential collaborators or identifying those that share common research interests for data harmonization or other purposes. The incorporation of temporal analysis can be employed to assess the change over time in TRS as new projects are funded and the extent to which new research reflects the funding priorities of the FDA.
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-015-0043-7) contains supplementary material, which is available to authorized users.
PMCID: PMC4432889  PMID: 25984237
Author topic modeling; Bibliometric analysis; Tobacco regulation science; FDA; Principle investigators
13.  Cancer based pharmacogenomics network supported with scientific evidences: from the view of drug repurposing 
BioData Mining  2015;8:9.
Pharmacogenomics (PGx) as an emerging field, is poised to change the way we practice medicine and deliver health care by customizing drug therapies on the basis of each patient’s genetic makeup. A large volume of PGx data including information among drugs, genes, and single nucleotide polymorphisms (SNPs) has been accumulated. Normalized and integrated PGx information could facilitate revelation of hidden relationships among drug treatments, genomic variations, and phenotype traits to better support drug discovery and next generation of treatment.
In this study, we generated a normalized and scientific evidence supported cancer based PGx network (CPN) by integrating cancer related PGx information from multiple well-known PGx resources including the Pharmacogenomics Knowledge Base (PharmGKB), the FDA PGx Biomarkers in Drug Labeling, and the Catalog of Published Genome-Wide Association Studies (GWAS). We successfully demonstrated the capability of the CPN for drug repurposing by conducting two case studies.
The CPN established in this study offers comprehensive cancer based PGx information to support cancer orientated research, especially for drug repurposing.
PMCID: PMC4345035  PMID: 25729430
Pharmacogenomics; Cancer; Network; Drug repurposing
14.  Visualisation of quadratic discriminant analysis and its application in exploration of microbial interactions 
BioData Mining  2015;8:8.
When comparing diseased and non-diseased patients in order to discriminate between the aspects associated with the specific disease, it is often observed that the diseased patients have more variability than the non-diseased patients. In such cases Quadratic discriminant analysis is required which is based on the estimation of different covariance structures for the different groups. Having different covariance matrices means the Canonical variate transformation cannot be used to obtain a visual representation of the discrimination and group separation.
In this paper an alternative method is proposed: combining the different transformations for the different groups into a single representation of the sample points with classification regions. In order to associate the differences in variables with group discrimination, a biplot is produced which include information on the variables, samples and their relationship.
PMCID: PMC4369096  PMID: 25798196
Quadratic discriminant analysis; Canonical variate analysis; Biplots
15.  Big data - a 21st century science Maginot Line? No-boundary thinking: shifting from the big data paradigm 
BioData Mining  2015;8:7.
Whether your interests lie in scientific arenas, the corporate world, or in government, you have certainly heard the praises of big data: Big data will give you new insights, allow you to become more efficient, and/or will solve your problems. While big data has had some outstanding successes, many are now beginning to see that it is not the Silver Bullet that it has been touted to be. Here our main concern is the overall impact of big data; the current manifestation of big data is constructing a Maginot Line in science in the 21st century. Big data is not “lots of data” as a phenomena anymore; The big data paradigm is putting the spirit of the Maginot Line into lots of data. Big data overall is disconnecting researchers and science challenges. We propose No-Boundary Thinking (NBT), applying no-boundary thinking in problem defining to address science challenges.
PMCID: PMC4323225  PMID: 25670967
Big data; Maginot Line; No-Boundary thinking
16.  An investigation of gene-gene interactions in dose-response studies with Bayesian nonparametrics 
BioData Mining  2015;8:6.
Best practice for statistical methodology in cell-based dose-response studies has yet to be established. We examine the ability of MANOVA to detect trait-associated genetic loci in the presence of gene-gene interactions. We present a novel Bayesian nonparametric method designed to detect such interactions.
MANOVA and the Bayesian nonparametric approach show good ability to detect trait-associated genetic variants under various possible genetic models. It is shown through several sets of analyses that this may be due to marginal effects being present, even if the underlying genetic model does not explicitly contain them.
Understanding how genetic interactions affect drug response continues to be a critical goal. MANOVA and the novel Bayesian framework present a trade-off between computational complexity and model flexibility.
PMCID: PMC4330980  PMID: 25691918
Dose-response; Epistasis; Bayesian nonparametric; Neural network; Machine learning
17.  Differential co-expression network centrality and machine learning feature selection for identifying susceptibility hubs in networks with scale-free structure 
BioData Mining  2015;8:5.
Biological insights into group differences, such as disease status, have been achieved through differential co-expression analysis of microarray data. Additional understanding of group differences may be achieved by integrating the connectivity structure of the differential co-expression network and per-gene differential expression between phenotypic groups. Such a global differential co-expression network strategy may increase sensitivity to detect gene-gene interactions (or expression epistasis) that may act as candidates for rewiring susceptibility co-expression networks.
We test two methods for inferring Genetic Association Interaction Networks (GAIN) incorporating both differential co-expression effects and differential expression effects: a generalized linear model (GLM) regression method with interaction effects (reGAIN) and a Fisher test method for correlation differences (dcGAIN). We rank the importance of each gene with complete interaction network centrality (CINC), which integrates each gene’s differential co-expression effects in the GAIN model along with each gene’s individual differential expression measure. We compare these methods with statistical learning methods Relief-F, Random Forests and Lasso. We also develop a mixture model and permutation approach for determining significant importance score thresholds for network centralities, Relief-F and Random Forest. We introduce a novel simulation strategy that generates microarray case–control data with embedded differential co-expression networks and underlying correlation structure based on scale-free or Erdos-Renyi (ER) random networks.
Using the network simulation strategy, we find that Relief-F and reGAIN provide the best balance between detecting interactions and main effects, plus reGAIN has the ability to adjust for covariates and model quantitative traits. The dcGAIN approach performs best at finding differential co-expression effects by design but worst for main effects, and it does not adjust for covariates and is limited to dichotomous outcomes. When the underlying network is scale free instead of ER, all interaction network methods have greater power to find differential co-expression effects. We apply these methods to a public microarray study of the differential immune response to influenza vaccine, and we identify effects that suggest a role in influenza vaccine immune response for genes from the PI3K family, which includes genes with known immunodeficiency function, and KLRG1, which is a known marker of senescence.
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-015-0040-x) contains supplementary material, which is available to authorized users.
PMCID: PMC4326454  PMID: 25685197
18.  Mining the entire Protein DataBank for frequent spatially cohesive amino acid patterns 
BioData Mining  2015;8:4.
The three-dimensional structure of a protein is an essential aspect of its functionality. Despite the large diversity in protein structures and functionality, it is known that there are common patterns and preferences in the contacts between amino acid residues, or between residues and other biomolecules, such as DNA. The discovery and characterization of these patterns is an important research topic within structural biology as it can give fundamental insight into protein structures and can aid in the prediction of unknown structures.
Here we apply an efficient spatial pattern miner to search for sets of amino acids that occur frequently in close spatial proximity in the protein structures of the Protein DataBank. This allowed us to mine for a new class of amino acid patterns, that we term FreSCOs (Frequent Spatially Cohesive Component sets), which feature synergetic combinations. To demonstrate the relevance of these FreSCOs, they were compared in relation to the thermostability of the protein structure and the interaction preferences of DNA-protein complexes. In both cases, the results matched well with prior investigations using more complex methods on smaller data sets.
The currently characterized protein structures feature a diverse set of frequent amino acid patterns that can be related to the stability of the protein molecular structure and that are independent from protein function or specific conserved domains.
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-015-0038-4) contains supplementary material, which is available to authorized users.
PMCID: PMC4318390  PMID: 25657820
Protein structure; Frequent pattern mining; Thermostability; Protein-DNA complexes
19.  Microarray enriched gene rank 
BioData Mining  2015;8:2.
We develop a new concept that reflects how genes are connected based on microarray data using the coefficient of determination (the squared Pearson correlation coefficient). Our gene rank combines a priori knowledge about gene connectivity, say, from the Gene Ontology (GO) database, and the microarray expression data at hand, called the microarray enriched gene rank, or simply gene rank (GR). GR, similarly to Google PageRank, is defined in a recursive fashion and is computed as the left maximum eigenvector of a stochastic matrix derived from microarray expression data. An efficient algorithm is devised that allows computation of GR for 50 thousand genes with 500 samples within minutes on a personal computer using the public domain statistical package R.
Computation of GR is illustrated with several microarray data sets. In particular, we apply GR (1) to answer whether bad genes are more connected than good genes in relation with cancer patient survival, (2) to associate gene connectivity with cluster/subtypes in ovarian cancer tumors, and to determine whether gene connectivity changes (3) from organ to organ within the same organism and (4) between organisms.
We have shown by examples that findings based on GR confirm biological expectations. GR may be used for hypothesis generation on gene pathways. It may be used for a homogeneous sample or for comparison of gene connectivity among cases and controls, or in longitudinal setting.
PMCID: PMC4305247  PMID: 25649242
20.  Performance of genetic programming optimised Bowtie2 on genome comparison and analytic testing (GCAT) benchmarks 
BioData Mining  2015;8:1.
Genetic studies are increasingly based on short noisy next generation scanners. Typically complete DNA sequences are assembled by matching short NextGen sequences against reference genomes. Despite considerable algorithmic gains since the turn of the millennium, matching both single ended and paired end strings to a reference remains computationally demanding. Further tailoring Bioinformatics tools to each new task or scanner remains highly skilled and labour intensive. With this in mind, we recently demonstrated a genetic programming based automated technique which generated a version of the state-of-the-art alignment tool Bowtie2 which was considerably faster on short sequences produced by a scanner at the Broad Institute and released as part of The Thousand Genome Project.
Bowtie2 GP and the original Bowtie2 release were compared on bioplanet’s GCAT synthetic benchmarks. Bowtie2 GP enhancements were also applied to the latest Bowtie2 release (2.2.3, 29 May 2014) and retained both the GP and the manually introduced improvements.
On both singled ended and paired-end synthetic next generation DNA sequence GCAT benchmarks Bowtie2GP runs up to 45% faster than Bowtie2. The lost in accuracy can be as little as 0.2–0.5% but up to 2.5% for longer sequences.
PMCID: PMC4304608  PMID: 25621011
Double-ended DNA sequence; High throughput Solexa 454 nextgen NGS sequence query; Rapid fuzzy string matching; Homo sapiens genome reference consortium HG19
21.  Linked vaccine adverse event data from VAERS for biomedical data analysis and longitudinal studies 
BioData Mining  2014;7:36.
Vaccines have been one of the most successful public health interventions to date. The use of vaccination, however, sometimes comes with possible adverse events. The U.S. FDA/CDC Vaccine Adverse Event Reporting System (VAERS) currently contains more than 200,000 reports for post-vaccination events that occur after the administration of vaccines licensed in the United States. Although the data from the VAERS has been applied to many public health and vaccine safety studies, each individual report does not necessarily indicate a casuality relationship between the vaccine and the reported symptoms. Further statistical analysis and summarization needs to be done before this data can be leveraged.
This paper introduces our efforts on representing the vaccine-symptom correlations and their corresponding meta-information extracted from the VAERS database using Resource Description Framework (RDF). Numbers of occurrences of vaccine-symptom pairs reported to the VAERS were summarized with corresponding proportional reporting ratios (PRR) calculated. All the data was stored in an RDF file. We then applied network analysis approaches to the RDF data to illustrate a use case of the data for longititual studies. We further dicussed our vision on integrating the data with vaccine information from other sources using RDF linked approach to facilitate more comprehensive analyses.
The 1990–2013 data from VAERS has been extracted from the VAERS database. There are 83,148 unique vaccine-symptom pairs with 75 vaccine types and 5,865 different reported symptoms. The yearly and over PRR values for each reported vaccine-symptom pair were calculated. The network properties of networks consisting of significant vaccine-symptom associations (i.e., PRR larger than 1) were then investigated. The results indicated that vaccine-symptom association network is a dense network, with any given node connected to all other nodes through an average of approximately two other nodes and a maximum of five nodes.
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-014-0036-y) contains supplementary material, which is available to authorized users.
PMCID: PMC4333877  PMID: 25699091
22.  Combining functional genomics strategies identifies modular heterogeneity of breast cancer intrinsic subtypes 
BioData Mining  2014;7:27.
The discovery of breast cancer subtypes and subsequent development of treatments aimed at them has allowed for a great reduction in the mortality of breast cancer. But despite this progress, tumors with similar characteristics that belong to the same subtype continue to respond differently to the same treatment. Five subtypes of breast cancer, namely intrinsic subtypes, have been characterized to date based on their gene expression profiles. Among other characteristics, subtypes vary in their degree of intra-subtype heterogeneity. It is not clear, however, whether this heterogeneity is shared across all tumor traits. It is also unclear whether individual traits can be highly heterogeneous among a majority of homogeneous traits.
We employ network theory to uncover gene modules and accordingly consider them as tumor traits, which capture shared biological processes among the subtypes. We then use the β-diversity metric from ecology to quantify the heterogeneity in these gene modules. In doing so, we show that breast cancer heterogeneity is contained in gene modules and that this modular heterogeneity increases monotonically across the subtypes. We identify a core of two modules that are shared among all subtypes which contain nucleosome assembly and mammary morphogenesis genes, and a number of modules that are specific to subtypes. This modular heterogeneity, which increases with global heterogeneity, relates to tumor aggressiveness. Indeed, we observe that Luminal A, the most treatable of subtypes, has the lowest modular heterogeneity whereas the Basal-like subtype, which is among the hardest to treat, has the highest. Furthermore, our analysis shows that a higher degree of global heterogeneity does not imply higher heterogeneity for all modules, as Luminal B shows the highest heterogeneity for core modules.
Overall, modular heterogeneity provides a framework with which to dissect cancer heterogeneity and better understand its underpinnings, thereby ultimately advancing our knowledge towards a more effective personalized cancer therapy.
Electronic supplementary material
The online version of this article (doi:10.1186/1756-0381-7-27) contains supplementary material, which is available to authorized users.
PMCID: PMC4350320  PMID: 25745517
Breast cancer subtype; Heterogeneity; β-diversity; Gene module
23.  Identifying genetic interactions associated with late-onset Alzheimer’s disease 
BioData Mining  2014;7:35.
Identifying genetic interactions in data obtained from genome-wide association studies (GWASs) can help in understanding the genetic basis of complex diseases. The large number of single nucleotide polymorphisms (SNPs) in GWASs however makes the identification of genetic interactions computationally challenging. We developed the Bayesian Combinatorial Method (BCM) that can identify pairs of SNPs that in combination have high statistical association with disease.
We applied BCM to two late-onset Alzheimer’s disease (LOAD) GWAS datasets to identify SNPs that interact with known Alzheimer associated SNPs. We also compared BCM with logistic regression that is implemented in PLINK. Gene Ontology analysis of genes from the top 200 dataset SNPs for both GWAS datasets showed overrepresentation of LOAD-related terms. Four genes were common to both datasets: APOE and APOC1, which have well established associations with LOAD, and CAMK1D and FBXL13, not previously linked to LOAD but having evidence of involvement in LOAD. Supporting evidence was also found for additional genes from the top 30 dataset SNPs.
BCM performed well in identifying several SNPs having evidence of involvement in the pathogenesis of LOAD that would not have been identified by univariate analysis due to small main effect. These results provide support for applying BCM to identify potential genetic variants such as SNPs from high dimensional GWAS datasets.
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-014-0035-z) contains supplementary material, which is available to authorized users.
PMCID: PMC4300162  PMID: 25649863
Genome-wide association study; Epistasis; Alzheimer’s disease; Bayesian networks
24.  Synthetic learning machines 
BioData Mining  2014;7:28.
Using a collection of different terminal nodesize constructed random forests, each generating a synthetic feature, a synthetic random forest is defined as a kind of hyperforest, calculated using the new input synthetic features, along with the original features.
Using a large collection of regression and multiclass datasets we show that synthetic random forests outperforms both conventional random forests and the optimized forest from the regresssion portfolio.
Synthetic forests removes the need for tuning random forests with no additional effort on the part of the researcher. Importantly, the synthetic forest does this with evidently no loss in prediction compared to a well-optimized single random forest.
PMCID: PMC4279689  PMID: 25614764
Machine; Nodesize; Random forest; Trees; Synthetic feature
25.  Integrative genomics and transcriptomics analysis of human embryonic and induced pluripotent stem cells 
BioData Mining  2014;7:32.
Human genomic variations, including single nucleotide polymorphisms (SNPs) and copy number variations (CNVs), are associated with several phenotypic traits varying from mild features to hereditary diseases. Several genome-wide studies have reported genomic variants that correlate with gene expression levels in various tissue and cell types.
We studied human embryonic stem cells (hESCs) and human induced pluripotent stem cells (hiPSCs) measuring the SNPs and CNVs with Affymetrix SNP 6 microarrays and expression values with Affymetrix Exon microarrays. We computed the linear relationships between SNPs and expression levels of exons, transcripts and genes, and the associations between gene CNVs and gene expression levels. Further, for a few of the resulted genes, the expression value was associated with both CNVs and SNPs. Our results revealed altogether 217 genes and 584 SNPs whose genomic alterations affect the transcriptome in the same cells. We analyzed the enriched pathways and gene ontologies within these groups of genes, and found out that the terms related to alternative splicing and development were enriched.
Our results revealed that in the human pluripotent stem cells, the expression values of several genes, transcripts and exons were affected due to the genomic variation.
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-014-0032-2) contains supplementary material, which is available to authorized users.
PMCID: PMC4298950  PMID: 25649046
hESC; hiPSC; Association analysis; SNP; CNV; Gene expression; Exon expression; Transcript expression

Results 1-25 (155)