Search tips
Search criteria

Results 1-25 (139)

Clipboard (0)

Select a Filter Below

Year of Publication
1.  Big data - a 21st century science Maginot Line? No-boundary thinking: shifting from the big data paradigm 
BioData Mining  2015;8:7.
Whether your interests lie in scientific arenas, the corporate world, or in government, you have certainly heard the praises of big data: Big data will give you new insights, allow you to become more efficient, and/or will solve your problems. While big data has had some outstanding successes, many are now beginning to see that it is not the Silver Bullet that it has been touted to be. Here our main concern is the overall impact of big data; the current manifestation of big data is constructing a Maginot Line in science in the 21st century. Big data is not “lots of data” as a phenomena anymore; The big data paradigm is putting the spirit of the Maginot Line into lots of data. Big data overall is disconnecting researchers and science challenges. We propose No-Boundary Thinking (NBT), applying no-boundary thinking in problem defining to address science challenges.
PMCID: PMC4323225  PMID: 25670967
Big data; Maginot Line; No-Boundary thinking
2.  An investigation of gene-gene interactions in dose-response studies with Bayesian nonparametrics 
BioData Mining  2015;8:6.
Best practice for statistical methodology in cell-based dose-response studies has yet to be established. We examine the ability of MANOVA to detect trait-associated genetic loci in the presence of gene-gene interactions. We present a novel Bayesian nonparametric method designed to detect such interactions.
MANOVA and the Bayesian nonparametric approach show good ability to detect trait-associated genetic variants under various possible genetic models. It is shown through several sets of analyses that this may be due to marginal effects being present, even if the underlying genetic model does not explicitly contain them.
Understanding how genetic interactions affect drug response continues to be a critical goal. MANOVA and the novel Bayesian framework present a trade-off between computational complexity and model flexibility.
PMCID: PMC4330980
Dose-response; Epistasis; Bayesian nonparametric; Neural network; Machine learning
3.  Differential co-expression network centrality and machine learning feature selection for identifying susceptibility hubs in networks with scale-free structure 
BioData Mining  2015;8:5.
Biological insights into group differences, such as disease status, have been achieved through differential co-expression analysis of microarray data. Additional understanding of group differences may be achieved by integrating the connectivity structure of the differential co-expression network and per-gene differential expression between phenotypic groups. Such a global differential co-expression network strategy may increase sensitivity to detect gene-gene interactions (or expression epistasis) that may act as candidates for rewiring susceptibility co-expression networks.
We test two methods for inferring Genetic Association Interaction Networks (GAIN) incorporating both differential co-expression effects and differential expression effects: a generalized linear model (GLM) regression method with interaction effects (reGAIN) and a Fisher test method for correlation differences (dcGAIN). We rank the importance of each gene with complete interaction network centrality (CINC), which integrates each gene’s differential co-expression effects in the GAIN model along with each gene’s individual differential expression measure. We compare these methods with statistical learning methods Relief-F, Random Forests and Lasso. We also develop a mixture model and permutation approach for determining significant importance score thresholds for network centralities, Relief-F and Random Forest. We introduce a novel simulation strategy that generates microarray case–control data with embedded differential co-expression networks and underlying correlation structure based on scale-free or Erdos-Renyi (ER) random networks.
Using the network simulation strategy, we find that Relief-F and reGAIN provide the best balance between detecting interactions and main effects, plus reGAIN has the ability to adjust for covariates and model quantitative traits. The dcGAIN approach performs best at finding differential co-expression effects by design but worst for main effects, and it does not adjust for covariates and is limited to dichotomous outcomes. When the underlying network is scale free instead of ER, all interaction network methods have greater power to find differential co-expression effects. We apply these methods to a public microarray study of the differential immune response to influenza vaccine, and we identify effects that suggest a role in influenza vaccine immune response for genes from the PI3K family, which includes genes with known immunodeficiency function, and KLRG1, which is a known marker of senescence.
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-015-0040-x) contains supplementary material, which is available to authorized users.
PMCID: PMC4326454
4.  Mining the entire Protein DataBank for frequent spatially cohesive amino acid patterns 
BioData Mining  2015;8:4.
The three-dimensional structure of a protein is an essential aspect of its functionality. Despite the large diversity in protein structures and functionality, it is known that there are common patterns and preferences in the contacts between amino acid residues, or between residues and other biomolecules, such as DNA. The discovery and characterization of these patterns is an important research topic within structural biology as it can give fundamental insight into protein structures and can aid in the prediction of unknown structures.
Here we apply an efficient spatial pattern miner to search for sets of amino acids that occur frequently in close spatial proximity in the protein structures of the Protein DataBank. This allowed us to mine for a new class of amino acid patterns, that we term FreSCOs (Frequent Spatially Cohesive Component sets), which feature synergetic combinations. To demonstrate the relevance of these FreSCOs, they were compared in relation to the thermostability of the protein structure and the interaction preferences of DNA-protein complexes. In both cases, the results matched well with prior investigations using more complex methods on smaller data sets.
The currently characterized protein structures feature a diverse set of frequent amino acid patterns that can be related to the stability of the protein molecular structure and that are independent from protein function or specific conserved domains.
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-015-0038-4) contains supplementary material, which is available to authorized users.
PMCID: PMC4318390  PMID: 25657820
Protein structure; Frequent pattern mining; Thermostability; Protein-DNA complexes
5.  Microarray enriched gene rank 
BioData Mining  2015;8:2.
We develop a new concept that reflects how genes are connected based on microarray data using the coefficient of determination (the squared Pearson correlation coefficient). Our gene rank combines a priori knowledge about gene connectivity, say, from the Gene Ontology (GO) database, and the microarray expression data at hand, called the microarray enriched gene rank, or simply gene rank (GR). GR, similarly to Google PageRank, is defined in a recursive fashion and is computed as the left maximum eigenvector of a stochastic matrix derived from microarray expression data. An efficient algorithm is devised that allows computation of GR for 50 thousand genes with 500 samples within minutes on a personal computer using the public domain statistical package R.
Computation of GR is illustrated with several microarray data sets. In particular, we apply GR (1) to answer whether bad genes are more connected than good genes in relation with cancer patient survival, (2) to associate gene connectivity with cluster/subtypes in ovarian cancer tumors, and to determine whether gene connectivity changes (3) from organ to organ within the same organism and (4) between organisms.
We have shown by examples that findings based on GR confirm biological expectations. GR may be used for hypothesis generation on gene pathways. It may be used for a homogeneous sample or for comparison of gene connectivity among cases and controls, or in longitudinal setting.
PMCID: PMC4305247  PMID: 25649242
6.  Performance of genetic programming optimised Bowtie2 on genome comparison and analytic testing (GCAT) benchmarks 
BioData Mining  2015;8:1.
Genetic studies are increasingly based on short noisy next generation scanners. Typically complete DNA sequences are assembled by matching short NextGen sequences against reference genomes. Despite considerable algorithmic gains since the turn of the millennium, matching both single ended and paired end strings to a reference remains computationally demanding. Further tailoring Bioinformatics tools to each new task or scanner remains highly skilled and labour intensive. With this in mind, we recently demonstrated a genetic programming based automated technique which generated a version of the state-of-the-art alignment tool Bowtie2 which was considerably faster on short sequences produced by a scanner at the Broad Institute and released as part of The Thousand Genome Project.
Bowtie2 GP and the original Bowtie2 release were compared on bioplanet’s GCAT synthetic benchmarks. Bowtie2 GP enhancements were also applied to the latest Bowtie2 release (2.2.3, 29 May 2014) and retained both the GP and the manually introduced improvements.
On both singled ended and paired-end synthetic next generation DNA sequence GCAT benchmarks Bowtie2GP runs up to 45% faster than Bowtie2. The lost in accuracy can be as little as 0.2–0.5% but up to 2.5% for longer sequences.
PMCID: PMC4304608  PMID: 25621011
Double-ended DNA sequence; High throughput Solexa 454 nextgen NGS sequence query; Rapid fuzzy string matching; Homo sapiens genome reference consortium HG19
7.  Linked vaccine adverse event data from VAERS for biomedical data analysis and longitudinal studies 
BioData Mining  2014;7:36.
Vaccines have been one of the most successful public health interventions to date. The use of vaccination, however, sometimes comes with possible adverse events. The U.S. FDA/CDC Vaccine Adverse Event Reporting System (VAERS) currently contains more than 200,000 reports for post-vaccination events that occur after the administration of vaccines licensed in the United States. Although the data from the VAERS has been applied to many public health and vaccine safety studies, each individual report does not necessarily indicate a casuality relationship between the vaccine and the reported symptoms. Further statistical analysis and summarization needs to be done before this data can be leveraged.
This paper introduces our efforts on representing the vaccine-symptom correlations and their corresponding meta-information extracted from the VAERS database using Resource Description Framework (RDF). Numbers of occurrences of vaccine-symptom pairs reported to the VAERS were summarized with corresponding proportional reporting ratios (PRR) calculated. All the data was stored in an RDF file. We then applied network analysis approaches to the RDF data to illustrate a use case of the data for longititual studies. We further dicussed our vision on integrating the data with vaccine information from other sources using RDF linked approach to facilitate more comprehensive analyses.
The 1990–2013 data from VAERS has been extracted from the VAERS database. There are 83,148 unique vaccine-symptom pairs with 75 vaccine types and 5,865 different reported symptoms. The yearly and over PRR values for each reported vaccine-symptom pair were calculated. The network properties of networks consisting of significant vaccine-symptom associations (i.e., PRR larger than 1) were then investigated. The results indicated that vaccine-symptom association network is a dense network, with any given node connected to all other nodes through an average of approximately two other nodes and a maximum of five nodes.
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-014-0036-y) contains supplementary material, which is available to authorized users.
PMCID: PMC4333877
8.  Identifying genetic interactions associated with late-onset Alzheimer’s disease 
BioData Mining  2014;7:35.
Identifying genetic interactions in data obtained from genome-wide association studies (GWASs) can help in understanding the genetic basis of complex diseases. The large number of single nucleotide polymorphisms (SNPs) in GWASs however makes the identification of genetic interactions computationally challenging. We developed the Bayesian Combinatorial Method (BCM) that can identify pairs of SNPs that in combination have high statistical association with disease.
We applied BCM to two late-onset Alzheimer’s disease (LOAD) GWAS datasets to identify SNPs that interact with known Alzheimer associated SNPs. We also compared BCM with logistic regression that is implemented in PLINK. Gene Ontology analysis of genes from the top 200 dataset SNPs for both GWAS datasets showed overrepresentation of LOAD-related terms. Four genes were common to both datasets: APOE and APOC1, which have well established associations with LOAD, and CAMK1D and FBXL13, not previously linked to LOAD but having evidence of involvement in LOAD. Supporting evidence was also found for additional genes from the top 30 dataset SNPs.
BCM performed well in identifying several SNPs having evidence of involvement in the pathogenesis of LOAD that would not have been identified by univariate analysis due to small main effect. These results provide support for applying BCM to identify potential genetic variants such as SNPs from high dimensional GWAS datasets.
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-014-0035-z) contains supplementary material, which is available to authorized users.
PMCID: PMC4300162  PMID: 25649863
Genome-wide association study; Epistasis; Alzheimer’s disease; Bayesian networks
9.  Synthetic learning machines 
BioData Mining  2014;7:28.
Using a collection of different terminal nodesize constructed random forests, each generating a synthetic feature, a synthetic random forest is defined as a kind of hyperforest, calculated using the new input synthetic features, along with the original features.
Using a large collection of regression and multiclass datasets we show that synthetic random forests outperforms both conventional random forests and the optimized forest from the regresssion portfolio.
Synthetic forests removes the need for tuning random forests with no additional effort on the part of the researcher. Importantly, the synthetic forest does this with evidently no loss in prediction compared to a well-optimized single random forest.
PMCID: PMC4279689  PMID: 25614764
Machine; Nodesize; Random forest; Trees; Synthetic feature
10.  Integrative genomics and transcriptomics analysis of human embryonic and induced pluripotent stem cells 
BioData Mining  2014;7:32.
Human genomic variations, including single nucleotide polymorphisms (SNPs) and copy number variations (CNVs), are associated with several phenotypic traits varying from mild features to hereditary diseases. Several genome-wide studies have reported genomic variants that correlate with gene expression levels in various tissue and cell types.
We studied human embryonic stem cells (hESCs) and human induced pluripotent stem cells (hiPSCs) measuring the SNPs and CNVs with Affymetrix SNP 6 microarrays and expression values with Affymetrix Exon microarrays. We computed the linear relationships between SNPs and expression levels of exons, transcripts and genes, and the associations between gene CNVs and gene expression levels. Further, for a few of the resulted genes, the expression value was associated with both CNVs and SNPs. Our results revealed altogether 217 genes and 584 SNPs whose genomic alterations affect the transcriptome in the same cells. We analyzed the enriched pathways and gene ontologies within these groups of genes, and found out that the terms related to alternative splicing and development were enriched.
Our results revealed that in the human pluripotent stem cells, the expression values of several genes, transcripts and exons were affected due to the genomic variation.
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-014-0032-2) contains supplementary material, which is available to authorized users.
PMCID: PMC4298950  PMID: 25649046
hESC; hiPSC; Association analysis; SNP; CNV; Gene expression; Exon expression; Transcript expression
12.  Motif mining based on network space compression 
BioData Mining  2014;8:29.
A network motif is a recurring subnetwork within a network, and it takes on certain functions in practical biological macromolecule applications. Previous algorithms have focused on the computational efficiency of network motif detection, but some problems in storage space and searching time manifested during earlier studies. The considerable computational and spacial complexity also presents a significant challenge. In this paper, we provide a new approach for motif mining based on compressing the searching space. According to the characteristic of the parity nodes, we cut down the searching space and storage space in real graphs and random graphs, thereby reducing the computational cost of verifying the isomorphism of sub-graphs. We obtain a new network with smaller size after removing parity nodes and the “repeated edges” connected with the parity nodes. Random graph structure and sub-graph searching are based on the Back Tracking Method; all sub-graphs can be searched for by adding edges progressively. Experimental results show that this algorithm has higher speed and better stability than its alternatives.
PMCID: PMC4269098  PMID: 25525470
Associated matrix; Space compression; Sub-graph mark; Parity nodes
13.  Accurate prediction of major histocompatibility complex class II epitopes by sparse representation via ℓ 1-minimization 
BioData Mining  2014;7:23.
The major histocompatibility complex (MHC) is responsible for presenting antigens (epitopes) on the surface of antigen-presenting cells (APCs). When pathogen-derived epitopes are presented by MHC class II on an APC surface, T cells may be able to trigger an specific immune response. Prediction of MHC-II epitopes is particularly challenging because the open binding cleft of the MHC-II molecule allows epitopes to bind beyond the peptide binding groove; therefore, the molecule is capable of accommodating peptides of variable length. Among the methods proposed to predict MHC-II epitopes, artificial neural networks (ANNs) and support vector machines (SVMs) are the most effective methods. We propose a novel classification algorithm to predict MHC-II called sparse representation via ℓ1-minimization.
We obtained a collection of experimentally confirmed MHC-II epitopes from the Immune Epitope Database and Analysis Resource (IEDB) and applied our ℓ1-minimization algorithm. To benchmark the performance of our proposed algorithm, we compared our predictions against a SVM classifier. We measured sensitivity, specificity abd accuracy; then we used Receiver Operating Characteristic (ROC) analysis to evaluate the performance of our method. The prediction performance of MHC-II epitopes of the ℓ1-minimization algorithm was generally comparable and, in some cases, superior to the standard SVM classification method and overcame the lack of robustness of other methods with respect to outliers. While our method consistently favoured DPPS encoding with the alleles tested, SVM showed a slightly better accuracy when “11-factor” encoding was used.
ℓ1-minimization has similar accuracy than SVM, and has additional advantages, such as overcoming the lack of robustness with respect to outliers. With ℓ1-minimization no model selection dependency is involved.
PMCID: PMC4225598  PMID: 25392716
Peptide binding; Sparse representation; MHC class II; Epitope prediction; Immunoinformatics; Machine learning; Classification algorithms
14.  Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends 
BioData Mining  2014;7:22.
The emergence of massive datasets in a clinical setting presents both challenges and opportunities in data storage and analysis. This so called “big data” challenges traditional analytic tools and will increasingly require novel solutions adapted from other fields. Advances in information and communication technology present the most viable solutions to big data analysis in terms of efficiency and scalability. It is vital those big data solutions are multithreaded and that data access approaches be precisely tailored to large volumes of semi-structured/unstructured data.
The MapReduce programming framework uses two tasks common in functional programming: Map and Reduce. MapReduce is a new parallel processing framework and Hadoop is its open-source implementation on a single computing node or on clusters. Compared with existing parallel processing paradigms (e.g. grid computing and graphical processing unit (GPU)), MapReduce and Hadoop have two advantages: 1) fault-tolerant storage resulting in reliable data processing by replicating the computing tasks, and cloning the data chunks on different computing nodes across the computing cluster; 2) high-throughput data processing via a batch processing framework and the Hadoop distributed file system (HDFS). Data are stored in the HDFS and made available to the slave nodes for computation.
In this paper, we review the existing applications of the MapReduce programming framework and its implementation platform Hadoop in clinical big data and related medical health informatics fields. The usage of MapReduce and Hadoop on a distributed system represents a significant advance in clinical big data processing and utilization, and opens up new opportunities in the emerging era of big data analytics. The objective of this paper is to summarize the state-of-the-art efforts in clinical big data analytics and highlight what might be needed to enhance the outcomes of clinical big data analytics tools. This paper is concluded by summarizing the potential usage of the MapReduce programming framework and Hadoop platform to process huge volumes of clinical data in medical health informatics related fields.
PMCID: PMC4224309  PMID: 25383096
MapReduce; Hadoop; Big data; Clinical big data analysis; Clinical data analysis; Bioinformatics; Distributed programming
15.  CardioGxE, a catalog of gene-environment interactions for cardiometabolic traits 
BioData Mining  2014;7:21.
Genetic understanding of complex traits has developed immensely over the past decade but remains hampered by incomplete descriptions of contribution to phenotypic variance. Gene-environment (GxE) interactions are one of these contributors and in the guise of diet and physical activity are important modulators of cardiometabolic phenotypes and ensuing diseases.
We mined the scientific literature to collect GxE interactions from 386 publications for blood lipids, glycemic traits, obesity anthropometrics, vascular measures, inflammation and metabolic syndrome, and introduce CardioGxE, a gene-environment interaction resource. We then analyzed the genes and SNPs supporting cardiometabolic GxEs in order to demonstrate utility of GxE SNPs and to discern characteristics of these important genetic variants. We were able to draw many observations from our extensive analysis of GxEs. 1) The CardioGxE SNPs showed little overlap with variants identified by main effect GWAS, indicating the importance of environmental interactions with genetic factors on cardiometabolic traits. 2) These GxE SNPs were enriched in adaptation to climatic and geographical features, with implications on energy homeostasis and response to physical activity. 3) Comparison to gene networks responding to plasma cholesterol-lowering or regression of atherosclerotic plaques showed that GxE genes have a greater role in those responses, particularly through high-energy diets and fat intake, than do GWAS-identified genes for the same traits. Other aspects of the CardioGxE dataset were explored.
Overall, we demonstrate that SNPs supporting cardiometabolic GxE interactions often exhibit transcriptional effects or are under positive selection. Still, not all such SNPs can be assigned potential functional or regulatory roles often because data are lacking in specific cell types or from treatments that approximate the environmental factor of the GxE. With research on metabolic related complex disease risk embarking on genome-wide GxE interaction tests, CardioGxE will be a useful resource.
PMCID: PMC4217104  PMID: 25368670
Cardiovascular diseases; Diet; Gene-environment interaction; Genetic variants; Phenotypic variance; Physical activity; Type 2 diabetes
16.  Knowledge-driven genomic interactions: an application in ovarian cancer 
BioData Mining  2014;7:20.
Effective cancer clinical outcome prediction for understanding of the mechanism of various types of cancer has been pursued using molecular-based data such as gene expression profiles, an approach that has promise for providing better diagnostics and supporting further therapies. However, clinical outcome prediction based on gene expression profiles varies between independent data sets. Further, single-gene expression outcome prediction is limited for cancer evaluation since genes do not act in isolation, but rather interact with other genes in complex signaling or regulatory networks. In addition, since pathways are more likely to co-operate together, it would be desirable to incorporate expert knowledge to combine pathways in a useful and informative manner.
Thus, we propose a novel approach for identifying knowledge-driven genomic interactions and applying it to discover models associated with cancer clinical phenotypes using grammatical evolution neural networks (GENN). In order to demonstrate the utility of the proposed approach, an ovarian cancer data from the Cancer Genome Atlas (TCGA) was used for predicting clinical stage as a pilot project.
We identified knowledge-driven genomic interactions associated with cancer stage from single knowledge bases such as sources of pathway-pathway interaction, but also knowledge-driven genomic interactions across different sets of knowledge bases such as pathway-protein family interactions by integrating different types of information. Notably, an integration model from different sources of biological knowledge achieved 78.82% balanced accuracy and outperformed the top models with gene expression or single knowledge-based data types alone. Furthermore, the results from the models are more interpretable because they are framed in the context of specific biological pathways or other expert knowledge.
The success of the pilot study we have presented herein will allow us to pursue further identification of models predictive of clinical cancer survival and recurrence. Understanding the underlying tumorigenesis and progression in ovarian cancer through the global view of interactions within/between different biological knowledge sources has the potential for providing more effective screening strategies and therapeutic targets for many types of cancer.
PMCID: PMC4161273  PMID: 25214892
Knowledge-driven genomic interaction; Integrative analysis; Grammatical evolution neural network; Clinical outcome prediction; Ovarian cancer
17.  The genetic interacting landscape of 63 candidate genes in Major Depressive Disorder: an explorative study 
BioData Mining  2014;7:19.
Genetic contributions to major depressive disorder (MDD) are thought to result from multiple genes interacting with each other. Different procedures have been proposed to detect such interactions. Which approach is best for explaining the risk of developing disease is unclear.
This study sought to elucidate the genetic interaction landscape in candidate genes for MDD by conducting a SNP-SNP interaction analysis using an exhaustive search through 3,704 SNP-markers in 1,732 cases and 1,783 controls provided from the GAIN MDD study. We used three different methods to detect interactions, two logistic regressions models (multiplicative and additive) and one data mining and machine learning (MDR) approach.
Although none of the interaction survived correction for multiple comparisons, the results provide important information for future genetic interaction studies in complex disorders. Among the 0.5% most significant observations, none had been reported previously for risk to MDD. Within this group of interactions, less than 0.03% would have been detectable based on main effect approach or an a priori algorithm. We evaluated correlations among the three different models and conclude that all three algorithms detected the same interactions to a low degree. Although the top interactions had a surprisingly large effect size for MDD (e.g. additive dominant model Puncorrected = 9.10E-9 with attributable proportion (AP) value = 0.58 and multiplicative recessive model with Puncorrected = 6.95E-5 with odds ratio (OR estimated from β3) value = 4.99) the area under the curve (AUC) estimates were low (< 0.54). Moreover, the population attributable fraction (PAF) estimates were also low (< 0.15).
We conclude that the top interactions on their own did not explain much of the genetic variance of MDD. The different statistical interaction methods we used in the present study did not identify the same pairs of interacting markers. Genetic interaction studies may uncover previously unsuspected effects that could provide novel insights into MDD risk, but much larger sample sizes are needed before this strategy can be powerfully applied.
PMCID: PMC4181757  PMID: 25279001
Additive interaction; Multiplicative interaction; Logistic regression; Data mining and machine learning; Major depressive disorder
18.  ExpressionData - A public resource of high quality curated datasets representing gene expression across anatomy, development and experimental conditions 
BioData Mining  2014;7:18.
Reference datasets are often used to compare, interpret or validate experimental data and analytical methods. In the field of gene expression, several reference datasets have been published. Typically, they consist of individual baseline or spike-in experiments carried out in a single laboratory and representing a particular set of conditions.
Here, we describe a new type of standardized datasets representative for the spatial and temporal dimensions of gene expression. They result from integrating expression data from a large number of globally normalized and quality controlled public experiments. Expression data is aggregated by anatomical part or stage of development to yield a representative transcriptome for each category. For example, we created a genome-wide expression dataset representing the FDA tissue panel across 35 tissue types. The proposed datasets were created for human and several model organisms and are publicly available at
PMCID: PMC4165432  PMID: 25228922
19.  Computational genetics analysis of grey matter density in Alzheimer’s disease 
BioData Mining  2014;7:17.
Alzheimer’s disease is the most common form of progressive dementia and there is currently no known cure. The cause of onset is not fully understood but genetic factors are expected to play a significant role. We present here a bioinformatics approach to the genetic analysis of grey matter density as an endophenotype for late onset Alzheimer’s disease. Our approach combines machine learning analysis of gene-gene interactions with large-scale functional genomics data for assessing biological relationships.
We found a statistically significant synergistic interaction among two SNPs located in the intergenic region of an olfactory gene cluster. This model did not replicate in an independent dataset. However, genes in this region have high-confidence biological relationships and are consistent with previous findings implicating sensory processes in Alzheimer’s disease.
Previous genetic studies of Alzheimer’s disease have revealed only a small portion of the overall variability due to DNA sequence differences. Some of this missing heritability is likely due to complex gene-gene and gene-environment interactions. We have introduced here a novel bioinformatics analysis pipeline that embraces the complexity of the genetic architecture of Alzheimer’s disease while at the same time harnessing the power of functional genomics. These findings represent novel hypotheses about the genetic basis of this complex disease and provide open-access methods that others can use in their own studies.
PMCID: PMC4145360  PMID: 25165488
20.  Hybrid coexpression link similarity graph clustering for mining biological modules from multiple gene expression datasets 
BioData Mining  2014;7:16.
Advances in genomic technologies have enabled the accumulation of vast amount of genomic data, including gene expression data for multiple species under various biological and environmental conditions. Integration of these gene expression datasets is a promising strategy to alleviate the challenges of protein functional annotation and biological module discovery based on a single gene expression data, which suffers from spurious coexpression.
We propose a joint mining algorithm that constructs a weighted hybrid similarity graph whose nodes are the coexpression links. The weight of an edge between two coexpression links in this hybrid graph is a linear combination of the topological similarities and co-appearance similarities of the corresponding two coexpression links. Clustering the weighted hybrid similarity graph yields recurrent coexpression link clusters (modules). Experimental results on Human gene expression datasets show that the reported modules are functionally homogeneous as evident by their enrichment with biological process GO terms and KEGG pathways.
PMCID: PMC4151083  PMID: 25221624
Coexpression; Biological networks; Mining; Frequent subnetworks; Hybrid similarity
21.  An iteration normalization and test method for differential expression analysis of RNA-seq data 
BioData Mining  2014;7:15.
Next generation sequencing technologies are powerful new tools for investigating a wide range of biological and medical questions. Statistical and computational methods are key to analyzing massive and complex sequencing data. In order to derive gene expression measures and compare these measures across samples or libraries, we first need to normalize read counts to adjust for varying sample sequencing depths and other potentially technical effects.
In this paper, we develop a normalization method based on iterating median of M-values (IMM) for detecting the differentially expressed (DE) genes. Compared to a previous approach TMM, the IMM method improves the accuracy of DE detection. Simulation studies show that the IMM method outperforms other methods for the sample normalization. We also look into the real data and find that the genes detected by IMM but not by TMM are much more accurate than the genes detected by TMM but not by IMM. What’s more, we discovered that gene UNC5C is highly associated with kidney cancer and so on.
PMCID: PMC4181730  PMID: 25285156
RNA-seq; Normalize; Expression level; TMM; IMM
22.  A simple structure-based model for the prediction of HIV-1 co-receptor tropism 
BioData Mining  2014;7:14.
Human Immunodeficiency Virus 1 enters host cells through interaction of its V3 loop (which is part of the gp120 protein) with the host cell receptor CD4 and one of two co-receptors, namely CCR5 or CXCR4. Entry inhibitors binding the CCR5 co-receptor can prevent viral entry. As these drugs are only available for CCR5-using viruses, accurate prediction of this so-called co-receptor tropism is important in order to ensure an effective personalized therapy. With the development of next-generation sequencing technologies, it is now possible to sequence representative subpopulations of the viral quasispecies.
Here we present T-CUP 2.0, a model for predicting co-receptor tropism. Based on our recently published T-CUP model, we developed a more accurate and even faster solution. Similarly to its predecessor, T-CUP 2.0 models co-receptor tropism using information of the electrostatic potential and hydrophobicity of V3-loops. However, extracting this information from a simplified structural vacuum-model leads to more accurate and faster predictions. The area-under-the-ROC-curve (AUC) achieved with T-CUP 2.0 on the training set is 0.968±0.005 in a leave-one-patient-out cross-validation. When applied to an independent dataset, T-CUP 2.0 has an improved prediction accuracy of around 3% when compared to the original T-CUP.
We found that it is possible to model co-receptor tropism in HIV-1 based on a simplified structure-based model of the V3 loop. In this way, genotypic prediction of co-receptor tropism is very accurate, fast and can be applied to large datasets derived from next-generation sequencing technologies. The reduced complexity of the electrostatic modeling makes T-CUP 2.0 independent from third-party software, making it easy to install and use.
PMCID: PMC4124776  PMID: 25120583
23.  First complex, then simple 
BioData Mining  2014;7:13.
PMCID: PMC4115213  PMID: 25076985
24.  Innovation is often unnerving: the door into summer 
BioData Mining  2014;7:12.
PMCID: PMC4107591  PMID: 25057294
25.  Hypothesis exploration with visualization of variance 
BioData Mining  2014;7:11.
The Consortium for Neuropsychiatric Phenomics (CNP) at UCLA was an investigation into the biological bases of traits such as memory and response inhibition phenotypes—to explore whether they are linked to syndromes including ADHD, Bipolar disorder, and Schizophrenia. An aim of the consortium was in moving from traditional categorical approaches for psychiatric syndromes towards more quantitative approaches based on large-scale analysis of the space of human variation. It represented an application of phenomics—wide-scale, systematic study of phenotypes—to neuropsychiatry research.
This paper reports on a system for exploration of hypotheses in data obtained from the LA2K, LA3C, and LA5C studies in CNP. ViVA is a system for exploratory data analysis using novel mathematical models and methods for visualization of variance. An example of these methods is called VISOVA, a combination of visualization and analysis of variance, with the flavor of exploration associated with ANOVA in biomedical hypothesis generation. It permits visual identification of phenotype profiles—patterns of values across phenotypes—that characterize groups. Visualization enables screening and refinement of hypotheses about variance structure of sets of phenotypes.
The ViVA system was designed for exploration of neuropsychiatric hypotheses by interdisciplinary teams. Automated visualization in ViVA supports ‘natural selection’ on a pool of hypotheses, and permits deeper understanding of the statistical architecture of the data. Large-scale perspective of this kind could lead to better neuropsychiatric diagnostics.
PMCID: PMC4114111  PMID: 25097666
Methodologies; Hypothesis generation and refinement; Visualization; EDA; ANOVA; Covariance structure

Results 1-25 (139)