Search tips
Search criteria

Results 1-25 (66)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  A Novel Scoring Based Distributed Protein Docking Application to Improve Enrichment 
Molecular docking is a computational technique which predicts the binding energy and the preferred binding mode of a ligand to a protein target. Virtual screening is a tool which uses docking to investigate large chemical libraries to identify ligands that bind favorably to a protein target.
We have developed a novel scoring based distributed protein docking application to improve enrichment in virtual screening. The application addresses the issue of time and cost of screening in contrast to conventional systematic parallel virtual screening methods in two ways. Firstly, it automates the process of creating and launching multiple independent dockings on a high performance computing cluster. Secondly, it uses a Nȧive Bayes scoring function to calculate binding energy of un-docked ligands to identify and preferentially dock (Autodock predicted) better binders.
The application was tested on four proteins using a library of 10,573 ligands. In all the experiments, (i). 200 of the 1000 best binders are identified after docking only ∼ 14% of the chemical library, (ii). 9 or 10 best-binders are identified after docking only ∼ 19% of the chemical library, and (iii). no significant enrichment is observed after docking ∼ 70% of the chemical library. The results show significant increase in enrichment of potential drug leads in early rounds of virtual screening.
PMCID: PMC4784258  PMID: 26671816
Virtual Screening; High Performance Computing; Distributed Protein Docking; HTCondor; Nȧive Bayes; Scoring Function
2.  Functional Module Analysis for Gene Coexpression Networks with Network Integration 
Network has been a general tool for studying the complex interactions between different genes, proteins and other small molecules. Module as a fundamental property of many biological networks has been widely studied and many computational methods have been proposed to identify the modules in an individual network. However, in many cases a single network is insufficient for module analysis due to the noise in the data or the tuning of parameters when building the biological network. The availability of a large amount of biological networks makes network integration study possible. By integrating such networks, more informative modules for some specific disease can be derived from the networks constructed from different tissues, and consistent factors for different diseases can be inferred.
In this paper, we have developed an effective method for module identification from multiple networks under different conditions. The problem is formulated as an optimization model, which combines the module identification in each individual network and alignment of the modules from different networks together. An approximation algorithm based on eigenvector computation is proposed. Our method outperforms the existing methods, especially when the underlying modules in multiple networks are different in simulation studies. We also applied our method to two groups of gene coexpression networks for humans, which include one for three different cancers, and one for three tissues from the morbidly obese patients. We identified 13 modules with 3 complete subgraphs, and 11 modules with 2 complete subgraphs, respectively. The modules were validated through Gene Ontology enrichment and KEGG pathway enrichment analysis. We also showed that the main functions of most modules for the corresponding disease have been addressed by other researchers, which may provide the theoretical basis for further studying the modules experimentally.
PMCID: PMC4664071  PMID: 26451826
functional module identification; gene coexpression networks; network integration; spectral clustering
3.  An independent filter for gene set testing based on spectral enrichment 
Gene set testing has become an indispensable tool for the analysis of high-dimensional genomic data. An important motivation for testing gene sets, rather than individual genomic variables, is to improve statistical power by reducing the number of tested hypotheses. Given the dramatic growth in common gene set collections, however, testing is often performed with nearly as many gene sets as underlying genomic variables. To address the challenge to statistical power posed by large gene set collections, we have developed spectral gene set filtering (SGSF), a novel technique for independent filtering of gene set collections prior to gene set testing. The SGSF method uses as a filter statistic the p-value measuring the statistical significance of the association between each gene set and the sample principal components (PCs), taking into account the significance of the associated eigenvalues. Because this filter statistic is independent of standard gene set test statistics under the null hypothesis but dependent under the alternative, the proportion of enriched gene sets is increased without impacting the type I error rate. As shown using simulated and real gene expression data, the SGSF algorithm accurately filters gene sets unrelated to the experimental outcome resulting in significantly increased gene set testing power.
PMCID: PMC4666312  PMID: 26451820
Gene set testing; gene set enrichment; screening-testing; principal component analysis; random matrix theory; Tracy-Widom
4.  Multivariate Hypergeometric Similarity Measure 
We propose a similarity measure based on the multivariate hypergeometric distribution for the pairwise comparison of images and data vectors. The formulation and performance of the proposed measure are compared with other similarity measures using synthetic data. A method of piecewise approximation is also implemented to facilitate application of the proposed measure to large samples. Example applications of the proposed similarity measure are presented using mass spectrometry imaging data and gene expression microarray data. Results from synthetic and biological data indicate that the proposed measure is capable of providing meaningful discrimination between samples, and that it can be a useful tool for identifying potentially related samples in large-scale biological data sets.
PMCID: PMC4983430  PMID: 24407308
Similarity measures; contingency tables; multivariate statistics; biology and genetics; chemistry
5.  Bayesian Normalization Model for Label-Free Quantitative Analysis by LC-MS 
We introduce a new method for normalization of data acquired by liquid chromatography coupled with mass spectrometry (LC-MS) in label-free differential expression analysis. Normalization of LC-MS data is desired prior to subsequent statistical analysis to adjust variabilities in ion intensities that are not caused by biological differences but experimental bias. There are different sources of bias including variabilities during sample collection and sample storage, poor experimental design, noise, etc. In addition, instrument variability in experiments involving a large number of LC-MS runs leads to a significant drift in intensity measurements. Although various methods have been proposed for normalization of LC-MS data, there is no universally applicable approach. In this paper, we propose a Bayesian normalization model (BNM) that utilizes scan-level information from LC-MS data. Specifically, the proposed method uses peak shapes to model the scan-level data acquired from extracted ion chromatograms (EIC) with parameters considered as a linear mixed effects model. We extended the model into BNM with drift (BNMD) to compensate for the variability in intensity measurements due to long LC-MS runs. We evaluated the performance of our method using synthetic and experimental data. In comparison with several existing methods, the proposed BNM and BNMD yielded significant improvement.
PMCID: PMC4838204  PMID: 26357332
Liquid chromatography; mass spectrometry; normalization; bayesian hierarchical model
6.  Unfold high-dimensional clouds for exhaustive gating of flow cytometry data 
Flow cytometry is able to measure the expressions of multiple proteins simultaneously at the single-cell level. A flow cytometry experiment on one biological sample provides measurements of several protein markers on or inside a large number of individual cells in that sample. Analysis of such data often aims to identify subpopulations of cells with distinct phenotypes. Currently, the most widely used analytical approach in the flow cytometry community is manual gating on a sequence of nested biaxial plots, which is highly subjective, labor intensive, and not exhaustive. To address those issues, a number of methods have been developed to automate the gating analysis by clustering algorithms. However, completely removing the subjectivity can be quite challenging. This paper describes an alternative approach. Instead of automating the analysis, we develop novel visualizations to facilitate manual gating. The proposed method views single-cell data of one biological sample as a high-dimensional point cloud of cells, derives the skeleton of the cloud, and unfolds the skeleton to generate 2D visualizations. We demonstrate the utility of the proposed visualization using real data, and provide quantitative comparison to visualizations generated from principal component analysis and multidimensional scaling.
PMCID: PMC4866872  PMID: 26357042
7.  MMBIRFinder: A Tool to Detect Microhomology-Mediated Break-Induced Replication 
The introduction of next-generation sequencing technologies has radically changed the way we view structural genetic events. Microhomology-mediated break-induced replication (MMBIR) is just one of the many mechanisms that can cause genomic destabilization that may lead to cancer. Although the mechanism for MMBIR remains unclear, it has been shown that MMBIR is typically associated with template-switching events. Currently, to our knowledge, there is no existing bioinformatics tool to detect these template-switching events. We have developed MMBIRFinder, a method that detects template-switching events associated with MMBIR from whole-genome sequenced data. MMBIRFinder uses a half-read alignment approach to identify potential regions of interest. Clustering of these potential regions helps narrow the search space to regions with strong evidence. Subsequent local alignments identify the template-switching events with single-nucleotide accuracy. Using simulated data, MMBIRFinder identified 83 percent of the MMBIR regions within a five nucleotide tolerance. Using real data, MMBIRFinder identified 16 MMBIR regions on a normal breast tissue data sample and 51 MMBIR regions on a triple-negative breast cancer tumor sample resulting in detection of 37 novel template-switching events. Finally, we identified template-switching events residing in the promoter region of seven genes that have been implicated in breast cancer. The program is freely available for download at
PMCID: PMC4857593  PMID: 26357319
Biology and genetics; life and medical sciences
8.  Evolutionary model selection and parameter estimation for protein-protein interaction network based on differential evolution algorithm 
Revealing the underlying evolutionary mechanism plays an important role in understanding protein interaction networks in the cell. While many evolutionary models have been proposed, the problem about applying these models to real network data, especially for differentiating which model can better describe evolutionary process for the observed network urgently remains as a challenge. The traditional way is to use a model with presumed parameters to generate a network, and then evaluate the fitness by summary statistics, which however cannot capture the complete network structures information and estimate parameter distribution.
In this work we developed a novel method based on Approximate Bayesian Computation and modified Differential Evolution (ABC-DEP) that is capable of conducting model selection and parameter estimation simultaneously and detecting the underlying evolutionary mechanisms more accurately. We tested our method for its power in differentiating models and estimating parameters on the simulated data and found significant improvement in performance benchmark, as compared with a previous method. We further applied our method to real data of protein interaction networks in human and yeast. Our results show Duplication Attachment model as the predominant evolutionary mechanism for human PPI networks and Scale-Free model as the predominant mechanism for yeast PPI networks.
PMCID: PMC4719153  PMID: 26357273
Minimax approximation and algorithms; Eigenvalues and eigenvectors; Probabilistic algorithms; Convergence
9.  A Machine Learning Approach for Accurate Annotation of Noncoding RNAs 
Searching genomes to locate noncoding RNA genes with known secondary structure is an important problem in bioinformatics. In general, the secondary structure of a searched noncoding RNA is defined with a structure model constructed from the structural alignment of a set of sequences from its family. Computing the optimal alignment between a sequence and a structure model is the core part of an algorithm that can search genomes for noncoding RNAs. In practice, a single structure model may not be sufficient to capture all crucial features important for a noncoding RNA family. In this paper, we develop a novel machine learning approach that can efficiently search genomes for noncoding RNAs with high accuracy. During the search procedure, a sequence segment in the searched genome sequence is processed and a feature vector is extracted to represent it. Based on the feature vector, a classifier is used to determine whether the sequence segment is the searched ncRNA or not. Our testing results show that this approach is able to efficiently capture crucial features of a noncoding RNA family. Compared with existing search tools, it significantly improves the accuracy of genome annotation.
PMCID: PMC4726481  PMID: 26357266
noncoding RNAs; secondary structure; genome annotation; feature vector; classifier
10.  Automatic Myonuclear Detection in Isolated Single Muscle Fibers Using Robust Ellipse Fitting and Sparse Representation 
Accurate and robust detection of myonuclei in isolated single muscle fibers is required to calculate myonuclear domain size. However, this task is challenging because: 1) shape and size variations of the nuclei, 2) overlapping nuclear clumps, and 3) multiple z-stack images with out-of-focus regions. In this paper, we have proposed a novel automatic detection algorithm to robustly quantify myonuclei in isolated single skeletal muscle fibers. The original z-stack images are first converted into one all-in-focus image using multi-focus image fusion. A sufficient number of ellipse fitting hypotheses are then generated from them yonuclei contour segments using heteroscedastic errors-invariables (HEIV) regression. A set of representative training samples and a set of discriminative features are selected by a two-stage sparse model. The selected samples with representative features are utilized to train a classifier to select the best candidates. A modified inner geodesic distance based mean-shift clustering algorithm is used to produce the final nuclei detection results. The proposed method was extensively tested using 42 sets of z-stack images containing over 1,500 myonuclei. The method demonstrates excellent results that are better than current state-of-the-art approaches.
PMCID: PMC4669954  PMID: 26356342
Robust ellipse fitting; muscle; segmentation; sparse optimization
11.  A Framework for Multiple Kernel Support Vector Regression and Its Applications to siRNA Efficacy Prediction 
The cell defense mechanism of RNA interference has applications in gene function analysis and promising potentials in human disease therapy. To effectively silence a target gene, it is desirable to select appropriate initiator siRNA molecules having satisfactory silencing capabilities. Computational prediction for silencing efficacy of siRNAs can assist this screening process before using them in biological experiments. String kernel functions, which operate directly on the string objects representing siRNAs and target mRNAs, have been applied to support vector regression for the prediction and improved accuracy over numerical kernels in multidimensional vector spaces constructed from descriptors of siRNA design rules. To fully utilize information provided by string and numerical data, we propose to unify the two in a kernel feature space by devising a multiple kernel regression framework where a linear combination of the kernels are used. We formulate the multiple kernel learning into a quadratically constrained quadratic programming (QCQP) problem, which although yields global optimal solution, is computationally demanding and requires a commercial solver package. We further propose three heuristics based on the principle of kernel–target alignment and predictive accuracy. Empirical results demonstrate that multiple kernel regression can improve accuracy, decrease model complexity by reducing the number of support vectors, and speed up computational performance dramatically. In addition, multiple kernel regression evaluates the importance of constituent kernels, which for the siRNA efficacy prediction problem, compares the relative significance of the design rules. Finally, we give insights into the multiple kernel regression mechanism and point out possible extensions.
PMCID: PMC4669230  PMID: 19407344
Multiple kernel learning; multiple kernel heuristics; support vector regression; QCQP optimization; RNA interference; siRNA efficacy
12.  Latent feature decompositions for integrative analysis of multi-platform genomic data 
Increased availability of multi-platform genomics data on matched samples has sparked research efforts to discover how diverse molecular features interact both within and between platforms. In addition, simultaneous measurements of genetic and epigenetic characteristics illuminate the roles their complex relationships play in disease progression and outcomes. However, integrative methods for diverse genomics data are faced with the challenges of ultra-high dimensionality and the existence of complex interactions both within and between platforms.
We propose a novel modeling framework for integrative analysis based on decompositions of the large number of platform-specific features into a smaller number of latent features. Subsequently we build a predictive model for clinical outcomes accounting for both within- and between-platform interactions based on Bayesian model averaging procedures. Principal components, partial least squares and non-negative matrix factorization as well as sparse counterparts of each are used to define the latent features, and the performance of these decompositions is compared both on real and simulated data. The latent feature interactions are shown to preserve interactions between the original features and not only aid prediction but also allow explicit selection of outcome-related features. The methods are motivated by and applied to, a glioblastoma multiforme dataset from The Cancer Genome Atlas to predict patient survival times integrating gene expression, microRNA, copy number and methylation data.
For the glioblastoma data, we find a high concordance between our selected prognostic genes and genes with known associations with glioblastoma. In addition, our model discovers several relevant cross-platform interactions such as copy number variation associated gene dosing and epigenetic regulation through promoter methylation. On simulated data, we show that our proposed method successfully incorporates interactions within and between genomic platforms to aid accurate prediction and variable selection. Our methods perform best when principal components are used to define the latent features.
PMCID: PMC4486317  PMID: 26146492
Latent feature; genomic data; high-dimensional; interactions; integrative models; Bayesian model averaging
13.  Improving Retrieval Efficacy of Homology Searches using the False Discovery Rate 
Over the past few decades, discovery based on sequence homology has become a widely accepted practice. Consequently, comparative accuracy of retrieval algorithms (e.g., BLAST) has been rigorously studied for improvement. Unlike most components of retrieval algorithms, the E-value threshold criterion has yet to be thoroughly investigated. An investigation of the threshold is important as it exclusively dictates which sequences are declared relevant and irrelevant. In this paper, we introduce the false discovery rate (FDR) statistic as a replacement for the uniform threshold criterion in order to improve efficacy in retrieval systems. Using NCBI’s BLAST and PSI-BLAST software packages, we demonstrate the applicability of such a replacement in both non-iterative (BLASTFDR) and iterative (PSI-BLASTFDR) homology searches. For each application, we performed an evaluation of retrieval efficacy with five different multiple testing methods on a large training database. For each algorithm, we choose the best performing method, Benjamini-Hochberg, as the default statistic. As measured by the Threshold Average Precision, BLASTFDR yielded 14.1% better retrieval performance than BLAST on a large (5,161 queries) test database and PSI-BLASTFDR attained 11.8% better retrieval performance than PSI-BLAST. The C++ source code specific to BLASTFDR and PSI-BLASTFDR and instructions are available at
PMCID: PMC4568567  PMID: 26357264
Homology search; false discovery rate; retrieval efficacy; uniform E-value thresholding
14.  RLIMS-P 2.0: A Generalizable Rule-Based Information Extraction System for Literature Mining of Protein Phosphorylation Information 
We introduce RLIMS-P version 2.0, an enhanced rule-based information extraction (IE) system for mining kinase, substrate, and phosphorylation site information from scientific literature. Consisting of natural language processing and IE modules, the system has integrated several new features, including the capability of processing full-text articles and generalizability towards different post-translational modifications (PTMs). To evaluate the system, sets of abstracts and full-text articles, containing a variety of textual expressions, were annotated. On the abstract corpus, the system achieved F-scores of 0.91, 0.92, and 0.95 for kinases, substrates, and sites, respectively. The corresponding scores on the full-text corpus were 0.88, 0.91, and 0.92. It was additionally evaluated on the corpus of the 2013 BioNLP-ST GE task, and achieved an F-score of 0.87 for the phosphorylation core task, improving upon the results previously reported on the corpus. Full-scale processing of all abstracts in MEDLINE and all articles in PubMed Central Open Access Subset has demonstrated scalability for mining rich information in literature, enabling its adoption for biocuration and for knowledge discovery. The new system is generalizable and it will be adapted to tackle other major PTM types. RLIMS-P 2.0 online system is available online ( and the developed corpora are available from iProLINK (
PMCID: PMC4568560  PMID: 26357075
Biology and genetics; context analysis and indexing; natural language processing; text mining
15.  Outlier Analysis and Top Scoring Pair for Integrated Data Analysis and Biomarker Discovery 
Pathway deregulation has been identified as a key driver of carcinogenesis, with proteins in signaling pathways serving as primary targets for drug development. Deregulation can be driven by a number of molecular events, including gene mutation, epigenetic changes in gene promoters, overexpression, and gene amplifications or deletions. We demonstrate a novel approach that identifies pathways of interest by integrating outlier analysis within and across molecular data types with gene set analysis. We use the results to seed the top-scoring pair algorithm to identify robust biomarkers associated with pathway deregulation. We demonstrate this methodology on pediatric acute myeloid leukemia (AML) data. We develop a biomarker in primary AML tumors, demonstrate robustness with an independent primary tumor data set, and show that the identified biomarkers also function well in relapsed pediatric AML tumors.
PMCID: PMC4156935  PMID: 24277953
Genomics; Data integration; Statistical analysis; Biomarker
16.  Fast Flexible Modeling of RNA Structure Using Internal Coordinates 
Modeling the structure and dynamics of large macromolecules remains a critical challenge. Molecular dynamics (MD) simulations are expensive because they model every atom independently, and are difficult to combine with experimentally derived knowledge. Assembly of molecules using fragments from libraries relies on the database of known structures and thus may not work for novel motifs. Coarse-grained modeling methods have yielded good results on large molecules but can suffer from difficulties in creating more detailed full atomic realizations. There is therefore a need for molecular modeling algorithms that remain chemically accurate and economical for large molecules, do not rely on fragment libraries, and can incorporate experimental information. RNABuilder works in the internal coordinate space of dihedral angles and thus has time requirements proportional to the number of moving parts rather than the number of atoms. It provides accurate physics-based response to applied forces, but also allows user-specified forces for incorporating experimental information. A particular strength of RNABuilder is that all Leontis-Westhof basepairs can be specified as primitives by the user to be satisfied during model construction. We apply RNABuilder to predict the structure of an RNA molecule with 160 bases from its secondary structure, as well as experimental information. Our model matches the known structure to 10.2 Angstroms RMSD and has low computational expense.
PMCID: PMC4428339  PMID: 21778523
Internal coordinate mechanics; molecular; structure; dynamics; RNA; modeling; prediction; linear; scaling
17.  Integration of network biology and imaging to study cancer phenotypes and responses 
Ever growing “omics” data and continuously accumulated biological knowledge provide an unprecedented opportunity to identify molecular biomarkers and their interactions that are responsible for cancer phenotypes that can be accurately defined by clinical measurements such as in vivo imaging. Since signaling or regulatory networks are dynamic and context-specific, systematic efforts to characterize such structural alterations must effectively distinguish significant network rewiring from random background fluctuations. Here we introduced a novel integration of network biology and imaging to study cancer phenotypes and responses to treatments at the molecular systems level. Specifically, Differential Dependence Network (DDN) analysis was used to detect statistically significant topological rewiring in molecular networks between two phenotypic conditions, and in vivo Magnetic Resonance Imaging (MRI) was used to more accurately define phenotypic sample groups for such differential analysis. We applied DDN to analyze two distinct phenotypic groups of breast cancer and study how genomic instability affects the molecular network topologies in high-grade ovarian cancer. Further, FDA-approved arsenic trioxide (ATO) and the ND2-SmoA1 mouse model of Medulloblastoma (MB) were used to extend our analyses of combined MRI and Reverse Phase Protein Microarray (RPMA) data to assess tumor responses to ATO and to uncover the complexity of therapeutic molecular biology.
PMCID: PMC4348060  PMID: 25750594
Network biology; MRI; differential network; cancer biology
18.  A Deep Learning Network Approach to ab initio Protein Secondary Structure Prediction 
Ab initio protein secondary structure (SS) predictions are utilized to generate tertiary structure predictions, which are increasingly demanded due to the rapid discovery of proteins. Although recent developments have slightly exceeded previous methods of SS prediction, accuracy has stagnated around 80% and many wonder if prediction cannot be advanced beyond this ceiling. Disciplines that have traditionally employed neural networks are experimenting with novel deep learning techniques in attempts to stimulate progress. Since neural networks have historically played an important role in SS prediction, we wanted to determine whether deep learning could contribute to the advancement of this field as well. We developed an SS predictor that makes use of the position-specific scoring matrix generated by PSI-BLAST and deep learning network architectures, which we call DNSS. Graphical processing units and CUDA software optimize the deep network architecture and efficiently train the deep networks. Optimal parameters for the training process were determined, and a workflow comprising three separately trained deep networks was constructed in order to make refined predictions. This deep learning network approach was used to predict SS for a fully independent test data set of 198 proteins, achieving a Q3 accuracy of 80.7% and a Sov accuracy of 74.2%.
PMCID: PMC4348072  PMID: 25750595
Machine Learning; Neural Nets; Protein Structure Prediction; Deep Learning
19.  Novel multi-sample scheme for inferring phylogenetic markers from whole genome tumor profiles 
Computational cancer phylogenetics seeks to enumerate the temporal sequences of aberrations in tumor evolution, thereby delineating the evolution of possible tumor progression pathways, molecular subtypes and mechanisms of action. We previously developed a pipeline for constructing phylogenies describing evolution between major recurring cell types computationally inferred from whole-genome tumor profiles. The accuracy and detail of the phylogenies, however, depends on the identification of accurate, high-resolution molecular markers of progression, i.e., reproducible regions of aberration that robustly differentiate different subtypes and stages of progression. Here we present a novel hidden Markov model (HMM) scheme for the problem of inferring such phylogenetically significant markers through joint segmentation and calling of multi-sample tumor data. Our method classifies sets of genome-wide DNA copy number measurements into a partitioning of samples into normal (diploid) or amplified at each probe. It differs from other similar HMM methods in its design specifically for the needs of tumor phylogenetics, by seeking to identify robust markers of progression conserved across a set of copy number profiles. We show an analysis of our method in comparison to other methods on both synthetic and real tumor data, which confirms its effectiveness for tumor phylogeny inference and suggests avenues for future advances.
PMCID: PMC3830698  PMID: 24407301
Biology and genetics; Health; Trees; Segmentation
20.  A Statistical Change Point Model Approach for the Detection of DNA Copy Number Variations in Array CGH Data 
Array comparative genomic hybridization (aCGH) provides a high-resolution and high-throughput technique for screening of copy number variations (CNVs) within the entire genome. This technique, compared to the conventional CGH, significantly improves the identification of chromosomal abnormalities. However, due to the random noise inherited in the imaging and hybridization process, identifying statistically significant DNA copy number changes in aCGH data is challenging. We propose a novel approach that uses the mean and variance change point model (MVCM) to detect CNVs or breakpoints in aCGH data sets. We derive an approximate p-value for the test statistic and also give the estimate of the locus of the DNA copy number change. We carry out simulation studies to evaluate the accuracy of the estimate and the p-value formulation. These simulation results show that the approach is effective in identifying copy number changes. The approach is also tested on fibroblast cancer cell line data, breast tumor cell line data, and breast cancer cell line aCGH data sets that are publicly available. Changes that have not been identified by the circular binary segmentation (CBS) method but are biologically verified are detected by our approach on these cell lines with higher sensitivity and specificity than CBS.
PMCID: PMC4154476  PMID: 19875853
Statistical hypothesis testing; aCGH microarray data; gene expression; DNA copy numbers; CNVs
21.  Genome-Guided Transcriptome Assembly in the Age of Next-Generation Sequencing 
Next-generation sequencing technologies provide unprecedented power to explore the repertoire of genes and their alternative splice variants, collectively defining the transcriptome of a species in great detail. However, assembling the short reads into full-length gene and transcript models presents significant computational challenges. We review current algorithms for assembling transcripts and genes from next-generation sequencing reads aligned to a reference genome, and lay out areas for future improvements.
PMCID: PMC4086730  PMID: 24524156
Algorithms; biology and genetics; computer applications; medicine and science
22.  Coalescent-based Method for Learning Parameters of Admixture Events from Large-Scale Genetic Variation Data 
Detecting and quantifying the timing and the genetic contributions of parental populations to a hybrid population is an important but challenging problem in reconstructing evolutionary histories from genetic variation data. With the advent of high throughput genotyping technologies, new methods suitable for large-scale data are especially needed. Furthermore, existing methods typically assume the assignment of individuals into subpopulations is known, when that itself is a difficult problem often unresolved for real data. Here we propose a novel method that combines prior work for inferring non-reticulate population structures with an MCMC scheme for sampling over admixture scenarios to both identify population assignments and learn divergence times and admixture proportions for those populations using genome-scale admixed genetic variation data. We validated our method using coalescent simulations and a collection of real bovine and human variation data. On simulated sequences, our methods show better accuracy and faster runtime than leading competitive methods in estimating admixture fractions and divergence times. Analysis on the real data further shows our methods to be effective at matching our best current knowledge about the relevant populations.
PMCID: PMC4019315  PMID: 23959633
J.3.a Biology and genetics; E.1.d Graphs and networks; H.1.1.b Information theory; F.2.2.b Computations on discrete structures
23.  Intensity-Based Skeletonization of CryoEM Gray-Scale Images Using a True Segmentation-Free Algorithm 
Cryo-electron microscopy is an experimental technique that is able to produce 3D gray-scale images of protein molecules. In contrast to other experimental techniques, cryo-electron microscopy is capable of visualizing large molecular complexes such as viruses and ribosomes. At medium resolution, the positions of the atoms are not visible and the process cannot proceed. The medium-resolution images produced by cryo-electron microscopy are used to derive the atomic structure of the proteins in de novo modeling. The skeletons of the 3D gray-scale images are used to interpret important information that is helpful in de novo modeling. Unfortunately, not all features of the image can be captured using a single segmentation. In this paper, we present a segmentation-free approach to extract the gray-scale curve-like skeletons. The approach relies on a novel representation of the 3D image, where the image is modeled as a graph and a set of volume trees. A test containing 36 synthesized maps and one authentic map shows that our approach can improve the performance of the two tested tools used in de novo modeling. The improvements were 62 and 13 percent for Gorgon and DP-TOSS, respectively.
PMCID: PMC4104753  PMID: 24384713
Image processing; graphs; modeling techniques; volumetric image representation
24.  Analytical Solution of Steady State Equations for Chemical Reaction Networks with Bilinear Rate Laws 
True steady states are a rare occurrence in living organisms, yet their knowledge is essential for quasi-steady state approximations, multistability analysis, and other important tools in the investigation of chemical reaction networks (CRN) used to describe molecular processes on the cellular level. Here we present an approach that can provide closed form steady-state solutions to complex systems, resulting from CRN with binary reactions and mass-action rate laws. We map the nonlinear algebraic problem of finding steady states onto a linear problem in a higher dimensional space. We show that the linearized version of the steady state equations obeys the linear conservation laws of the original CRN. We identify two classes of problems for which complete, minimally parameterized solutions may be obtained using only the machinery of linear systems and a judicious choice of the variables used as free parameters. We exemplify our method, providing explicit formulae, on CRN describing signal initiation of two important types of RTK receptor-ligand systems, VEGF and EGF-ErbB1.
PMCID: PMC4090023  PMID: 24334389
Chemical reaction networks; cell signaling; VEGF; EGF; linear conservation laws; analytical solution; bilinear systems; minimally parameterized solutions
25.  The Receiver Operational Characteristic for Binary Classification with Multiple Indices and Its Application to the Neuroimaging Study of Alzheimer’s Disease 
Given a single index, the receiver operational characteristic (ROC) curve analysis is routinely utilized for characterizing performances in distinguishing two conditions/groups in terms of sensitivity and specificity. Given the availability of multiple data sources (referred to as multi-indices), such as multimodal neuroimaging data sets, cognitive tests, and clinical ratings and genomic data in Alzheimer’s disease (AD) studies, the single-index-based ROC underutilizes all available information. For a long time, a number of algorithmic/analytic approaches combining multiple indices have been widely used to simultaneously incorporate multiple sources. In this study, we propose an alternative for combining multiple indices using logical operations, such as “AND,” “OR,” and “at least n” (where n is an integer), to construct multivariate ROC (multiV-ROC) and characterize the sensitivity and specificity statistically associated with the use of multiple indices. With and without the “leave-one-out” cross-validation, we used two data sets from AD studies to showcase the potentially increased sensitivity/specificity of the multiV-ROC in comparison to the single-index ROC and linear discriminant analysis (an analytic way of combining multi-indices). We conclude that, for the data sets we investigated, the proposed multiV-ROC approach is capable of providing a natural and practical alternative with improved classification accuracy as compared to univariate ROC and linear discriminant analysis.
PMCID: PMC4085147  PMID: 23702553
Alzheimer’s dementia (AD); multiple indices; multiV-ROC; receiver operational characteristic (ROC)

Results 1-25 (66)