Search tips
Search criteria

Results 1-25 (649585)

Clipboard (0)

Related Articles

1.  PHAST: A Fast Phage Search Tool 
Nucleic Acids Research  2011;39(Web Server issue):W347-W352.
PHAge Search Tool (PHAST) is a web server designed to rapidly and accurately identify, annotate and graphically display prophage sequences within bacterial genomes or plasmids. It accepts either raw DNA sequence data or partially annotated GenBank formatted data and rapidly performs a number of database comparisons as well as phage ‘cornerstone’ feature identification steps to locate, annotate and display prophage sequences and prophage features. Relative to other prophage identification tools, PHAST is up to 40 times faster and up to 15% more sensitive. It is also able to process and annotate both raw DNA sequence data and Genbank files, provide richly annotated tables on prophage features and prophage ‘quality’ and distinguish between intact and incomplete prophage. PHAST also generates downloadable, high quality, interactive graphics that display all identified prophage components in both circular and linear genomic views. PHAST is available at (
PMCID: PMC3125810  PMID: 21672955
2.  A Model-Based Analysis of GC-Biased Gene Conversion in the Human and Chimpanzee Genomes 
PLoS Genetics  2013;9(8):e1003684.
GC-biased gene conversion (gBGC) is a recombination-associated process that favors the fixation of G/C alleles over A/T alleles. In mammals, gBGC is hypothesized to contribute to variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations, but its prevalence and general functional consequences remain poorly understood. gBGC is difficult to incorporate into models of molecular evolution and so far has primarily been studied using summary statistics from genomic comparisons. Here, we introduce a new probabilistic model that captures the joint effects of natural selection and gBGC on nucleotide substitution patterns, while allowing for correlations along the genome in these effects. We implemented our model in a computer program, called phastBias, that can accurately detect gBGC tracts about 1 kilobase or longer in simulated sequence alignments. When applied to real primate genome sequences, phastBias predicts gBGC tracts that cover roughly 0.3% of the human and chimpanzee genomes and account for 1.2% of human-chimpanzee nucleotide differences. These tracts fall in clusters, particularly in subtelomeric regions; they are enriched for recombination hotspots and fast-evolving sequences; and they display an ongoing fixation preference for G and C alleles. They are also significantly enriched for disease-associated polymorphisms, suggesting that they contribute to the fixation of deleterious alleles. The gBGC tracts provide a unique window into historical recombination processes along the human and chimpanzee lineages. They supply additional evidence of long-term conservation of megabase-scale recombination rates accompanied by rapid turnover of hotspots. Together, these findings shed new light on the evolutionary, functional, and disease implications of gBGC. The phastBias program and our predicted tracts are freely available.
Author Summary
Interpreting patterns of DNA sequence variation in the genomes of closely related species is critically important for understanding the causes and functional effects of nucleotide substitutions. Classical models describe patterns of substitution in terms of the fundamental forces of mutation, recombination, neutral drift, and natural selection. However, an entirely separate force, called GC-biased gene conversion (gBGC), also appears to have an important influence on substitution patterns in many species. gBGC is a recombination-associated evolutionary process that favors the fixation of strong (G/C) over weak (A/T) alleles. In mammals, gBGC is thought to promote variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations. However, its genome-wide influence remains poorly understood, in part because, it is difficult to incorporate gBGC into statistical models of evolution. In this paper, we describe a new evolutionary model that jointly describes the effects of selection and gBGC and apply it to the human and chimpanzee genomes. Our genome-wide predictions of gBGC tracts indicate that gBGC has been an important force in recent human evolution. Our publicly available computer program, called phastBias, and our genome-wide predictions will enable other researchers to consider gBGC in their analyses.
PMCID: PMC3744432  PMID: 23966869
3.  CpGcluster: a distance-based algorithm for CpG-island detection 
BMC Bioinformatics  2006;7:446.
Despite their involvement in the regulation of gene expression and their importance as genomic markers for promoter prediction, no objective standard exists for defining CpG islands (CGIs), since all current approaches rely on a large parameter space formed by the thresholds of length, CpG fraction and G+C content.
Given the higher frequency of CpG dinucleotides at CGIs, as compared to bulk DNA, the distance distributions between neighboring CpGs should differ for bulk and island CpGs. A new algorithm (CpGcluster) is presented, based on the physical distance between neighboring CpGs on the chromosome and able to predict directly clusters of CpGs, while not depending on the subjective criteria mentioned above. By assigning a p-value to each of these clusters, the most statistically significant ones can be predicted as CGIs. CpGcluster was benchmarked against five other CGI finders by using a test sequence set assembled from an experimental CGI library. CpGcluster reached the highest overall accuracy values, while showing the lowest rate of false-positive predictions. Since a minimum-length threshold is not required, CpGcluster can find short but fully functional CGIs usually missed by other algorithms. The CGIs predicted by CpGcluster present the lowest degree of overlap with Alu retrotransposons and, simultaneously, the highest overlap with vertebrate Phylogenetic Conserved Elements (PhastCons). CpGcluster's CGIs overlapping with the Transcription Start Site (TSS) show the highest statistical significance, as compared to the islands in other genome locations, thus qualifying CpGcluster as a valuable tool in discriminating functional CGIs from the remaining islands in the bulk genome.
CpGcluster uses only integer arithmetic, thus being a fast and computationally efficient algorithm able to predict statistically significant clusters of CpG dinucleotides. Another outstanding feature is that all predicted CGIs start and end with a CpG dinucleotide, which should be appropriate for a genomic feature whose functionality is based precisely on CpG dinucleotides. The only search parameter in CpGcluster is the distance between two consecutive CpGs, in contrast to previous algorithms. Therefore, none of the main statistical properties of CpG islands (neither G+C content, CpG fraction nor length threshold) are needed as search parameters, which may lead to the high specificity and low overlap with spurious Alu elements observed for CpGcluster predictions.
PMCID: PMC1617122  PMID: 17038168
4.  PKreport: report generation for checking population pharmacokinetic model assumptions 
Graphics play an important and unique role in population pharmacokinetic (PopPK) model building by exploring hidden structure among data before modeling, evaluating model fit, and validating results after modeling.
The work described in this paper is about a new R package called PKreport, which is able to generate a collection of plots and statistics for testing model assumptions, visualizing data and diagnosing models. The metric system is utilized as the currency for communicating between data sets and the package to generate special-purpose plots. It provides ways to match output from diverse software such as NONMEM, Monolix, R nlme package, etc. The package is implemented with S4 class hierarchy, and offers an efficient way to access the output from NONMEM 7. The final reports take advantage of the web browser as user interface to manage and visualize plots.
PKreport provides 1) a flexible and efficient R class to store and retrieve NONMEM 7 output, 2) automate plots for users to visualize data and models, 3) automatically generated R scripts that are used to create the plots; 4) an archive-oriented management tool for users to store, retrieve and modify figures, 5) high-quality graphs based on the R packages, lattice and ggplot2. The general architecture, running environment and statistical methods can be readily extended with R class hierarchy. PKreport is free to download at
PMCID: PMC3121579  PMID: 21575245
5.  A Model-Based Approach to Identify Binding Sites in CLIP-Seq Data 
PLoS ONE  2014;9(4):e93248.
Cross-linking immunoprecipitation coupled with high-throughput sequencing (CLIP-Seq) has made it possible to identify the targeting sites of RNA-binding proteins in various cell culture systems and tissue types on a genome-wide scale. Here we present a novel model-based approach (MiClip) to identify high-confidence protein-RNA binding sites from CLIP-seq datasets. This approach assigns a probability score for each potential binding site to help prioritize subsequent validation experiments. The MiClip algorithm has been tested in both HITS-CLIP and PAR-CLIP datasets. In the HITS-CLIP dataset, the signal/noise ratios of miRNA seed motif enrichment produced by the MiClip approach are between 17% and 301% higher than those by the ad hoc method for the top 10 most enriched miRNAs. In the PAR-CLIP dataset, the MiClip approach can identify ∼50% more validated binding targets than the original ad hoc method and two recently published methods. To facilitate the application of the algorithm, we have released an R package, MiClip (, and a public web-based graphical user interface software ( for customized analysis.
PMCID: PMC3979666  PMID: 24714572
6.  ttime: an R Package for Translating the Timing of Brain Development Across Mammalian Species 
Neuroinformatics  2010;8(3):201-205.
Understanding relationships between the sequence and timing of brain developmental events across a given set of mammalian species can provide information about both neural development and evolution. Yet neuro-developmental event timing data available from the published literature are incomplete, particularly for humans. Experimental documentation of unknown event timings requires considerable effort that can be expensive, time consuming, and for humans, often impossible. Application of suitable statistical models for translating neurodevelopmental event timings across mammalian species is essential. The present study implements an established statistical model and related functions as an open-source R package (ttime, translating time). The model incorporated into ttime allows predictions of unknown neurodevelopmental timings and explorations of phylogenetic relationships. The open-source package will enable transparency and reproducibility while minimizing redundancy. Sustainability and widespread dissemination will be guaranteed by the active CRAN (Comprehensive R Archive Network) community. The package updates the web-service (Clancy et al. 2007b) by permitting predictions based on curated event timing databases which may include species not yet incorporated in the current model. The R package can be integrated into complex workflows that use the event predictions in their analyses. The package ttime is publicly available and can be downloaded from
PMCID: PMC3189701  PMID: 20824390
Open-source; R package; Cross-species modeling; Cross-species comparisons; Neurodevelopment
7.  Local conservation scores without a priori assumptions on neutral substitution rates 
BMC Bioinformatics  2008;9:190.
Comparative genomics aims to detect signals of evolutionary conservation as an indicator of functional constraint. Surprisingly, results of the ENCODE project revealed that about half of the experimentally verified functional elements found in non-coding DNA were classified as unconstrained by computational predictions. Following this observation, it has been hypothesized that this may be partly explained by biased estimates on neutral evolutionary rates used by existing sequence conservation metrics. All methods we are aware of rely on a comparison with the neutral rate and conservation is estimated by measuring the deviation of a particular genomic region from this rate. Consequently, it is a reasonable assumption that inaccurate neutral rate estimates may lead to biased conservation and constraint estimates.
We propose a conservation signal that is produced by local Maximum Likelihood estimation of evolutionary parameters using an optimized sliding window and present a Kullback-Leibler projection that allows multiple different estimated parameters to be transformed into a conservation measure. This conservation measure does not rely on assumptions about neutral evolutionary substitution rates and little a priori assumptions on the properties of the conserved regions are imposed. We show the accuracy of our approach (KuLCons) on synthetic data and compare it to the scores generated by state-of-the-art methods (phastCons, GERP, SCONE) in an ENCODE region. We find that KuLCons is most often in agreement with the conservation/constraint signatures detected by GERP and SCONE while qualitatively very different patterns from phastCons are observed. Opposed to standard methods KuLCons can be extended to more complex evolutionary models, e.g. taking insertion and deletion events into account and corresponding results show that scores obtained under this model can diverge significantly from scores using the simpler model.
Our results suggest that discriminating among the different degrees of conservation is possible without making assumptions about neutral rates. We find, however, that it cannot be expected to discover considerably different constraint regions than GERP and SCONE. Consequently, we conclude that the reported discrepancies between experimentally verified functional and computationally identified constraint elements are likely not to be explained by biased neutral rate estimates.
PMCID: PMC2375903  PMID: 18405366
8.  adegenet 1.3-1: new tools for the analysis of genome-wide SNP data 
Bioinformatics  2011;27(21):3070-3071.
Summary: While the R software is becoming a standard for the analysis of genetic data, classical population genetics tools are being challenged by the increasing availability of genomic sequences. Dedicated tools are needed for harnessing the large amount of information generated by next-generation sequencing technologies. We introduce new tools implemented in the adegenet 1.3-1 package for handling and analyzing genome-wide single nucleotide polymorphism (SNP) data. Using a bit-level coding scheme for SNP data and parallelized computation, adegenet enables the analysis of large genome-wide SNPs datasets using standard personal computers.
Availability: adegenet 1.3-1 is available from CRAN: Information and support including a dedicated forum of discussion can be found on the adegenet website: adegenet is released with a manual and four tutorials totalling over 300 pages of documentation, and distributed under the GNU General Public Licence (≥2).
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3198581  PMID: 21926124
9.  PHAST: A Collaborative Machine Translation and Post-Editing Tool for Public Health 
This paper describes a novel collaborative machine translation (MT) plus post-editing system called PHAST (Public Health Automatic System for Translation,, tailored for use in producing multilingual education materials for public health. Its collaborative features highlight a new approach in public health informatics: sharing limited bilingual translation resources via a groupware system. We report here on the design methods and requirements used to develop PHAST and on its evaluation with potential public health users. Our results indicate such a system could be a feasible means of increasing the production of multilingual public health materials by reducing the barriers of time and cost. PHAST’s design can serve as a model for other communities interested in assuring the accuracy of MT through shared language expertise.
PMCID: PMC4765627  PMID: 26958182
10.  AFLPsim: an R package to simulate and detect dominant markers under selection in hybridizing populations 
Plant Methods  2014;10:40.
In spite of a large diversity of approaches to investigate loci under selection from a population genetic perspective, very few programs have been specifically designed to date to test selection in hybrids using dominant markers. In addition, simulators of dominant markers are very scarce and they do not usually take into account hybridization.
Here, we present a new, multifunctional, R package for dominant genetic markers, AFLPsim. This package can simulate dominant markers in hybridizing populations and implements genome scan methods for detecting outlier dominant loci in hybrids. In addition, it includes tools for further manipulating the results, plotting them and other tasks. We describe and tabulate the major functions implemented in AFLPsim. In addition, we provide some demonstration of its use and we perform a comparative study with other software. Finally, we conclude by briefly describing the input and output formats.
The R package AFLPsim application provides several useful tools in the context of hybridization studies. It can simulate dominant markers in hybridizing populations and predict their demographic evolution. In addition, we implement a new genome scan method for detecting outlier dominant loci in hybrids, which shows a rather high sensitivity and is very conservative in comparison with Gagnaire et al.’s, Bayescan and introgress. The application is downloadable at
PMCID: PMC4413549  PMID: 25926861
Demographic simulation; Dominant markers; Genome scan; Hybridization; Outlier loci; R package
11.  CGene: an R package for implementation of causal genetic analyses 
European Journal of Human Genetics  2011;19(12):1292-1294.
The excitement over findings from Genome-Wide Association Studies (GWASs) has been tempered by the difficulty in finding the location of the true causal disease susceptibility loci (DSLs), rather than markers that are correlated with the causal variants. In addition, many recent GWASs have studied multiple phenotypes – often highly correlated – making it difficult to understand which associations are causal and which are seemingly causal, induced by phenotypic correlations. In order to identify DSLs, which are required to understand the genetic etiology of the observed associations, statistical methodology has been proposed that distinguishes between a direct effect of a genetic locus on the primary phenotype and an indirect effect induced by the association with the intermediate phenotype that is also correlated with the primary phenotype. However, so far, the application of this important methodology has been challenging, as no user-friendly software implementation exists. The lack of software implementation of this sophisticated methodology has prevented its large-scale use in the genetic community. We have now implemented this statistical approach in a user-friendly and robust R package that has been thoroughly tested. The R package ‘CGene' is available for download at The R code is also available at
PMCID: PMC3230361  PMID: 21731061
causal modeling; statistical genetics; software
12.  DIME: R-package for identifying differential ChIP-seq based on an ensemble of mixture models 
Bioinformatics  2011;27(11):1569-1570.
Summary: Differential Identification using Mixtures Ensemble (DIME) is a package for identification of biologically significant differential binding sites between two conditions using ChIP-seq data. It considers a collection of finite mixture models combined with a false discovery rate (FDR) criterion to find statistically significant regions. This leads to a more reliable assessment of differential binding sites based on a statistical approach. In addition to ChIP-seq, DIME is also applicable to data from other high-throughput platforms.
Availability and implementation: DIME is implemented as an R-package, which is available at It may also be downloaded from
PMCID: PMC3102220  PMID: 21471015
13.  DR-Integrator: a new analytic tool for integrating DNA copy number and gene expression data 
Bioinformatics  2009;26(3):414-416.
Summary: DNA copy number alterations (CNA) frequently underlie gene expression changes by increasing or decreasing gene dosage. However, only a subset of genes with altered dosage exhibit concordant changes in gene expression. This subset is likely to be enriched for oncogenes and tumor suppressor genes, and can be identified by integrating these two layers of genome-scale data. We introduce DNA/RNA-Integrator (DR-Integrator), a statistical software tool to perform integrative analyses on paired DNA copy number and gene expression data. DR-Integrator identifies genes with significant correlations between DNA copy number and gene expression, and implements a supervised analysis that captures genes with significant alterations in both DNA copy number and gene expression between two sample classes.
Availability: DR-Integrator is freely available for non-commercial use from the Pollack Lab at and can be downloaded as a plug-in application to Microsoft Excel and as a package for the R statistical computing environment. The R package is available under the name ‘DRI’ at An example analysis using DR-Integrator is included as supplemental material.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2815664  PMID: 20031972
14.  Alternative forms for genomic clines 
Ecology and Evolution  2013;3(7):1951-1966.
Understanding factors regulating hybrid fitness and gene exchange is a major research challenge for evolutionary biology. Genomic cline analysis has been used to evaluate alternative patterns of introgression, but only two models have been used widely and the approach has generally lacked a hypothesis testing framework for distinguishing effects of selection and drift. I propose two alternative cline models, implement multivariate outlier detection to identify markers associated with hybrid fitness, and simulate hybrid zone dynamics to evaluate the signatures of different modes of selection. Analysis of simulated data shows that previous approaches are prone to false positives (multinomial regression) or relatively insensitive to outlier loci affected by selection (Barton's concordance). The new, theory-based logit-logistic cline model is generally best at detecting loci affecting hybrid fitness. Although some generalizations can be made about different modes of selection, there is no one-to-one correspondence between pattern and process. These new methods will enhance our ability to extract important information about the genetics of reproductive isolation and hybrid fitness. However, much remains to be done to relate statistical patterns to particular evolutionary processes. The methods described here are implemented in a freely available package “HIest” for the R statistical software (CRAN;
PMCID: PMC3728937  PMID: 23919142
Admixture; hybrid zones; introgression; reproductive isolation; speciation
15.  'SEEDY' (Simulation of Evolutionary and Epidemiological Dynamics): An R Package to Follow Accumulation of Within-Host Mutation in Pathogens 
PLoS ONE  2015;10(6):e0129745.
Genome sequencing is an increasingly common component of infectious disease outbreak investigations. However, the relationship between pathogen transmission and observed genetic data is complex, and dependent on several uncertain factors. As such, simulation of pathogen dynamics is an important tool for interpreting observed genomic data in an infectious disease outbreak setting, in order to test hypotheses and to explore the range of outcomes consistent with a given set of parameters. We introduce ‘seedy’, an R package for the simulation of evolutionary and epidemiological dynamics ( Our software implements stochastic models for the accumulation of mutations within hosts, as well as individual-level disease transmission. By allowing variables such as the transmission bottleneck size, within-host effective population size and population mixing rates to be specified by the user, our package offers a flexible framework to investigate evolutionary dynamics during disease outbreaks. Furthermore, our software provides theoretical pairwise genetic distance distributions to provide a likelihood of person-to-person transmission based on genomic observations, and using this framework, implements transmission route assessment for genomic data collected during an outbreak. Our open source software provides an accessible platform for users to explore pathogen evolution and outbreak dynamics via simulation, and offers tools to assess observed genomic data in this context.
PMCID: PMC4467979  PMID: 26075402
16.  MM2S: personalized diagnosis of medulloblastoma patients and model systems 
Medulloblastoma (MB) is a highly malignant and heterogeneous brain tumour that is the most common cause of cancer-related deaths in children. Increasing availability of genomic data over the last decade had resulted in improvement of human subtype classification methods, and the parallel development of MB mouse models towards identification of subtype-specific disease origins and signaling pathways. Despite these advances, MB classification schemes remained inadequate for personalized prediction of MB subtypes for individual patient samples and across model systems. To address this issue, we developed the Medullo-Model to Subtypes (MM2S) classifier, a new method enabling classification of individual gene expression profiles from MB samples (patient samples, mouse models, and cell lines) against well-established molecular subtypes [Genomics 106:96-106, 2015]. We demonstrated the accuracy and flexibility of MM2S in the largest meta-analysis of human patients and mouse models to date. Here, we present a new functional package that provides an easy-to-use and fully documented implementation of the MM2S method, with additional functionalities that allow users to obtain graphical and tabular summaries of MB subtype predictions for single samples and across sample replicates. The flexibility of the MM2S package promotes incorporation of MB predictions into large Medulloblastoma-driven analysis pipelines, making this tool suitable for use by researchers.
The MM2S package is applied in two case studies involving human primary patient samples, as well as sample replicates of the GTML mouse model. We highlight functions that are of use for species-specific MB classification, across individual samples and sample replicates. We emphasize on the range of functions that can be used to derive both singular and meta-centric views of MB predictions, across samples and across MB subtypes.
Our MM2S package can be used to generate predictions without having to rely on an external web server or additional sources. Our open-source package facilitates and extends the MM2S algorithm in diverse computational and bioinformatics contexts. The package is available on CRAN, at the following URL:, as well as on Github at the following URLs: and
Electronic supplementary material
The online version of this article (doi:10.1186/s13029-016-0053-y) contains supplementary material, which is available to authorized users.
PMCID: PMC4827218  PMID: 27069505
Subtype classification; Medulloblastoma; Diagnosis; Single-sample; Cancer; Mouse models; Primary tumours
17.  nCal: an R package for non-linear calibration 
Bioinformatics  2013;29(20):2653-2654.
Summary: Non-linear calibration is a widely used method for quantifying biomarkers wherein concentration-response curves estimated using samples of known concentrations are used to predict the biomarker concentrations in the samples of interest. The R package nCal fills an important gap in the open source, stand-alone software for performing non-linear calibration. For curve fitting, nCal provides a new implementation of a robust, Bayesian hierarchical five-parameter logistic model. nCal supports a simple graphical user interface that can be used by laboratory scientists, and contains functionality for importing data from the multiplex bead array assay instrumentation.
Availability: The R package ‘nCal’ is available from under GPL-2 or later.
Supplementary information: Supplementary information is available in the form of an R package vignette at the above repository and an FAQ at
PMCID: PMC3789552  PMID: 23926226
18.  Direction pathway analysis of large-scale proteomics data reveals novel features of the insulin action pathway 
Bioinformatics  2013;30(6):808-814.
Motivation: With the advancement of high-throughput techniques, large-scale profiling of biological systems with multiple experimental perturbations is becoming more prevalent. Pathway analysis incorporates prior biological knowledge to analyze genes/proteins in groups in a biological context. However, the hypotheses under investigation are often confined to a 1D space (i.e. up, down, either or mixed regulation). Here, we develop direction pathway analysis (DPA), which can be applied to test hypothesis in a high-dimensional space for identifying pathways that display distinct responses across multiple perturbations.
Results: Our DPA approach allows for the identification of pathways that display distinct responses across multiple perturbations. To demonstrate the utility and effectiveness, we evaluated DPA under various simulated scenarios and applied it to study insulin action in adipocytes. A major action of insulin in adipocytes is to regulate the movement of proteins from the interior to the cell surface membrane. Quantitative mass spectrometry-based proteomics was used to study this process on a large-scale. The combined dataset comprises four separate treatments. By applying DPA, we identified that several insulin responsive pathways in the plasma membrane trafficking are only partially dependent on the insulin-regulated kinase Akt. We subsequently validated our findings through targeted analysis of key proteins from these pathways using immunoblotting and live cell microscopy. Our results demonstrate that DPA can be applied to dissect pathway networks testing diverse hypotheses and integrating multiple experimental perturbations.
Availability and implementation: The R package ‘directPA’ is distributed from CRAN under GNU General Public License (GPL)-3 and can be downloaded from:
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3957074  PMID: 24167158
19.  phangorn: phylogenetic analysis in R 
Bioinformatics  2010;27(4):592-593.
Summary: phangorn is a package for phylogenetic reconstruction and analysis in the R language. Previously it was only possible to estimate phylogenetic trees with distance methods in R. phangorn, now offers the possibility of reconstructing phylogenies with distance based methods, maximum parsimony or maximum likelihood (ML) and performing Hadamard conjugation. Extending the general ML framework, this package provides the possibility of estimating mixture and partition models. Furthermore, phangorn offers several functions for comparing trees, phylogenetic models or splits, simulating character data and performing congruence analyses.
Availability: phangorn can be obtained through the CRAN homepage phangorn is licensed under GPL 2.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3035803  PMID: 21169378
20.  EXPANDS: expanding ploidy and allele frequency on nested subpopulations 
Bioinformatics  2013;30(1):50-60.
Motivation: Several cancer types consist of multiple genetically and phenotypically distinct subpopulations. The underlying mechanism for this intra-tumoral heterogeneity can be explained by the clonal evolution model, whereby growth advantageous mutations cause the expansion of cancer cell subclones. The recurrent phenotype of many cancers may be a consequence of these coexisting subpopulations responding unequally to therapies. Methods to computationally infer tumor evolution and subpopulation diversity are emerging and they hold the promise to improve the understanding of genetic and molecular determinants of recurrence.
Results: To address cellular subpopulation dynamics within human tumors, we developed a bioinformatic method, EXPANDS. It estimates the proportion of cells harboring specific mutations in a tumor. By modeling cellular frequencies as probability distributions, EXPANDS predicts mutations that accumulate in a cell before its clonal expansion. We assessed the performance of EXPANDS on one whole genome sequenced breast cancer and performed SP analyses on 118 glioblastoma multiforme samples obtained from TCGA. Our results inform about the extent of subclonal diversity in primary glioblastoma, subpopulation dynamics during recurrence and provide a set of candidate genes mutated in the most well-adapted subpopulations. In summary, EXPANDS predicts tumor purity and subclonal composition from sequencing data.
Availability and implementation: EXPANDS is available for download at (matlab version - used in this manuscript) and (R version).
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3866558  PMID: 24177718
21.  Clusthaplo: a plug-in for MCQTL to enhance QTL detection using ancestral alleles in multi-cross design 
Key message
We enhance power and accuracy of QTL mapping in multiple related families, by clustering the founders of the families on their local genomic similarity.
MCQTL is a linkage mapping software application that allows the joint QTL mapping of multiple related families. In its current implementation, QTLs are modeled with one or two parameters for each parent that is a founder of the multi-cross design. The higher the number of parents, the higher the number of model parameters which can impact the power and the accuracy of the mapping. We propose to make use of the availability of denser and denser genotyping information on the founders to lessen the number of MCQTL parameters and thus boost the QTL discovery. We developed clusthaplo, an R package (, which aims to cluster haplotypes using a genomic similarity that reflects the probability of sharing the same ancestral allele. Computed in a sliding window along the genome and followed by a clustering method, the genomic similarity allows the local clustering of the parent haplotypes. Our assumption is that the haplotypes belonging to the same class transmit the same ancestral allele. So their putative QTL allelic effects can be modeled with the same parameter, leading to a parsimonious model, that is plugged in MCQTL. Intensive simulations using three maize data sets showed the significant gain in power and in accuracy of the QTL mapping with the ancestral allele model compared to the classical MCQTL model. MCQTL_LD (clusthaplo outputs plug in MCQTL) is a versatile and powerful tool for QTL mapping in multiple related families that makes use of linkage and linkage disequilibrium (web site
Electronic supplementary material
The online version of this article (doi:10.1007/s00122-014-2267-1) contains supplementary material, which is available to authorized users.
PMCID: PMC3964294  PMID: 24482114
22.  WAVECLOCK: wavelet analysis of circadian oscillation 
Bioinformatics  2008;24(23):2794-2795.
Summary: Oscillations in mRNA and protein of circadian clock components can be continuously monitored in vitro using synchronized cell lines. These rhythms can be highly variable due to culture conditions and are non-stationary due to baseline trends, damping and drift in period length. We present a technique for characterizing the modal frequencies of oscillation using continuous wavelet decomposition to non-parametrically model changes in amplitude and period while removing baseline effects and noise.
Availability: The method has been implemented as the package waveclock for the free statistical software program R and is available for download from
Supplementary information: Supplementary figures are available at Bioinformatics online.
PMCID: PMC2639275  PMID: 18931366
23.  CollapsABEL: an R library for detecting compound heterozygote alleles in genome-wide association studies 
BMC Bioinformatics  2016;17:156.
Compound Heterozygosity (CH) in classical genetics is the presence of two different recessive mutations at a particular gene locus. A relaxed form of CH alleles may account for an essential proportion of the missing heritability, i.e. heritability of phenotypes so far not accounted for by single genetic variants. Methods to detect CH-like effects in genome-wide association studies (GWAS) may facilitate explaining the missing heritability, but to our knowledge no viable software tools for this purpose are currently available.
In this work we present the Generalized Compound Double Heterozygosity (GCDH) test and its implementation in the R package CollapsABEL. Time-consuming procedures are optimized for computational efficiency using Java or C++. Intermediate results are stored either in an SQL database or in a so-called big.matrix file to achieve reasonable memory footprint. Our large scale simulation studies show that GCDH is capable of discovering genetic associations due to CH-like interactions with much higher power than a conventional single-SNP approach under various settings, whether the causal genetic variations are available or not. CollapsABEL provides a user-friendly pipeline for genotype collapsing, statistical testing, power estimation, type I error control and graphics generation in the R language.
CollapsABEL provides a computationally efficient solution for screening general forms of CH alleles in densely imputed microarray or whole genome sequencing datasets. The GCDH test provides an improved power over single-SNP based methods in detecting the prevalence of CH in human complex phenotypes, offering an opportunity for tackling the missing heritability problem.
Binary and source packages of CollapsABEL are available on CRAN ( and the website of the GenABEL project (
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-016-1006-9) contains supplementary material, which is available to authorized users.
PMCID: PMC4826552  PMID: 27059780
Genome wide association study; Next generation sequencing; Compound heterozygosity; Missing heritability
24.  SeqFeatR for the Discovery of Feature-Sequence Associations 
PLoS ONE  2016;11(1):e0146409.
Specific selection pressures often lead to specifically mutated genomes. The open source software SeqFeatR has been developed to identify associations between mutation patterns in biological sequences and specific selection pressures (“features”). For instance, SeqFeatR has been used to discover in viral protein sequences new T cell epitopes for hosts of given HLA types. SeqFeatR supports frequentist and Bayesian methods for the discovery of statistical sequence-feature associations. Moreover, it offers novel ways to visualize results of the statistical analyses and to relate them to further properties. In this article we demonstrate various functions of SeqFeatR with real data. The most frequently used set of functions is also provided by a web server. SeqFeatR is implemented as R package and freely available from the R archive CRAN ( The package includes a tutorial vignette. The software is distributed under the GNU General Public License (version 3 or later). The web server URL is
PMCID: PMC4701496  PMID: 26731669
25.  PurBayes: estimating tumor cellularity and subclonality in next-generation sequencing data 
Bioinformatics  2013;29(15):1888-1889.
Summary: We have developed a novel Bayesian method, PurBayes, to estimate tumor purity and detect intratumor heterogeneity based on next-generation sequencing data of paired tumor-normal tissue samples, which uses finite mixture modeling methods. We demonstrate our approach using simulated data and discuss its performance under varying conditions.
Availability: PurBayes is implemented as an R package, and source code is available for download through CRAN at
Supplementary information: Supplementary data are available online at Bioinformatics online.
PMCID: PMC3712213  PMID: 23749958

Results 1-25 (649585)