Search tips
Search criteria

Results 1-25 (299904)

Clipboard (0)

Related Articles

1.  PHAST: A Fast Phage Search Tool 
Nucleic Acids Research  2011;39(Web Server issue):W347-W352.
PHAge Search Tool (PHAST) is a web server designed to rapidly and accurately identify, annotate and graphically display prophage sequences within bacterial genomes or plasmids. It accepts either raw DNA sequence data or partially annotated GenBank formatted data and rapidly performs a number of database comparisons as well as phage ‘cornerstone’ feature identification steps to locate, annotate and display prophage sequences and prophage features. Relative to other prophage identification tools, PHAST is up to 40 times faster and up to 15% more sensitive. It is also able to process and annotate both raw DNA sequence data and Genbank files, provide richly annotated tables on prophage features and prophage ‘quality’ and distinguish between intact and incomplete prophage. PHAST also generates downloadable, high quality, interactive graphics that display all identified prophage components in both circular and linear genomic views. PHAST is available at (
PMCID: PMC3125810  PMID: 21672955
2.  PHASTER: a better, faster version of the PHAST phage search tool 
Nucleic Acids Research  2016;44(Web Server issue):W16-W21.
PHASTER (PHAge Search Tool – Enhanced Release) is a significant upgrade to the popular PHAST web server for the rapid identification and annotation of prophage sequences within bacterial genomes and plasmids. Although the steps in the phage identification pipeline in PHASTER remain largely the same as in the original PHAST, numerous software improvements and significant hardware enhancements have now made PHASTER faster, more efficient, more visually appealing and much more user friendly. In particular, PHASTER is now 4.3× faster than PHAST when analyzing a typical bacterial genome. More specifically, software optimizations have made the backend of PHASTER 2.7X faster than PHAST, while the addition of 80 CPUs to the PHASTER compute cluster are responsible for the remaining speed-up. PHASTER can now process a typical bacterial genome in 3 min from the raw sequence alone, or in 1.5 min when given a pre-annotated GenBank file. A number of other optimizations have also been implemented, including automated algorithms to reduce the size and redundancy of PHASTER's databases, improvements in handling multiple (metagenomic) queries and higher user traffic, along with the ability to perform automated look-ups against 14 000 previously PHAST/PHASTER annotated bacterial genomes (which can lead to complete phage annotations in seconds as opposed to minutes). PHASTER's web interface has also been entirely rewritten. A new graphical genome browser has been added, gene/genome visualization tools have been improved, and the graphical interface is now more modern, robust and user-friendly. PHASTER is available online at
PMCID: PMC4987931  PMID: 27141966
3.  CpGcluster: a distance-based algorithm for CpG-island detection 
BMC Bioinformatics  2006;7:446.
Despite their involvement in the regulation of gene expression and their importance as genomic markers for promoter prediction, no objective standard exists for defining CpG islands (CGIs), since all current approaches rely on a large parameter space formed by the thresholds of length, CpG fraction and G+C content.
Given the higher frequency of CpG dinucleotides at CGIs, as compared to bulk DNA, the distance distributions between neighboring CpGs should differ for bulk and island CpGs. A new algorithm (CpGcluster) is presented, based on the physical distance between neighboring CpGs on the chromosome and able to predict directly clusters of CpGs, while not depending on the subjective criteria mentioned above. By assigning a p-value to each of these clusters, the most statistically significant ones can be predicted as CGIs. CpGcluster was benchmarked against five other CGI finders by using a test sequence set assembled from an experimental CGI library. CpGcluster reached the highest overall accuracy values, while showing the lowest rate of false-positive predictions. Since a minimum-length threshold is not required, CpGcluster can find short but fully functional CGIs usually missed by other algorithms. The CGIs predicted by CpGcluster present the lowest degree of overlap with Alu retrotransposons and, simultaneously, the highest overlap with vertebrate Phylogenetic Conserved Elements (PhastCons). CpGcluster's CGIs overlapping with the Transcription Start Site (TSS) show the highest statistical significance, as compared to the islands in other genome locations, thus qualifying CpGcluster as a valuable tool in discriminating functional CGIs from the remaining islands in the bulk genome.
CpGcluster uses only integer arithmetic, thus being a fast and computationally efficient algorithm able to predict statistically significant clusters of CpG dinucleotides. Another outstanding feature is that all predicted CGIs start and end with a CpG dinucleotide, which should be appropriate for a genomic feature whose functionality is based precisely on CpG dinucleotides. The only search parameter in CpGcluster is the distance between two consecutive CpGs, in contrast to previous algorithms. Therefore, none of the main statistical properties of CpG islands (neither G+C content, CpG fraction nor length threshold) are needed as search parameters, which may lead to the high specificity and low overlap with spurious Alu elements observed for CpGcluster predictions.
PMCID: PMC1617122  PMID: 17038168
4.  Laboratory Investigation of Salmonella enterica serovar Poona Outbreak in California: Comparison of Pulsed-Field Gel Electrophoresis (PFGE) and Whole Genome Sequencing (WGS) Results 
PLoS Currents  2016;8:ecurrents.outbreaks.1bb3e36e74bd5779bc43ac3a8dae52e6.
Introduction: Recently, Salmonella enterica serovar Poona caused a multistate outbreak, with 245 out of 907 cases occurring in California. We report a comparison of pulsed-field gel electrophoresis (PFGE) results with whole genome sequencing (WGS) for genotyping of Salmonella Poona isolates.
Methods: CA Salmonella Poona isolates, collected from July to August 2015, were genotyped by PFGE using XbaI restriction enzyme. WGS was done using Nextera XT library kit with 2x300 bp or 2x250 bp sequencing chemistry on the Illumina MiSeq Sequencer.  Reads were mapped to the de novo assembled serovar Poona draft genome (48 contigs, N50= 223,917) from the outbreak using CLCbio GW 8.0.2. The phylogenetic tree was generated based on hqSNPs calling. Genomes were annotated with CGE and PHAST online tools. In silico MLST was performed using the CGE online tool.
Results: Human (14) and cucumber (2) Salmonella Poona isolates exhibited 3 possibly related PFGE patterns (JL6X01.0018 [predominant], JL6X01.0375, JL6X01.0778).  All isolates that were related by PFGE also clustered together according to the WGS. One isolate with a divergent PFGE pattern (JL6X01.0776) served as an outlier in the phylogenetic analysis and substantially differed from the outbreak clade by WGS. All outbreak isolates were assigned to MLST sequence type 447. The majority of the outbreak-related isolates possessed the same set of Salmonella Pathogenicity Islands with few variations. One outbreak isolate was sequenced and analyzed independently by CDC and CDPH laboratories; there was 0 SNP difference in results. Additional two isolates were sequenced by CDC and the raw data was processed through CDPH and CDC analysis pipelines. Both data analysis pipelines also generated concordant results. 
Discussion: PFGE and WGS results for the recent CA Salmonella enterica serovar Poona outbreak provided concordant assignment of the isolates to the outbreak cluster. WGS allowed more robust determination of genetic relatedness, provided information regarding MLST-type, pathogenicity genes, and bacteriophage content. WGS data obtained independently at two laboratories showed complete agreement.
PMCID: PMC5145817  PMID: 28018748
Outbreak; phylogenetic analysis; pulsed-field gel electrophoresis; Salmonella serovar Poona; whole-genome sequencing
5.  A Model-Based Analysis of GC-Biased Gene Conversion in the Human and Chimpanzee Genomes 
PLoS Genetics  2013;9(8):e1003684.
GC-biased gene conversion (gBGC) is a recombination-associated process that favors the fixation of G/C alleles over A/T alleles. In mammals, gBGC is hypothesized to contribute to variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations, but its prevalence and general functional consequences remain poorly understood. gBGC is difficult to incorporate into models of molecular evolution and so far has primarily been studied using summary statistics from genomic comparisons. Here, we introduce a new probabilistic model that captures the joint effects of natural selection and gBGC on nucleotide substitution patterns, while allowing for correlations along the genome in these effects. We implemented our model in a computer program, called phastBias, that can accurately detect gBGC tracts about 1 kilobase or longer in simulated sequence alignments. When applied to real primate genome sequences, phastBias predicts gBGC tracts that cover roughly 0.3% of the human and chimpanzee genomes and account for 1.2% of human-chimpanzee nucleotide differences. These tracts fall in clusters, particularly in subtelomeric regions; they are enriched for recombination hotspots and fast-evolving sequences; and they display an ongoing fixation preference for G and C alleles. They are also significantly enriched for disease-associated polymorphisms, suggesting that they contribute to the fixation of deleterious alleles. The gBGC tracts provide a unique window into historical recombination processes along the human and chimpanzee lineages. They supply additional evidence of long-term conservation of megabase-scale recombination rates accompanied by rapid turnover of hotspots. Together, these findings shed new light on the evolutionary, functional, and disease implications of gBGC. The phastBias program and our predicted tracts are freely available.
Author Summary
Interpreting patterns of DNA sequence variation in the genomes of closely related species is critically important for understanding the causes and functional effects of nucleotide substitutions. Classical models describe patterns of substitution in terms of the fundamental forces of mutation, recombination, neutral drift, and natural selection. However, an entirely separate force, called GC-biased gene conversion (gBGC), also appears to have an important influence on substitution patterns in many species. gBGC is a recombination-associated evolutionary process that favors the fixation of strong (G/C) over weak (A/T) alleles. In mammals, gBGC is thought to promote variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations. However, its genome-wide influence remains poorly understood, in part because, it is difficult to incorporate gBGC into statistical models of evolution. In this paper, we describe a new evolutionary model that jointly describes the effects of selection and gBGC and apply it to the human and chimpanzee genomes. Our genome-wide predictions of gBGC tracts indicate that gBGC has been an important force in recent human evolution. Our publicly available computer program, called phastBias, and our genome-wide predictions will enable other researchers to consider gBGC in their analyses.
PMCID: PMC3744432  PMID: 23966869
6.  PKreport: report generation for checking population pharmacokinetic model assumptions 
Graphics play an important and unique role in population pharmacokinetic (PopPK) model building by exploring hidden structure among data before modeling, evaluating model fit, and validating results after modeling.
The work described in this paper is about a new R package called PKreport, which is able to generate a collection of plots and statistics for testing model assumptions, visualizing data and diagnosing models. The metric system is utilized as the currency for communicating between data sets and the package to generate special-purpose plots. It provides ways to match output from diverse software such as NONMEM, Monolix, R nlme package, etc. The package is implemented with S4 class hierarchy, and offers an efficient way to access the output from NONMEM 7. The final reports take advantage of the web browser as user interface to manage and visualize plots.
PKreport provides 1) a flexible and efficient R class to store and retrieve NONMEM 7 output, 2) automate plots for users to visualize data and models, 3) automatically generated R scripts that are used to create the plots; 4) an archive-oriented management tool for users to store, retrieve and modify figures, 5) high-quality graphs based on the R packages, lattice and ggplot2. The general architecture, running environment and statistical methods can be readily extended with R class hierarchy. PKreport is free to download at
PMCID: PMC3121579  PMID: 21575245
7.  The Ruby UCSC API: accessing the UCSC genome database using Ruby 
BMC Bioinformatics  2012;13:240.
The University of California, Santa Cruz (UCSC) genome database is among the most used sources of genomic annotation in human and other organisms. The database offers an excellent web-based graphical user interface (the UCSC genome browser) and several means for programmatic queries. A simple application programming interface (API) in a scripting language aimed at the biologist was however not yet available. Here, we present the Ruby UCSC API, a library to access the UCSC genome database using Ruby.
The API is designed as a BioRuby plug-in and built on the ActiveRecord 3 framework for the object-relational mapping, making writing SQL statements unnecessary. The current version of the API supports databases of all organisms in the UCSC genome database including human, mammals, vertebrates, deuterostomes, insects, nematodes, and yeast.
The API uses the bin index—if available—when querying for genomic intervals. The API also supports genomic sequence queries using locally downloaded *.2bit files that are not stored in the official MySQL database. The API is implemented in pure Ruby and is therefore available in different environments and with different Ruby interpreters (including JRuby).
Assisted by the straightforward object-oriented design of Ruby and ActiveRecord, the Ruby UCSC API will facilitate biologists to query the UCSC genome database programmatically. The API is available through the RubyGem system. Source code and documentation are available at under the Ruby license. Feedback and help is provided via the website at
PMCID: PMC3542311  PMID: 22994508
8.  AFLPsim: an R package to simulate and detect dominant markers under selection in hybridizing populations 
Plant Methods  2014;10:40.
In spite of a large diversity of approaches to investigate loci under selection from a population genetic perspective, very few programs have been specifically designed to date to test selection in hybrids using dominant markers. In addition, simulators of dominant markers are very scarce and they do not usually take into account hybridization.
Here, we present a new, multifunctional, R package for dominant genetic markers, AFLPsim. This package can simulate dominant markers in hybridizing populations and implements genome scan methods for detecting outlier dominant loci in hybrids. In addition, it includes tools for further manipulating the results, plotting them and other tasks. We describe and tabulate the major functions implemented in AFLPsim. In addition, we provide some demonstration of its use and we perform a comparative study with other software. Finally, we conclude by briefly describing the input and output formats.
The R package AFLPsim application provides several useful tools in the context of hybridization studies. It can simulate dominant markers in hybridizing populations and predict their demographic evolution. In addition, we implement a new genome scan method for detecting outlier dominant loci in hybrids, which shows a rather high sensitivity and is very conservative in comparison with Gagnaire et al.’s, Bayescan and introgress. The application is downloadable at
PMCID: PMC4413549  PMID: 25926861
Demographic simulation; Dominant markers; Genome scan; Hybridization; Outlier loci; R package
9.  ttime: an R Package for Translating the Timing of Brain Development Across Mammalian Species 
Neuroinformatics  2010;8(3):201-205.
Understanding relationships between the sequence and timing of brain developmental events across a given set of mammalian species can provide information about both neural development and evolution. Yet neuro-developmental event timing data available from the published literature are incomplete, particularly for humans. Experimental documentation of unknown event timings requires considerable effort that can be expensive, time consuming, and for humans, often impossible. Application of suitable statistical models for translating neurodevelopmental event timings across mammalian species is essential. The present study implements an established statistical model and related functions as an open-source R package (ttime, translating time). The model incorporated into ttime allows predictions of unknown neurodevelopmental timings and explorations of phylogenetic relationships. The open-source package will enable transparency and reproducibility while minimizing redundancy. Sustainability and widespread dissemination will be guaranteed by the active CRAN (Comprehensive R Archive Network) community. The package updates the web-service (Clancy et al. 2007b) by permitting predictions based on curated event timing databases which may include species not yet incorporated in the current model. The R package can be integrated into complex workflows that use the event predictions in their analyses. The package ttime is publicly available and can be downloaded from
PMCID: PMC3189701  PMID: 20824390
Open-source; R package; Cross-species modeling; Cross-species comparisons; Neurodevelopment
10.  CGene: an R package for implementation of causal genetic analyses 
European Journal of Human Genetics  2011;19(12):1292-1294.
The excitement over findings from Genome-Wide Association Studies (GWASs) has been tempered by the difficulty in finding the location of the true causal disease susceptibility loci (DSLs), rather than markers that are correlated with the causal variants. In addition, many recent GWASs have studied multiple phenotypes – often highly correlated – making it difficult to understand which associations are causal and which are seemingly causal, induced by phenotypic correlations. In order to identify DSLs, which are required to understand the genetic etiology of the observed associations, statistical methodology has been proposed that distinguishes between a direct effect of a genetic locus on the primary phenotype and an indirect effect induced by the association with the intermediate phenotype that is also correlated with the primary phenotype. However, so far, the application of this important methodology has been challenging, as no user-friendly software implementation exists. The lack of software implementation of this sophisticated methodology has prevented its large-scale use in the genetic community. We have now implemented this statistical approach in a user-friendly and robust R package that has been thoroughly tested. The R package ‘CGene' is available for download at The R code is also available at
PMCID: PMC3230361  PMID: 21731061
causal modeling; statistical genetics; software
11.  DR-Integrator: a new analytic tool for integrating DNA copy number and gene expression data 
Bioinformatics  2009;26(3):414-416.
Summary: DNA copy number alterations (CNA) frequently underlie gene expression changes by increasing or decreasing gene dosage. However, only a subset of genes with altered dosage exhibit concordant changes in gene expression. This subset is likely to be enriched for oncogenes and tumor suppressor genes, and can be identified by integrating these two layers of genome-scale data. We introduce DNA/RNA-Integrator (DR-Integrator), a statistical software tool to perform integrative analyses on paired DNA copy number and gene expression data. DR-Integrator identifies genes with significant correlations between DNA copy number and gene expression, and implements a supervised analysis that captures genes with significant alterations in both DNA copy number and gene expression between two sample classes.
Availability: DR-Integrator is freely available for non-commercial use from the Pollack Lab at and can be downloaded as a plug-in application to Microsoft Excel and as a package for the R statistical computing environment. The R package is available under the name ‘DRI’ at An example analysis using DR-Integrator is included as supplemental material.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2815664  PMID: 20031972
12.  Track data hubs enable visualization of user-defined genome-wide annotations on the UCSC Genome Browser 
Bioinformatics  2013;30(7):1003-1005.
Summary: Track data hubs provide an efficient mechanism for visualizing remotely hosted Internet-accessible collections of genome annotations. Hub datasets can be organized, configured and fully integrated into the University of California Santa Cruz (UCSC) Genome Browser and accessed through the familiar browser interface. For the first time, individuals can use the complete browser feature set to view custom datasets without the overhead of setting up and maintaining a mirror.
Availability and implementation: Source code for the BigWig, BigBed and Genome Browser software is freely available for non-commercial use at, implemented in C and supported on Linux. Binaries for the BigWig and BigBed creation and parsing utilities may be downloaded at Binary Alignment/Map (BAM) and Variant Call Format (VCF)/tabix utilities are available from and The UCSC Genome Browser is publicly accessible at
PMCID: PMC3967101  PMID: 24227676
13.  adegenet 1.3-1: new tools for the analysis of genome-wide SNP data 
Bioinformatics  2011;27(21):3070-3071.
Summary: While the R software is becoming a standard for the analysis of genetic data, classical population genetics tools are being challenged by the increasing availability of genomic sequences. Dedicated tools are needed for harnessing the large amount of information generated by next-generation sequencing technologies. We introduce new tools implemented in the adegenet 1.3-1 package for handling and analyzing genome-wide single nucleotide polymorphism (SNP) data. Using a bit-level coding scheme for SNP data and parallelized computation, adegenet enables the analysis of large genome-wide SNPs datasets using standard personal computers.
Availability: adegenet 1.3-1 is available from CRAN: Information and support including a dedicated forum of discussion can be found on the adegenet website: adegenet is released with a manual and four tutorials totalling over 300 pages of documentation, and distributed under the GNU General Public Licence (≥2).
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3198581  PMID: 21926124
14.  Direction pathway analysis of large-scale proteomics data reveals novel features of the insulin action pathway 
Bioinformatics  2013;30(6):808-814.
Motivation: With the advancement of high-throughput techniques, large-scale profiling of biological systems with multiple experimental perturbations is becoming more prevalent. Pathway analysis incorporates prior biological knowledge to analyze genes/proteins in groups in a biological context. However, the hypotheses under investigation are often confined to a 1D space (i.e. up, down, either or mixed regulation). Here, we develop direction pathway analysis (DPA), which can be applied to test hypothesis in a high-dimensional space for identifying pathways that display distinct responses across multiple perturbations.
Results: Our DPA approach allows for the identification of pathways that display distinct responses across multiple perturbations. To demonstrate the utility and effectiveness, we evaluated DPA under various simulated scenarios and applied it to study insulin action in adipocytes. A major action of insulin in adipocytes is to regulate the movement of proteins from the interior to the cell surface membrane. Quantitative mass spectrometry-based proteomics was used to study this process on a large-scale. The combined dataset comprises four separate treatments. By applying DPA, we identified that several insulin responsive pathways in the plasma membrane trafficking are only partially dependent on the insulin-regulated kinase Akt. We subsequently validated our findings through targeted analysis of key proteins from these pathways using immunoblotting and live cell microscopy. Our results demonstrate that DPA can be applied to dissect pathway networks testing diverse hypotheses and integrating multiple experimental perturbations.
Availability and implementation: The R package ‘directPA’ is distributed from CRAN under GNU General Public License (GPL)-3 and can be downloaded from:
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3957074  PMID: 24167158
15.  A Model-Based Approach to Identify Binding Sites in CLIP-Seq Data 
PLoS ONE  2014;9(4):e93248.
Cross-linking immunoprecipitation coupled with high-throughput sequencing (CLIP-Seq) has made it possible to identify the targeting sites of RNA-binding proteins in various cell culture systems and tissue types on a genome-wide scale. Here we present a novel model-based approach (MiClip) to identify high-confidence protein-RNA binding sites from CLIP-seq datasets. This approach assigns a probability score for each potential binding site to help prioritize subsequent validation experiments. The MiClip algorithm has been tested in both HITS-CLIP and PAR-CLIP datasets. In the HITS-CLIP dataset, the signal/noise ratios of miRNA seed motif enrichment produced by the MiClip approach are between 17% and 301% higher than those by the ad hoc method for the top 10 most enriched miRNAs. In the PAR-CLIP dataset, the MiClip approach can identify ∼50% more validated binding targets than the original ad hoc method and two recently published methods. To facilitate the application of the algorithm, we have released an R package, MiClip (, and a public web-based graphical user interface software ( for customized analysis.
PMCID: PMC3979666  PMID: 24714572
16.  CollapsABEL: an R library for detecting compound heterozygote alleles in genome-wide association studies 
BMC Bioinformatics  2016;17:156.
Compound Heterozygosity (CH) in classical genetics is the presence of two different recessive mutations at a particular gene locus. A relaxed form of CH alleles may account for an essential proportion of the missing heritability, i.e. heritability of phenotypes so far not accounted for by single genetic variants. Methods to detect CH-like effects in genome-wide association studies (GWAS) may facilitate explaining the missing heritability, but to our knowledge no viable software tools for this purpose are currently available.
In this work we present the Generalized Compound Double Heterozygosity (GCDH) test and its implementation in the R package CollapsABEL. Time-consuming procedures are optimized for computational efficiency using Java or C++. Intermediate results are stored either in an SQL database or in a so-called big.matrix file to achieve reasonable memory footprint. Our large scale simulation studies show that GCDH is capable of discovering genetic associations due to CH-like interactions with much higher power than a conventional single-SNP approach under various settings, whether the causal genetic variations are available or not. CollapsABEL provides a user-friendly pipeline for genotype collapsing, statistical testing, power estimation, type I error control and graphics generation in the R language.
CollapsABEL provides a computationally efficient solution for screening general forms of CH alleles in densely imputed microarray or whole genome sequencing datasets. The GCDH test provides an improved power over single-SNP based methods in detecting the prevalence of CH in human complex phenotypes, offering an opportunity for tackling the missing heritability problem.
Binary and source packages of CollapsABEL are available on CRAN ( and the website of the GenABEL project (
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-016-1006-9) contains supplementary material, which is available to authorized users.
PMCID: PMC4826552  PMID: 27059780
Genome wide association study; Next generation sequencing; Compound heterozygosity; Missing heritability
17.  SeqFeatR for the Discovery of Feature-Sequence Associations 
PLoS ONE  2016;11(1):e0146409.
Specific selection pressures often lead to specifically mutated genomes. The open source software SeqFeatR has been developed to identify associations between mutation patterns in biological sequences and specific selection pressures (“features”). For instance, SeqFeatR has been used to discover in viral protein sequences new T cell epitopes for hosts of given HLA types. SeqFeatR supports frequentist and Bayesian methods for the discovery of statistical sequence-feature associations. Moreover, it offers novel ways to visualize results of the statistical analyses and to relate them to further properties. In this article we demonstrate various functions of SeqFeatR with real data. The most frequently used set of functions is also provided by a web server. SeqFeatR is implemented as R package and freely available from the R archive CRAN ( The package includes a tutorial vignette. The software is distributed under the GNU General Public License (version 3 or later). The web server URL is
PMCID: PMC4701496  PMID: 26731669
18.  PopGenome: An Efficient Swiss Army Knife for Population Genomic Analyses in R 
Molecular Biology and Evolution  2014;31(7):1929-1936.
Although many computer programs can perform population genetics calculations, they are typically limited in the analyses and data input formats they offer; few applications can process the large data sets produced by whole-genome resequencing projects. Furthermore, there is no coherent framework for the easy integration of new statistics into existing pipelines, hindering the development and application of new population genetics and genomics approaches. Here, we present PopGenome, a population genomics package for the R software environment (a de facto standard for statistical analyses). PopGenome can efficiently process genome-scale data as well as large sets of individual loci. It reads DNA alignments and single-nucleotide polymorphism (SNP) data sets in most common formats, including those used by the HapMap, 1000 human genomes, and 1001 Arabidopsis genomes projects. PopGenome also reads associated annotation files in GFF format, enabling users to easily define regions or classify SNPs based on their annotation; all analyses can also be applied to sliding windows. PopGenome offers a wide range of diverse population genetics analyses, including neutrality tests as well as statistics for population differentiation, linkage disequilibrium, and recombination. PopGenome is linked to Hudson’s MS and Ewing’s MSMS programs to assess statistical significance based on coalescent simulations. PopGenome’s integration in R facilitates effortless and reproducible downstream analyses as well as the production of publication-quality graphics. Developers can easily incorporate new analyses methods into the PopGenome framework. PopGenome and R are freely available from CRAN ( for all major operating systems under the GNU General Public License.
PMCID: PMC4069620  PMID: 24739305
population genomics; software; single-nucleotide polymorphisms
19.  EasyStrata: evaluation and visualization of stratified genome-wide association meta-analysis data 
Bioinformatics  2014;31(2):259-261.
Summary: The R package EasyStrata facilitates the evaluation and visualization of stratified genome-wide association meta-analyses (GWAMAs) results. It provides (i) statistical methods to test and account for between-strata difference as a means to tackle gene–strata interaction effects and (ii) extended graphical features tailored for stratified GWAMA results. The software provides further features also suitable for general GWAMAs including functions to annotate, exclude or highlight specific loci in plots or to extract independent subsets of loci from genome-wide datasets. It is freely available and includes a user-friendly scripting interface that simplifies data handling and allows for combining statistical and graphical functions in a flexible fashion.
Availability: EasyStrata is available for free (under the GNU General Public License v3) from our Web site and from the CRAN R package repository
Contact: or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4287944  PMID: 25260699
20.  A high-performance computing toolset for relatedness and principal component analysis of SNP data 
Bioinformatics  2012;28(24):3326-3328.
Summary: Genome-wide association studies are widely used to investigate the genetic basis of diseases and traits, but they pose many computational challenges. We developed gdsfmt and SNPRelate (R packages for multi-core symmetric multiprocessing computer architectures) to accelerate two key computations on SNP data: principal component analysis (PCA) and relatedness analysis using identity-by-descent measures. The kernels of our algorithms are written in C/C++ and highly optimized. Benchmarks show the uniprocessor implementations of PCA and identity-by-descent are ∼8–50 times faster than the implementations provided in the popular EIGENSTRAT (v3.0) and PLINK (v1.07) programs, respectively, and can be sped up to 30–300-fold by using eight cores. SNPRelate can analyse tens of thousands of samples with millions of SNPs. For example, our package was used to perform PCA on 55 324 subjects from the ‘Gene-Environment Association Studies’ consortium studies.
Availability and implementation: gdsfmt and SNPRelate are available from R CRAN (, including a vignette. A tutorial can be found at
PMCID: PMC3519454  PMID: 23060615
21.  NATsDB: Natural Antisense Transcripts DataBase 
Nucleic Acids Research  2006;35(Database issue):D156-D161.
Natural antisense transcripts (NATs) are reverse complementary at least in part to the sequences of other endogenous sense transcripts. Most NATs are transcribed from opposite strands of their sense partners. They regulate sense genes at multiple levels and are implicated in various diseases. Using an improved whole-genome computational pipeline, we identified abundant cis-encoded exon-overlapping sense–antisense (SA) gene pairs in human (7356), mouse (6806), fly (1554), and eight other eukaryotic species (total 6534). We developed NATsDB (Natural Antisense Transcripts DataBase, ) to enable efficient browsing, searching and downloading of this currently most comprehensive collection of SA genes, grouped into six classes based on their overlapping patterns. NATsDB also includes non-exon-overlapping bidirectional (NOB) genes and non-bidirectional (NBD) genes. To facilitate the study of functions, regulations and possible pathological implications, NATsDB includes extensive information about gene structures, poly(A) signals and tails, phastCons conservation, homologues in other species, repeat elements, expressed sequence tag (EST) expression profiles and OMIM disease association. NATsDB supports interactive graphical display of the alignment of all supporting EST and mRNA transcripts of the SA and NOB genes to the genomic loci. It supports advanced search by species, gene name, sequence accession number, chromosome location, coding potential, OMIM association and sequence similarity.
PMCID: PMC1635336  PMID: 17082204
22.  Genome sequence of the phage-gene rich marine Phaeobacter arcticus type strain DSM 23566T 
Standards in Genomic Sciences  2013;8(3):450-464.
Phaeobacter arcticus Zhang et al. 2008 belongs to the marine Roseobacter clade whose members are phylogenetically and physiologically diverse. In contrast to the type species of this genus, Phaeobacter gallaeciensis, which is well characterized, relatively little is known about the characteristics of P. arcticus. Here, we describe the features of this organism including the annotated high-quality draft genome sequence and highlight some particular traits. The 5,049,232 bp long genome with its 4,828 protein-coding and 81 RNA genes consists of one chromosome and five extrachromosomal elements. Prophage sequences identified via PHAST constitute nearly 5% of the bacterial chromosome and included a potential Mu-like phage as well as a gene-transfer agent (GTA). In addition, the genome of strain DSM 23566T encodes all of the genes necessary for assimilatory nitrate reduction. Phylogenetic analysis and intergenomic distances indicate that the classification of the species might need to be reconsidered.
PMCID: PMC3910698  PMID: 24501630
aerobic; psychrophilic; motile; high-quality draft; prophage-like structures; extrachromosomal elements; assimilatory nitrate reduction; Alphaproteobacteria; Roseobacter clade
23.  DIME: R-package for identifying differential ChIP-seq based on an ensemble of mixture models 
Bioinformatics  2011;27(11):1569-1570.
Summary: Differential Identification using Mixtures Ensemble (DIME) is a package for identification of biologically significant differential binding sites between two conditions using ChIP-seq data. It considers a collection of finite mixture models combined with a false discovery rate (FDR) criterion to find statistically significant regions. This leads to a more reliable assessment of differential binding sites based on a statistical approach. In addition to ChIP-seq, DIME is also applicable to data from other high-throughput platforms.
Availability and implementation: DIME is implemented as an R-package, which is available at It may also be downloaded from
PMCID: PMC3102220  PMID: 21471015
24.  FootPrinter3: phylogenetic footprinting in partially alignable sequences 
Nucleic Acids Research  2006;34(Web Server issue):W617-W620.
FootPrinter3 is a web server for predicting transcription factor binding sites by using phylogenetic footprinting. Until now, phylogenetic footprinting approaches have been based either on multiple alignment analysis (e.g. PhyloVista, PhastCons), or on motif-discovery algorithms (e.g. FootPrinter2). FootPrinter3 integrates these two approaches, making use of local multiple sequence alignment blocks when those are available and reliable, but also allowing finding motifs in unalignable regions. The result is a set of predictions that joins the advantages of alignment-based methods (good specificity) to those of motif-based methods (good sensitivity, even in the presence of highly diverged species). FootPrinter3 is thus a tool of choice to exploit the wealth of vertebrate genomes being sequenced, as it allows taking full advantage of the sequences of highly diverged species (e.g. chicken, zebrafish), as well as those of more closely related species (e.g. mammals). The FootPrinter3 web server is available at: .
PMCID: PMC1538810  PMID: 16845084
25.  RevEcoR: an R package for the reverse ecology analysis of microbiomes 
BMC Bioinformatics  2016;17:294.
All species live in complex ecosystems. The structure and complexity of a microbial community reflects not only diversity and function, but also the environment in which it occurs. However, traditional ecological methods can only be applied on a small scale and for relatively well-understood biological systems. Recently, a graph-theory-based algorithm called the reverse ecology approach has been developed that can analyze the metabolic networks of all the species in a microbial community, and predict the metabolic interface between species and their environment.
Here, we present RevEcoR, an R package and a Shiny Web application that implements the reverse ecology algorithm for determining microbe–microbe interactions in microbial communities. This software allows users to obtain large-scale ecological insights into species’ ecology directly from high-throughput metagenomic data. The software has great potential for facilitating the study of microbiomes.
RevEcoR is open source software for the study of microbial community ecology. The RevEcoR R package is freely available under the GNU General Public License v. 2.0 at with the vignette and typical usage examples, and the interactive Shiny web application is available at, or can be installed locally with the source code accessed from
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-016-1088-4) contains supplementary material, which is available to authorized users.
PMCID: PMC4965897  PMID: 27473172
Metabolic network; Microbiome; Reverse ecology

Results 1-25 (299904)