PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (233280)

Clipboard (0)
None

Related Articles

1.  PHAST: A Fast Phage Search Tool 
Nucleic Acids Research  2011;39(Web Server issue):W347-W352.
PHAge Search Tool (PHAST) is a web server designed to rapidly and accurately identify, annotate and graphically display prophage sequences within bacterial genomes or plasmids. It accepts either raw DNA sequence data or partially annotated GenBank formatted data and rapidly performs a number of database comparisons as well as phage ‘cornerstone’ feature identification steps to locate, annotate and display prophage sequences and prophage features. Relative to other prophage identification tools, PHAST is up to 40 times faster and up to 15% more sensitive. It is also able to process and annotate both raw DNA sequence data and Genbank files, provide richly annotated tables on prophage features and prophage ‘quality’ and distinguish between intact and incomplete prophage. PHAST also generates downloadable, high quality, interactive graphics that display all identified prophage components in both circular and linear genomic views. PHAST is available at (http://phast.wishartlab.com).
doi:10.1093/nar/gkr485
PMCID: PMC3125810  PMID: 21672955
2.  CpGcluster: a distance-based algorithm for CpG-island detection 
BMC Bioinformatics  2006;7:446.
Background
Despite their involvement in the regulation of gene expression and their importance as genomic markers for promoter prediction, no objective standard exists for defining CpG islands (CGIs), since all current approaches rely on a large parameter space formed by the thresholds of length, CpG fraction and G+C content.
Results
Given the higher frequency of CpG dinucleotides at CGIs, as compared to bulk DNA, the distance distributions between neighboring CpGs should differ for bulk and island CpGs. A new algorithm (CpGcluster) is presented, based on the physical distance between neighboring CpGs on the chromosome and able to predict directly clusters of CpGs, while not depending on the subjective criteria mentioned above. By assigning a p-value to each of these clusters, the most statistically significant ones can be predicted as CGIs. CpGcluster was benchmarked against five other CGI finders by using a test sequence set assembled from an experimental CGI library. CpGcluster reached the highest overall accuracy values, while showing the lowest rate of false-positive predictions. Since a minimum-length threshold is not required, CpGcluster can find short but fully functional CGIs usually missed by other algorithms. The CGIs predicted by CpGcluster present the lowest degree of overlap with Alu retrotransposons and, simultaneously, the highest overlap with vertebrate Phylogenetic Conserved Elements (PhastCons). CpGcluster's CGIs overlapping with the Transcription Start Site (TSS) show the highest statistical significance, as compared to the islands in other genome locations, thus qualifying CpGcluster as a valuable tool in discriminating functional CGIs from the remaining islands in the bulk genome.
Conclusion
CpGcluster uses only integer arithmetic, thus being a fast and computationally efficient algorithm able to predict statistically significant clusters of CpG dinucleotides. Another outstanding feature is that all predicted CGIs start and end with a CpG dinucleotide, which should be appropriate for a genomic feature whose functionality is based precisely on CpG dinucleotides. The only search parameter in CpGcluster is the distance between two consecutive CpGs, in contrast to previous algorithms. Therefore, none of the main statistical properties of CpG islands (neither G+C content, CpG fraction nor length threshold) are needed as search parameters, which may lead to the high specificity and low overlap with spurious Alu elements observed for CpGcluster predictions.
doi:10.1186/1471-2105-7-446
PMCID: PMC1617122  PMID: 17038168
3.  A Model-Based Analysis of GC-Biased Gene Conversion in the Human and Chimpanzee Genomes 
PLoS Genetics  2013;9(8):e1003684.
GC-biased gene conversion (gBGC) is a recombination-associated process that favors the fixation of G/C alleles over A/T alleles. In mammals, gBGC is hypothesized to contribute to variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations, but its prevalence and general functional consequences remain poorly understood. gBGC is difficult to incorporate into models of molecular evolution and so far has primarily been studied using summary statistics from genomic comparisons. Here, we introduce a new probabilistic model that captures the joint effects of natural selection and gBGC on nucleotide substitution patterns, while allowing for correlations along the genome in these effects. We implemented our model in a computer program, called phastBias, that can accurately detect gBGC tracts about 1 kilobase or longer in simulated sequence alignments. When applied to real primate genome sequences, phastBias predicts gBGC tracts that cover roughly 0.3% of the human and chimpanzee genomes and account for 1.2% of human-chimpanzee nucleotide differences. These tracts fall in clusters, particularly in subtelomeric regions; they are enriched for recombination hotspots and fast-evolving sequences; and they display an ongoing fixation preference for G and C alleles. They are also significantly enriched for disease-associated polymorphisms, suggesting that they contribute to the fixation of deleterious alleles. The gBGC tracts provide a unique window into historical recombination processes along the human and chimpanzee lineages. They supply additional evidence of long-term conservation of megabase-scale recombination rates accompanied by rapid turnover of hotspots. Together, these findings shed new light on the evolutionary, functional, and disease implications of gBGC. The phastBias program and our predicted tracts are freely available.
Author Summary
Interpreting patterns of DNA sequence variation in the genomes of closely related species is critically important for understanding the causes and functional effects of nucleotide substitutions. Classical models describe patterns of substitution in terms of the fundamental forces of mutation, recombination, neutral drift, and natural selection. However, an entirely separate force, called GC-biased gene conversion (gBGC), also appears to have an important influence on substitution patterns in many species. gBGC is a recombination-associated evolutionary process that favors the fixation of strong (G/C) over weak (A/T) alleles. In mammals, gBGC is thought to promote variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations. However, its genome-wide influence remains poorly understood, in part because, it is difficult to incorporate gBGC into statistical models of evolution. In this paper, we describe a new evolutionary model that jointly describes the effects of selection and gBGC and apply it to the human and chimpanzee genomes. Our genome-wide predictions of gBGC tracts indicate that gBGC has been an important force in recent human evolution. Our publicly available computer program, called phastBias, and our genome-wide predictions will enable other researchers to consider gBGC in their analyses.
doi:10.1371/journal.pgen.1003684
PMCID: PMC3744432  PMID: 23966869
4.  PKreport: report generation for checking population pharmacokinetic model assumptions 
Background
Graphics play an important and unique role in population pharmacokinetic (PopPK) model building by exploring hidden structure among data before modeling, evaluating model fit, and validating results after modeling.
Results
The work described in this paper is about a new R package called PKreport, which is able to generate a collection of plots and statistics for testing model assumptions, visualizing data and diagnosing models. The metric system is utilized as the currency for communicating between data sets and the package to generate special-purpose plots. It provides ways to match output from diverse software such as NONMEM, Monolix, R nlme package, etc. The package is implemented with S4 class hierarchy, and offers an efficient way to access the output from NONMEM 7. The final reports take advantage of the web browser as user interface to manage and visualize plots.
Conclusions
PKreport provides 1) a flexible and efficient R class to store and retrieve NONMEM 7 output, 2) automate plots for users to visualize data and models, 3) automatically generated R scripts that are used to create the plots; 4) an archive-oriented management tool for users to store, retrieve and modify figures, 5) high-quality graphs based on the R packages, lattice and ggplot2. The general architecture, running environment and statistical methods can be readily extended with R class hierarchy. PKreport is free to download at http://cran.r-project.org/web/packages/PKreport/index.html.
doi:10.1186/1472-6947-11-31
PMCID: PMC3121579  PMID: 21575245
5.  The Ruby UCSC API: accessing the UCSC genome database using Ruby 
BMC Bioinformatics  2012;13:240.
Background
The University of California, Santa Cruz (UCSC) genome database is among the most used sources of genomic annotation in human and other organisms. The database offers an excellent web-based graphical user interface (the UCSC genome browser) and several means for programmatic queries. A simple application programming interface (API) in a scripting language aimed at the biologist was however not yet available. Here, we present the Ruby UCSC API, a library to access the UCSC genome database using Ruby.
Results
The API is designed as a BioRuby plug-in and built on the ActiveRecord 3 framework for the object-relational mapping, making writing SQL statements unnecessary. The current version of the API supports databases of all organisms in the UCSC genome database including human, mammals, vertebrates, deuterostomes, insects, nematodes, and yeast.
The API uses the bin index—if available—when querying for genomic intervals. The API also supports genomic sequence queries using locally downloaded *.2bit files that are not stored in the official MySQL database. The API is implemented in pure Ruby and is therefore available in different environments and with different Ruby interpreters (including JRuby).
Conclusions
Assisted by the straightforward object-oriented design of Ruby and ActiveRecord, the Ruby UCSC API will facilitate biologists to query the UCSC genome database programmatically. The API is available through the RubyGem system. Source code and documentation are available at https://github.com/misshie/bioruby-ucsc-api/ under the Ruby license. Feedback and help is provided via the website at http://rubyucscapi.userecho.com/.
doi:10.1186/1471-2105-13-240
PMCID: PMC3542311  PMID: 22994508
6.  AFLPsim: an R package to simulate and detect dominant markers under selection in hybridizing populations 
Plant Methods  2014;10:40.
Background
In spite of a large diversity of approaches to investigate loci under selection from a population genetic perspective, very few programs have been specifically designed to date to test selection in hybrids using dominant markers. In addition, simulators of dominant markers are very scarce and they do not usually take into account hybridization.
Results
Here, we present a new, multifunctional, R package for dominant genetic markers, AFLPsim. This package can simulate dominant markers in hybridizing populations and implements genome scan methods for detecting outlier dominant loci in hybrids. In addition, it includes tools for further manipulating the results, plotting them and other tasks. We describe and tabulate the major functions implemented in AFLPsim. In addition, we provide some demonstration of its use and we perform a comparative study with other software. Finally, we conclude by briefly describing the input and output formats.
Conclusions
The R package AFLPsim application provides several useful tools in the context of hybridization studies. It can simulate dominant markers in hybridizing populations and predict their demographic evolution. In addition, we implement a new genome scan method for detecting outlier dominant loci in hybrids, which shows a rather high sensitivity and is very conservative in comparison with Gagnaire et al.’s, Bayescan and introgress. The application is downloadable at http://cran.r-project.org/web/packages/AFLPsim/.
doi:10.1186/1746-4811-10-40
PMCID: PMC4413549  PMID: 25926861
Demographic simulation; Dominant markers; Genome scan; Hybridization; Outlier loci; R package
7.  ttime: an R Package for Translating the Timing of Brain Development Across Mammalian Species 
Neuroinformatics  2010;8(3):201-205.
Understanding relationships between the sequence and timing of brain developmental events across a given set of mammalian species can provide information about both neural development and evolution. Yet neuro-developmental event timing data available from the published literature are incomplete, particularly for humans. Experimental documentation of unknown event timings requires considerable effort that can be expensive, time consuming, and for humans, often impossible. Application of suitable statistical models for translating neurodevelopmental event timings across mammalian species is essential. The present study implements an established statistical model and related functions as an open-source R package (ttime, translating time). The model incorporated into ttime allows predictions of unknown neurodevelopmental timings and explorations of phylogenetic relationships. The open-source package will enable transparency and reproducibility while minimizing redundancy. Sustainability and widespread dissemination will be guaranteed by the active CRAN (Comprehensive R Archive Network) community. The package updates the web-service (Clancy et al. 2007b) www.translatingtime.net by permitting predictions based on curated event timing databases which may include species not yet incorporated in the current model. The R package can be integrated into complex workflows that use the event predictions in their analyses. The package ttime is publicly available and can be downloaded from http://cran.r-project.org/web/packages/ttime/index.html.
doi:10.1007/s12021-010-9081-y
PMCID: PMC3189701  PMID: 20824390
Open-source; R package; Cross-species modeling; Cross-species comparisons; Neurodevelopment
8.  CGene: an R package for implementation of causal genetic analyses 
European Journal of Human Genetics  2011;19(12):1292-1294.
The excitement over findings from Genome-Wide Association Studies (GWASs) has been tempered by the difficulty in finding the location of the true causal disease susceptibility loci (DSLs), rather than markers that are correlated with the causal variants. In addition, many recent GWASs have studied multiple phenotypes – often highly correlated – making it difficult to understand which associations are causal and which are seemingly causal, induced by phenotypic correlations. In order to identify DSLs, which are required to understand the genetic etiology of the observed associations, statistical methodology has been proposed that distinguishes between a direct effect of a genetic locus on the primary phenotype and an indirect effect induced by the association with the intermediate phenotype that is also correlated with the primary phenotype. However, so far, the application of this important methodology has been challenging, as no user-friendly software implementation exists. The lack of software implementation of this sophisticated methodology has prevented its large-scale use in the genetic community. We have now implemented this statistical approach in a user-friendly and robust R package that has been thoroughly tested. The R package ‘CGene' is available for download at http://cran.r-project.org/. The R code is also available at http://people.hsph.harvard.edu/~plipman.
doi:10.1038/ejhg.2011.122
PMCID: PMC3230361  PMID: 21731061
causal modeling; statistical genetics; software
9.  DR-Integrator: a new analytic tool for integrating DNA copy number and gene expression data 
Bioinformatics  2009;26(3):414-416.
Summary: DNA copy number alterations (CNA) frequently underlie gene expression changes by increasing or decreasing gene dosage. However, only a subset of genes with altered dosage exhibit concordant changes in gene expression. This subset is likely to be enriched for oncogenes and tumor suppressor genes, and can be identified by integrating these two layers of genome-scale data. We introduce DNA/RNA-Integrator (DR-Integrator), a statistical software tool to perform integrative analyses on paired DNA copy number and gene expression data. DR-Integrator identifies genes with significant correlations between DNA copy number and gene expression, and implements a supervised analysis that captures genes with significant alterations in both DNA copy number and gene expression between two sample classes.
Availability: DR-Integrator is freely available for non-commercial use from the Pollack Lab at http://pollacklab.stanford.edu/ and can be downloaded as a plug-in application to Microsoft Excel and as a package for the R statistical computing environment. The R package is available under the name ‘DRI’ at http://cran.r-project.org/. An example analysis using DR-Integrator is included as supplemental material.
Contact: ksalari@stanford.edu; pollack1@stanford.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp702
PMCID: PMC2815664  PMID: 20031972
10.  Track data hubs enable visualization of user-defined genome-wide annotations on the UCSC Genome Browser 
Bioinformatics  2013;30(7):1003-1005.
Summary: Track data hubs provide an efficient mechanism for visualizing remotely hosted Internet-accessible collections of genome annotations. Hub datasets can be organized, configured and fully integrated into the University of California Santa Cruz (UCSC) Genome Browser and accessed through the familiar browser interface. For the first time, individuals can use the complete browser feature set to view custom datasets without the overhead of setting up and maintaining a mirror.
Availability and implementation: Source code for the BigWig, BigBed and Genome Browser software is freely available for non-commercial use at http://hgdownload.cse.ucsc.edu/admin/jksrc.zip, implemented in C and supported on Linux. Binaries for the BigWig and BigBed creation and parsing utilities may be downloaded at http://hgdownload.cse.ucsc.edu/admin/exe/. Binary Alignment/Map (BAM) and Variant Call Format (VCF)/tabix utilities are available from http://samtools.sourceforge.net/ and http://vcftools.sourceforge.net/. The UCSC Genome Browser is publicly accessible at http://genome.ucsc.edu.
Contact: donnak@soe.ucsc.edu
doi:10.1093/bioinformatics/btt637
PMCID: PMC3967101  PMID: 24227676
11.  adegenet 1.3-1: new tools for the analysis of genome-wide SNP data 
Bioinformatics  2011;27(21):3070-3071.
Summary: While the R software is becoming a standard for the analysis of genetic data, classical population genetics tools are being challenged by the increasing availability of genomic sequences. Dedicated tools are needed for harnessing the large amount of information generated by next-generation sequencing technologies. We introduce new tools implemented in the adegenet 1.3-1 package for handling and analyzing genome-wide single nucleotide polymorphism (SNP) data. Using a bit-level coding scheme for SNP data and parallelized computation, adegenet enables the analysis of large genome-wide SNPs datasets using standard personal computers.
Availability: adegenet 1.3-1 is available from CRAN: http://cran.r-project.org/web/packages/adegenet/. Information and support including a dedicated forum of discussion can be found on the adegenet website: http://adegenet.r-forge.r-project.org/. adegenet is released with a manual and four tutorials totalling over 300 pages of documentation, and distributed under the GNU General Public Licence (≥2).
Contact: t.jombart@imperial.ac.uk
Supplementary Information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr521
PMCID: PMC3198581  PMID: 21926124
12.  Direction pathway analysis of large-scale proteomics data reveals novel features of the insulin action pathway 
Bioinformatics  2013;30(6):808-814.
Motivation: With the advancement of high-throughput techniques, large-scale profiling of biological systems with multiple experimental perturbations is becoming more prevalent. Pathway analysis incorporates prior biological knowledge to analyze genes/proteins in groups in a biological context. However, the hypotheses under investigation are often confined to a 1D space (i.e. up, down, either or mixed regulation). Here, we develop direction pathway analysis (DPA), which can be applied to test hypothesis in a high-dimensional space for identifying pathways that display distinct responses across multiple perturbations.
Results: Our DPA approach allows for the identification of pathways that display distinct responses across multiple perturbations. To demonstrate the utility and effectiveness, we evaluated DPA under various simulated scenarios and applied it to study insulin action in adipocytes. A major action of insulin in adipocytes is to regulate the movement of proteins from the interior to the cell surface membrane. Quantitative mass spectrometry-based proteomics was used to study this process on a large-scale. The combined dataset comprises four separate treatments. By applying DPA, we identified that several insulin responsive pathways in the plasma membrane trafficking are only partially dependent on the insulin-regulated kinase Akt. We subsequently validated our findings through targeted analysis of key proteins from these pathways using immunoblotting and live cell microscopy. Our results demonstrate that DPA can be applied to dissect pathway networks testing diverse hypotheses and integrating multiple experimental perturbations.
Availability and implementation: The R package ‘directPA’ is distributed from CRAN under GNU General Public License (GPL)-3 and can be downloaded from: http://cran.r-project.org/web/packages/directPA/index.html
Contact: jean.yang@sydney.edu.au
Supplementary Information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt616
PMCID: PMC3957074  PMID: 24167158
13.  A Model-Based Approach to Identify Binding Sites in CLIP-Seq Data 
PLoS ONE  2014;9(4):e93248.
Cross-linking immunoprecipitation coupled with high-throughput sequencing (CLIP-Seq) has made it possible to identify the targeting sites of RNA-binding proteins in various cell culture systems and tissue types on a genome-wide scale. Here we present a novel model-based approach (MiClip) to identify high-confidence protein-RNA binding sites from CLIP-seq datasets. This approach assigns a probability score for each potential binding site to help prioritize subsequent validation experiments. The MiClip algorithm has been tested in both HITS-CLIP and PAR-CLIP datasets. In the HITS-CLIP dataset, the signal/noise ratios of miRNA seed motif enrichment produced by the MiClip approach are between 17% and 301% higher than those by the ad hoc method for the top 10 most enriched miRNAs. In the PAR-CLIP dataset, the MiClip approach can identify ∼50% more validated binding targets than the original ad hoc method and two recently published methods. To facilitate the application of the algorithm, we have released an R package, MiClip (http://cran.r-project.org/web/packages/MiClip/index.html), and a public web-based graphical user interface software (http://galaxy.qbrc.org/tool_runner?tool_id=mi_clip) for customized analysis.
doi:10.1371/journal.pone.0093248
PMCID: PMC3979666  PMID: 24714572
14.  PopGenome: An Efficient Swiss Army Knife for Population Genomic Analyses in R 
Molecular Biology and Evolution  2014;31(7):1929-1936.
Although many computer programs can perform population genetics calculations, they are typically limited in the analyses and data input formats they offer; few applications can process the large data sets produced by whole-genome resequencing projects. Furthermore, there is no coherent framework for the easy integration of new statistics into existing pipelines, hindering the development and application of new population genetics and genomics approaches. Here, we present PopGenome, a population genomics package for the R software environment (a de facto standard for statistical analyses). PopGenome can efficiently process genome-scale data as well as large sets of individual loci. It reads DNA alignments and single-nucleotide polymorphism (SNP) data sets in most common formats, including those used by the HapMap, 1000 human genomes, and 1001 Arabidopsis genomes projects. PopGenome also reads associated annotation files in GFF format, enabling users to easily define regions or classify SNPs based on their annotation; all analyses can also be applied to sliding windows. PopGenome offers a wide range of diverse population genetics analyses, including neutrality tests as well as statistics for population differentiation, linkage disequilibrium, and recombination. PopGenome is linked to Hudson’s MS and Ewing’s MSMS programs to assess statistical significance based on coalescent simulations. PopGenome’s integration in R facilitates effortless and reproducible downstream analyses as well as the production of publication-quality graphics. Developers can easily incorporate new analyses methods into the PopGenome framework. PopGenome and R are freely available from CRAN (http://cran.r-project.org/) for all major operating systems under the GNU General Public License.
doi:10.1093/molbev/msu136
PMCID: PMC4069620  PMID: 24739305
population genomics; software; single-nucleotide polymorphisms
15.  NATsDB: Natural Antisense Transcripts DataBase 
Nucleic Acids Research  2006;35(Database issue):D156-D161.
Natural antisense transcripts (NATs) are reverse complementary at least in part to the sequences of other endogenous sense transcripts. Most NATs are transcribed from opposite strands of their sense partners. They regulate sense genes at multiple levels and are implicated in various diseases. Using an improved whole-genome computational pipeline, we identified abundant cis-encoded exon-overlapping sense–antisense (SA) gene pairs in human (7356), mouse (6806), fly (1554), and eight other eukaryotic species (total 6534). We developed NATsDB (Natural Antisense Transcripts DataBase, ) to enable efficient browsing, searching and downloading of this currently most comprehensive collection of SA genes, grouped into six classes based on their overlapping patterns. NATsDB also includes non-exon-overlapping bidirectional (NOB) genes and non-bidirectional (NBD) genes. To facilitate the study of functions, regulations and possible pathological implications, NATsDB includes extensive information about gene structures, poly(A) signals and tails, phastCons conservation, homologues in other species, repeat elements, expressed sequence tag (EST) expression profiles and OMIM disease association. NATsDB supports interactive graphical display of the alignment of all supporting EST and mRNA transcripts of the SA and NOB genes to the genomic loci. It supports advanced search by species, gene name, sequence accession number, chromosome location, coding potential, OMIM association and sequence similarity.
doi:10.1093/nar/gkl782
PMCID: PMC1635336  PMID: 17082204
16.  A high-performance computing toolset for relatedness and principal component analysis of SNP data 
Bioinformatics  2012;28(24):3326-3328.
Summary: Genome-wide association studies are widely used to investigate the genetic basis of diseases and traits, but they pose many computational challenges. We developed gdsfmt and SNPRelate (R packages for multi-core symmetric multiprocessing computer architectures) to accelerate two key computations on SNP data: principal component analysis (PCA) and relatedness analysis using identity-by-descent measures. The kernels of our algorithms are written in C/C++ and highly optimized. Benchmarks show the uniprocessor implementations of PCA and identity-by-descent are ∼8–50 times faster than the implementations provided in the popular EIGENSTRAT (v3.0) and PLINK (v1.07) programs, respectively, and can be sped up to 30–300-fold by using eight cores. SNPRelate can analyse tens of thousands of samples with millions of SNPs. For example, our package was used to perform PCA on 55 324 subjects from the ‘Gene-Environment Association Studies’ consortium studies.
Availability and implementation: gdsfmt and SNPRelate are available from R CRAN (http://cran.r-project.org), including a vignette. A tutorial can be found at https://www.genevastudy.org/Accomplishments/software.
Contact: zhengx@u.washington.edu
doi:10.1093/bioinformatics/bts606
PMCID: PMC3519454  PMID: 23060615
17.  Genome sequence of the phage-gene rich marine Phaeobacter arcticus type strain DSM 23566T 
Standards in Genomic Sciences  2013;8(3):450-464.
Phaeobacter arcticus Zhang et al. 2008 belongs to the marine Roseobacter clade whose members are phylogenetically and physiologically diverse. In contrast to the type species of this genus, Phaeobacter gallaeciensis, which is well characterized, relatively little is known about the characteristics of P. arcticus. Here, we describe the features of this organism including the annotated high-quality draft genome sequence and highlight some particular traits. The 5,049,232 bp long genome with its 4,828 protein-coding and 81 RNA genes consists of one chromosome and five extrachromosomal elements. Prophage sequences identified via PHAST constitute nearly 5% of the bacterial chromosome and included a potential Mu-like phage as well as a gene-transfer agent (GTA). In addition, the genome of strain DSM 23566T encodes all of the genes necessary for assimilatory nitrate reduction. Phylogenetic analysis and intergenomic distances indicate that the classification of the species might need to be reconsidered.
doi:10.4056/sigs.383362
PMCID: PMC3910698  PMID: 24501630
aerobic; psychrophilic; motile; high-quality draft; prophage-like structures; extrachromosomal elements; assimilatory nitrate reduction; Alphaproteobacteria; Roseobacter clade
18.  FootPrinter3: phylogenetic footprinting in partially alignable sequences 
Nucleic Acids Research  2006;34(Web Server issue):W617-W620.
FootPrinter3 is a web server for predicting transcription factor binding sites by using phylogenetic footprinting. Until now, phylogenetic footprinting approaches have been based either on multiple alignment analysis (e.g. PhyloVista, PhastCons), or on motif-discovery algorithms (e.g. FootPrinter2). FootPrinter3 integrates these two approaches, making use of local multiple sequence alignment blocks when those are available and reliable, but also allowing finding motifs in unalignable regions. The result is a set of predictions that joins the advantages of alignment-based methods (good specificity) to those of motif-based methods (good sensitivity, even in the presence of highly diverged species). FootPrinter3 is thus a tool of choice to exploit the wealth of vertebrate genomes being sequenced, as it allows taking full advantage of the sequences of highly diverged species (e.g. chicken, zebrafish), as well as those of more closely related species (e.g. mammals). The FootPrinter3 web server is available at: .
doi:10.1093/nar/gkl123
PMCID: PMC1538810  PMID: 16845084
19.  DIME: R-package for identifying differential ChIP-seq based on an ensemble of mixture models 
Bioinformatics  2011;27(11):1569-1570.
Summary: Differential Identification using Mixtures Ensemble (DIME) is a package for identification of biologically significant differential binding sites between two conditions using ChIP-seq data. It considers a collection of finite mixture models combined with a false discovery rate (FDR) criterion to find statistically significant regions. This leads to a more reliable assessment of differential binding sites based on a statistical approach. In addition to ChIP-seq, DIME is also applicable to data from other high-throughput platforms.
Availability and implementation: DIME is implemented as an R-package, which is available at http://www.stat.osu.edu/~statgen/SOFTWARE/DIME. It may also be downloaded from http://cran.r-project.org/web/packages/DIME/.
Contact: shili@stat.osu.edu
doi:10.1093/bioinformatics/btr165
PMCID: PMC3102220  PMID: 21471015
20.  svapls: an R package to correct for hidden factors of variability in gene expression studies 
BMC Bioinformatics  2013;14:236.
Background
Hidden variability is a fundamentally important issue in the context of gene expression studies. Collected tissue samples may have a wide variety of hidden effects that may alter their transcriptional landscape significantly. As a result their actual differential expression pattern can be potentially distorted, leading to inaccurate results from a genome-wide testing for the important transcripts.
Results
We present an R package svapls that can be used to identify several types of unknown sample-specific sources of heterogeneity in a gene expression study and adjust for them in order to provide a more accurate inference on the original expression pattern of the genes over different varieties of samples. The proposed method implements Partial Least Squares regression to extract the hidden signals of sample-specific heterogeneity in the data and uses them to find the genes that are actually correlated with the phenotype of interest. We also compare our package with three other popular softwares for testing differential gene expression along with a detailed illustration on the widely popular Golub dataset. Results from the sensitivity analyes on simulated data with widely different hidden variation patterns reveal the improved detection power of our R package compared to the other softwares along with reasonably smaller error rates. Application on the real-life dataset exhibits the efficacy of the R package in detecting potential batch effects from the dataset.
Conclusions
Overall, Our R package provides the user with a simplified framework for analyzing gene expression data with a wide range of hidden variation patterns and delivering a differential gene expression analysis with substantially improved power and accuracy.
The R package svapls is freely available at http://cran.r-project.org/web/packages/svapls/index.html.
doi:10.1186/1471-2105-14-236
PMCID: PMC3733742  PMID: 23883280
21.  Alternative forms for genomic clines 
Ecology and Evolution  2013;3(7):1951-1966.
Understanding factors regulating hybrid fitness and gene exchange is a major research challenge for evolutionary biology. Genomic cline analysis has been used to evaluate alternative patterns of introgression, but only two models have been used widely and the approach has generally lacked a hypothesis testing framework for distinguishing effects of selection and drift. I propose two alternative cline models, implement multivariate outlier detection to identify markers associated with hybrid fitness, and simulate hybrid zone dynamics to evaluate the signatures of different modes of selection. Analysis of simulated data shows that previous approaches are prone to false positives (multinomial regression) or relatively insensitive to outlier loci affected by selection (Barton's concordance). The new, theory-based logit-logistic cline model is generally best at detecting loci affecting hybrid fitness. Although some generalizations can be made about different modes of selection, there is no one-to-one correspondence between pattern and process. These new methods will enhance our ability to extract important information about the genetics of reproductive isolation and hybrid fitness. However, much remains to be done to relate statistical patterns to particular evolutionary processes. The methods described here are implemented in a freely available package “HIest” for the R statistical software (CRAN; http://cran.r-project.org/).
doi:10.1002/ece3.609
PMCID: PMC3728937  PMID: 23919142
Admixture; hybrid zones; introgression; reproductive isolation; speciation
22.  Navigating protected genomics data with UCSC Genome Browser in a Box 
Bioinformatics  2014;31(5):764-766.
Summary: Genome Browser in a Box (GBiB) is a small virtual machine version of the popular University of California Santa Cruz (UCSC) Genome Browser that can be run on a researcher's own computer. Once GBiB is installed, a standard web browser is used to access the virtual server and add personal data files from the local hard disk. Annotation data are loaded on demand through the Internet from UCSC or can be downloaded to the local computer for faster access.
Availability and implementation: Software downloads and installation instructions are freely available for non-commercial use at https://genome-store.ucsc.edu/. GBiB requires the installation of open-source software VirtualBox, available for all major operating systems, and the UCSC Genome Browser, which is open source and free for non-commercial use. Commercial use of GBiB and the Genome Browser requires a license (http://genome.ucsc.edu/license/).
Contact: genome@soe.ucsc.edu
doi:10.1093/bioinformatics/btu712
PMCID: PMC4341066  PMID: 25348212
23.  nCal: an R package for non-linear calibration 
Bioinformatics  2013;29(20):2653-2654.
Summary: Non-linear calibration is a widely used method for quantifying biomarkers wherein concentration-response curves estimated using samples of known concentrations are used to predict the biomarker concentrations in the samples of interest. The R package nCal fills an important gap in the open source, stand-alone software for performing non-linear calibration. For curve fitting, nCal provides a new implementation of a robust, Bayesian hierarchical five-parameter logistic model. nCal supports a simple graphical user interface that can be used by laboratory scientists, and contains functionality for importing data from the multiplex bead array assay instrumentation.
Availability: The R package ‘nCal’ is available from http://cran.r-project.org/web/packages/nCal/ under GPL-2 or later.
Contact: yfong@fhcrc.org
Supplementary information: Supplementary information is available in the form of an R package vignette at the above repository and an FAQ at http://research.fhcrc.org/youyifong/en/resources/ncal.html.
doi:10.1093/bioinformatics/btt456
PMCID: PMC3789552  PMID: 23926226
24.  'SEEDY' (Simulation of Evolutionary and Epidemiological Dynamics): An R Package to Follow Accumulation of Within-Host Mutation in Pathogens 
PLoS ONE  2015;10(6):e0129745.
Genome sequencing is an increasingly common component of infectious disease outbreak investigations. However, the relationship between pathogen transmission and observed genetic data is complex, and dependent on several uncertain factors. As such, simulation of pathogen dynamics is an important tool for interpreting observed genomic data in an infectious disease outbreak setting, in order to test hypotheses and to explore the range of outcomes consistent with a given set of parameters. We introduce ‘seedy’, an R package for the simulation of evolutionary and epidemiological dynamics (http://cran.r-project.org/web/packages/seedy/). Our software implements stochastic models for the accumulation of mutations within hosts, as well as individual-level disease transmission. By allowing variables such as the transmission bottleneck size, within-host effective population size and population mixing rates to be specified by the user, our package offers a flexible framework to investigate evolutionary dynamics during disease outbreaks. Furthermore, our software provides theoretical pairwise genetic distance distributions to provide a likelihood of person-to-person transmission based on genomic observations, and using this framework, implements transmission route assessment for genomic data collected during an outbreak. Our open source software provides an accessible platform for users to explore pathogen evolution and outbreak dynamics via simulation, and offers tools to assess observed genomic data in this context.
doi:10.1371/journal.pone.0129745
PMCID: PMC4467979  PMID: 26075402
25.  Inferring rare disease risk variants based on exact probabilities of sharing by multiple affected relatives 
Bioinformatics  2014;30(15):2189-2196.
Motivation: Family-based designs are regaining popularity for genomic sequencing studies because they provide a way to test cosegregation with disease of variants that are too rare in the population to be tested individually in a conventional case–control study.
Results: Where only a few affected subjects per family are sequenced, the probability that any variant would be shared by all affected relatives—given it occurred in any one family member—provides evidence against the null hypothesis of a complete absence of linkage and association. A P-value can be obtained as the sum of the probabilities of sharing events as (or more) extreme in one or more families. We generalize an existing closed-form expression for exact sharing probabilities to more than two relatives per family. When pedigree founders are related, we show that an approximation of sharing probabilities based on empirical estimates of kinship among founders obtained from genome-wide marker data is accurate for low levels of kinship. We also propose a more generally applicable approach based on Monte Carlo simulations. We applied this method to a study of 55 multiplex families with apparent non-syndromic forms of oral clefts from four distinct populations, with whole exome sequences available for two or three affected members per family. The rare single nucleotide variant rs149253049 in ADAMTS9 shared by affected relatives in three Indian families achieved significance after correcting for multiple comparisons (p=2×10−6).
Availability and implementation: Source code and binaries of the R package RVsharing are freely available for download at http://cran.r-project.org/web/packages/RVsharing/index.html.
Contact: alexandre.bureau@msp.ulaval.ca or ingo@jhu.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btu198
PMCID: PMC4103601  PMID: 24740360

Results 1-25 (233280)