PHAge Search Tool (PHAST) is a web server designed to rapidly and accurately identify, annotate and graphically display prophage sequences within bacterial genomes or plasmids. It accepts either raw DNA sequence data or partially annotated GenBank formatted data and rapidly performs a number of database comparisons as well as phage ‘cornerstone’ feature identification steps to locate, annotate and display prophage sequences and prophage features. Relative to other prophage identification tools, PHAST is up to 40 times faster and up to 15% more sensitive. It is also able to process and annotate both raw DNA sequence data and Genbank files, provide richly annotated tables on prophage features and prophage ‘quality’ and distinguish between intact and incomplete prophage. PHAST also generates downloadable, high quality, interactive graphics that display all identified prophage components in both circular and linear genomic views. PHAST is available at (http://phast.wishartlab.com).
GC-biased gene conversion (gBGC) is a recombination-associated process that favors the fixation of G/C alleles over A/T alleles. In mammals, gBGC is hypothesized to contribute to variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations, but its prevalence and general functional consequences remain poorly understood. gBGC is difficult to incorporate into models of molecular evolution and so far has primarily been studied using summary statistics from genomic comparisons. Here, we introduce a new probabilistic model that captures the joint effects of natural selection and gBGC on nucleotide substitution patterns, while allowing for correlations along the genome in these effects. We implemented our model in a computer program, called phastBias, that can accurately detect gBGC tracts about 1 kilobase or longer in simulated sequence alignments. When applied to real primate genome sequences, phastBias predicts gBGC tracts that cover roughly 0.3% of the human and chimpanzee genomes and account for 1.2% of human-chimpanzee nucleotide differences. These tracts fall in clusters, particularly in subtelomeric regions; they are enriched for recombination hotspots and fast-evolving sequences; and they display an ongoing fixation preference for G and C alleles. They are also significantly enriched for disease-associated polymorphisms, suggesting that they contribute to the fixation of deleterious alleles. The gBGC tracts provide a unique window into historical recombination processes along the human and chimpanzee lineages. They supply additional evidence of long-term conservation of megabase-scale recombination rates accompanied by rapid turnover of hotspots. Together, these findings shed new light on the evolutionary, functional, and disease implications of gBGC. The phastBias program and our predicted tracts are freely available.
Interpreting patterns of DNA sequence variation in the genomes of closely related species is critically important for understanding the causes and functional effects of nucleotide substitutions. Classical models describe patterns of substitution in terms of the fundamental forces of mutation, recombination, neutral drift, and natural selection. However, an entirely separate force, called GC-biased gene conversion (gBGC), also appears to have an important influence on substitution patterns in many species. gBGC is a recombination-associated evolutionary process that favors the fixation of strong (G/C) over weak (A/T) alleles. In mammals, gBGC is thought to promote variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations. However, its genome-wide influence remains poorly understood, in part because, it is difficult to incorporate gBGC into statistical models of evolution. In this paper, we describe a new evolutionary model that jointly describes the effects of selection and gBGC and apply it to the human and chimpanzee genomes. Our genome-wide predictions of gBGC tracts indicate that gBGC has been an important force in recent human evolution. Our publicly available computer program, called phastBias, and our genome-wide predictions will enable other researchers to consider gBGC in their analyses.
Despite their involvement in the regulation of gene expression and their importance as genomic markers for promoter prediction, no objective standard exists for defining CpG islands (CGIs), since all current approaches rely on a large parameter space formed by the thresholds of length, CpG fraction and G+C content.
Given the higher frequency of CpG dinucleotides at CGIs, as compared to bulk DNA, the distance distributions between neighboring CpGs should differ for bulk and island CpGs. A new algorithm (CpGcluster) is presented, based on the physical distance between neighboring CpGs on the chromosome and able to predict directly clusters of CpGs, while not depending on the subjective criteria mentioned above. By assigning a p-value to each of these clusters, the most statistically significant ones can be predicted as CGIs. CpGcluster was benchmarked against five other CGI finders by using a test sequence set assembled from an experimental CGI library. CpGcluster reached the highest overall accuracy values, while showing the lowest rate of false-positive predictions. Since a minimum-length threshold is not required, CpGcluster can find short but fully functional CGIs usually missed by other algorithms. The CGIs predicted by CpGcluster present the lowest degree of overlap with Alu retrotransposons and, simultaneously, the highest overlap with vertebrate Phylogenetic Conserved Elements (PhastCons). CpGcluster's CGIs overlapping with the Transcription Start Site (TSS) show the highest statistical significance, as compared to the islands in other genome locations, thus qualifying CpGcluster as a valuable tool in discriminating functional CGIs from the remaining islands in the bulk genome.
CpGcluster uses only integer arithmetic, thus being a fast and computationally efficient algorithm able to predict statistically significant clusters of CpG dinucleotides. Another outstanding feature is that all predicted CGIs start and end with a CpG dinucleotide, which should be appropriate for a genomic feature whose functionality is based precisely on CpG dinucleotides. The only search parameter in CpGcluster is the distance between two consecutive CpGs, in contrast to previous algorithms. Therefore, none of the main statistical properties of CpG islands (neither G+C content, CpG fraction nor length threshold) are needed as search parameters, which may lead to the high specificity and low overlap with spurious Alu elements observed for CpGcluster predictions.
Graphics play an important and unique role in population pharmacokinetic (PopPK) model building by exploring hidden structure among data before modeling, evaluating model fit, and validating results after modeling.
The work described in this paper is about a new R package called PKreport, which is able to generate a collection of plots and statistics for testing model assumptions, visualizing data and diagnosing models. The metric system is utilized as the currency for communicating between data sets and the package to generate special-purpose plots. It provides ways to match output from diverse software such as NONMEM, Monolix, R nlme package, etc. The package is implemented with S4 class hierarchy, and offers an efficient way to access the output from NONMEM 7. The final reports take advantage of the web browser as user interface to manage and visualize plots.
PKreport provides 1) a flexible and efficient R class to store and retrieve NONMEM 7 output, 2) automate plots for users to visualize data and models, 3) automatically generated R scripts that are used to create the plots; 4) an archive-oriented management tool for users to store, retrieve and modify figures, 5) high-quality graphs based on the R packages, lattice and ggplot2. The general architecture, running environment and statistical methods can be readily extended with R class hierarchy. PKreport is free to download at http://cran.r-project.org/web/packages/PKreport/index.html.
Understanding relationships between the sequence and timing of brain developmental events across a given set of mammalian species can provide information about both neural development and evolution. Yet neuro-developmental event timing data available from the published literature are incomplete, particularly for humans. Experimental documentation of unknown event timings requires considerable effort that can be expensive, time consuming, and for humans, often impossible. Application of suitable statistical models for translating neurodevelopmental event timings across mammalian species is essential. The present study implements an established statistical model and related functions as an open-source R package (ttime, translating time). The model incorporated into ttime allows predictions of unknown neurodevelopmental timings and explorations of phylogenetic relationships. The open-source package will enable transparency and reproducibility while minimizing redundancy. Sustainability and widespread dissemination will be guaranteed by the active CRAN (Comprehensive R Archive Network) community. The package updates the web-service (Clancy et al. 2007b) www.translatingtime.net by permitting predictions based on curated event timing databases which may include species not yet incorporated in the current model. The R package can be integrated into complex workflows that use the event predictions in their analyses. The package ttime is publicly available and can be downloaded from http://cran.r-project.org/web/packages/ttime/index.html.
Open-source; R package; Cross-species modeling; Cross-species comparisons; Neurodevelopment
Comparative genomics aims to detect signals of evolutionary conservation as an indicator of functional constraint. Surprisingly, results of the ENCODE project revealed that about half of the experimentally verified functional elements found in non-coding DNA were classified as unconstrained by computational predictions. Following this observation, it has been hypothesized that this may be partly explained by biased estimates on neutral evolutionary rates used by existing sequence conservation metrics. All methods we are aware of rely on a comparison with the neutral rate and conservation is estimated by measuring the deviation of a particular genomic region from this rate. Consequently, it is a reasonable assumption that inaccurate neutral rate estimates may lead to biased conservation and constraint estimates.
We propose a conservation signal that is produced by local Maximum Likelihood estimation of evolutionary parameters using an optimized sliding window and present a Kullback-Leibler projection that allows multiple different estimated parameters to be transformed into a conservation measure. This conservation measure does not rely on assumptions about neutral evolutionary substitution rates and little a priori assumptions on the properties of the conserved regions are imposed. We show the accuracy of our approach (KuLCons) on synthetic data and compare it to the scores generated by state-of-the-art methods (phastCons, GERP, SCONE) in an ENCODE region. We find that KuLCons is most often in agreement with the conservation/constraint signatures detected by GERP and SCONE while qualitatively very different patterns from phastCons are observed. Opposed to standard methods KuLCons can be extended to more complex evolutionary models, e.g. taking insertion and deletion events into account and corresponding results show that scores obtained under this model can diverge significantly from scores using the simpler model.
Our results suggest that discriminating among the different degrees of conservation is possible without making assumptions about neutral rates. We find, however, that it cannot be expected to discover considerably different constraint regions than GERP and SCONE. Consequently, we conclude that the reported discrepancies between experimentally verified functional and computationally identified constraint elements are likely not to be explained by biased neutral rate estimates.
Cross-linking immunoprecipitation coupled with high-throughput sequencing (CLIP-Seq) has made it possible to identify the targeting sites of RNA-binding proteins in various cell culture systems and tissue types on a genome-wide scale. Here we present a novel model-based approach (MiClip) to identify high-confidence protein-RNA binding sites from CLIP-seq datasets. This approach assigns a probability score for each potential binding site to help prioritize subsequent validation experiments. The MiClip algorithm has been tested in both HITS-CLIP and PAR-CLIP datasets. In the HITS-CLIP dataset, the signal/noise ratios of miRNA seed motif enrichment produced by the MiClip approach are between 17% and 301% higher than those by the ad hoc method for the top 10 most enriched miRNAs. In the PAR-CLIP dataset, the MiClip approach can identify ∼50% more validated binding targets than the original ad hoc method and two recently published methods. To facilitate the application of the algorithm, we have released an R package, MiClip (http://cran.r-project.org/web/packages/MiClip/index.html), and a public web-based graphical user interface software (http://galaxy.qbrc.org/tool_runner?tool_id=mi_clip) for customized analysis.
The University of California, Santa Cruz (UCSC) genome database is among the most used sources of genomic annotation in human and other organisms. The database offers an excellent web-based graphical user interface (the UCSC genome browser) and several means for programmatic queries. A simple application programming interface (API) in a scripting language aimed at the biologist was however not yet available. Here, we present the Ruby UCSC API, a library to access the UCSC genome database using Ruby.
The API is designed as a BioRuby plug-in and built on the ActiveRecord 3 framework for the object-relational mapping, making writing SQL statements unnecessary. The current version of the API supports databases of all organisms in the UCSC genome database including human, mammals, vertebrates, deuterostomes, insects, nematodes, and yeast.
The API uses the bin index—if available—when querying for genomic intervals. The API also supports genomic sequence queries using locally downloaded *.2bit files that are not stored in the official MySQL database. The API is implemented in pure Ruby and is therefore available in different environments and with different Ruby interpreters (including JRuby).
Assisted by the straightforward object-oriented design of Ruby and ActiveRecord, the Ruby UCSC API will facilitate biologists to query the UCSC genome database programmatically. The API is available through the RubyGem system. Source code and documentation are available at https://github.com/misshie/bioruby-ucsc-api/ under the Ruby license. Feedback and help is provided via the website at http://rubyucscapi.userecho.com/.
The excitement over findings from Genome-Wide Association Studies (GWASs) has been tempered by the difficulty in finding the location of the true causal disease susceptibility loci (DSLs), rather than markers that are correlated with the causal variants. In addition, many recent GWASs have studied multiple phenotypes – often highly correlated – making it difficult to understand which associations are causal and which are seemingly causal, induced by phenotypic correlations. In order to identify DSLs, which are required to understand the genetic etiology of the observed associations, statistical methodology has been proposed that distinguishes between a direct effect of a genetic locus on the primary phenotype and an indirect effect induced by the association with the intermediate phenotype that is also correlated with the primary phenotype. However, so far, the application of this important methodology has been challenging, as no user-friendly software implementation exists. The lack of software implementation of this sophisticated methodology has prevented its large-scale use in the genetic community. We have now implemented this statistical approach in a user-friendly and robust R package that has been thoroughly tested. The R package ‘CGene' is available for download at http://cran.r-project.org/. The R code is also available at http://people.hsph.harvard.edu/~plipman.
causal modeling; statistical genetics; software
Summary: Differential Identification using Mixtures Ensemble (DIME) is a package for identification of biologically significant differential binding sites between two conditions using ChIP-seq data. It considers a collection of finite mixture models combined with a false discovery rate (FDR) criterion to find statistically significant regions. This leads to a more reliable assessment of differential binding sites based on a statistical approach. In addition to ChIP-seq, DIME is also applicable to data from other high-throughput platforms.
Availability and implementation: DIME is implemented as an R-package, which is available at http://www.stat.osu.edu/~statgen/SOFTWARE/DIME. It may also be downloaded from http://cran.r-project.org/web/packages/DIME/.
Summary: DNA copy number alterations (CNA) frequently underlie gene expression changes by increasing or decreasing gene dosage. However, only a subset of genes with altered dosage exhibit concordant changes in gene expression. This subset is likely to be enriched for oncogenes and tumor suppressor genes, and can be identified by integrating these two layers of genome-scale data. We introduce DNA/RNA-Integrator (DR-Integrator), a statistical software tool to perform integrative analyses on paired DNA copy number and gene expression data. DR-Integrator identifies genes with significant correlations between DNA copy number and gene expression, and implements a supervised analysis that captures genes with significant alterations in both DNA copy number and gene expression between two sample classes.
Availability: DR-Integrator is freely available for non-commercial use from the Pollack Lab at http://pollacklab.stanford.edu/ and can be downloaded as a plug-in application to Microsoft Excel and as a package for the R statistical computing environment. The R package is available under the name ‘DRI’ at http://cran.r-project.org/. An example analysis using DR-Integrator is included as supplemental material.
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
Understanding factors regulating hybrid fitness and gene exchange is a major research challenge for evolutionary biology. Genomic cline analysis has been used to evaluate alternative patterns of introgression, but only two models have been used widely and the approach has generally lacked a hypothesis testing framework for distinguishing effects of selection and drift. I propose two alternative cline models, implement multivariate outlier detection to identify markers associated with hybrid fitness, and simulate hybrid zone dynamics to evaluate the signatures of different modes of selection. Analysis of simulated data shows that previous approaches are prone to false positives (multinomial regression) or relatively insensitive to outlier loci affected by selection (Barton's concordance). The new, theory-based logit-logistic cline model is generally best at detecting loci affecting hybrid fitness. Although some generalizations can be made about different modes of selection, there is no one-to-one correspondence between pattern and process. These new methods will enhance our ability to extract important information about the genetics of reproductive isolation and hybrid fitness. However, much remains to be done to relate statistical patterns to particular evolutionary processes. The methods described here are implemented in a freely available package “HIest” for the R statistical software (CRAN; http://cran.r-project.org/).
Admixture; hybrid zones; introgression; reproductive isolation; speciation
Summary: Track data hubs provide an efficient mechanism for visualizing remotely hosted Internet-accessible collections of genome annotations. Hub datasets can be organized, configured and fully integrated into the University of California Santa Cruz (UCSC) Genome Browser and accessed through the familiar browser interface. For the first time, individuals can use the complete browser feature set to view custom datasets without the overhead of setting up and maintaining a mirror.
Availability and implementation: Source code for the BigWig, BigBed and Genome Browser software is freely available for non-commercial use at http://hgdownload.cse.ucsc.edu/admin/jksrc.zip, implemented in C and supported on Linux. Binaries for the BigWig and BigBed creation and parsing utilities may be downloaded at http://hgdownload.cse.ucsc.edu/admin/exe/. Binary Alignment/Map (BAM) and Variant Call Format (VCF)/tabix utilities are available from http://samtools.sourceforge.net/ and http://vcftools.sourceforge.net/. The UCSC Genome Browser is publicly accessible at http://genome.ucsc.edu.
Summary: Non-linear calibration is a widely used method for quantifying biomarkers wherein concentration-response curves estimated using samples of known concentrations are used to predict the biomarker concentrations in the samples of interest. The R package nCal fills an important gap in the open source, stand-alone software for performing non-linear calibration. For curve fitting, nCal provides a new implementation of a robust, Bayesian hierarchical five-parameter logistic model. nCal supports a simple graphical user interface that can be used by laboratory scientists, and contains functionality for importing data from the multiplex bead array assay instrumentation.
Availability: The R package ‘nCal’ is available from http://cran.r-project.org/web/packages/nCal/ under GPL-2 or later.
Supplementary information is available in the form of an R package vignette at the above repository and an FAQ at http://research.fhcrc.org/youyifong/en/resources/ncal.html.
Summary: While the R software is becoming a standard for the analysis of genetic data, classical population genetics tools are being challenged by the increasing availability of genomic sequences. Dedicated tools are needed for harnessing the large amount of information generated by next-generation sequencing technologies. We introduce new tools implemented in the adegenet 1.3-1 package for handling and analyzing genome-wide single nucleotide polymorphism (SNP) data. Using a bit-level coding scheme for SNP data and parallelized computation, adegenet enables the analysis of large genome-wide SNPs datasets using standard personal computers.
Availability: adegenet 1.3-1 is available from CRAN: http://cran.r-project.org/web/packages/adegenet/. Information and support including a dedicated forum of discussion can be found on the adegenet website: http://adegenet.r-forge.r-project.org/. adegenet is released with a manual and four tutorials totalling over 300 pages of documentation, and distributed under the GNU General Public Licence (≥2).
Supplementary Information: Supplementary data are available at Bioinformatics online.
Motivation: Several cancer types consist of multiple genetically and phenotypically distinct subpopulations. The underlying mechanism for this intra-tumoral heterogeneity can be explained by the clonal evolution model, whereby growth advantageous mutations cause the expansion of cancer cell subclones. The recurrent phenotype of many cancers may be a consequence of these coexisting subpopulations responding unequally to therapies. Methods to computationally infer tumor evolution and subpopulation diversity are emerging and they hold the promise to improve the understanding of genetic and molecular determinants of recurrence.
Results: To address cellular subpopulation dynamics within human tumors, we developed a bioinformatic method, EXPANDS. It estimates the proportion of cells harboring specific mutations in a tumor. By modeling cellular frequencies as probability distributions, EXPANDS predicts mutations that accumulate in a cell before its clonal expansion. We assessed the performance of EXPANDS on one whole genome sequenced breast cancer and performed SP analyses on 118 glioblastoma multiforme samples obtained from TCGA. Our results inform about the extent of subclonal diversity in primary glioblastoma, subpopulation dynamics during recurrence and provide a set of candidate genes mutated in the most well-adapted subpopulations. In summary, EXPANDS predicts tumor purity and subclonal composition from sequencing data.
Availability and implementation: EXPANDS is available for download at http://code.google.com/p/expands (matlab version - used in this manuscript) and http://cran.r-project.org/web/packages/expands (R version).
Supplementary information: Supplementary data are available at Bioinformatics online.
We enhance power and accuracy of QTL mapping in multiple related families, by clustering the founders of the families on their local genomic similarity.
MCQTL is a linkage mapping
software application that allows the joint QTL mapping of multiple related families. In its current implementation, QTLs are modeled with one or two parameters for each parent that is a founder of the multi-cross design. The higher the number of parents, the higher the number of model parameters which can impact the power and the accuracy of the mapping. We propose to make use of the availability of denser and denser genotyping information on the founders to lessen the number of MCQTL parameters and thus boost the QTL discovery. We developed clusthaplo, an R package (http://cran.r-project.org/web/packages/clusthaplo/index.html), which aims to cluster haplotypes using a genomic similarity that reflects the probability of sharing the same ancestral allele. Computed in a sliding window along the genome and followed by a clustering method, the genomic similarity allows the local clustering of the parent haplotypes. Our assumption is that the haplotypes belonging to the same class transmit the same ancestral allele. So their putative QTL allelic effects can be modeled with the same parameter, leading to a parsimonious model, that is plugged in MCQTL. Intensive simulations using three maize data sets showed the significant gain in power and in accuracy of the QTL mapping with the ancestral allele model compared to the classical MCQTL model. MCQTL_LD (clusthaplo outputs plug in MCQTL) is a versatile and powerful tool for QTL mapping in multiple related families that makes use of linkage and linkage disequilibrium (web site http://carlit.toulouse.inra.fr/MCQTL/).
Electronic supplementary material
The online version of this article (doi:10.1007/s00122-014-2267-1) contains supplementary material, which is available to authorized users.
Summary: phangorn is a package for phylogenetic reconstruction and analysis in the R language. Previously it was only possible to estimate phylogenetic trees with distance methods in R. phangorn, now offers the possibility of reconstructing phylogenies with distance based methods, maximum parsimony or maximum likelihood (ML) and performing Hadamard conjugation. Extending the general ML framework, this package provides the possibility of estimating mixture and partition models. Furthermore, phangorn offers several functions for comparing trees, phylogenetic models or splits, simulating character data and performing congruence analyses.
Availability: phangorn can be obtained through the CRAN homepage http://cran.r-project.org/web/packages/phangorn/index.html. phangorn is licensed under GPL 2.
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: Oscillations in mRNA and protein of circadian clock components can be continuously monitored in vitro using synchronized cell lines. These rhythms can be highly variable due to culture conditions and are non-stationary due to baseline trends, damping and drift in period length. We present a technique for characterizing the modal frequencies of oscillation using continuous wavelet decomposition to non-parametrically model changes in amplitude and period while removing baseline effects and noise.
Availability: The method has been implemented as the package waveclock for the free statistical software program R and is available for download from http://cran.r-project.org/
Supplementary information: Supplementary figures are available at Bioinformatics online.
Summary: We have developed a novel Bayesian method, PurBayes, to estimate tumor purity and detect intratumor heterogeneity based on next-generation sequencing data of paired tumor-normal tissue samples, which uses finite mixture modeling methods. We demonstrate our approach using simulated data and discuss its performance under varying conditions.
Availability: PurBayes is implemented as an R package, and source code is available for download through CRAN at http://cran.r-project.org/package=PurBayes.
Supplementary data are available online at Bioinformatics online.
Although many computer programs can perform population genetics calculations, they are typically limited in the analyses and data input formats they offer; few applications can process the large data sets produced by whole-genome resequencing projects. Furthermore, there is no coherent framework for the easy integration of new statistics into existing pipelines, hindering the development and application of new population genetics and genomics approaches. Here, we present PopGenome, a population genomics package for the R software environment (a de facto standard for statistical analyses). PopGenome can efficiently process genome-scale data as well as large sets of individual loci. It reads DNA alignments and single-nucleotide polymorphism (SNP) data sets in most common formats, including those used by the HapMap, 1000 human genomes, and 1001 Arabidopsis genomes projects. PopGenome also reads associated annotation files in GFF format, enabling users to easily define regions or classify SNPs based on their annotation; all analyses can also be applied to sliding windows. PopGenome offers a wide range of diverse population genetics analyses, including neutrality tests as well as statistics for population differentiation, linkage disequilibrium, and recombination. PopGenome is linked to Hudson’s MS and Ewing’s MSMS programs to assess statistical significance based on coalescent simulations. PopGenome’s integration in R facilitates effortless and reproducible downstream analyses as well as the production of publication-quality graphics. Developers can easily incorporate new analyses methods into the PopGenome framework. PopGenome and R are freely available from CRAN (http://cran.r-project.org/) for all major operating systems under the GNU General Public License.
population genomics; software; single-nucleotide polymorphisms
Summary: Genome-wide association studies are widely used to investigate the genetic basis of diseases and traits, but they pose many computational challenges. We developed gdsfmt and SNPRelate (R packages for multi-core symmetric multiprocessing computer architectures) to accelerate two key computations on SNP data: principal component analysis (PCA) and relatedness analysis using identity-by-descent measures. The kernels of our algorithms are written in C/C++ and highly optimized. Benchmarks show the uniprocessor implementations of PCA and identity-by-descent are ∼8–50 times faster than the implementations provided in the popular EIGENSTRAT (v3.0) and PLINK (v1.07) programs, respectively, and can be sped up to 30–300-fold by using eight cores. SNPRelate can analyse tens of thousands of samples with millions of SNPs. For example, our package was used to perform PCA on 55 324 subjects from the ‘Gene-Environment Association Studies’ consortium studies.
Availability and implementation: gdsfmt and SNPRelate are available from R CRAN (http://cran.r-project.org), including a vignette. A tutorial can be found at https://www.genevastudy.org/Accomplishments/software.
Natural antisense transcripts (NATs) are reverse complementary at least in part to the sequences of other endogenous sense transcripts. Most NATs are transcribed from opposite strands of their sense partners. They regulate sense genes at multiple levels and are implicated in various diseases. Using an improved whole-genome computational pipeline, we identified abundant cis-encoded exon-overlapping sense–antisense (SA) gene pairs in human (7356), mouse (6806), fly (1554), and eight other eukaryotic species (total 6534). We developed NATsDB (Natural Antisense Transcripts DataBase, ) to enable efficient browsing, searching and downloading of this currently most comprehensive collection of SA genes, grouped into six classes based on their overlapping patterns. NATsDB also includes non-exon-overlapping bidirectional (NOB) genes and non-bidirectional (NBD) genes. To facilitate the study of functions, regulations and possible pathological implications, NATsDB includes extensive information about gene structures, poly(A) signals and tails, phastCons conservation, homologues in other species, repeat elements, expressed sequence tag (EST) expression profiles and OMIM disease association. NATsDB supports interactive graphical display of the alignment of all supporting EST and mRNA transcripts of the SA and NOB genes to the genomic loci. It supports advanced search by species, gene name, sequence accession number, chromosome location, coding potential, OMIM association and sequence similarity.
The rapid identification of genetic markers for multifactorial diseases from genome-wide association studies is fuelling interest in investigating the predictive ability and health care utility of genetic risk models. Various measures are available for the assessment of risk prediction models, each addressing a different aspect of performance and utility. We developed PredictABEL, a package in R that covers descriptive tables, measures and figures that are used in the analysis of risk prediction studies such as measures of model fit, predictive ability and clinical utility, and risk distributions, calibration plot and the receiver operating characteristic plot. Tables and figures are saved as separate files in a user-specified format, which include publication-quality EPS and TIFF formats. All figures are available in a ready-made layout, but they can be customized to the preferences of the user. The package has been developed for the analysis of genetic risk prediction studies, but can also be used for studies that only include non-genetic risk factors. PredictABEL is freely available at the websites of GenABEL (http://www.genabel.org) and CRAN (http://cran.r-project.org/).
Risk prediction; Genetic; Assessment; Measures; Software
Phaeobacter arcticus Zhang et al. 2008 belongs to the marine Roseobacter clade whose members are phylogenetically and physiologically diverse. In contrast to the type species of this genus, Phaeobacter gallaeciensis, which is well characterized, relatively little is known about the characteristics of P. arcticus. Here, we describe the features of this organism including the annotated high-quality draft genome sequence and highlight some particular traits. The 5,049,232 bp long genome with its 4,828 protein-coding and 81 RNA genes consists of one chromosome and five extrachromosomal elements. Prophage sequences identified via PHAST constitute nearly 5% of the bacterial chromosome and included a potential Mu-like phage as well as a gene-transfer agent (GTA). In addition, the genome of strain DSM 23566T encodes all of the genes necessary for assimilatory nitrate reduction. Phylogenetic analysis and intergenomic distances indicate that the classification of the species might need to be reconsidered.
aerobic; psychrophilic; motile; high-quality draft; prophage-like structures; extrachromosomal elements; assimilatory nitrate reduction; Alphaproteobacteria; Roseobacter clade