PHAge Search Tool (PHAST) is a web server designed to rapidly and accurately identify, annotate and graphically display prophage sequences within bacterial genomes or plasmids. It accepts either raw DNA sequence data or partially annotated GenBank formatted data and rapidly performs a number of database comparisons as well as phage ‘cornerstone’ feature identification steps to locate, annotate and display prophage sequences and prophage features. Relative to other prophage identification tools, PHAST is up to 40 times faster and up to 15% more sensitive. It is also able to process and annotate both raw DNA sequence data and Genbank files, provide richly annotated tables on prophage features and prophage ‘quality’ and distinguish between intact and incomplete prophage. PHAST also generates downloadable, high quality, interactive graphics that display all identified prophage components in both circular and linear genomic views. PHAST is available at (http://phast.wishartlab.com).
The University of California Santa Cruz (UCSC) Genome Browser Database is an up to date source for genome sequence data integrated with a large collection of related annotations. The database is optimized to support fast interactive performance with the web-based UCSC Genome Browser, a tool built on top of the database for rapid visualization and querying of the data at many levels. The annotations for a given genome are displayed in the browser as a series of tracks aligned with the genomic sequence. Sequence data and annotations may also be viewed in a text-based tabular format or downloaded as tab-delimited flat files. The Genome Browser Database, browsing tools and downloadable data files can all be found on the UCSC Genome Bioinformatics website (http://genome.ucsc.edu), which also contains links to documentation and related technical information.
The Encyclopedia of DNA Elements (ENCODE) project is an international consortium of investigators funded to analyze the human genome with the goal of producing a comprehensive catalog of functional elements. The ENCODE Data Coordination Center at The University of California, Santa Cruz (UCSC) is the primary repository for experimental results generated by ENCODE investigators. These results are captured in the UCSC Genome Bioinformatics database and download server for visualization and data mining via the UCSC Genome Browser and companion tools (Rhead et al. The UCSC Genome Browser Database: update 2010, in this issue). The ENCODE web portal at UCSC (http://encodeproject.org or http://genome.ucsc.edu/ENCODE) provides information about the ENCODE data and convenient links for access.
The University of California Santa Cruz (UCSC) Genome Browser (genome.ucsc.edu) is a popular Web-based tool for quickly displaying a requested portion of a genome at any scale, accompanied by a series of aligned annotation “tracks”. The annotations—generated by the UCSC Genome Bioinformatics Group and external collaborators—display gene predictions, mRNA and expressed sequence tag alignments, simple nucleotide polymorphisms, expression and regulatory data, phenotype and variation data, and pairwise and multiple-species comparative genomics data. All information relevant to a region is presented in one window, facilitating biological analysis and interpretation. The database tables underlying the Genome Browser tracks can be viewed, downloaded, and manipulated using another Web-based application, the UCSC Table Browser. Users can upload data as custom annotation tracks in both browsers for research or educational use. This unit describes how to use the Genome Browser and Table Browser for genome analysis, download the underlying database tables, and create and display custom annotation tracks.
Genome Browser; Table Browser; UCSC; human genome; genome analysis; comparative genomics; human variation; Bioinformatics; Bioinformatics Fundamentals; Biological Databases
The University of California, Santa Cruz, Genome Browser Database (GBD) provides integrated sequence and annotation data for a large collection of vertebrate and model organism genomes. Seventeen new assemblies have been added to the database in the past year, for a total coverage of 19 vertebrate and 21 invertebrate species as of September 2007. For each assembly, the GBD contains a collection of annotation data aligned to the genomic sequence. Highlights of this year's additions include a 28-species human-based vertebrate conservation annotation, an enhanced UCSC Genes set, and more human variation, MGC, and ENCODE data. The database is optimized for fast interactive performance with a set of web-based tools that may be used to view, manipulate, filter and download the annotation data. New toolset features include the Genome Graphs tool for displaying genome-wide data sets, session saving and sharing, better custom track management, expanded Genome Browser configuration options and a Genome Browser wiki site. The downloadable GBD data, the companion Genome Browser toolset and links to documentation and related information can be found at: http://genome.ucsc.edu/.
The University of California Santa Cruz (UCSC) Genome Browser is a popular Web-based tool for quickly displaying a requested portion of a genome at any scale, accompanied by a series of aligned annotation “tracks.” The annotations generated by the UCSC Genome Bioinformatics Group and external collaborators include gene predictions, mRNA and expressed sequence tag alignments, simple nucleotide polymorphisms, expression and regulatory data, phenotype and variation data, and pairwise and multiple-species comparative genomics data. All information relevant to a region is presented in one window, facilitating biological analysis and interpretation. The database tables underlying the Genome Browser tracks can be viewed, downloaded, and manipulated using another Web-based application, the UCSC Table Browser. Users can upload personal datasets in a wide variety of formats as custom annotation tracks in both browsers for research or educational purposes.
Genome Browser; Table Browser; human genome; genome analysis; comparative genomics; human variation; next-gen sequencing; human genetics analysis; biological databases; BAM
The University of California Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu) offers online public access to a growing database of genomic sequence and annotations for a wide variety of organisms. The Browser is an integrated tool set for visualizing, comparing, analysing and sharing both publicly available and user-generated genomic datasets. As of September 2012, genomic sequence and a basic set of annotation ‘tracks’ are provided for 63 organisms, including 26 mammals, 13 non-mammal vertebrates, 3 invertebrate deuterostomes, 13 insects, 6 worms, yeast and sea hare. In the past year 19 new genome assemblies have been added, and we anticipate releasing another 28 in early 2013. Further, a large number of annotation tracks have been either added, updated by contributors or remapped to the latest human reference genome. Among these are an updated UCSC Genes track for human and mouse assemblies. We have also introduced several features to improve usability, including new navigation menus. This article provides an update to the UCSC Genome Browser database, which has been previously featured in the Database issue of this journal.
The University of California Santa Cruz Genome Browser (http://genome.ucsc.edu) offers online public access to a growing database of genomic sequence and annotations for a wide variety of organisms. The Browser is an integrated tool set for visualizing, comparing, analyzing and sharing both publicly available and user-generated genomic data sets. In the past year, the local database has been updated with four new species assemblies, and we anticipate another four will be released by the end of 2011. Further, a large number of annotation tracks have been either added, updated by contributors, or remapped to the latest human reference genome. Among these are new phenotype and disease annotations, UCSC genes, and a major dbSNP update, which required new visualization methods. Growing beyond the local database, this year we have introduced ‘track data hubs’, which allow the Genome Browser to provide access to remotely located sets of annotations. This feature is designed to significantly extend the number and variety of annotation tracks that are publicly available for visualization and analysis from within our site. We have also introduced several usability features including track search and a context-sensitive menu of options available with a right-click anywhere on the Browser's image.
High-throughput data sets such as genome-wide protein–protein interactions, protein–DNA interactions and gene expression data have been published for several model systems, especially for human cancer samples. The University of California, Santa Cruz (UCSC) Interaction Browser (http://sysbio.soe.ucsc.edu/nets) is an online tool for biologists to view high-throughput data sets simultaneously for the analysis of functional relationships between biological entities. Users can access several public interaction networks and functional genomics data sets through the portal as well as upload their own networks and data sets for analysis. Users can navigate through correlative relationships for focused sets of genes belonging to biological pathways using a standard web browser. Using a new visual modality called the CircleMap, multiple ‘omics’ data sets can be viewed simultaneously within the context of curated, predicted, directed and undirected regulatory interactions. The Interaction Browser provides an integrative viewing of biological networks based on the consensus of many observations about genes and their products, which may provide new insights about normal and disease processes not obvious from any isolated data set.
Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-seq) or ChIP followed by genome tiling array analysis (ChIP-chip) have become standard technologies for genome-wide identification of DNA-binding protein target sites. A number of algorithms have been developed in parallel that allow identification of binding sites from ChIP-seq or ChIP-chip datasets and subsequent visualization in the University of California Santa Cruz (UCSC) Genome Browser as custom annotation tracks. However, summarizing these tracks can be a daunting task, particularly if there are a large number of binding sites or the binding sites are distributed widely across the genome.
We have developed ChIPpeakAnno as a Bioconductor package within the statistical programming environment R to facilitate batch annotation of enriched peaks identified from ChIP-seq, ChIP-chip, cap analysis of gene expression (CAGE) or any experiments resulting in a large number of enriched genomic regions. The binding sites annotated with ChIPpeakAnno can be viewed easily as a table, a pie chart or plotted in histogram form, i.e., the distribution of distances to the nearest genes for each set of peaks. In addition, we have implemented functionalities for determining the significance of overlap between replicates or binding sites among transcription factors within a complex, and for drawing Venn diagrams to visualize the extent of the overlap between replicates. Furthermore, the package includes functionalities to retrieve sequences flanking putative binding sites for PCR amplification, cloning, or motif discovery, and to identify Gene Ontology (GO) terms associated with adjacent genes.
ChIPpeakAnno enables batch annotation of the binding sites identified from ChIP-seq, ChIP-chip, CAGE or any technology that results in a large number of enriched genomic regions within the statistical programming environment R. Allowing users to pass their own annotation data such as a different Chromatin immunoprecipitation (ChIP) preparation and a dataset from literature, or existing annotation packages, such as GenomicFeatures and BSgenome, provides flexibility. Tight integration to the biomaRt package enables up-to-date annotation retrieval from the BioMart database.
GC-biased gene conversion (gBGC) is a recombination-associated process that favors the fixation of G/C alleles over A/T alleles. In mammals, gBGC is hypothesized to contribute to variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations, but its prevalence and general functional consequences remain poorly understood. gBGC is difficult to incorporate into models of molecular evolution and so far has primarily been studied using summary statistics from genomic comparisons. Here, we introduce a new probabilistic model that captures the joint effects of natural selection and gBGC on nucleotide substitution patterns, while allowing for correlations along the genome in these effects. We implemented our model in a computer program, called phastBias, that can accurately detect gBGC tracts about 1 kilobase or longer in simulated sequence alignments. When applied to real primate genome sequences, phastBias predicts gBGC tracts that cover roughly 0.3% of the human and chimpanzee genomes and account for 1.2% of human-chimpanzee nucleotide differences. These tracts fall in clusters, particularly in subtelomeric regions; they are enriched for recombination hotspots and fast-evolving sequences; and they display an ongoing fixation preference for G and C alleles. They are also significantly enriched for disease-associated polymorphisms, suggesting that they contribute to the fixation of deleterious alleles. The gBGC tracts provide a unique window into historical recombination processes along the human and chimpanzee lineages. They supply additional evidence of long-term conservation of megabase-scale recombination rates accompanied by rapid turnover of hotspots. Together, these findings shed new light on the evolutionary, functional, and disease implications of gBGC. The phastBias program and our predicted tracts are freely available.
Interpreting patterns of DNA sequence variation in the genomes of closely related species is critically important for understanding the causes and functional effects of nucleotide substitutions. Classical models describe patterns of substitution in terms of the fundamental forces of mutation, recombination, neutral drift, and natural selection. However, an entirely separate force, called GC-biased gene conversion (gBGC), also appears to have an important influence on substitution patterns in many species. gBGC is a recombination-associated evolutionary process that favors the fixation of strong (G/C) over weak (A/T) alleles. In mammals, gBGC is thought to promote variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations. However, its genome-wide influence remains poorly understood, in part because, it is difficult to incorporate gBGC into statistical models of evolution. In this paper, we describe a new evolutionary model that jointly describes the effects of selection and gBGC and apply it to the human and chimpanzee genomes. Our genome-wide predictions of gBGC tracts indicate that gBGC has been an important force in recent human evolution. Our publicly available computer program, called phastBias, and our genome-wide predictions will enable other researchers to consider gBGC in their analyses.
The ENCODE project is an international consortium with a goal of cataloguing all the functional elements in the human genome. The ENCODE Data Coordination Center (DCC) at the University of California, Santa Cruz serves as the central repository for ENCODE data. In this role, the DCC offers a collection of high-throughput, genome-wide data generated with technologies such as ChIP-Seq, RNA-Seq, DNA digestion and others. This data helps illuminate transcription factor-binding sites, histone marks, chromatin accessibility, DNA methylation, RNA expression, RNA binding and other cell-state indicators. It includes sequences with quality scores, alignments, signals calculated from the alignments, and in most cases, element or peak calls calculated from the signal data. Each data set is available for visualization and download via the UCSC Genome Browser (http://genome.ucsc.edu/). ENCODE data can also be retrieved using a metadata system that captures the experimental parameters of each assay. The ENCODE web portal at UCSC (http://encodeproject.org/) provides information about the ENCODE data and links for access.
The University of California, Santa Cruz (UCSC) Genome Browser website (http://genome.ucsc.edu/) provides a large database of publicly available sequence and annotation data along with an integrated tool set for examining and comparing the genomes of organisms, aligning sequence to genomes, and displaying and sharing users’ own annotation data. As of September 2009, genomic sequence and a basic set of annotation ‘tracks’ are provided for 47 organisms, including 14 mammals, 10 non-mammal vertebrates, 3 invertebrate deuterostomes, 13 insects, 6 worms and a yeast. New data highlights this year include an updated human genome browser, a 44-species multiple sequence alignment track, improved variation and phenotype tracks and 16 new genome-wide ENCODE tracks. New features include drag-and-zoom navigation, a Wiki track for user-added annotations, new custom track formats for large datasets (bigBed and bigWig), a new multiple alignment output tool, links to variation and protein structure tools, in silico PCR utility enhancements, and improved track configuration tools.
The sequence of the human genome provides a scaffold on which numerous annotations, such the locations of genes, can be laid. Genome browsers have been created to allow the simultaneous display of multiple annotations within a graphical interface. In addition, they provide the ability to search for markers and sequences, to extract annotations for specific regions or for the whole genome and to act as a central starting point for genomic research. This review describes the basic functionality of genome browsers and compares three of them: the University of California Santa Cruz (UCSC) Genome Browser, the Ensembl Genome Browser and the NCBI MapViewer.
genome browsers; genome annotations; genome databases
The University of California Santa Cruz (UCSC) Table Browser (http://genome.ucsc.edu/cgi-bin/hgText) provides text-based access to a large collection of genome assemblies and annotation data stored in the Genome Browser Database. A flexible alternative to the graphical-based Genome Browser, this tool offers an enhanced level of query support that includes restrictions based on field values, free-form SQL queries and combined queries on multiple tables. Output can be filtered to restrict the fields and lines returned, and may be organized into one of several formats, including a simple tab- delimited file that can be loaded into a spreadsheet or database as well as advanced formats that may be uploaded into the Genome Browser as custom annotation tracks. The Table Browser User’s Guide located on the UCSC website provides instructions and detailed examples for constructing queries and configuring output.
The Encyclopedia of DNA Elements (ENCODE), http://encodeproject.org, has completed its fifth year of scientific collaboration to create a comprehensive catalog of functional elements in the human genome, and its third year of investigations in the mouse genome. Since the last report in this journal, the ENCODE human data repertoire has grown by 898 new experiments (totaling 2886), accompanied by a major integrative analysis. In the mouse genome, results from 404 new experiments became available this year, increasing the total to 583, collected during the course of the project. The University of California, Santa Cruz, makes this data available on the public Genome Browser http://genome.ucsc.edu for visual browsing and data mining. Download of raw and processed data files are all supported. The ENCODE portal provides specialized tools and information about the ENCODE data sets.
The University of California, Santa Cruz (UCSC) genome database is among the most used sources of genomic annotation in human and other organisms. The database offers an excellent web-based graphical user interface (the UCSC genome browser) and several means for programmatic queries. A simple application programming interface (API) in a scripting language aimed at the biologist was however not yet available. Here, we present the Ruby UCSC API, a library to access the UCSC genome database using Ruby.
The API is designed as a BioRuby plug-in and built on the ActiveRecord 3 framework for the object-relational mapping, making writing SQL statements unnecessary. The current version of the API supports databases of all organisms in the UCSC genome database including human, mammals, vertebrates, deuterostomes, insects, nematodes, and yeast.
The API uses the bin index—if available—when querying for genomic intervals. The API also supports genomic sequence queries using locally downloaded *.2bit files that are not stored in the official MySQL database. The API is implemented in pure Ruby and is therefore available in different environments and with different Ruby interpreters (including JRuby).
Assisted by the straightforward object-oriented design of Ruby and ActiveRecord, the Ruby UCSC API will facilitate biologists to query the UCSC genome database programmatically. The API is available through the RubyGem system. Source code and documentation are available at https://github.com/misshie/bioruby-ucsc-api/ under the Ruby license. Feedback and help is provided via the website at http://rubyucscapi.userecho.com/.
The University of California, Santa Cruz Genome Browser Database contains, as of September 2006, sequence and annotation data for the genomes of 13 vertebrate and 19 invertebrate species. The Genome Browser displays a wide variety of annotations at all scales from the single nucleotide level up to a full chromosome and includes assembly data, genes and gene predictions, mRNA and EST alignments, and comparative genomics, regulation, expression and variation data. The database is optimized for fast interactive performance with web tools that provide powerful visualization and querying capabilities for mining the data. In the past year, 22 new assemblies and several new sets of human variation annotation have been released. New features include VisiGene, a fully integrated in situ hybridization image browser; phyloGif, for drawing evolutionary tree diagrams; a redesigned Custom Track feature; an expanded SNP annotation track; and many new display options. The Genome Browser, other tools, downloadable data files and links to documentation and other information can be found at .
The University of California, Santa Cruz Genome Browser (http://genome.ucsc.edu) offers online access to a database of genomic sequence and annotation data for a wide variety of organisms. The Browser also has many tools for visualizing, comparing and analyzing both publicly available and user-generated genomic data sets, aligning sequences and uploading user data. Among the features released this year are a gene search tool and annotation track drag-reorder functionality as well as support for BAM and BigWig/BigBed file formats. New display enhancements include overlay of multiple wiggle tracks through use of transparent coloring, options for displaying transformed wiggle data, a ‘mean+whiskers’ windowing function for display of wiggle data at high zoom levels, and more color schemes for microarray data. New data highlights include seven new genome assemblies, a Neandertal genome data portal, phenotype and disease association data, a human RNA editing track, and a zebrafish Conservation track. We also describe updates to existing tracks.
T2 weighted carotid artery images acquired using the TSE sequence frequently suffers from motion artifacts due to respiration and blood pulsation. The possibility of using HASTE sequence to achieve motion-free carotid images was investigated. The HASTE sequence suffers from severe blurring artifacts due to signal loss in later echoes due to T2 decay. Combining HASTE with parallel acquisition (PHASTE) decreases the number of echoes acquired and thus effectively reduces the blurring artifact caused by T2 relaxation. Further improvement in image sharpness can be achieved by performing T2 decay compensation before reconstructing the PHASTE data. Preliminary results have shown successful suppression of motion artifacts with PHASTE imaging. The image quality was enhanced relative to the original HASTE image, but was still less sharp than a non-motion-corrupted TSE image.
HASTE; K-space inherited parallel acquisition; T2 blurring; Carotid artery imaging; Motion suppression
Regions covering one percent of the genome, selected by ENCODE for extensive analysis, were annotated by the HAVANA/Gencode group with high quality transcripts, thus defining a benchmark. The ENCODE Genome Annotation Assessment Project (EGASP) competition aimed at reproducing Gencode and finding new genes. The organizers evaluated the protein predictions in depth. We present a complementary analysis of the mRNAs, including alternative transcript variants.
We evaluate 25 gene tracks from the University of California Santa Cruz (UCSC) genome browser. We either distinguish or collapse the alternative splice variants, and compare the genomic coordinates of exons, introns and nucleotides. Whole mRNA models, seen as chains of introns, are sorted to find the best matching pairs, and compared so that each mRNA is used only once. At the mRNA level, AceView is by far the closest to Gencode: the vast majority of transcripts of the two methods, including alternative variants, are identical. At the protein level, however, due to a lack of experimental data, our predictions differ: Gencode annotates proteins in only 41% of the mRNAs whereas AceView does so in virtually all. We describe the driving principles of AceView, and how, by performing hand-supervised automatic annotation, we solve the combinatorial splicing problem and summarize all of GenBank, dbEST and RefSeq into a genome-wide non-redundant but comprehensive cDNA-supported transcriptome. AceView accuracy is now validated by Gencode.
Relative to a consensus mRNA catalog constructed from all evidence-based annotations, Gencode and AceView have 81% and 84% sensitivity, and 74% and 73% specificity, respectively. This close agreement validates a richer view of the human transcriptome, with three to five times more transcripts than in UCSC Known Genes (sensitivity 28%), RefSeq (sensitivity 21%) or Ensembl (sensitivity 19%).
FootPrinter3 is a web server for predicting transcription factor binding sites by using phylogenetic footprinting. Until now, phylogenetic footprinting approaches have been based either on multiple alignment analysis (e.g. PhyloVista, PhastCons), or on motif-discovery algorithms (e.g. FootPrinter2). FootPrinter3 integrates these two approaches, making use of local multiple sequence alignment blocks when those are available and reliable, but also allowing finding motifs in unalignable regions. The result is a set of predictions that joins the advantages of alignment-based methods (good specificity) to those of motif-based methods (good sensitivity, even in the presence of highly diverged species). FootPrinter3 is thus a tool of choice to exploit the wealth of vertebrate genomes being sequenced, as it allows taking full advantage of the sequences of highly diverged species (e.g. chicken, zebrafish), as well as those of more closely related species (e.g. mammals). The FootPrinter3 web server is available at: .
A number of software packages are available to generate DNA multiple sequence alignments (MSAs) evolved under continuous-time Markov processes on phylogenetic trees. On the other hand, methods of simulating the DNA MSA directly from the transition matrices do not exist. Moreover, existing software restricts to the time-reversible models and it is not optimized to generate nonhomogeneous data (i.e. placing distinct substitution rates at different lineages).
We present the first package designed to generate MSAs evolving under discrete-time Markov processes on phylogenetic trees, directly from probability substitution matrices. Based on the input model and a phylogenetic tree in the Newick format (with branch lengths measured as the expected number of substitutions per site), the algorithm produces DNA alignments of desired length. GenNon-h is publicly available for download.
The software presented here is an efficient tool to generate DNA MSAs on a given phylogenetic tree. GenNon-h provides the user with the nonstationary or nonhomogeneous phylogenetic data that is well suited for testing complex biological hypotheses, exploring the limits of the reconstruction algorithms and their robustness to such models.
Genome-wide association studies (GWAS) have successfully identified numerous genetic loci that are associated with phenotypic traits and diseases. GWAS Integrator is a bioinformatics tool that integrates information on these associations from the National Human Genome Research institute (NHGRI) Catalog, SNAP (SNP Annotation and Proxy Search), and the Human Genome Epidemiology (HuGE) Navigator literature database. This tool includes robust search and data mining functionalities that can be used to quickly identify relevant associations from GWAS, as well as proxy single-nucleotide polymorphisms (SNPs) and potential candidate genes. Query-based University of California Santa Cruz (UCSC) Genome Browser custom tracks are generated dynamically on the basis of users' selected GWAS hits or candidate genes from HuGE Navigator literature database (http://www.hugenavigator.net/HuGENavigator/gWAHitStartPage.do). The GWAS Integrator may help enhance inference on potential genetic associations identified from GWAS studies.
genome-wide association studies; database; bioinformatics
The University of California Santa Cruz Genome Browser Database (GBD) contains sequence and annotation data for the genomes of about a dozen vertebrate species and several major model organisms. Genome annotations typically include assembly data, sequence composition, genes and gene predictions, mRNA and expressed sequence tag evidence, comparative genomics, regulation, expression and variation data. The database is optimized to support fast interactive performance with web tools that provide powerful visualization and querying capabilities for mining the data. The Genome Browser displays a wide variety of annotations at all scales from single nucleotide level up to a full chromosome. The Table Browser provides direct access to the database tables and sequence data, enabling complex queries on genome-wide datasets. The Proteome Browser graphically displays protein properties. The Gene Sorter allows filtering and comparison of genes by several metrics including expression data and several gene properties. BLAT and In Silico PCR search for sequences in entire genomes in seconds. These tools are highly integrated and provide many hyperlinks to other databases and websites. The GBD, browsing tools, downloadable data files and links to documentation and other information can be found at .