With the Ebola epidemic raging out of control in West Africa, there has been a flurry of research into the Ebola virus, resulting in the generation of much genomic data.
In response to the clear need for tools that integrate multiple strands of research around molecular sequences, we have created the University of California Santa Cruz (UCSC) Ebola Genome Browser, an adaptation of our popular UCSC Genome Browser web tool, which can be used to view the Ebola virus genome sequence from GenBank and nearly 30 annotation tracks generated by mapping external data to the reference sequence. Significant annotations include a multiple alignment comprising 102 Ebola genomes from the current outbreak, 56 from previous outbreaks, and 2 Marburg genomes as an outgroup; a gene track curated by NCBI; protein annotations curated by UniProt and antibody-binding epitopes curated by IEDB. We have extended the Genome Browser’s multiple alignment color-coding scheme to distinguish mutations resulting from non-synonymous coding changes, synonymous changes, or changes in untranslated regions.
Our Ebola Genome portal at http://genome.ucsc.edu/ebolaPortal/ links to the Ebola virus Genome Browser and an aggregate of useful information, including a collection of Ebola antibodies we are curating.
ebola; ebolavirus; EBOV; genome analysis; genomics
The University of California Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu) offers online public access to a growing database of genomic sequence and annotations for a large collection of organisms, primarily vertebrates, with an emphasis on the human and mouse genomes. The Browser’s web-based tools provide an integrated environment for visualizing, comparing, analysing and sharing both publicly available and user-generated genomic data sets. As of September 2013, the database contained genomic sequence and a basic set of annotation ‘tracks’ for ∼90 organisms. Significant new annotations include a 60-species multiple alignment conservation track on the mouse, updated UCSC Genes tracks for human and mouse, and several new sets of variation and ENCODE data. New software tools include a Variant Annotation Integrator that returns predicted functional effects of a set of variants uploaded as a custom track, an extension to UCSC Genes that displays haplotype alleles for protein-coding genes and an expansion of data hubs that includes the capability to display remotely hosted user-provided assembly sequence in addition to annotation data. To improve European access, we have added a Genome Browser mirror (http://genome-euro.ucsc.edu) hosted at Bielefeld University in Germany.
Summary: Track data hubs provide an efficient mechanism for visualizing remotely hosted Internet-accessible collections of genome annotations. Hub datasets can be organized, configured and fully integrated into the University of California Santa Cruz (UCSC) Genome Browser and accessed through the familiar browser interface. For the first time, individuals can use the complete browser feature set to view custom datasets without the overhead of setting up and maintaining a mirror.
Availability and implementation: Source code for the BigWig, BigBed and Genome Browser software is freely available for non-commercial use at http://hgdownload.cse.ucsc.edu/admin/jksrc.zip, implemented in C and supported on Linux. Binaries for the BigWig and BigBed creation and parsing utilities may be downloaded at http://hgdownload.cse.ucsc.edu/admin/exe/. Binary Alignment/Map (BAM) and Variant Call Format (VCF)/tabix utilities are available from http://samtools.sourceforge.net/ and http://vcftools.sourceforge.net/. The UCSC Genome Browser is publicly accessible at http://genome.ucsc.edu.
Sequencing of multiple related species followed by comparative genomics analysis constitutes a powerful approach for the systematic understanding of any genome. Here, we use the genomes of 12 Drosophila species for the de novo discovery of functional elements in the fly. Each type of functional element shows characteristic patterns of change, or ‘evolutionary signatures’, dictated by its precise selective constraints. Such signatures enable recognition of new protein-coding genes and exons, spurious and incorrect gene annotations, and numerous unusual gene structures, including abundant stop-codon readthrough. Similarly, we predict non-protein-coding RNA genes and structures, and new microRNA (miRNA) genes. We provide evidence of miRNA processing and functionality from both hairpin arms and both DNA strands. We identify several classes of pre- and post-transcriptional regulatory motifs, and predict individual motif instances with high confidence. We also study how discovery power scales with the divergence and number of species compared, and we provide general guidelines for comparative studies.
GC-AG introns represent 0.7% of total human pre-mRNA introns. To study the function of GC-AG introns in splicing regulation, 196 cDNA-confirmed GC-AG introns were identified in Caenorhabditis elegans. These represent 0.6% of the cDNA- confirmed intron data set for this organism. Eleven of these GC-AG introns are involved in alternative splicing. In a comparison of the genomic sequences of homologous genes between C.elegans and Caenorhabditis briggsae for 26 GC-AG introns, the C at the +2 position is conserved in only five of these introns. A system to experimentally test the function of GC-AG introns in alternative splicing was developed. Results from these experiments indicate that the conserved C at the +2 position of the tenth intron of the let-2 gene is essential for developmentally regulated alternative splicing. This C allows the splice donor to function as a very weak splice site that works in balance with an alternative GT splice donor. A weak GT splice donor can functionally replace the GC splice donor and allow for splicing regulation. These results indicate that while the majority of GC-AG introns appear to be constitutively spliced and have no evolutionary constraints to prevent them from being GT-AG introns, a subset of GC-AG introns is involved in alternative splicing and the C at the +2 position of these introns can have an important role in splicing regulation.
The Encyclopedia of DNA Elements (ENCODE), http://encodeproject.org, has completed its fifth year of scientific collaboration to create a comprehensive catalog of functional elements in the human genome, and its third year of investigations in the mouse genome. Since the last report in this journal, the ENCODE human data repertoire has grown by 898 new experiments (totaling 2886), accompanied by a major integrative analysis. In the mouse genome, results from 404 new experiments became available this year, increasing the total to 583, collected during the course of the project. The University of California, Santa Cruz, makes this data available on the public Genome Browser http://genome.ucsc.edu for visual browsing and data mining. Download of raw and processed data files are all supported. The ENCODE portal provides specialized tools and information about the ENCODE data sets.
The University of California Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu) offers online public access to a growing database of genomic sequence and annotations for a wide variety of organisms. The Browser is an integrated tool set for visualizing, comparing, analysing and sharing both publicly available and user-generated genomic datasets. As of September 2012, genomic sequence and a basic set of annotation ‘tracks’ are provided for 63 organisms, including 26 mammals, 13 non-mammal vertebrates, 3 invertebrate deuterostomes, 13 insects, 6 worms, yeast and sea hare. In the past year 19 new genome assemblies have been added, and we anticipate releasing another 28 in early 2013. Further, a large number of annotation tracks have been either added, updated by contributors or remapped to the latest human reference genome. Among these are an updated UCSC Genes track for human and mouse assemblies. We have also introduced several features to improve usability, including new navigation menus. This article provides an update to the UCSC Genome Browser database, which has been previously featured in the Database issue of this journal.
The University of California Santa Cruz (UCSC) Genome Browser is a popular Web-based tool for quickly displaying a requested portion of a genome at any scale, accompanied by a series of aligned annotation “tracks.” The annotations generated by the UCSC Genome Bioinformatics Group and external collaborators include gene predictions, mRNA and expressed sequence tag alignments, simple nucleotide polymorphisms, expression and regulatory data, phenotype and variation data, and pairwise and multiple-species comparative genomics data. All information relevant to a region is presented in one window, facilitating biological analysis and interpretation. The database tables underlying the Genome Browser tracks can be viewed, downloaded, and manipulated using another Web-based application, the UCSC Table Browser. Users can upload personal datasets in a wide variety of formats as custom annotation tracks in both browsers for research or educational purposes.
Genome Browser; Table Browser; human genome; genome analysis; comparative genomics; human variation; next-gen sequencing; human genetics analysis; biological databases; BAM
The UCSC Genome Browser (http://genome.ucsc.edu) is a graphical viewer for genomic data now in its 13th year. Since the early days of the Human Genome Project, it has presented an integrated view of genomic data of many kinds. Now home to assemblies for 58 organisms, the Browser presents visualization of annotations mapped to genomic coordinates. The ability to juxtapose annotations of many types facilitates inquiry-driven data mining. Gene predictions, mRNA alignments, epigenomic data from the ENCODE project, conservation scores from vertebrate whole-genome alignments and variation data may be viewed at any scale from a single base to an entire chromosome. The Browser also includes many other widely used tools, including BLAT, which is useful for alignments from high-throughput sequencing experiments. Private data uploaded as Custom Tracks and Data Hubs in many formats may be displayed alongside the rich compendium of precomputed data in the UCSC database. The Table Browser is a full-featured graphical interface, which allows querying, filtering and intersection of data tables. The Saved Session feature allows users to store and share customized views, enhancing the utility of the system for organizing multiple trains of thought. Binary Alignment/Map (BAM), Variant Call Format and the Personal Genome Single Nucleotide Polymorphisms (SNPs) data formats are useful for visualizing a large sequencing experiment (whole-genome or whole-exome), where the differences between the data set and the reference assembly may be displayed graphically. Support for high-throughput sequencing extends to compact, indexed data formats, such as BAM, bigBed and bigWig, allowing rapid visualization of large datasets from RNA-seq and ChIP-seq experiments via local hosting.
UCSC genome browser; bioinformatics; genetics; human genome; genomics; sequencing
To complement the human Encyclopedia of DNA Elements (ENCODE) project and to enable a broad range of mouse genomics efforts, the Mouse ENCODE Consortium is applying the same experimental pipelines developed for human ENCODE to annotate the mouse genome.
ENCODE Project; mouse genome; DNaseI hypersensitive sites; histone modifications; transcriptome; transcription factor binding sites; comparative genomics; ChIP-seq; RNA-seq
Comparison of related genomes has emerged as a powerful lens for genome interpretation. Here, we report the sequencing and comparative analysis of 29 eutherian genomes. We confirm that at least 5.5% of the human genome has undergone purifying selection, and report constrained elements covering ~4.2% of the genome. We use evolutionary signatures and comparison with experimental datasets to suggest candidate functions for ~60% of constrained bases. These elements reveal a small number of new coding exons, candidate stop codon readthrough events, and over 10,000 regions of overlapping synonymous constraint within protein-coding exons. We find 220 candidate RNA structural families, and nearly a million elements overlapping potential promoter, enhancer and insulator regions. We report specific amino acid residues that have undergone positive selection, 280,000 non-coding elements exapted from mobile elements, and ~1,000 primate- and human-accelerated elements. Overlap with disease-associated variants suggests our findings will be relevant for studies of human biology and health.
The Encyclopedia of DNA Elements (ENCODE) Consortium is entering its 5th year of production-level effort generating high-quality whole-genome functional annotations of the human genome. The past year has brought the ENCODE compendium of functional elements to critical mass, with a diverse set of 27 biochemical assays now covering 200 distinct human cell types. Within the mouse genome, which has been under study by ENCODE groups for the past 2 years, 37 cell types have been assayed. Over 2000 individual experiments have been completed and submitted to the Data Coordination Center for public use. UCSC makes this data available on the quality-reviewed public Genome Browser (http://genome.ucsc.edu) and on an early-access Preview Browser (http://genome-preview.ucsc.edu). Visual browsing, data mining and download of raw and processed data files are all supported. An ENCODE portal (http://encodeproject.org) provides specialized tools and information about the ENCODE data sets.
We systematically generated large-scale data sets to improve genome annotation for the nematode Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling across a developmental time course, genome-wide identification of transcription factor–binding sites, and maps of chromatin organization. From this, we created more complete and accurate gene models, including alternative splice forms and candidate noncoding RNAs. We constructed hierarchical networks of transcription factor–binding and microRNA interactions and discovered chromosomal locations bound by an unusually large number of transcription factors. Different patterns of chromatin composition and histone modification were revealed between chromosome arms and centers, with similarly prominent differences between autosomes and the X chromosome. Integrating data types, we built statistical models relating chromatin, transcription factor binding, and gene expression. Overall, our analyses ascribed putative functions to most of the conserved genome.
The University of California Santa Cruz (UCSC) Genome Browser (genome.ucsc.edu) is a popular Web-based tool for quickly displaying a requested portion of a genome at any scale, accompanied by a series of aligned annotation “tracks”. The annotations—generated by the UCSC Genome Bioinformatics Group and external collaborators—display gene predictions, mRNA and expressed sequence tag alignments, simple nucleotide polymorphisms, expression and regulatory data, phenotype and variation data, and pairwise and multiple-species comparative genomics data. All information relevant to a region is presented in one window, facilitating biological analysis and interpretation. The database tables underlying the Genome Browser tracks can be viewed, downloaded, and manipulated using another Web-based application, the UCSC Table Browser. Users can upload data as custom annotation tracks in both browsers for research or educational use. This unit describes how to use the Genome Browser and Table Browser for genome analysis, download the underlying database tables, and create and display custom annotation tracks.
Genome Browser; Table Browser; UCSC; human genome; genome analysis; comparative genomics; human variation; Bioinformatics; Bioinformatics Fundamentals; Biological Databases
The UCSC Cancer Genomics Browser (https://genome-cancer.ucsc.edu) comprises a suite of web-based tools to integrate, visualize and analyze cancer genomics and clinical data. The browser displays whole-genome views of genome-wide experimental measurements for multiple samples alongside their associated clinical information. Multiple data sets can be viewed simultaneously as coordinated ‘heatmap tracks’ to compare across studies or different data modalities. Users can order, filter, aggregate, classify and display data interactively based on any given feature set including clinical features, annotated biological pathways and user-contributed collections of genes. Integrated standard statistical tools provide dynamic quantitative analysis within all available data sets. The browser hosts a growing body of publicly available cancer genomics data from a variety of cancer types, including data generated from the Cancer Genome Atlas project. Multiple consortiums use the browser on confidential prepublication data enabled by private installations. Many new features have been added, including the hgMicroscope tumor image viewer, hgSignature for real-time genomic signature evaluation on any browser track, and ‘PARADIGM’ pathway tracks to display integrative pathway activities. The browser is integrated with the UCSC Genome Browser; thus inheriting and integrating the Genome Browser’s rich set of human biology and genetics data that enhances the interpretability of the cancer genomics data.
The ENCODE project is an international consortium with a goal of cataloguing all the functional elements in the human genome. The ENCODE Data Coordination Center (DCC) at the University of California, Santa Cruz serves as the central repository for ENCODE data. In this role, the DCC offers a collection of high-throughput, genome-wide data generated with technologies such as ChIP-Seq, RNA-Seq, DNA digestion and others. This data helps illuminate transcription factor-binding sites, histone marks, chromatin accessibility, DNA methylation, RNA expression, RNA binding and other cell-state indicators. It includes sequences with quality scores, alignments, signals calculated from the alignments, and in most cases, element or peak calls calculated from the signal data. Each data set is available for visualization and download via the UCSC Genome Browser (http://genome.ucsc.edu/). ENCODE data can also be retrieved using a metadata system that captures the experimental parameters of each assay. The ENCODE web portal at UCSC (http://encodeproject.org/) provides information about the ENCODE data and links for access.
The University of California, Santa Cruz Genome Browser (http://genome.ucsc.edu) offers online access to a database of genomic sequence and annotation data for a wide variety of organisms. The Browser also has many tools for visualizing, comparing and analyzing both publicly available and user-generated genomic data sets, aligning sequences and uploading user data. Among the features released this year are a gene search tool and annotation track drag-reorder functionality as well as support for BAM and BigWig/BigBed file formats. New display enhancements include overlay of multiple wiggle tracks through use of transparent coloring, options for displaying transformed wiggle data, a ‘mean+whiskers’ windowing function for display of wiggle data at high zoom levels, and more color schemes for microarray data. New data highlights include seven new genome assemblies, a Neandertal genome data portal, phenotype and disease association data, a human RNA editing track, and a zebrafish Conservation track. We also describe updates to existing tracks.
The Encyclopedia of DNA Elements (ENCODE) project is an international consortium of investigators funded to analyze the human genome with the goal of producing a comprehensive catalog of functional elements. The ENCODE Data Coordination Center at The University of California, Santa Cruz (UCSC) is the primary repository for experimental results generated by ENCODE investigators. These results are captured in the UCSC Genome Bioinformatics database and download server for visualization and data mining via the UCSC Genome Browser and companion tools (Rhead et al. The UCSC Genome Browser Database: update 2010, in this issue). The ENCODE web portal at UCSC (http://encodeproject.org or http://genome.ucsc.edu/ENCODE) provides information about the ENCODE data and convenient links for access.
The University of California, Santa Cruz (UCSC) Genome Browser website (http://genome.ucsc.edu/) provides a large database of publicly available sequence and annotation data along with an integrated tool set for examining and comparing the genomes of organisms, aligning sequence to genomes, and displaying and sharing users’ own annotation data. As of September 2009, genomic sequence and a basic set of annotation ‘tracks’ are provided for 47 organisms, including 14 mammals, 10 non-mammal vertebrates, 3 invertebrate deuterostomes, 13 insects, 6 worms and a yeast. New data highlights this year include an updated human genome browser, a 44-species multiple sequence alignment track, improved variation and phenotype tracks and 16 new genome-wide ENCODE tracks. New features include drag-and-zoom navigation, a Wiki track for user-added annotations, new custom track formats for large datasets (bigBed and bigWig), a new multiple alignment output tool, links to variation and protein structure tools, in silico PCR utility enhancements, and improved track configuration tools.
Evolution via point mutations is a relatively slow process and is unlikely to completely explain the differences between primates and other mammals. By contrast, 45% of the human genome is composed of retroposed elements, many of which were inserted in the primate lineage. A subset of retroposed mRNAs (retrocopies) shows strong evidence of expression in primates, often yielding functional retrogenes.
To identify and analyze the relatively recently evolved retrogenes, we carried out BLASTZ alignments of all human mRNAs against the human genome and scored a set of features indicative of retroposition. Of over 12,000 putative retrocopy-derived genes that arose mainly in the primate lineage, 726 with strong evidence of transcript expression were examined in detail. These mRNA retroposition events fall into three categories: I) 34 retrocopies and antisense retrocopies that added potential protein coding space and UTRs to existing genes; II) 682 complete retrocopy duplications inserted into new loci; and III) an unexpected set of 13 retrocopies that contributed out-of-frame, or antisense sequences in combination with other types of transposed elements (SINEs, LINEs, LTRs), even unannotated sequence to form potentially novel genes with no homologs outside primates. In addition to their presence in human, several of the gene candidates also had potentially viable ORFs in chimpanzee, orangutan, and rhesus macaque, underscoring their potential of function.
mRNA-derived retrocopies provide raw material for the evolution of genes in a wide variety of ways, duplicating and amending the protein coding region of existing genes as well as generating the potential for new protein coding space, or non-protein coding RNAs, by unexpected contributions out of frame, in reverse orientation, or from previously non-protein coding sequence.
The goal of the Encyclopedia Of DNA Elements (ENCODE) Project is to identify all functional elements in the human genome. The pilot phase is for comparison of existing methods and for the development of new methods to rigorously analyze a defined 1% of the human genome sequence. Experimental datasets are focused on the origin of replication, DNase I hypersensitivity, chromatin immunoprecipitation, promoter function, gene structure, pseudogenes, non-protein-coding RNAs, transcribed RNAs, multiple sequence alignment and evolutionarily constrained elements. The ENCODE project at UCSC website () is the primary portal for the sequence-based data produced as part of the ENCODE project. In the pilot phase of the project, over 30 labs provided experimental results for a total of 56 browser tracks supported by 385 database tables. The site provides researchers with a number of tools that allow them to visualize and analyze the data as well as download data for local analyses. This paper describes the portal to the data, highlights the data that has been made available, and presents the tools that have been developed within the ENCODE project. Access to the data and types of interactive analysis that are possible are illustrated through supplemental examples.
The variation resources within the University of California Santa Cruz Genome Browser include polymorphism data drawn from public collections and analyses of these data, along with their display in the context of other genomic annotations. Primary data from dbSNP is included for many organisms, with added information including genomic alleles and orthologous alleles for closely related organisms. Display filtering and coloring is available by variant type, functional class or other annotations. Annotation of potential errors is highlighted and a genomic alignment of the variant's flanking sequence is displayed. HapMap allele frequencies and linkage disequilibrium (LD) are available for each HapMap population, along with non-human primate alleles. The browsing and analysis tools, downloadable data files and links to documentation and other information can be found at .
The major challenge to identifying natural sense– antisense (SA) transcripts from public databases is how to determine the correct orientation for an expressed sequence, especially an expressed sequence tag sequence. In this study, we established a set of very stringent criteria to identify the correct orientation of each human transcript. We used these orientation-reliable transcripts to create 26 741 transcription clusters in the human genome. Our analysis shows that 22% (5880) of the human transcription clusters form SA pairs, higher than any previous estimates. Our orientation-specific RT–PCR results along with the comparison of experimental data from previous studies confirm that our SA data set is reliable. This study not only demonstrates that our criteria for the prediction of SA transcripts are efficient, but also provides additional convincing data to support the view that antisense transcription is quite pervasive in the human genome. In-depth analyses show that SA transcripts have some significant differences compared with other types of transcripts, with regard to chromosomal distribution and Gene Ontology-annotated categories of physiological roles, functions and spatial localizations of gene products.
This study reports an abundance of primate-specific segmental duplications at the breakpoints of syntenic blocks in the human genome. Segmental duplications are associated with syntenic rearrangements but do not necessarily cause them.
Chromosomal evolution is thought to occur through a random process of breakage and rearrangement that leads to karyotype differences and disruption of gene order. With the availability of both the human and mouse genomic sequences, detailed analysis of the sequence properties underlying these breakpoints is now possible.
We report an abundance of primate-specific segmental duplications at the breakpoints of syntenic blocks in the human genome. Using conservative criteria, we find that 25% (122/461) of all breakpoints contain ≥ 10 kb of duplicated sequence. This association is highly significant (p < 0.0001) when compared to a simulated random-breakage model. The significance is robust under a variety of parameters, multiple sets of conserved synteny data, and for orthologous breakpoints between and within chromosomes. A comparison of mouse lineage-specific breakpoints since the divergence of rat and mouse showed a similar association with regions associated with segmental duplications in the primate genome.
These results indicate that segmental duplications are associated with syntenic rearrangements, even when pericentromeric and subtelomeric regions are excluded. However, segmental duplications are not necessarily the cause of the rearrangements. Rather, our analysis supports a nonrandom model of chromosomal evolution that implicates specific regions within the mammalian genome as having been predisposed to both recurrent small-scale duplication and large-scale evolutionary rearrangements.