|Home | About | Journals | Submit | Contact Us | Français|
The University of California Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu) offers online public access to a growing database of genomic sequence and annotations for a wide variety of organisms. The Browser is an integrated tool set for visualizing, comparing, analysing and sharing both publicly available and user-generated genomic datasets. As of September 2012, genomic sequence and a basic set of annotation ‘tracks’ are provided for 63 organisms, including 26 mammals, 13 non-mammal vertebrates, 3 invertebrate deuterostomes, 13 insects, 6 worms, yeast and sea hare. In the past year 19 new genome assemblies have been added, and we anticipate releasing another 28 in early 2013. Further, a large number of annotation tracks have been either added, updated by contributors or remapped to the latest human reference genome. Among these are an updated UCSC Genes track for human and mouse assemblies. We have also introduced several features to improve usability, including new navigation menus. This article provides an update to the UCSC Genome Browser database, which has been previously featured in the Database issue of this journal.
The University of California Santa Cruz (UCSC) Genome Browser (1,2) at http://genome.ucsc.edu is a web-based set of tools providing access to a database of genome sequence and annotations for visualization, comparison and analysis by the scientific, medical and academic communities. Our primary mission is to provide timely and convenient open access to high-quality human genome sequence and annotations in a framework that enables easy exploration from genome-wide down to the base level. Annotation datasets, or ‘tracks’, on the human genome cover conservation and evolutionary comparisons, gene models, regulation, expression, epigenetics and tissue differentiation, variation, phenotype and disease associations. Our mission extends to a number of additional organisms including 6 other primates, 19 additional mammals including 3 marsupials and 1 monotreme, 13 non-mammalian vertebrates and 24 invertebrates, each with varying degrees of genome-specific annotation. Many of the genomes in our database have multiple assembly versions, which support researchers who use annotations mapped using older assemblies.
The Genome Browser locally hosts mapping and sequence annotation tracks that describe assembly, gap and GC content for all organisms in the browser database. Additionally, for most organisms we show alignments from RefSeq genes (3), mRNAs and ESTs from GenBank (4), and other gene or gene prediction tracks such as Ensembl Genes (5). For human and mouse assemblies, we also offer a locally generated UCSC Genes track based upon RefSeq, GenBank, CCDS and UniProt data (6,7). About half of the genomes hosted at UCSC include a multiple sequence alignment (multiz) track (8) and pairwise genomic alignments between assemblies to facilitate comparative and evolutionary investigations. Expression, regulation, variation and phenotype tracks are available for many of the assemblies. Most locally hosted tracks include descriptions with references and links to the original contributors or research upon which the annotations are based.
With the abundance of new vertebrate assemblies available in GenBank, the UCSC Genome Browser team has streamlined its browser release pipeline in the effort to keep pace. We have added 19 new assemblies to the Genome Browser in the past year, including 4 model organisms (Fugu, mouse, worm and yeast), 7 newly sequenced organisms (gibbon, lesser hedgehog tenrec, medium ground finch, naked mole-rat, tasmanian devil, turkey and western painted turtle) and 8 updated assemblies for previously published organisms (chicken, cow, dog, gorilla, microbat, rat, tammar wallaby and western clawed frog)—see Table 1 for details. We anticipate the public release of 28 more genome assemblies in the coming months (Table 2) in support of the new mouse (GRCm38/mm10) 60-way conservation track. For a complete list of the genome assemblies included in this track, refer to the mm10 Conservation track description page on the Genome Browser website.
Many new datasets were added to the Genome Browser this year, and several existing datasets underwent major revisions. A significant portion of these were contributed by the Encyclopedia of DNA Elements (ENCODE) Consortium: we released tracks and downloadable files for more than 2300 experiments as the Data Coordination Center for the ENCODE Project (9,10), described in a companion paper in this issue.
We published a major update of the UCSC Genes track (6) for the human assembly (GRCh37/hg19) that includes more non-coding transcripts based on data from Rfam and from the tRNA Genes track. We anticipate releasing an updated UCSC Genes for mm10 in fall of 2012. Rat Genome Database (RGD) Genes for rat has replaced UCSC Genes as the main gene track for Baylor 3.4/rn4 (11).
We have updated dbSNP for hg19 to version 135, which includes interim phase 1 variant calls from the 1000 Genomes project (12). This new version contains additional annotation data not included in previous dbSNP tracks, with corresponding coloring and filtering options in the Genome Browser. We anticipate having dbSNP version 137 for hg19 available in fall 2012, with Sequence Ontology (13) terms replacing dbSNP's functional annotation terms in the display.
To ensure timely display of data from frequently updated phenotype and disease association databases we have automated loading of the following hg19 tracks: Catalogue Of Somatic Mutations In Cancer (COSMIC), GeneReviews, GWAS Catalog and Online Mendelian Inheritance in Man (OMIM) (14–17).
We have added a Publications track that shows DNA and protein sequences, SNPs, cytogenetic bands and gene symbols which were text-mined from 3 million biomedical articles in Elsevier, PubMed Central and other databases (18). This track is based on the UCSC Genocoding Project, which searches for references to chromosomal locations in scientific articles. The annotations in this track link back to the original article, thus allowing researchers to identify publications relevant to a particular locus (Figure 1).
We have added four public track hubs for hg19 from external data providers (see below for more details on track hubs): the ENCODE Analysis hub contains descriptions of ENCODE data in uniformly processed signal and element representations, as well as genome segmentations (19); the UMassMed ZHub contains H3K4me3 ChIP-seq data for autistic brains (20); the Expression & PolyA Database (xPAD) hub contains a map of polyadenylation sites in cancer tissues and tumor cell lines (21); the miRcode hub contains predicted microRNA target sites in GENCODE transcripts (22).
We made several changes to the interface of the Genome Browser in 2012 based on suggestions from our users. All pages now display a menu bar to make it easier to access features and navigate around the website in a consistent way. We have changed the fonts and background to improve usability. The annotation search and gene suggest box have been combined, and we have added descriptions to the gene suggestion list. We have changed the way users log in when saving sessions; this change simplifies the login procedure and also removes the dependency on MediaWiki, which makes it easier for Genome Browser mirrors to support saved sessions.
We introduced support for the Variant Call Format (VCF) in 2011 (23). This year we improved VCF support with a haplotype sorting display. VCF can optionally represent phased genotypes, i.e. the two alleles of each diploid genotype have been assigned to two haplotypes, one inherited from each parent. For VCF files that contain phased genotypes from multiple samples, we have developed an advanced display to highlight local patterns of genetic linkage between variants. The display features the clustering of independent haplotypes within the viewed region. The goal of the clustering is to visually group co-occurring allele sequences in haplotypes, so local patterns of linkage can be easily discerned. The clustering does not indicate relatedness of individuals, but merely local composition of mostly ancient haplotype blocks. We anticipate adding 1000 Genomes Phase 1 variant calls with phased genotypes for 1092 individuals using this display in fall 2012.
In the haplotype sorting display (Figure 2), independent haplotypes are shown horizontally, and variants are vertical bars with reference alleles in white (invisible) and alternate alleles in black. A variant for which most haplotypes have the reference allele will be mostly white (invisible); tick marks at the top and bottom of each variant make such variants easier to see. Haplotypes are clustered by similarity weighted by proximity to a central variant, which is outlined in purple. In order to limit compute time, only a small number of variants are used for clustering; these variants have purple tick marks above and below. The clustering tree is drawn in the left label area, and is used to order the haplotypes from top to bottom. When a rightmost branch in the clustering tree is purple, it means that all haplotypes in the branch are identical, at least in the variants used for clustering.
In 2011 we introduced support for track data hubs, which are web-accessible directories of genomic data that can be viewed in the UCSC Genome Browser alongside the annotation tracks hosted by UCSC (2). This technology has many advantages: it allows researchers to combine and configure large numbers of datasets for presentation as single entity, it improves performance by allowing the Genome Browser to retrieve data only when necessary, and it allows researchers to share a collection of data with colleagues as a private data hub. Track hubs usage increased greatly in 2012; by September 2012 more than 2000 track hubs were in use. There is also a growing trend in the research community to use track hubs to collect and organize data for presentation in publications. UCSC has extended the documentation (http://genome.ucsc.edu/goldenPath/help/trackDb/trackDbDoc.html) for track hubs on the Genome Browser website to facilitate their use.
We will continue to add new and updated genome assemblies for vertebrate and other selected model organisms as they become available. Only assemblies registered and deposited in NCBI’s GenBank will be considered for hosting at UCSC, as stipulated in the Browser Genome Release Agreement instituted by NCBI, Ensembl and UCSC. Many researchers have expressed interest in using the Genome Browser to visualize and analyse assemblies that are not deposited at NCBI. To assist such research, we intend to develop support for assembly data hubs, which will enable the genomics community to easily extend the Genome Browser to display genome assemblies that we are unable to integrate into our own database. The assembly data hub will be similar in concept to the track data hub: the data provider will store the genome sequence in a compressed, binary, indexed file format and make it available on a remote web server along with a list of tracks that annotate that genome.
We plan to add or update several annotation tracks in the upcoming year, including a coverage/mapability track based on 1000 Genomes project data, an updated recombination rate and UCSC Genes track for the human genome, an updated ORFeome track for zebrafish, a mouse strain variant track, segmental duplication tracks for several assemblies, and more selected personal genomes in the human Personal Genome Variants track. We will also continue to incorporate selected datasets from the ENCODE project that are of general interest to our users.
We are developing a tool for integrating diverse annotations in our databases with user-provided genomic variants, to assist with analysis and prioritization of variants discovered via sequencing. We will finish support for VCF in tracks hubs. We also plan to implement a supported mirror in Germany to improve access speed for European users of the Genome Browser.
We have two public, moderated mailing lists for user support: ude.cscu.eos@emoneg for general questions about the Genome Browser and ude.cscu.eos@rorrim-emoneg for questions specific to the setup and maintenance of Genome Browser mirrors. Archives of both lists are searchable from our contacts page at http://genome.ucsc.edu/contacts.html. You may also reach us at firstname.lastname@example.org, the preferred address for inquiring about mirror site licenses and reporting server errors.
National Human Genome Research Institute [P41HG002371 to G.P.B., H.C., M.D., P.A.F., A.S.H., F.H., D.K., V.K., W.J.K., R.M.K., B.T.L., C.H.L., L.R.M, A.P., B.J.R., B.R., G.R. and A.S.Z.; U41HG004568 to M.S.C., T.R.D., M.G., F.H., W.J.K., K.L., V.S.M., B.J.R., K.R.R., C.A.S. and M.W.; and subcontracts from P01HG5062 to G.P.B., W.J.K. and B.R; U54HG004555 to M.D. and R.A.H.; U41HG004269 to A.S.H. and W.J.K.; U01HG004695 to W.J.K.]; subcontracts from the National Institute of Dental and Craniofacial Research [U01DE20057 to G.P.B. and R.M.K.]; National Institute of Child Health and Human Development [RC2HD064525 to H.C., A.S.H. and R.M.K.]; National Institute of Environmental Health Sciences [U01ES017154 to W.J.K]. European Molecular Biology Organization Long-Term Fellowship (ALTF 292-2011 to M.H.). Support from Howard Hughes Medical Institute (to D.H.). Funding for open access charge: Howard Hughes Medical Institute.
Conflict of interest statement. G.P.B., H.C., M.D., T.R.D., P.A.F., B.M.G., D.H., R.A.H., A.S.H., D.K., V.K., W.J.K., R.M.K., K.L., C.H.L., V.S.M., L.R.M., A.P., B.R., B.J.R., K.R.R., C.A.S. and A.S.Z. receive royalties from the sale of UCSC Genome Browser source code licenses to commercial entities; W.J.K. works for Kent Informatics.
The authors would like to thank the many data contributors whose work makes the Genome Browser possible, our Scientific Advisory Board for steering our efforts, our users for their consistent support and valuable feedback, and our outstanding team of system administrators: Jorge Garcia, Erich Weiler and Gary Moro.