|Home | About | Journals | Submit | Contact Us | Français|
Motivation: Genome browsers that support fast navigation through vast datasets and provide interactive visual analytics functions can help scientists achieve deeper insight into biological systems. Toward this end, we developed Integrated Genome Browser (IGB), a highly configurable, interactive and fast open source desktop genome browser.
Results: Here we describe multiple updates to IGB, including all-new capabilities to display and interact with data from high-throughput sequencing experiments. To demonstrate, we describe example visualizations and analyses of datasets from RNA-Seq, ChIP-Seq and bisulfite sequencing experiments. Understanding results from genome-scale experiments requires viewing the data in the context of reference genome annotations and other related datasets. To facilitate this, we enhanced IGB’s ability to consume data from diverse sources, including Galaxy, Distributed Annotation and IGB-specific Quickload servers. To support future visualization needs as new genome-scale assays enter wide use, we transformed the IGB codebase into a modular, extensible platform for developers to create and deploy all-new visualizations of genomic data.
Availability and implementation: IGB is open source and is freely available from http://bioviz.org/igb.
Genome browsers are visualization software tools that display genomic data in interactive, graphical formats. Since the 1990s, genome browsers have played an essential role in genomics, first as tools for building, inspecting, and annotating assemblies and later as tools for distributing data to the public (Durbin and Thierry-Mieg, 1991; Harris, 1997; Kent et al., 2002). Later, the rise of genome-scale assays created the need for a new generation of genome browsers that could display user’s experimental data alongside reference sequence data and annotations.
Integrated Genome Browser (IGB), first developed in 2001 at Affymetrix, was among the first of this new breed of tools. IGB was first written to support Affymetrix scientists and collaborators who were using whole genome tiling arrays to probe gene expression and transcription factor binding sites as part of the ENCODE project (Kapranov et al., 2003). As such, IGB was designed from the start to display what we would now call ‘big data’ in bioinformatics—millions of probe intensity values per sample. IGB was one of the first genome browsers to support visual analytics, in which interactive visual interfaces augment our natural ability to notice patterns in data (Keim et al., 2006). The ‘thresholding’ feature described below is an example.
Because early IGB development was publicly funded, Affymetrix released IGB and its companion graphics library, the Genoviz Software Development Kit (Helt et al., 2009), as open source software in 2004. Since then, we have continued to develop IGB, adding new features and incorporating libraries and code from many other open source projects, e.g. Cytoscape (Shannon et al., 2003). Our first article introducing IGB appeared in 2009 (Nicol et al., 2009) and focused on visualization of tiling array data. Here, we describe new visual analytics and data integration features developed for high-throughput sequencing data. We also introduce a new plug-in application programmers interface (API) for developers to add new functionality, transforming IGB into an extensible visual analytics platform for genomics.
IGB is implemented as a stand-alone, rich client desktop program using the Java programming language. To run IGB, users download and run platform-specific installers; these support automatic updates and include the Java virtual machine, making it unnecessary for users to install (and maintain) Java separately.
IGB’s implementation as a local application rather than as a Web app means that IGB can access the full processing power of the user’s local computer. IGB is always present on the user’s desktop, regardless of internet connection status, but IGB can also use the Web to consume data, as described below. Because IGB runs locally, users view their own datasets without uploading them to a server, which can be important when working with confidential data.
On startup, IGB displays a ‘home’ screen featuring a carousel of images linking the latest versions of model organism, crop plant and reference human genome assemblies. More species and genome versions are available via menus in the Current Genome tabbed panel, including more than seventy animal, plant and microbial genomes. Users can also load and visualize their own genome assemblies, called ‘custom’ genomes in IGB, provided they have a reference sequence.
Once a user selects a species and a genome version, IGB automatically loads the reference gene model annotations for that genome, if available. Annotations are loaded from files in various formats hosted on the main IGB Quickload site, described below. For any given genome version, additional annotations or high-throughput datasets may also be available via the Data Access Panel interface. These datasets can come from different sites, and to highlight this, IGB can display a favicon.ico graphic distinguishing these different data sources. This is why IGB is named ‘integrated’—it integrates data from different sources into the same view. To illustrate, Figure 1 shows an example view of integrated datasets from the human genome. Reference gene model annotations and other datasets from IGB Quickload are shown, together with tracks loaded from a DAS1 server hosted by the UCSC Genome Bioinformatics group.
Genomic datasets span many scales, and fast navigation through these different scales is a key feature for a genome browser. Users need to be able to quickly travel between base-level views depicting sequence details like splice sites, gene-level views depicting the exon-intron organization of genes, and chromosome-level views showing larger-scale structures, like centromeres and chromosome bands. Tools that support fast navigation through the data can accelerate the discovery process.
For this reason, IGB implements a visualization technique called one-dimensional, animated semantic zooming, in which objects change their appearance in an animated fashion around a central line, called the ‘zoom focus’ (Loraine and Helt, 2002). Animation helps users stay oriented during zooming and can create the impression of flying through one’s data (Bederson and Boltman, 1999; Cockburn et al., 2009).
To set the zoom focus, users click a location in the display. A vertical, semi-opaque line called the ‘zoom stripe’ indicates the zoom focus position and also serves as a pointer and guideline tool when viewing sequence or exon boundaries. When users zoom, the display appears to contract or expand around this central zoom stripe, which remains in place, thus creating a feeling of stability and control even while the virtual genomic landscape is rapidly changing.
IGB also supports ‘jump zooming,’ in which a request to zoom triggers instantaneous teleportation to a new location. To jump-zoom, users can double-click an item, click-drag a region in the coordinates track, or search using the Advanced Search tab or the Quick Search box at the top left of the display.
Moving without changing the scale (panning) is also important for fast navigation through data. In IGB, clicking arrows in the toolbar moves the display from left to right, and scrollbars offer ways to move more rapidly. Users can click-drag the selection (arrow) cursor into the left or right border of the main display window to activate continuous pan. For finer-scale control, a move tool cursor enables click-dragging the display in any direction.
IGB helps users ask and answer questions about their data by supporting multiple ways for users to interact with what they see. Selecting data display elements (Glyphs) triggers display of meta-data about the selected item and mouse-over causes a tooltip to appear. Right-clicking items within tracks activates a context menu with options to search Google, run a BLAST search at NCBI, or open a sequence viewer (Fig. 1). Right-clicking track labels activates context menus showing a rich suite of visual analytics tools and functions, some of which we discuss in more detail below.
IGB can consume data from local files or URLs and can read more than 30 different file formats popular in genomics. When users open a dataset, a new track is added to the main display. Users then choose how much data to add to the new track by operating data loading controls. Larger datasets, such as RNA-Seq data, should be loaded on a region-by-region basis, while others that are small enough to fit into memory can be loaded in their entirety, e.g. a BED file containing ChIP-Seq peaks. To load data into a region, users zoom and pan to the region of interest and click a button labeled ‘Load Data.’ To load all data in an opened dataset, they can change the dataset’s loading method by selecting a ‘genome’ load mode setting in the Data Access Panel.
This behavior differs from other genome browsers in that most other tools link navigation and data loading. In other tools, a request to navigate to a new location or change the zoom level both redraws the display and also triggers a data loading operation. Although sometimes convenient for users, this limits the types of navigation interactions a tool can support. This trade-off can be seen in the Integrative Genomics Viewer (IGV), a desktop Java application developed after IGB (Thorvaldsdottir et al., 2013). IGV auto-loads data but restricts movement; for example, it lacks panning scrollbars and does not support fast, animated zooming. IGB, by contrast, prioritizes navigation speed and gives users total control over when data load, which can be important when loading data from distant locations over slower internet connections. IGB gives users control over when such delays might occur, thus making waiting more palatable.
IGB aims to support the scientific discovery process by making it easy for users to document and share results. Taking a cue from Web browsers, IGB for many years has supported bookmarking genomic scenes. In IGB, genomic scene bookmarks record the location, genome version, and datasets loaded into the current view. Users can also add free text notes and thus record conclusions about what they see. Selecting a bookmark causes IGB to zoom and pan to the bookmarked location and load all associated datasets, thus enabling users to quickly return to a region of interest. Users can sort, edit, import and export bookmarks using the Bookmarks tab. Exporting bookmarks creates an HTML file which users can re-import into IGB or open in a Web browser. If IGB is running, clicking an IGB bookmark in a Web browser causes IGB to zoom to the bookmarked location and load associated data.
IGB loads bookmarks through a REST-style endpoint (Fielding and Taylor, 2002) implemented within IGB itself, using a port on the user’s computer. This endpoint allows bookmarks to be loaded from HTML hyperlinks embedded in web pages, spreadsheets, or any other document type that supports hyperlinks. IGB was the first desktop genome browser to implement this technique; for many years, Affymetrix used it to display probe set alignments on their NetAffx Web site (Liu et al., 2003).
IGB supports multiple formats and protocols for sharing dataset collections and integrating across data sources. The most lightweight and easy to use of these is the IGB specific ‘Quickload’ format, which consists of a simple directory structure containing plain text, metadata files. The metadata files can reference datasets stored in the same Quickload directory or in other locations, including Web, ftp sites, or cloud storage resources such as Dropbox and iPlant/CyVerse (Goff et al., 2011). This flexibility makes it possible for a Quickload site to aggregate datasets from multiple locations. The metadata for a Quickload site also includes styling and data access directives controlling data loading strategies and appearance.
Sharing a Quickload site is straightforward. Users simply copy the contents to a publicly accessible location and then publicize the URL, which IGB users then add as a new Quickload site to their copy of IGB. Typically, users copy their Quickload sites to the content directories of Web sites, and IGB supports secure access via the HTTP basic authentication protocol, making it easy to keep datasets private if desired. If users do not require password-protection for their data, they can also use Public Dropbox folders to share Quickload sites.
As ultra high-throughput sequencing of cDNA (RNA-Seq) has supplanted microarrays for surveying gene expression (Ozsolak and Milos, 2011), we added new features to IGB to support visualization of RNA-Seq data. RNA-Seq analysis data processing workflows typically produce large read alignment (BAM) files, which IGB can open and display. IGB also implements many new visual analytics functions that operate on BAM file tracks and highlight biologically meaningful patterns in the data.
Right-clicking a BAM track label opens a context menu listing options to create coverage graphs, called ‘depth’ graphs in IGB. At present, IGB supports two depth graph types: ‘depth graph start’ that counts a read’s first mapped base and ‘depth graph all’ that counts the number of reads overlapping a position. The latter is useful for viewing overall expression at a locus, and the former is useful for investigating sequencing bias. Both graph types are implemented using IGB’s track operations API, which developers can use to add all-new graph generation algorithms as plug-ins, described in Section 2.8.
By comparing depth graphs for multiple samples, users can identify differences in transcript levels. Using the Graph tab, users can place multiple graphs on the same scale, and if sequencing depth is similar, peaks that are the same height and shape reflect similar expression. Figure 2a shows ‘depth graph all’ graphs made from RNA-Seq alignment (BAM) tracks; reads were from sequencing lung cancer samples bearing wild-type or mutant copies of the KRAS oncogene (Kalari et al., 2012). One peak is much taller in the mutant sample, indicating higher expression. The gene is AQP3, encoding a water/glycerol-transporting protein (Hara-Chikuma and Verkman, 2005).
Visual analysis of RNA-seq data can also highlight alternative splicing differences between samples. Figure 2b shows PFN2, which produces two isoforms due to alternative splicing. Here, the depth graphs from Figure 2a were merged into a single track. Comparing the peak discontinuities to the gene models shows that the mutant sample favors the shorter isoform. To further aid splicing analysis, we developed FindJunctions, a visual analytics tool that identifies split read alignments in an RNA-Seq track, uses them to identify potential exon-exon junctions, and then creates an all-new track containing junction features annotated with the number of alignments that supported them. Figure 2b shows an example where FindJunction’s quantification of split reads reinforces the finding that PFN2 is differentially spliced between samples. Viewing the read alignments is also informative. Figure 2c shows a zoomed-in view of the alternatively spliced region from Figure 2b. Read alignments show that the shorter isoform predominates in the sample bearing mutant KRAS. Details on FindJunctions are available from the IGB User’s Guide.
Users can interact with reads and other data displayed in the viewer using selection operations. Clicking a single item selects it, and click-dragging over multiple items selects all of them. Pressing keyboard modifiers while clicking an item adds (SHIFT) or removes (CNTRL) it from the pool of selected items. IGB reports the identity or number of selected items in the upper right corner. In this way, users can identify and count items that match a biologically interesting pattern, e.g. spliced reads that support a junction. As an additional visual cue, when a read or annotation is selected, all items with matching boundaries on either the 5′ or 3′ end are highlighted. This technique, called ‘edge matching,’ aids in visualization of read boundaries and identifying alternative donor/acceptor sites.
Knowing where a transcription factor binds DNA in relation to nearby genes is tantamount to understanding its function, as genes whose promoters are bound by a given transcription factor are likely to be regulated by it. Identifying binding sites of DNA-binding proteins is now routinely done using whole-genome ChIP-Seq, in which DNA cross-linked to protein is immunoprecipitated using antibodies against the protein of interest and then sequenced (Park, 2009). Subsequent data analysis typically involves mapping reads onto a reference genome, identifying regions with large numbers of immunoprecipitated reads, and then performing statistical analysis to assess significance of enrichment. Each step requires users to choose analysis parameters whose effects and importance may be hard to predict. To validate these choices, it is important to view the ‘raw’ alignments and statistical analysis results in a genome browser. As an example, we describe using IGB to view results from a ChIP-Seq analysis done using MACS, a widely used tool (Zhang et al., 2008).
MACS produces a file containing peak locations and significance scores indicating which peaks likely contain a binding site. Opening and loading this file in IGB creates a new track with single-span annotations representing the extent of each peak. At first, they look identical, varying only by length, and it is difficult to distinguish them. To highlight the highly scoring peak regions, IGB offers a powerful ‘Color by’ visual analytics feature that can assign colors from a heatmap using quantitative variables associated with features. For this, users right-click a track, select ‘Color by’ from the context menu, and then operate a heatmap editor adopted from Cytoscape to color-code by score. This makes it easy to identify high scoring regions likely to contain a binding site (Fig. 3a).
ChIP-Seq analysis tools also typically produce WIG-format depth-graph files that report where immunoprecipitated sequences have ‘piled up,’ forming peaks. Loading this WIG file into IGB creates a new graph track that shows how these peaks coincide with regions from the BED file. In IGB, users can move tracks to new locations in the display. As shown in Figure 3b, placing the WIG track above the color-coded BED track makes it easy to observe how taller peaks typically have higher significance values.
IGB’s graph thresholding function, originally developed for tiling arrays, is useful for exploring the relationship between coverage depth and peak score. Available from the Graph tab, thresholding identifies regions in a graph track where consecutive y-values exceed a user-defined value. Users can change this threshold value dynamically using a slider and observe in real-time how this affects the number and extent of identified regions. Users can promote regions to new tracks and save them in various formats. As an example, Figure 3b shows four MACS-detected peaks near the Hs6st1 gene, a regulatory target for the SOX9 transcription factor in mouse (Kadaja et al., 2014). Two peaks exceed the threshold, providing a visual cue that these locations may be most important for regulation.
Sometimes the recognition sequence for a transcription factor being studied is known or can be deduced from the data. IGB offers a way for users to visualize instances of binding site motifs. Using the Advanced Search tab, users can search the genomic region in view using regular expressions. For instance, to search for the motif AGCCGYG (where Y can be C or T) a user would enter ‘AGCCG[CT]G’. In this example, the search found several instances of this motif, with one located in each of the two user-defined significant peaks near the Hs6st1 gene (Fig. 3c).
Whole genome bisulfite sequencing (WGBS) refers to bisulfite conversion of unmethylated cytosines to thymines followed by sequencing (Krueger et al., 2012). This technique can identify methylated sites throughout a genome and reveal potential epigenetic regulation of gene expression. Analyzing bisulfite data involves mapping the reads onto the genome using tools that can accommodate reads where many but not all cytosine residues have been converted. Several such tools are available; here we describe using IGB to visualize output of Bismark (Krueger and Andrews, 2011).
Bismark produces a file containing read alignments and a depth-graph file reporting percent methylation calculated from sliding windows along the genome. Figure 4a shows a Bismark depth graph from an experiment investigating methylation in the model plant Arabidopsis thaliana (Yelagandula et al., 2014). All of chromosome one is shown, and peaks indicate regions of high methylation. This whole chromosome view makes it easy to observe that the centromeric region is highly methylated. From here, users can zoom in to examine methylation at a region, gene, or base pair level.
Figure 4b shows a zoomed-in view of one of the tallest peaks visible in Figure 4a. Similar to the ChIP-Seq data analysis described in the previous section, users can apply the thresholding feature to identify regions of high methylation within a gene. In this example, the threshold for percent methylation is set to 70% or greater, highlighting how the first intron, but not the second and third introns, is highly enriched with methylated cytosines.
Closer examination of aligned reads provides further support of methylation. In IGB, nucleotide residues are color-coded, and users can change these colors. In addition, read alignment tracks can be configured so that only mismatched bases are color-coded. Figure 4c shows a view of bisulfite sequencing data in which cytosines are red and thymines are white. White columns in the read track (labeled WGBS) represent unmethylated cytosines, and marks in the coordinates track indicate cytosine residues in the reference. Regions with many cytosines and few white columns are highly methylated.
Developers have created many genome browser tools, each aiming to meet a need not met by preceding tools. And yet, each new tool has faced similar problems, such as how to consume data from files, how to lay out genomic features and graphs into tracks, and how to support zooming through vast differences in scale. As described above, IGB has solved many of these problems and offers a flexible and fast environment for users to explore the genomic landscape. In order for developers to quickly create new visualizations we transformed IGB into a modular, extensible platform for developers to create and deploy all-new visualizations of genomic data.
The IGB software architecture now resembles other popular open source Java-based projects that support adding new functionality via plug-ins, including Eclipse, the Netbeans Rich Client Platform and Cytoscape (Shannon et al., 2003). This similarity comes from our common use of OSGi, a services based architectural framework and community standard for building modular software. By adopting this framework, we increased IGB extensibility, simplified adding new features, and created a new Apps API that empowers community developers to contribute new functionality without needing deep understanding of IGB internal systems. The Apps API is new, but community developers are already using it to create novel visualizations (Céol and Müller, 2015). Documentation and tutorials are available from the IGB developer’s guide.
IGB offers powerful utilities for viewing, analyzing, and interacting with data within an environment that feels fast, flexible and highly interactive. In the future, we plan to make the IGB’s Quickload data sharing system easier to use by providing tools for building Quickload sites from within IGB, along with a Quickload registry for users to publicize their sites. In addition, we will continue developing and improving the IGB APIs, providing documentation and example Apps to demonstrate how developers can use IGB as a platform to support genomics research.
This work was supported by the National Institutes of Health [R01GM103463 to A.L.].
Conflict of Interest: none declared.