We propose a solution called the University of California, Santa Cruz (UCSC) Interaction Browser (IB) that provides several features for viewing biological data sets overlaid on genetic pathways. The IB is a web portal to explore evidence pertaining to the regulatory interactions present in a genetic pathway (). The browser takes two main inputs: (i) a network of interactions among gene products and (ii) ‘omics’ data measuring evidence on multiple genes across multiple samples.
Figure 1. Overview of the UCSC IB workflow. Users select or supply (i) a network of interactions (left box) and (ii) a data set for viewing (right box). The data are viewed for a selected set of genes in the main panel using a CircleMap display (center box). Data (more ...)
Interaction data sets can be selected from a backend database of networks or supplied by the user. A genetic network is displayed as a graph with nodes and edges. A user can select a set of genes representing a pathway of interest and a set of networks to explore known functional interactions documented among the selected genes. Molecular entities such as genes, gene products, complexes, families, small molecules and abstract cellular processes can be rendered. Genes are keyed based on HUGO gene symbols to facilitate the communication of protein-level information across a variety of functional networks, genetic pathways, as well as measurement platforms such as microarrays, RNA-seq, SNP-chip copy number and proteomics measurements.
Currently, the IB contains 929 curated pathways available for viewing including those ingested from Reactome (6
), Biocarta (http://www.biocarta.com/
) and NCI-PID (8
) (Supplemental Table S1
). In addition, the IB currently contains 59 networks predicted from functional genomics sources such as those derived from protein interaction databases including BioGRID (14
) (19 networks) or extracted from the meta-analysis of many gene expression data sets to infer co-expression networks (Supplemental Table S2
). The IB provides different visual styles to display multiple types of linkages in networks or pathways. Directed edges allow viewing curated regulatory and signaling interactions. Arrowheads are used for activating actions and lines ending with a small perpendicular line segment (‘T-bars’) are used for inhibitory actions.
The user interface (UI) is geared toward the display of smaller focused pathways, for example, those pathways containing 10–100 genes. However, the IB does provide access to much larger networks that can be drawn on. For example, the UCSC Superimposed Pathway (‘SuperPathway’) combines the interactions collected from NCI-PID, BioCarta and Reactome into one network currently containing 32 000 directed and undirected interactions among over 7000 proteins (15
). These larger pathways can be loaded into the background, allowing smaller subnetworks of interest to be brought into the display using filtering steps. This allows users to query the background network while avoiding the display of massively complex interaction sets that can tax browser responsiveness.
To enable identifying high-confidence relations, multiple interaction networks can be used for filtering or display simultaneously. Borrowing a convention from the UCSC Genome Browser (16
), different networks of interactions are available as overlays onto a set of genes as separate network ‘tracks’. If an interaction between two genes in the current display is present in a selected track, it is rendered using colors that distinguish the database source of the interaction. In this way, several network tracks can be visualized at once to investigate the degree to which any functional linkage between genes is supported by multiple data sets and platforms, thereby raising the confidence level of particular links. For example, regulatory interactions from transcription factors (TFs) to targets (directed) can be displayed alongside physical protein–protein interactions (undirected) detected from yeast two-hybrid or co-immunoprecipitation assays. In this context, the protein–protein links between TFs can help identify putative transcription factor complexes that share common targets, which in turn may be connected by co-expression–derived linkages (undirected).
The IB provides a portal to visualize publicly available genomics data sets, functional genomics data sets, phenotype data on samples and outcome data on patients. Currently, 22 cancer genomics data sets from TCGA are available, where each contains multiple different measurement platforms including copy number abnormalities (CNA), DNA methylation patterns, gene expression levels and protein levels from reverse phase protein arrays. Data summarized at the gene level (Level 3 data) as well as associated clinical information on patient outcomes and relevant subtypes are obtained from the TCGA Data Repository (http://tcga-data.nci.nih.gov/tcga/
). TCGA data are also ingested from the Broad Institute’s Firehose pipeline (http://www.broadinstitute.org/cancer/cga/Firehose
) that provides higher levels of interpretive results such as significantly mutated genes, focal regions of amplification from GISTIC (17
) analysis of CNA and pathway-level inferences from the UCSC-developed PAthway Recognition Algorithm using Data Integration on Genomic Models (PARADIGM) engine. The IB is particularly well suited for viewing inferences from PARADIGM’s integrative analysis. Briefly, PARADIGM was developed to infer the activities of genes in the context of pathways by integrating any number of functional genomic data sets in a patient sample (18
). Other cancer genomics data sets are also available through the IB including several breast cancer studies, several lung, blood, skin, brain, ovarian, pancreatic cancer tracks, two COSMIC (19
) tracks, NCI60 cell line data, the Connectivity Map (20
) data sets, and the Cell Line Encyclopedia (21
) data measuring expression and drug response on different cancer cell lines. A description of the collection can be found in Supplemental Table S3
Users can upload their own data matrices to the IB for viewing. The input file format is a matrix of scores in which the rows represent the data for a gene and the columns represent the data for samples. The first column identifies the gene, preferably the HUGO name. The first row lists the column names, usually pertaining to unique sample identifiers for each column. Once the data matrix has been uploaded, a custom checkbox can be used to select the data set that is private only to the user’s current IB session. The IB also allows users to provide their own files for gene sets and networks. The IB webpage has information about how this API may be accessed programmatically to query the IB data sets. CircleMaps are generated through an HTTP GET mechanism. Open source Python scripts are available for free from the IB website for users to generate their own CircleMaps outside of the IB, which may be more convenient for users to generate CircleMaps of their own data set files.
CircleMaps: dynamic, coordinated viewing
Pathways and networks can enrich the viewing of high-throughput data sets by focusing the exploration of data on established or predicted gene regulatory logic. One of the main IB features is the introduction of the ‘CircleMap’ concept for viewing omics data sets. A CircleMap displays multiple data sets as nested rings pivoted around each protein. Each ring represents measurements of a gene property across any number of samples. Rings are composed of a series of colored ‘spokes’, where each spoke represents one sample in the data set (e.g. cell line or tumor sample). All of the data for a particular sample, oriented for a particular gene, can be viewed as one radiating spoke. Importantly, CircleMaps for multiple genes are coordinated so that a sample is located at the same angular position, allowing the results to be easily traced across an entire pathway.
illustrates the comparison of a CircleMap to a heatmap to illustrate the complementary strength added by the CircleMap. The heatmaps for two theoretical data sets on the same samples (columns) are shown, one containing DNA methylation levels for CpG islands near the promoters of a set of genes (left matrix) and another containing gene expression levels for the same set of genes (right matrix). Such multidimensional data sets are becoming common especially from national and international consortia. The two-dimensional clustered matrix display of the heatmaps allows one to visually see the dominant patterns in the data. For example, most genes have increasing DNA methylation and decreasing mRNA expression (left to right orientation in the heatmap), an anti-correlation relationship that is expected because DNA methylation tends to silence gene promoters. However, the heatmap view can overlook specific relations between a subset of interacting genes. In this contrived example, gene A’s product inhibits the expression of gene B. While the DNA methylation profiles of the genes are positively correlated, leading to the co-clustering of genes A and B, the mRNA expression of the two are anti-correlated and therefore gene B is sorted far from gene A in the mRNA heatmap. Identifying the presence of this confirmed regulatory relationship is therefore problematic in the heatmap because a user’s eye has to relate one or a few gene vectors, embedded in a much larger collection, to other vectors across long visual distances. On the other hand, the CircleMap view readily reveals this kind of relationship using a single sort of the samples based on gene A’s mRNA expression levels.
Figure 2. A toy example illustrating relationship of a CircleMap to a standard heatmap. (A) Left matrix represents DNA methylation data; right matrix mRNA expression data for the same genes (rows) and 10 samples (columns) as for the DNA methylation data. Two genes (more ...)
Several fundamental CircleMap operations including spoke coordination, spoke sorting and spoke aggregation provide simple but powerful view modalities. First, all of the nested rings in the display are coordinated such that every gene maintains the same spoke order for the samples. Second, by selecting any ring, the spokes are sorted in ascending or descending order according to the data values of the ring and this order is propagated to all other rings. Third, samples can be grouped together by disease subtypes, tissues, common phenotypes and patient outcomes. All of the values within a group can be averaged together and displayed as a ring segment. The combination of these operations allows the eye to pick up trends and detect correlations between particular genes that exist within important subsets.