Functional genomic and proteomic techniques enable routine measurement of expression profiles and functional interactions from the cells and tissues of many different organisms1–4
. These measurements have significant potential to map cellular processes and their dynamics, given the appropriate computer software to filter and interpret the resulting large amount of data. Commonly used expression analysis methods identify active biological processes from expression profiles by finding enriched gene annotation terms in the lists of differentially expressed genes5–8
. By combining expression profiles with cellular network information, including protein–protein and protein–DNA interactions, we can begin to explain the control mechanisms underlying the observed changes in activity of a biological process. For instance we can identify a transcription factor known to regulate a set of affected genes.
An important benefit of integrating expression and network data is biologically relevant signals supported by both data types are more likely to be correct than those supported from either data source alone. This is important because expression profiles can be noisy and difficult to reproduce when expression levels are low9
, while protein interaction assays are known to contain false positives and negatives. For instance, it is estimated that up to 50% of unfiltered yeast two-hybrid data are spurious10
, although this is improving as experimental protocols and automated reliability measures that combine multiple data sets of a given type evolve11,12
Many sources of expression profiles and cellular networks exist. The Gene Expression Omnibus13
are both large public repositories of gene expression profiles. Protein–protein interactions mapped either by focused studies or by high-throughput techniques are increasingly available in public repositories such as IntAct15
(as reviewed in ref. 18
). Protein–DNA interactions mapped at the genome scale using ChIP-Chip and ChIPSeq technology19
provide potential links between transcription factors and their regulated genes. When information is not available in databases but is in the literature, text-mining techniques can extract functional relationships between recognized genes that, while not always accurate, are useful for analysis in aggregate20
. In these networks, two genes are linked if they are frequently mentioned in the same sentence21
. This link may indicate a biochemical association, such as catalysis, or a genetic, colocalization or coexpression relationship. Literature association networks are also useful as a general literature search tool, since each link is tied to the supporting publication. These public data repositories are growing rapidly as the underlying measurement technology improves. For example, the HPRD repository more than doubled in size between 2003 and 200522
A number of software tools are available for network visualization and analysis, including Osprey23
. Each tool has a distinct set of features, which are highlighted in . Here, we describe the application of Cytoscape within a workflow for integration of functional genomics data with biological networks.
Comparison of network analysis platforms.
Cytoscape is freely distributed under the open-source GNU Lesser General Public License, which allows any use of the software, including feature extension by programming (http://www.gnu.org/licenses/lgpl.html
). In Cytoscape nodes representing biological entities, such as proteins or genes, are connected with edges representing pairwise interactions, such as experimentally determined protein–protein interactions (). Nodes and edges can have associated data attributes describing properties of the protein or interaction. A key feature of Cytoscape is its ability to set visual aspects of nodes and edges, such as shape, color and size, based on attribute values. This data-to-visual attribute mapping allows biologists to synoptically view multiple types of data in a network context. Additionally, Cytoscape allows users to extend its functionality by creating or downloading additional software modules known as ‘plugins’. These plugins provide additional functionality in areas such as network data query and download services32–35
; network data integration and filtering12
; attribute-directed network layout36,37
; Gene Ontology (GO) enrichment analysis7
; and network motif38,39
, functional module40–42
, protein complex43
or domain interaction detection44
. Links to these plugins can be found at http://www.cytoscape.org
. Altogether, Cytoscape and its plugins provide a powerful tool kit designed to help researchers answer specific biological questions using large amounts of cellular network and molecular profiling information.
Figure 1 The Cytoscape Desktop. The Cytoscape canvas displays network data. The toolbar (top) contains the command buttons. The name of each command button is shown when the mouse pointer hovers over it. The Control Panel (left) displays the Network tree viewer, (more ...)
This protocol is modular in its organization, and the five modules can be followed sequentially or as stand-alone protocols (the modular organization is shown schematically in ). The first module, ‘Obtain network data’, describes methods to build networks for genes of interest by querying protein interaction databases and text-mining data sources. ‘Explore network and generate layout’ introduces basic aspects of Cytoscape operation, including network navigation and layout. ‘Annotate with attribute and expression data’ shows how to link expression profile data to the network for visualization and analysis. This module uses mRNA expression data as an example, but the steps outlined apply to any form of molecular profile such as protein levels. ‘Analyze network features’ explains how to perform analytical methods that identify putative functional or structural modules within the network that may, for instance, highlight protein complexes active under a profiled experimental condition. Finally, ‘Detect enriched gene functions’ illustrates methods to identify enriched gene functions, such as those characteristic of biological processes, in previously identified sets of interesting genes or network regions.
Figure 2 Outline of the protocol. The steps in red are included in this protocol, while analyses listed in black represent useful alternatives that achieve related goals. Dotted lines represent steps that can be done in any order, while solid lines represent steps (more ...)
These analysis steps have proven useful in multiple studies, such as analyzing networks of genetic interactions45–48
, gene regulatory events49,50
and protein–protein interactions51,52
, cellular network organization53,54
and determining pathways involved in atherosclerosis56
. A sample protein network and mRNA expression profile resulting from gene knockouts that perturb galactose metabolism in S. cerevisiae57
is provided to illustrate the protocol.
We now turn to each of the five modules in detail, presenting the rationale for each portion of the protocol and indicating viable alternative techniques.
Obtain network data
This section describes three ways to import network data into Cytoscape.
The first method is to query protein interaction databases such as cPath33
with a list of genes of interest. cPath queries the IntAct15
databases. This is an appropriate method for users who are interested in assessing the connections between genes with significant experimental responses, in well-studied organisms such as S. cerevisiae
or Homo sapiens
. Cytoscape users can interrogate additional protein interaction databases via MiMi34
, plugins that are similar in application to cPath. In each of these cases, following the steps in this workflow section yields a network that contains known and putative functional associations between the genes of interest.
The second method is to build a text-mining association network using the Agilent Literature Search plugin21
. This method is most appropriate for those who are working in organisms that are not well represented in protein interaction databases, or want to restrict the network to associations observed only in specific contexts. For Literature Search the user builds a set of queries by entering terms, such as gene names, and contexts, such as an organism or disease name. The query set is submitted to selected search engines, for example PubMed or OMIM. The resulting documents are fetched, parsed into sentences, and analyzed for known interaction terms, like ‘binding’ or ‘activate’. Agilent Literature Search uses a lexicon set for defining gene names (concepts) and aliases, drawn from Entrez Gene, and interaction terms (verbs) of interest. An association is extracted for every sentence containing at least two concepts and one verb. Associations are then converted into interactions with corresponding sentences and source hyperlinks, and added to a Cytoscape network. Interaction network and text-mining association network data are complementary: protein interaction databases contain experimentally determined interactions; whereas text mining association networks contain more general association types and offer an alternative network source where interaction data are limited.
The third method described for importing network data in to Cytoscape is to import a network file, such as a SIF (Simple Interaction Format) file. The SIF file format is detailed in Box 1
. These files are straightforward for a user to create with a standard text editor. SIF file import is the most appropriate method for users who want to focus their analysis on network data identified in advance, such as those who are interested in the impact of the experimental conditions on sets of specific interactions or pathways.
CYTOSCAPE INPUT AND OUTPUT FILE FORMATS
Cytoscape can import and export data in a variety of formats, from simple delimited text formats to XML and other sophisticated formats for sharing data with other programs. This box provides a brief overview of these formats. For complete information, please refer to the Cytoscape manual at http://www.cytoscape.org/
Network import and export
The standard file that Cytoscape opens and saves is the Cytoscape Session File (.cys). This file stores all information in your current session including multiple network layouts, attribute values and setting information. Cytoscape can also import network data in the following formats:
- Simple Interaction File (SIF or .sif)
- Cytoscape Node and Edge Attribute File Format (.noa and .eda)
- Graph Markup Language (GML or .gml)
- eXtensible Graph Markup Language (XGMML or .xgmml)
- Systems Biology Markup Language (SBML)
- Biological PAthways eXchange (BioPAX)
- Proteomics Standards Initiative Molecular Interaction (PSI-MI) Level 1 and 2.5
- Delimited Text Table
- Excel Workbook (.xls)
For network export, Cytoscape can output SIF, GML, XGMML and PSI-MI formats. Users may also create an image file of their network data via File → Export → Network view as graphics. This feature includes many standard image formats, including JPEG, PNG and PDF.
The SIF format is a straightforward format to allow users to define network data with a text editor. Each line of the file contains three or more tokens. The first token is the source node. The second token is the interaction type. This token is an arbitrary text string that describes the interaction between the two nodes. The third token, and all subsequent tokens on the same line, specify the target nodes. A sample file might look like this:
nodeA interactionType1 nodeB
nodeA interactionType2 nodeC
In this network nodeA has an edge to nodeB labeled interactionType1; nodeA has an edge to nodeC labeled interactionType2; nodeF is defined but has no edges. In practice the nodes will be the names of proteins or genes in the network, and the labels given to the interaction type will be some tag that defines that relationship, such as ‘protein-protein’, ‘degrades’ or ‘phosphorylates’. Because of its basic text format, a SIF file is easily created either manually by a user (e.g., in Excel) or programmatically by a text-processing script. The Supplementary Data contain a sample SIF file, galFiltered.sif (Supplementary Data 2
Other network file formats
The GML format stores information on the network connectivity (like SIF) but also preserves the visual layout and appearance of the network in the Cytoscape view. The XGMML format is the XML extension of the GML format and is generally preferable to GML. These formats are not amenable to human editing. However, once a network is loaded into Cytoscape from a (human-edited) SIF file, the attributes and visual style settings can be modified using the GUI interface. Then the network can be stored along with its visual properties and data attributes using the GML or XGMML format. Cytoscape can also read files in the SBML, BioPAX, and PSI-MI data exchange formats, allowing the use of networks created in other programs.
Attribute file formats
Data attributes on nodes and edges can be imported from delimited text files or Excel spreadsheets via Cytoscape’s Table Import functionality. Text files are given the filename suffix .noa for Node Attribute and .eda for Edge Attribute. For a node attribute, a sample .noa file might appear like this:
nodeA = value1
nodeB = value2
nodeC = value1
The first line is the name of the attribute, for example, ‘SubcellularLocation’. Each subsequent line contains the name of the node, an equal to sign and the node’s attribute value. Attribute values can be numeric or textual such as ‘4.12’ or ‘nucleus’. An edge attribute file (.eda) is formatted similarly, where the edge name is specified by the source node, the interaction value in parentheses and the target node. For example:
nodeA (interactionType) nodeB = 0.56
nodeB (interactionType) nodeC = 0.918
nodeB (interactionType) nodeA = 0.3412
Each attribute is stored in a separate file.
Expression data file format
Cytoscape reads expression data from tab-delimited text files that can be exported from a spreadsheet program or created by the user in a text editor. These files must be renamed to have the extension either ‘.mrna’ or ‘.pvals’ to be recognized by Cytoscape. The choice of extension does not matter between these two. The data are organized as a matrix, with each row representing the expression results for one gene/protein in the network (). The first row provides column labels. The first column holds the gene/protein identifier, while the second column contains arbitrary text, such as a descriptive annotation. The subsequent columns contain expression data: one experiment per column, with the experiment names provided in the first row. If the expression data consist of one measure per gene per experiment, then there is one column per experiment. If the data also contain P
-values or other significance measures, as generated by expression analysis packages such as RMA73
, then each experiment is represented with two columns: the first is assumed to contain the expression measure, the second contains the significance measure, and the two columns must have exactly the same label:
GeneA identifies a gene, labelA provides a descriptive name, valueA_1 contains its expression level as measured in Experiment1, valueA_2 contains its expression level as measured in Experiment2, and pvalueA_1 and pvalueA_2 contain P
-values (or other measures of significance of differential expression) for Experiment1 and Experiment2, respectively. The P
-value columns are optional for most Cytoscape functionality. However, to identify active modules with the jActiveModules plugin, the expression data must contain P
-values of significance ranging between 0 (most significant) and 1 (least significant). The Supplementary Data contain a sample expression data file, galExpData.pvals (Supplementary Data 1
). illustrates the first lines of this file.
In addition to the import methods described in this protocol, Cytoscape users can also import pathways from repositories such as KEGG58
via the PSI-MI, BioPAX, or SBML data exchange formats (as reviewed in ref. 60
), although such pathway data contain non-pairwise interactions between molecules. For instance, there is a single interaction between multiple substrates and products in a biochemical reaction, which must be mapped to pairwise interactions in Cytoscape.
After importing interaction data, the user may optionally filter the resulting network to reduce the network size by selecting just the types of network information of interest. For instance, the CABIN plugin12
enables the user to merge network data from multiple experimental sources and select the interactions observed multiple times, which are likely more reliable than those observed once.
Explore network and generate layout
This section introduces the user to basic aspects of Cytoscape operation, including network navigation and layout. Apart from basic operation, basic filter features are available to identify genes as important by virtue of their high number of connections in the network61
. Essential genes are also associated with nodes, which occupy central positions in large interaction networks, particularly those frequently found to connect other nodes together62
. This section uses the yFiles layout algorithm, although many other layout algorithms exist, some of which support more specialized operations.
Annotate with attribute and expression data
Expression profiles can provide powerful insights into cellular state and dynamics when integrated with network data. For example, if two nodes consistently show similar changes in expression levels, or they consistently show changes in opposite directions, then one gene might regulate the production of the other. Also, a group of connected nodes characterized by large fold changes may represent a signal-transduction cascade or protein complex that is repressed or induced under the experimental condition63
. This section shows how to apply mRNA expression data to a network as an example, but the steps outlined apply to any form of molecular profile data.
Analyze network features
This section explains how to run analysis methods that identify putative functional or structural modules within the network that may highlight protein complexes active under a profiled experimental condition as complex topologies of interaction networks make this difficult to do by eye. The jActiveModules plugin42
automates this analysis and identifies connected sections of the network in which the nodes have significant P
-values. This indicates a group of nodes that may be coregulated, suggesting a module whose activity is influenced by the experimental context of the expression data. A complex
is a module in which the macromolecules involved form a structure to execute some function called a molecular machine. Densely connected regions in protein interaction networks tend to correspond to protein complexes. The MCODE plugin43
identifies putative complexes by finding regions of significant local density. At this stage, the user could also identify over-represented network motifs with the Metabolica or NetMatch plugins39
, although these alternatives are not covered here.
Detect enriched gene functions
The function of interesting gene sets or network regions can be summarized by finding significantly enriched functional annotation terms. This may be used to support module or complex predictions. This section outlines steps required to identify enriched GO processes with the BiNGO plugin7
. Users can optionally use the GOlorize plugin36
to refine the network layout according to selected GO classes.
Cytoscape can be used in many additional visualization and analysis workflows, including published protocols on the reverse engineering of regulatory64
networks, and can be extended using Java programming to implement new features and analysis methods. Over 40 excellent plugins contributed by many software development groups provide examples. While these topics are not covered here, Cytoscape has a community of software developers and users who can answer questions about potential new Cytoscape uses, as detailed at http://www.cytoscape.org