In sections 3.1 and 3.2 we provide two examples of pathway topologies converted to gene networks by graphite, while in section 3.3 we show the core functions to retrieve, convert and display graphite networks. In section 3.4 we perform a simulation study to verify the efficacy of our signal propagation strategy in terms of topological analyses. In section 3.5 we run an example of topological gene set analysis using graphite networks and in section 3.6 we critically compare graphite with other available R/Bioconductor packages providing pathway topologies.
3.1 Pathway conversion example 1: Insulin signaling pathway
Figure represent an example in which the simple elimination of compounds leads to an incorrect network topology.
Insulin is an hormone controlling the balance between mobilization and storage of energy molecules. Insulin binds the Insulin Receptor (IR) and through phosphorilation of the IRS adaptors is able to recruit and activate PI3K. PI3K is a kinase that converts PIP2 in PIP3 which is a secondary messenger involved in the regulation of various processes. The conversion between PIP3 into PI(3,4)P2 or PI(4,5)P2 operated by phosphatases like SHIP1/2 or PTEN induce a depletion of PIP3 levels and of consequence a reduced activity on its downstream targets [
29].
PIP3 associates with the inner lipid bilayer of the plasma membrane to promote the recruitment of proteins with pleckstrin homology (PH) domains, like PDPK and AKT, which is a crucial mediator of various cell process, such as apoptosis, cell cycle, protein synthesis, regulation of metabolism [
30].
Among other functions, AKT activates also the cyclic nucleotide phosphodiesterases (PDEs), that is a group of enzymes able to regulate the localization, duration, and amplitude of the cyclic nucleotides. Signaling PDEs are therefore important regulators of signal transduction mediated by these second messenger molecules [
31]. In this pathway, PDE, depleting cAMP, indirectly inhibits the PKC mediated phosphorilation, and the activation of LIPE that is a lipase able to mobilize lipid energy stores. PDE acts, in this way, as a anti-lipolytic agents [
32].
This hormonal mediated signaling cascade, from the insulin receptor to the inhibition of HSL, involves two "second messenger" compounds (PIP3 and cAMP) crucial for the transduction of the signal.
In panel A of Figure we report a part of the insulin signaling pathway of KEGG (hsa4910) that contains three groups OR (PDE3, AKT and PKA), and two compound mediated interactions (through PIP3 and cAMP). This is a clear examples of a signal cascade in which the propagation of the signal through compounds is crucial to keep the whole signaling path. In panel B we report graphite reconstructed signal cascade while in panel C the KEGGgraph partially reconstructed signal.
An extract of the
xml file corresponding to the signal reported in Figure of the main text is present in Additional file
1. From the
xml definition it is evident how entry 2 (SKIP) and entry 3 (SHIP) are linked to compound 15 (PIP3) while there is no direct interaction between compound 15 (PIP3) and entry 62 (PDK1/2). This is why
KEGGgraph misses the signal, while
graphite captures it by splitting the relation between entry 52 (protein complex P13K) and 62 (PDK1/2) through compound 15 (PIP3) into both 52 to 15 and 15 to 62. This dissection allows the reconstruction of the signal, otherwise impossible.
3.2 Pathway conversion example 2: catalysis and cleavage of Notch 1 by Gamma Secretase Complex
We selected the reaction 1784.3 from the Reactome pathway called "A third proteolytic cleavage releases NICD". Gamma secretase is a multi-subunit protease complex, itself an integral membrane protein, that cleaves single-pass transmembrane proteins at residues within the transmembrane domain. Here represented the processing of the Notch 1 protein. The gamma-secretase complex is composed of Presenilin homodimer (PSEN1 variant 1 or 2 or 3 or 4 or 5 and PSEN2 variant 1 or 2), Nicastrin (NCSTN variant 1 or variant 2), APH1 (APH1A or APH1B) and PEN2. Maturation of the Notch receptor involves a cleavage of the protein, the intracellular domain is liberated from the plasma membrane that can enter into the nucleus to engage other DNA-binding proteins regulating gene expression. The cleavage is catalyzed and performed by Gamma Secretase Complex.
Figure shows Reactome representation of the reactions (Panel A), the BioPax information as it is stored in owl model and in Cytoscape plug-in for BioPax (respectively panel B and C) and the graphite final network (panel D). In the graphite network the nodes are annotated using the XRefs informations while edges preserve the type of the reaction annotated the OWL model. Distinction between OR complexes (formed by all the possibile variants of each protein) nested inside the AND complex of the Gamma secretase are topologically preserved in the resulting graph.
3.3 graphite functions
To access the Reactome, KEGG, Biocarta and NCI databases graphite uses respectively the lists reactome, kegg, biocarta and nci. A pathway network can be retrieved from one of the lists using the name of the pathway,
> names(biocarta)[1:3]
[1] "acetylation and deacetylation of rela in nucleus"
[2] "actions of nitric oxide in the heart"
[3] "activation of camp-dependent protein kinase pka"
> biocarta[["ras signaling pathway"]]
"ras signaling pathway" pathway from BioCarta
Number of nodes = 18
Number of edges = 22
Type of identifiers = native
Retrieved on = 2011-05-12
or its position in the list of pathways:
> p <- biocarta [[175]]
> p@title
[1] "ras signaling pathway"
In the network, nodes represent genes and edges functional relationships among them. Nodes can have heterogeneous IDs (according to the pathway original annotation) and edges can be characterized by multiple functional relationships.
The function pathwayGraph builds a graphNEL object from a pathway p:
> g <- pathwayGraph(p)
> g
A graphNEL graph with directed edges
Number of Nodes = 18
Number of Edges = 23
> edgeData(g) [1]
$ 'EntrezGene:10928 | EntrezGene:5879'
$ 'EntrezGene: 10928 | EntrezGene:5879 '$weight
[1] 1
$ 'EntrezGene:10928 | EntrezGene:5879 '$edgeType
[1] "catalysisOut (ACTIVATION)"
The function converterIdentifiers allows the user to map such variety of IDs to a single type (Entrez-Gene or Gene Symbol). For the ID conversion graphite uses the data provided by the Bioconductor package org.Hs.eg.db. This mapping process, however, may lead to the loss of some nodes (not all identifiers may be recognized) and has an impact on the topology of the network (one ID may correspond to multiple IDs in another annotation or vice versa).
> pEntrez <- convertIdentifiers (p, "entrez")
> pEntrez
"ras signaling pathway" pathway from BioCarta
Number of nodes = 20
Number of edges = 20
Type of identifiers = Entrez Gene
Retrieved on = 2011-05-12
> nodes(pEntrez)[1:10]
[1] "10928" "1147" "3265" "387" "4303" "5295" "572" "5879" "5894" "5898"
Several pathways have a huge number of nodes and edges, thus there is the need of an efficient system of visualization. To this end
graphite uses the
Rcytoscape package to export the network to Cytoscape [
27]. Cytoscape is a Java based software specifically built to manage biological network complexity and for this reason it is widely used by the biological community. Run Cytoscape with RPC plugin enabled and type at the R command prompt:
> cytoscapePlot(convertIdentifiers(a$'toll-like receptor pathway', "symbol"))
The network will be automatically loaded into Cytoscape. See Figure for the result of this operation.
3.4 Simulation study: compound propagated signal improves topological analysis
In order to verify our signal propagation strategy we perform a simulation study. Using the insulin signaling pathway of the KEGG database we select as differentially expressed 22 genes lying on the signal paths highlighted in Figure . These genes are connected if propagation is employed, otherwise they are disconnected (see Figure and for propagation and non-propagation respectively). We expect that propagation will lead to better results in terms of topological analyses.
Our simulation is based on the following steps: 1) we randomly generate
μFC ~
U (2, 10); 2) we randomly generate log fold change values (
δi for
i = 1,..., 22) of the differentially expressed genes as
δi ~
N (
μFC, 2) (interactions of the signal paths selected are characterized all by activation, thus, fold changes have the same sign); 3) we run the
SPIA[
5] algorithm on the Insulin signaling pathway with and without signal propagations and we take the p-value of the topological analysis (pPERT); 4) we repeat from step 1 10,000 times.
As shown in Figure the distribution of the topological significance p-values in case of signal propagation is shifted towards lower values with respect to the case of non-propagation. Propagation p-value distribution is not only centered on 0.1 (while the one with non-propagation is centered on 0.3) but is also less variable. As expected the same results are obtained simulating negative fold changes (data not shown). This finding demonstrate that compound mediating signal propagation improves topological analyses giving more reliable results.
3.5 Example of topological analysis: B-lineage Adult Acute Lymphocytic Leukemia
3.5.1 Data
The dataset, recently published by [
33], characterizes gene expression signatures in acute lymphocytic leukemia (ALL) cells associated with known genotypic abnormalities in adult patients. Several distinct genetic mechanisms lead to acute lymphocytic leukemia (ALL) malignant transformations deriving from distinct lymphoid precursor cells that have been committed to either T-lineage or B-lineage differentiation. Chromosome translocations and molecular rearrangements are common events in B-lineage ALL and reflect distinct mechanisms of transformation. The relative frequencies of specific molecular rearrangements differ in children and adults with B-lineage ALL. The BCR breakpoint cluster region and the c-abl oncogene 1 (BCR/ABL) gene rearrangement occurs in about 25% of cases in adult ALL, and much less frequently in pediatric ALL.
Data is available at the Bioconductor site (
http://www.bioconductor.org/help/publications/2003/Chiaretti/chiaretti2/) Expression values, appropriately normalized according to
rma and quantile normalization, derived from Affymetrix single channel technology, consist of 37 observations from one experimental condition (
n1 = 37, BCR; presence of BCR/ABL gene rearrangement) and 41 observations from another experimental condition (
n2 = 41, NEG; absence of rearrangement). Probes platform have been annotate using EntrezGene custom CDF version 14 [
34]. Given the involvement of BCR and ABL genes in the chimera rearrangement, we expect these genes playing a central role in the gene set analysis; thus, most of the pathways containing BCR and/or ABL genes should be found as significant.
3.5.2 Results
We report the results obtained by
SPIA[
5] and
topologyGSA[
6] on the
graphite networks. These statistical tests are based on completely different null hypotheses; while
SPIA needs the list of differentially expressed genes,
topologyGSA performs two statistical tests (to compare the mean and the variance of the pathway between two groups) on the entire list of genes belonging to a pathway. Here, differentially expressed genes required for
SPIA package have been identified using RankProd test [
35] (
F D R <0.01), while the test on the mean has been chosen for
topologyGSA package.
Tables and reports the list of significant pathways identified by the above approaches; pathways marked with
are those containing BCR and/or ABL genes. It is interesting to observe that several pathways containing either BCR and ABL genes were identified as deregulated especially with
topologyGSA. Then, as expected, several additional pathways associated to cancer progression, apoptosis, cell cycle, cell proliferation and inflammation have been selected as significant.
| Table 3Pathway analysis performed using SPIA statistical test on graphite networks. |
| Table 4Pathway analysis performed using topologyGSA statistical test on graphite networks. |
Leaving the comparison between topological analyses aside (because it is out of the scope of the present work), the results testify the feasibility of performing analyses using graphite and the ability to obtain reliable results independently of the chosen analysis method. In addition, for the first time, thanks to graphite all the topological methods gain the access to pathway repositories previously not considered.
Our results highlight that the hierarchical pathway structure and the reduced dimension of the pathways characterizing respectively the Reactome and Biocarta databases jointly with the specialized cancer pathways of the NCI databases allow the user to have deeper insight into the data.
To highlight the usefulness of topological analysis in the context of transcriptomic data interpretation, we report two graphite networks identified as significantly altered in the previous analysis.
Chronic myeloid leukemia pathway includes both genes, BCR and ABL1, and was identified as differentially expressed between BCR/ABL positive and negative patients by topologyGSA. Figure shows the chronic myeloid leukemia graphite network from KEGG database with differentially expressed genes mapped with different colors according to fold change sign. It is interesting to note the presence of several OR groups (e.g. PI3K, AKT, IKK, CBL gene families), single members of which resulted to be differentially expressed. Two clear deregulated paths starting from BCR and ABL1 genes towards apoptosis and NFKB pathways highlight the power of topological analysis to deeper investigate signal cascades within large pathways.
3.6 R/Bioconductor packages
Currently there are two Bioconductor packages that try to convert pathway to the SIF model.
KEGGgraph[
9] that parses KGML format but i) considers all group of genes as groups OR and ii) completely removes compounds without propagation and
NCIgraph that imports BioPax models from Cytoscape [
27] without taking into account groups and compound propagation (compounds become nodes of the network) and uses internal IDs as node labels. The package allows to retain only nodes with EntrezGene IDs loosing all the other nodes. Thus, both of them are not suitable for topological pathway analyses.