As a very simple example, we demonstrate how graph concepts can be used to do an analysis that relates gene expression data to protein interaction data.
Proteins that form a functional complex need to be expressed concurrently, hence we expect that something can be learned from comparing co-expression and protein complex co-membership. In particular, we consider the question is whether genes in a protein complex are more likely to have a similar pattern of gene expression than genes in different complexes.
The analysis that we demonstrate in the following was reported by Balasubramanian et al. [
14] and is based on the work of Geone et al. [
7]. Balasubramanian et al. used two graphs defined on a common set of nodes: the genes present in yeast. The relationship represented by the edges in the first graph is co-membership in a cluster of correlated expression, while the edges in the second graph represent co-membership in a protein complex.
For concreteness, we will show the R programming code to perform this analysis. Figures , , , , are generated from the results of these computations, and the
Sweave source document for this article includes all the R code for analysis and graphics displays. It is available as additional file
1.
We set up the comparison by creating the two graphs as objects in the R language and counting how many edges they have in common. To see whether this number is significantly above what could be expected by chance if there were no relationship between protein complex co-membership and co-expression. There are some subtleties in the definition by what we mean by by chance, as we will discuss below.
The Data
The R package
yeastExpData contains the gene expression data from a yeast cell-cycle time course [
21], including an assignment of the genes into co-expressed clusters in the dataframe ccyclered, and protein-protein interaction (PPI) data extracted from published papers (litG).
> library("yeastExpData")
> data("ccyclered")
> ccyclered [1:2, 1:8]
> table(ccyclered$Cluster)
> data("litG")
> litG
A graphNEL graph with undirected edges
Number of Nodes = 2885
Number of Edges = 315
The code above shows the first two rows (genes) of ccyclered, the sizes of the 30 clusters, and a summary of the graph object litG.
Exploration of the PPI graph
To explore the graph litG, we can employ the functionality of the package RBGL. First, we find the connected components.
> library("RBGL")
> cc = connectedComp(litG)
> table(listLen(cc))
cc is a list of the connected components of litG. There are 2587 singletons (connected components of size 1), and the largest connected component has size 88. Let us plot the two largest components using the Rgraphviz package. We first determine the indices of the ordered components,
> ord = order(listLen(cc), decreasing = TRUE)
select the largest subgraph,
> sG1 = subGraph(cc [[ord[1]]], litG)
lay it out using the function agopen, which is an interface to the graphviz graph layout library, and plot it. There are many options for node color, line color and type, node shape etc., for which we refer to the vignette of the Rgraphviz package.
> lsG1 = agopen(sG1, layoutType = "neato", nodeAttrs = makeNodeAttrs(sG1, + fillcolor = "steelblue2"), name = "sG1")
> plot(lsG1)
The graph is shown in Figure . Similarly, Figure shows the second-largest connected component sG2.
> sG2 = subGraph(cc [[ord[2]]], litG)
Construction of the cluster graph
There is a specialized graph class clusterGraph that can be used to represent clusters. The 30 clusters of the 2885 genes in the ccyclered dataset are represented by 30 subgraphs which are fully connected within themselves and unconnected with each other.
> clusts = with(ccyclered, split(Y. name, Cluster))
> clG = new("clusterGraph", clusters = clusts)
Statistical analysis of the graph overlap
It is now easy to determine how many pairs of genes have both a protein-protein interaction and are found in the same expression cluster. We find the intersection of the cluster-graph and the literature graph using the R function intersection.
> commonG = intersection(clG, litG)
A graphNEL graph with undirected edges
Number of Nodes = 2885
Number of Edges = 42
We find that 42 edges are in common, now we will try to determine whether this number is statistically interesting, i. e. different from what could be expected by chance. We will do this by generating a null distribution via permutation of node labels on the observed graph. The following function implements this.
> nodePerm = function(g, h, B = 500) {
+ n = nodes(g)
+ sapply(1:B, function(i) {
+ nodes(g) <- sample(n)
+ numEdges(intersection(g, h))
+ })
+ }
> nPdist = nodePerm(clG, litG)
Figure shows the histogram of nPdist together with a vertical line at 42, the number of edges of the intersection graph. The largest number of common edges in the permutation distribution is 24. We conclude that the overlap between litG and clG is statistically significant. In the next section, we will do some data exploration to investigate some of the biological significance.
Cohesive subgroups
Let us look at cohesive subgroups of the intersection graph commonG. First, we remove the singleton nodes,
> sel = names(which(degree(commonG) >= 1))
> commonG = subGraph(sel, commonG)
then we use the functions from the RBGL package to identify the different types of cohesive subgroups that were defined above.
> kcliq = kCliques(commonG)
> kcore = kCores(commonG)
> lambd = lambdaSets(commonG)
kcliq, the return value of kCliques is a list whose k-th entry is a list of all the k-cliques in the graph. We can get all the 2-cliques of size >= 4,
> listSel = function(x, n) x[listLen(x) >= n]
> kc2 = listSel(kcliq[[2]], 4)
> kc2
[[1]]
[1] "YBR009C" "YBR010W" "YNL030W" "YNL031C"
[[2]]
[1] "YBL035C" "YJR043C" "YNL102W" "YPR135W"
[[3]]
[1] "YBR088C" "YDL102W" "YJR006W" "YJR043C" "YNL102W"
Remember that a 2-clique is a subgraph in which the distance between each pair of nodes is ≤ 2. Any subgraph of size ≤ 3 satisfies this requirement trivially, hence we consider those with size ≥ 4. They are shown in Figure . Using the gene annotation data provided in the metadata package YEAST, we can look at the names and descriptions of the 4 genes in the second 2-clique.
> library("YEAST")
> mget(kc2[[2]], YEASTGENENAME)
> mget(kc2[[2]], YEASTDESCRIPTION)
YBL035C YJR043C YNL102W YPR135W
"POL12" "POL32" "POL1" "CTF4"
YBL035C
B subunit of DNA polymerase alpha-primase complex, required for initiation of DNA replication during mitotic and premeiotic DNA synthesis; also functions in telomere capping and length regulation
YJR043C
Third subunit of DNA polymerase delta, involved in chromosomal DNA replication; required for error-prone DNA synthesis in the presence of DNA damage and processivity; interacts with Hys2p, PCNA (Pol30p), and Pol1p
YNL102W
Catalytic subunit of the DNA polymerase alpha-primase complex, required for the initiation of DNA replication during mitotic DNA synthesis and premeiotic DNA synthesis
YPR135W
Chromatin-associated protein, required for sister chromatid cohesion; interacts with DNA polymerase alpha (Pol1p) and may link DNA synthesis to sister chromatid cohesion
The first 2-clique is a duplicated pair of histone proteins:
> sapply(kc2[[1]], function(i) YEASTGENENAME [[i]])
YBR009C YBR010W YNL030W YNL031C
"HHF1" "HHT1" "HHF2" "HHT2"
A k-core is a subgraph in which every node is connected to at least k other nodes within the subgraph. The 2-cores of commonG are shown in Figure . lambd represents the λ-sets of commonG. It has 2 elements, the first one is the maximum degree kmax in the graph, the second is a list of length 3 with the λ-sets for k = 0, 1, and 2, respectively.
> lambd[[1]]
[1] 2
> names(lambd[[2]])
[1] "lambda-0 sets" "lambda-1 sets" "lambda-2 sets"
> lambd[[2]] [[3]]
[[1]]
[1] "YBR009C" "YBR010W" "YNL030W" "YNL031C"
[[2]]
[1] "YDL102W" "YJR006W" "YJR043C"
[[3]]
[1] "YDR356W" "YHR172W" "YNL126W"
In this particular example, we note that the λ-sets for k = 2 are the same as the 2-cores in Figure , hence there is no need for an extra figure.