Given a phenotype-expressing organism, the DENSE algorithm (Figure ) tackles the problem of identifying genes that are functionally associated to a set of known phenotype-related proteins by enumerating the "dense and enriched" subgraphs in genome-scale networks of functionally associated or interacting proteins. A "dense" subgraph is defined as one in which every vertex is adjacent to at least some
γ percentage of the other vertices in the subgraph for some value
γ above 50%, which corresponds to a set of genes with many strong pairwise protein functional associations. The researchers' prior knowledge is incorporated by introducing the concept of an "enriched" dense subgraph in which at least
μ percentage of the vertices are contained in the knowledge prior query set. Genes contained in such dense and enriched subgraphs, or
μ-enriched,
γ-dense quasi-cliques, have strong functional relationships with the previously identified genes, and so are likely to perform a related task. Previous approaches to finding such clusters have included fuzzy logic-based approaches [
27] (also, see [
28]), probabilistic approaches [
29,
30], stochastic approaches [
31], and consensus clustering [
32]. The discovery of dense non-clique subgraphs has recently been explored by a number of other researchers [
33-
38], and a number of different formulations for what it means for a subgraph to be "dense" have emerged.
Luo
et al [
39] discuss 3 types of dense subgraphs other than cliques:
k-plexes,
k-cores, and
n-cliques. The
k-plexes [
40] are subgraphs where each vertex is connected to all but
k others. More specifically, Luo
et al [
39] use a
k-plex definition where
k =
n/2. A definition similar to
k-plex has been used by Carter and Johnson [
35]. Meanwhile,
k-cores [
41] are subgraphs where each vertex is connected to at least
k others, and
n-cliques [
42] are subgraphs with diameter at most
n. In this paper we use a more restrictive definition of the
n-clique, i.e, 2-clique with some additional constraints. Abello
et al [
33] use a definition where at least

edges exist in the subgraph, and Bu
et al [
34] use a definition of a dense subgraph based on the eigenvalue decomposition of the adjacency matrix of the graph. Gao and Wong [
36] use a definition based on "clique percolation," meaning that any dense subgraph must satisfy the property that one could reach all of the vertices by taking a clique of size 4 in the subgraph and changing one vertex at a time to form another clique of size 4 until every vertex has been touched. Pei
et al [
37] and Zeng
et al [
38] describe cross-graph quasi-cliques, which use a similar notion of subgraph density as we do, but their work describes techniques for finding subgraphs that meet this density criterion across several graphs at once, whereas we are interested in quasi-cliques that are "enriched" with respect to some knowledge priors. In this paper, we attempt to outline theoretical conditions on dense subgraphs of a network that are enriched with respect to some target set of vertices. An algorithm based on this theory would be able to answer "fuzzy queries" on graph data, identifying dense, possibly overlapping subgraphs in which the "query set" of vertices is overrepresented. By finding these dense, enriched "fuzzy clusters," or enriched quasi-cliques, we hope to achieve superior precision and coverage over conventional hard clustering techniques, which heuristically partition graphs into non-overlapping subgraphs. Further, by limiting the focus to discovering those "quasi-cliques" in which the query labels are overrepresented, the search space for identifying these quasi-cliques may be limited, which has the potential to improve execution time significantly over full quasi-clique enumeration. In this work, we use the following definition for a "dense" subgraph:
Definition 1.1 Given a labeled graph G and a real value γ ??#8712; (0.5, 1], a subgraph S of G is a γ-dense quasi-clique if and only if every vertex of S is adjacent to at least γ(|S| - 1) of the other vertices of S. If γ(|S| - 1) is not a natural number, every vertex would need to be adjacent to ??#8968;γ(|S| -1)??#8969; of the other vertices of S.
There are two advantages of using this definition. First, it corresponds nicely with the typical use of the term "density" in that it forces a certain fraction of the possible edges in the subgraph to exist. The second advantage is that by framing the definition as a condition that each vertex must satisfy, we force the resulting subgraphs to be "uniformly" dense. As an illustration, a graph consisting of an isolated vertex and a subgraph in which every pair of vertices is connected may contain a high overall percentage of the possible edges, but it is unlikely anyone would consider the isolated vertex to be related to the others in any significant sense.
Definition 1.2 Given a labeled graph G, a "query" set of vertices Q, a real value γ ??#8712; (0.5, 1], and a real value μ ??#8712; (0, 1], a γ-dense quasi-clique S is μ-enriched with respect to Q if and only if at least μ|S| vertices of S are contained in Q.
Henceforth, μ-enriched γ-quasi-cliques will hereafter be referred to as μ, γ-quasi-cliques, and the "query" set of vertices will be denoted as Q.
Definition 1.3 Given a labeled graph G, a "query" set of vertices Q, a real value γ ??#8712; (0.5, 1], and a real value μ ??#8712; (0, 1], a γ-dense quasi-clique S is also maximal if no larger supergraph S' of S is a γ-dense quasi clique that is μ-enriched with respect to Q.
The algorithm to enumerate
μ,
γ-quasi-cliques is an agglomerative bottom-up approach with a backtracking paradigm. The basic premise of the algorithm is that we will build the
μ,
γ-quasi-cliques starting with a single query vertex
v0 (
v0 ??#8712;
Q) and backtracking as we find maximal
μ,
γ-quasi-cliques or subgraphs that cannot be contained in a
μ,
γ-quasi-clique. For this section, we use the convention that
S represents the current subgraph under consideration, and
C represents the set of vertices that could extend
S to produce a
μ,
γ-quasi-clique. The number of vertices in
S adjacent to a vertex
v is denoted as
sa(
v) and in
C is denoted as
ca(
v).
Nk(
S) denotes all vertices at distance
k (
k edges) or less from all vertices of
S. To improve the efficiency of the algorithm we use some theoretical results and properties (the detailed proofs are available in Supplement 1). The properties are targeted at three points to improve efficiency (1) reducing the size of
C, i.e., the search space of candidates be added, (2) deciding on when to stop expanding a subgraph
S further, and (3) deciding on when to discard a subgraph
S if it can never be a
μ,
γ-quasi-clique. The first property is based on a result presented by Pei
et al [
37], it states that for
S to be a
μ,
γ-quasi-clique, every pair of vertices has to be at a maximum distance of 2 edges from each other. Using this property, the size of the candidate set
C for any subgraph
S can at the maximum only have |
N2(
S)|
/|
S| entries. The second property based on results drawn from Zeng
et al [
38] states that if for any given vertex
v ??#8712;
V (
S), the number of vertices in
C and
S that are adjacent to
v together do not satisfy the
γ constraint, then no supergraph of
S will ever satisfy the
γ constraint, i.e.,
sa(
v) +
ca(
v)
> γ(|S| - 1 +
ca(
v)) needs to be satisfied to warrant expanding
S further; otherwise, we output
S as the maximal
μ,
γ-quasi-clique. The third property states that for any vertex
v ??#8712;
C,
S ??#8746; {
v} or any supergraph of
S ??#8746; {
v} can satisfy the
γ criterion if and only if
sa(
v) +
ca(
v) ≥
γ (|
S| +
ca(
v)). All vertices in
C that do not satisfy this constraint can be removed from the candidate list, thereby reducing the search space further. The fourth property deals with reducing the size of
C based on the enrichment constraint. The current subgraph
S is
μ-enriched if |
S ??#8745;
Q| ≥
μ|
S|. The condition |
S ??#8745;
Q| + |
C ??#8745;
Q| ≥
μ(|
S| + |
C ??#8745;
Q|) must be met by every
S that can be further extended and still satisfy the
μ criterion. The maximum increase in enrichment occurs when subgraph
S is extended by the addition of all vertices from
C ??#8745;
Q. This maximum enrichment has to be less than the sum of the number of vertices common between
Q and
S, and
Q and
C, to warrant any further expansion of
S. If during the algorithm execution we reach a point where the addition of a vertex
v to the current subgraph
S' results in a subgraph
S that violates the above condition,
v is removed from the candidate list. Additional properties for restricting the search space of potential
μ,
γ-quasi-cliques are available in Supplement 1. We loop through all vertices in the query set
Q and for each vertex
v ??#8712;
Q we enumerate all the
μ,
γ-quasi maximal cliques that contain
v and avoid enumerating the same subgraph twice by keeping track of the ones enumerated earlier. All the above theoretical properties and results are utilized to improve the efficiency of the backtracking algorithm (The detailed pseudocode is available Additional File
3). In order to decide when a
μ,
γ-quasi-clique is maximal, we propose to maintain a bitmap index of the
μ,
γ-quasi-cliques that contains each vertex. As the algorithm identifies
μ,
γ-quasi-cliques, it assigns numbers to them sequentially and adds these values to indices for the vertices contained in the
μ,
γ-quasi-cliques. Then, as we add and remove vertices from set
C, we check these bitmap indices to see if there is an already-discovered
μ,
γ-quasi-clique that contains all vertices of
S ??#8746;
C by performing a bitwise and of the indices associated with the vertices of
S ??#8746;
C. If there is an already-discovered
μ,
γ-quasi-clique that is a superset of
S ??#8746;
C, we may safely backtrack, as no further extensions of
S will be maximal. One drawback of using a bitmap index, however, is that as more
μ,
γ-quasi-cliques are identified, the size of the index will increase. In an effort to avoid checking the entire index for each vertex (in the case where
S ??#8746;
C is maximal), we propose using a hierarchical bitmap index, in which each byte of the index is summarized by a single bit in a higher level index. As we are checking for the existence of a bit that is set in all of the indices related to the vertices of
S ??#8746;
C, we do not need to examine bytes that have no bits set. As such, we summarize zero bytes in the "base level" index with a 0 and nonzero bytes with a 1. As the size of the index grows, we can add more levels, summarizing each byte in the "first level" index with a bit in the "second level" index, each byte in the "second level" index with a bit in the third, and so on. In this way, we can use higher level indices to reduce the number of bytes we need to check on the "base level" index.
Parameter Selection
DENSE requires the user input of two parametes: the enrichment (μ) and the density (γ). The earlier description of these parameters suggests that higher values of γ will produce more connected (clique-like) subgraphs. Similarly, higher values of the enrichment (μ ≥ 0.5) will produce subgraphs that are primarily composed of the "query" vertices, whereas a very low value (μ ≤ 0.001) will result in enumeration of all the subgraphs that satisfy the γ threshold and contain at least one query vertex.
Parameter thresholds depend on the application. In this paper, we are interested in identifying phenotype-related protein functional modules, given a user-defined initial set of phenotype-related proteins as a query. Setting
μ value to 0.001 will result in finding all the modules that could potentially be related to phenotype-expression (e.g., via guilt-by-association). Since a functional module is believed to form a group of highly connected proteins in a protein functional association network [
43], the authors of [
44,
45] suggested that the density of the subgraph that represents a functional module should fall between 0.5 and 1, where the greater the density is, the more likely the subgraph is a true functional module. Based on these observations, setting
γ = 1 will produce those subgraphs that are the most probable functional modules. However, since organismal networks are prone to missing information (edges), the value of
γ = 1 could be too stringent, and the algorithm may miss some of the phenotype-related modules. Hence, we chose a
γ value of 0.75 (midpoint of 0.5 and 1) to identify highly connected (but not fully connected) subgraphs as most probable modules that are functionally associated with phenotype-related query proteins.