Home | About | Journals | Submit | Contact Us | Français |

**|**BMC Bioinformatics**|**v.13; 2012**|**PMC3556311

Formats

Article sections

- Abstract
- Background
- Method
- Results and discussion
- Conclusion
- Availability and requirements
- Competing interests
- Authors’ contributions
- Supplementary Material
- References

Authors

Related links

BMC Bioinformatics. 2012; 13: 250.

Published online 2012 September 28. doi: 10.1186/1471-2105-13-250

PMCID: PMC3556311

Günhan Gülsoy: ude.lfu.esic@yoslugg; Nirmalya Bandhyopadhyay: ude.lfu.esic@aylamrin; Tamer Kahveci: ude.lfu.esic@remat

Received 2012 March 22; Accepted 2012 September 21.

Copyright ©2012 Gülsoy et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article has been cited by other articles in PMC.

Transcription factors regulate numerous cellular processes by controlling the rate of production of each gene. The regulatory relations are modeled using transcriptional regulatory networks. Recent studies have shown that such networks have an underlying hierarchical organization. We consider the problem of discovering the underlying hierarchy in transcriptional regulatory networks.

We first transform this problem to a mixed integer programming problem. We then use existing tools to solve the resulting problem. For larger networks this strategy does not work due to rapid increase in running time and space usage. We use divide and conquer strategy for such networks. We use our method to analyze the transcriptional regulatory networks of *E. coli*, *H. sapiens* and *S. cerevisiae*.

Our experiments demonstrate that: (i) Our method gives statistically better results than three existing state of the art methods; (ii) Our method is robust against errors in the data and (iii) Our method’s performance is not affected by the different topologies in the data.

Genes are the smallest functional units of an organism. They carry out vital functions in cells by interacting with each other and with other molecules. Biological networks model such interactions among genes. Using biological networks, researchers are able to take a holistic approach on the analysis of cellular functions. Such analysis has shown that biological networks have a number of global properties. One of these properties is their hierarchical organization. Hierarchical organization defines a partial ordering of the underlying genes. Recent studies have shown that directed interactions between transcription factors (TFs) in transcriptional regulatory networks (TRNs) impose a hierarchy on TRNs [1-5]. Analysis of the hierarchies of TRNs helps researchers better understand the flow of controlling signals through the transcription machinery [1,3]. TFs are special types of proteins that control the expression of other genes by binding to specific regions of the DNA [6]. Since each protein is coded by genes, we will use the terms *transcription factor* and *gene* to refer to TFs throughout this paper. One way to model hierarchy in TRNs is to assign levels to the interacting TFs. Figure Figure11 shows a sample network with its level assignments. In this paper, we consider the problem of finding the hierarchical organization of a TRN. The formal definition of our problem in this paper is as follows:

Let us denote a TRN with $\mathcal{G}=(N,E)$. Here, *N* denotes the set of TFs and *E* denotes the set of directed interactions (i.e. edges) between the TFs in *N*. We refer to each TF in *N* as a *node*. We name the *i*th node in *N* as *n*_{i}. We represent an edge from the node *n*_{i}to *n*_{j} with (*n*_{i},*n*_{j}). Also, we denote the maximum possible number of levels in $\mathcal{G}$ with *M*. We denote the hierarchy level assigned to a node *n*_{i }with *t*_{i} where *t*_{i} is an integer in {1,2,,3,…,*M*}. Let *ϕ*(*n*_{i},*n*_{j})→{0,1} be a binary function that describes the key topological relationship between *n*_{i} and *n*_{j}. (We elaborate on the *ϕ* function below.) We compute a penalty score *p*_{ij }for each pair of nodes as follows:

$${p}_{\mathit{\text{ij}}}=\left\{\begin{array}{ll}\varphi ({n}_{i},{n}_{j}),\phantom{\rule{1em}{0ex}}& \text{if}\phantom{\rule{0.3em}{0ex}}{t}_{i}\le {t}_{j}\\ 0,& \text{else}\end{array}\right)$$

Our aim is to find an assignment of hierarchies to the nodes of *N* which minimizes $\sum _{(i,j)}{p}_{\mathit{\text{ij}}}$.

In this paper, we use two different *ϕ *functions describing two key topological properties.

1. **Adjacency.** We define *ϕ*(*n*_{i},*n*_{j}) = 1 if (*n*_{i},*n*_{j}) * E *and *ϕ*(*n*_{i},*n*_{j}) = 0 otherwise.

2. **Reachability.** We define *ϕ*(*n*_{i},*n*_{j}) = 1 if there exists a path from *n*_{i }to *n*_{j }in $\mathcal{G}$ by traversing the edges in *E*. We set *ϕ*(*n*_{i},*n*_{j}) = 0 otherwise.

Depending on the choice of the two *ϕ *functions, we name the resulting distance function adjacency distance or reachability distance respectively. In summary, using adjacency distance we aim to assign levels such that every TF is above the others it directly regulates, and below its every direct regulator. On the other hand, using reachability distance, we consider any direct or indirect regulation relation between two TFs when assigning levels.

There has been attempts to devise methods to reveal the underlying hierarchies of TRNs. Yu and Gerstein developed BFS-level method to carry out this task [1]. This method uses breadth first search to assign hierarchies to TFs in a network. Although their method works for most networks, it fails to assign accurate levels for networks that contain cycles. Jothi *et al.* developed vertex sort method [2]. This method incorporates topological sort algorithm for addressing the network hierarchy problem. Vertex sort method does not have any restrictions on network motifs or cycles. However, rather than a certain hierarchy, it assigns a range of possible levels for the TFs. Hartsperger *et al.* devised an algorithm based on breadth first search method to solve the problem [3]. Their solution improves the BFS-level method, and outputs a hierarchy for every network regardless of its topological features. However all these algorithms fail to minimize the number of edges that violate the hierarchy. We name such edges as *conflicting edges*. We will elaborate on the quality of results calculated by existing methods in Section Comparison with existing hierarchical decomposition methods.

In this paper, we develop a novel approach to tackle the problem of discovering underlying network hierarchy. We first consider the topology of the network as a set of constraints. Then, we define two different objective functions using adjacency and reachability penalty functions. We define the minimization of total penalty as the objective of the problem. Using the above explanations, we transform this problem to a mixed integer programming problem(MIPP) [7]. We solve the resulting problem using existing MIPP solvers. We name our method **HI**erarchical **DE**composition of regulatory **N**etworks (HIDEN). The main advantage of HIDEN is it introduces a sound mathematical formulation to the network hierarchy problem. Our formulation can work with any objective function that is a linear combination of the edges. One drawback of HIDEN is that it does not scale well to very large networks due to the growing size of the MIPP with increasing number of TFs. In order to address this issue we develop a divide and conquer approach.

The rest of this article is organized as follows: In Section Algorithm, we describe the methods we developed in this paper. In Section Results and discussion, we discuss the results of HIDEN in detail. Finally, in Section Conclusion, we briefly conclude the paper.

In this section, we describe the hierarchical decomposition method we developed. Section HIDEN describes our method. Section Example demonstrates HIDEN on a simple example. Section Divide and Conquer method describes divide and conquer method we employ to scale HIDEN to larger networks.

HIDEN transforms the hierarchical network decomposition problem to a MIPP [7]. Given a TRN, HIDEN first constructs a set of linear constraints and a linear optimization function that collectively describe the penalty of the decomposition. Then it uses existing optimizer software to solve the resulting problem. Next, we will explain how we formulate the MIPP.

Let us denote the given network that will be decomposed with $\mathcal{G}$. Let us denote the nodes (i.e., TFs) of this network with *n*_{1}, *n*_{2}, …, *n*_{m}, where *m* is the total number of nodes of $\mathcal{G}$. HIDEN, allows the user to set a limit on the maximum number of allowed levels for hierarchical decomposition. Let us denote this number with *M*. Also, let us represent the level assigned to node *n*_{i} with *t*_{i} for all *i*{1,2,…,*m*}, (i.e. ∀*i*,* t*_{i } {1,2,3,…,*M*}). We aim to find the level assignment *T *= {*t*_{1},*t*_{2},…,*t*_{m}} that minimizes the total penalty resulting from this level assignment. Therefore, the objective of our problem is the sum of individual penalty scores for each pair of nodes:

$$\text{minimize}\phantom{\rule{1em}{0ex}}\sum _{1\le i,j\le m}{p}_{\mathit{\text{ij}}}.$$

(1)

Next, we set a limit on the number of levels in the hierarchy. We do this by limiting the variables *t*_{i} as follows:

$$0\le {t}_{i}<M.$$

(2)

We, then, represent each *p*_{ij} as a linear constraint. Remember that *p*_{ij }is a binary function in the following form:

$${p}_{\mathit{\text{ij}}}=\left\{\begin{array}{ll}\varphi ({n}_{i},{n}_{j}),& \text{if}\phantom{\rule{0.3em}{0ex}}{t}_{i}\le {t}_{j}\\ 0,& \text{else}\end{array}\right)$$

We can rewrite this function as follows:

$${p}_{\mathit{\text{ij}}}=\left\{\begin{array}{ll}1,& \text{if}\phantom{\rule{0.3em}{0ex}}{t}_{i}\le {t}_{j}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\text{and}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\varphi ({n}_{i},{n}_{j})=1\\ 0,& \text{else}\end{array}\right)$$

Let us only consider the cases where *ϕ*(*n*_{i},*n*_{j}) = 1. We can represent the rest of this function using two linear inequalities. The following set of constraints represent the function *p*_{ij}:

$${p}_{\mathit{\text{ij}}}\in \{0,1\}$$

(3)

$${t}_{j}-{t}_{i}-M\times {p}_{\mathit{\text{ij}}}\ge -M$$

(4)

$${t}_{j}-{t}_{i}-M\times {p}_{\mathit{\text{ij}}}\le -1$$

(5)

In order to prove that these inequalities model the function *p*_{ij }correctly, we need to inspect all possible scenarios:

1. if *t*_{i }>* t*_{j }and *p*_{ij }= 0, then −1 ≥* t*_{j }−* t*_{i }≥ −(*M*−1) and *M *×* p*_{ij }= 0. Therefore both (4) and (5) holds.

2. if *t*_{i }≤* t*_{j }and *p*_{ij }= 0, then *t*_{j}−*t*_{i }≥ 0 and *M *×* p*_{ij }= 0. Therefore, (4) holds, however, (5) does not hold.

3. if *t*_{i }>* t*_{j }and *p*_{ij }= 1, then −1 ≥* t*_{j}−*t*_{i }≥ −(*M*−1) and *M *×* p*_{ij }=* M*. Therefore the expression *t*_{j}−*t*_{i }−*M*×*p*_{ij }is smaller than or equal to −*M*−1. This implies that (4) does not hold but (5) holds.

4. if *t*_{i }≤* t*_{j }and *p*_{ij }= 1, then (*M*−1) ≥* t*_{j}−*t*_{i }≥ 0 and *M *×* p*_{ij }=* M*, therefore both (4) and (5) holds.

Therefore, enforcing the constraints (3), (4) and (5) implies:

$$({p}_{\mathit{\text{ij}}}=0\iff {t}_{i}>{t}_{j})\wedge ({p}_{\mathit{\text{ij}}}=1\iff {t}_{i}\le {t}_{j})$$

(6)

This corresponds to the latter definition of the function *p*_{ij} except the condition of *ϕ*(*n*_{i},*n*_{j}) = 1. Since we choose the function *ϕ* prior to the construction of the MIPP, we know the value of *ϕ*(*n*_{i},*n*_{j}) for every pair (*n*_{i},*n*_{j}). Therefore, we can manually ensure this property, by only considering *p*_{ij} where *ϕ*(*n*_{i},*n*_{j}) = 1 and excluding *p*_{ij} completely from our calculations where *ϕ*(*n*_{i},*n*_{j}) = 0.

Based on the constraints above, the MIPP we construct to solve the network hierarchy problem is as follows:

$$\begin{array}{l}\text{minimize}\sum _{i,j\phantom{\rule{1em}{0ex}}\mathrm{s.t.}\phantom{\rule{1em}{0ex}}\varphi ({n}_{i},{n}_{j})=1}{p}_{\mathit{\text{ij}}}\\ \phantom{\rule{1em}{0ex}}\text{where}\\ \phantom{\rule{2em}{0ex}}\forall {n}_{i}\\ \phantom{\rule{3em}{0ex}}0\le {t}_{i}<M\\ \phantom{\rule{3em}{0ex}}{t}_{i}\text{is an integer}\\ \phantom{\rule{1em}{0ex}}\forall {n}_{i},{n}_{j}\text{such that}\phantom{\rule{1em}{0ex}}\varphi ({n}_{i},{n}_{j})=1\\ \phantom{\rule{3em}{0ex}}{p}_{\mathit{\text{ij}}}\in \{0,1\}\\ \phantom{\rule{3em}{0ex}}{t}_{j}-{t}_{i}-M\times {p}_{\mathit{\text{ij}}}\ge -M\\ \phantom{\rule{3em}{0ex}}{t}_{j}-{t}_{i}-M\times {p}_{\mathit{\text{ij}}}\le -1\end{array}$$

In this section, we show the application of HIDEN on the network in Figure Figure1.1. We will use adjacency penalty function in this example. Therefore:

$$\varphi ({n}_{i},{n}_{j})=\left\{\begin{array}{ll}1& \text{if an edge from}\phantom{\rule{.3em}{0ex}}{n}_{i}\phantom{\rule{.3em}{0ex}}\text{to}\phantom{\rule{.3em}{0ex}}{n}_{j}\phantom{\rule{.3em}{0ex}}\text{exists}\\ 0& \text{otherwise}\end{array}\right)$$

Using this *ϕ *function, the objective of the MIPP is to minimize the following function:

$$\sum _{i,j\phantom{\rule{.3em}{0ex}}\mathrm{s.t.}\phantom{\rule{.3em}{0ex}}\varphi ({n}_{i},{n}_{j})\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}1}{p}_{\mathit{\text{ij}}}={p}_{12}+{p}_{13}+{p}_{15}+{p}_{24}+{p}_{35}+{p}_{36}+{p}_{67}+{p}_{73}$$

Now we go over to the constraints. First set of constraints limit *t*_{i}:

$$\begin{array}{l}\forall {n}_{i},i\text{from 1 to 7}\\ \phantom{\rule{3em}{0ex}}\phantom{\rule{3em}{0ex}}\phantom{\rule{2.40em}{0ex}}0\le {t}_{i}<M\\ \phantom{\rule{3em}{0ex}}\phantom{\rule{3em}{0ex}}\phantom{\rule{2.40em}{0ex}}{t}_{i}\phantom{\rule{1em}{0ex}}\text{is an integer}\end{array}$$

Then, we write the remaining functions as follows:

$$\begin{array}{l}\forall ({n}_{i},{n}_{j})\in \left\{\right({n}_{1},{n}_{2}),({n}_{1},{n}_{3}),({n}_{1},{n}_{5}),({n}_{2},{n}_{4}),\\ \phantom{\rule{3em}{0ex}}\phantom{\rule{3em}{0ex}}({n}_{3},{n}_{5}),({n}_{3},{n}_{6}),({n}_{6},{n}_{7}),({n}_{7},{n}_{3})\}\\ \phantom{\rule{3em}{0ex}}\phantom{\rule{3em}{0ex}}{p}_{\mathit{\text{ij}}}\in \{0,1\}\\ \phantom{\rule{3em}{0ex}}\phantom{\rule{3em}{0ex}}{t}_{j}-{t}_{i}-M\times {p}_{\mathit{\text{ij}}}\ge -M\\ \phantom{\rule{3em}{0ex}}\phantom{\rule{3em}{0ex}}{t}_{j}-{t}_{i}-M\times {p}_{\mathit{\text{ij}}}\le -1\end{array}$$

In the resulting problem, *M* is left as a user defined parameter. When we run the above problem with *M *= 4, HIDEN returns the following result:

$$\begin{array}{r}{p}_{12}+{p}_{13}+{p}_{15}+{p}_{24}+{p}_{35}+{p}_{36}+{p}_{67}+{p}_{73}=1\\ ({t}_{1},{t}_{2},{t}_{3},{t}_{4},{t}_{5},{t}_{6},{t}_{7})=(4,3,3,2,2,2,1)\end{array}$$

Figure Figure22 shows the result of HIDEN on the given network. Note that HIDEN computes the level decomposition successfully despite the existence of a cycle in the network.

HIDEN works well for networks that have up to 100 nodes. For larger networks, however, it becomes difficult to solve the resulting MIPP using current hardware. This is mainly because the number of integer variables of the MIPP that describe the problem for the given network increases. This increases the memory consumption and the running time significantly.

In order to solve our problem for networks that have more than 100 nodes we adopt a divide and conquer approach. Given a large TRN, we randomly divide this network into fixed size partitions. We do this by first randomly selecting a node from the given network. This node is the seed of the first partition, and thus it is a member of that partition. We then chose the remaining nodes in that partition iteratively by randomly growing the partition one node at a time. More specifically, at each iteration, we randomly select a node that is not selected so far and that is interacting with at least one of the nodes in the partition. We repeat these iterations until the number of nodes in the partition reaches to a predefined threshold or all the nodes in the TRN are assigned to a partition. Then, we use HIDEN to decompose the subnetwork defined by the nodes and the edges in this partition into hierarchical levels. Once we determine the levels of all the nodes in the current partition, we store those values as they will remain unchanged in the rest of our solution. Next we randomly pick another node from the given TRN among those that have not been considered yet as the seed of the next partition. We grow the next partition similarly and use HIDEN to decompose it into hierarchical levels. We repeat these steps until we exhaust all the nodes in the given TRN.

This method greatly reduces the running time of HIDEN on large networks. Since MIPP is NP-hard, depending on the size and the connectivity of the given TRN, the divide and conquer strategy can be orders of magnitude faster than the unpartitioned HIDEN. However, due to random selection of the nodes, it is possible for us to not achieve the optimal result. This is possible if the partition of the network we start with does not intersect with one or more of the levels in its underlying hierarchy. It is worth mentioning that this probability is usually very low. We can explain this as follows. Consider an N node network which contains n nodes belonging to a specific level x. If we select k nodes among these N nodes randomly, the probability that none of the k nodes belong to level x is (N-n choose k)/(N choose k). As k or n increases, this expression quickly converges to zero. In order to reduce this probability further, we repeat the divide and conquer strategy multiple times, each time starting from a randomly selected node. In our implementation, we repeat this process 1000 times for real TRNs. After 1000 iterations, the probability of all the trials starting with an undesired (i.e. does not intersect with all the final levels) partition becomes very small (i.e. if for 1 iteration, the probability is as high as 0.9, after 1000 iterations, the probability becomes 0.9^{1000}~10^{−46}). Since the running time of partitioned HIDEN is orders of magnitude less than that of the unpartitioned HIDEN, 1000 repetitions remains to be practical. It took less than 10 minutes for the largest dataset (*S. cerevisiae*). Our experiments showed that on the average, the results of the divide and conquer method reach its optimum in less than 500 iterations.

In this section, we evaluate HIDEN using a number of computational tests. In our tests, we let the underlying MIPP solver to handle the case of multiple optimal results. We only consider the unique result reported by the solver in our discussions. In the rest of this paper, we will use the term *experiment* to refer to *in silico* experiments for simplicity.

Dataset In our experiments, we used TRNs of *E. coli*, *H. sapiens* and *S. cerevisiae*. We used the previously constructed networks, used by existing methods to test our method [1-3,5]. Earlier studies used existing interaction data to construct these three networks [8-17]. For the experiments that require gene function information, we used the information included in the Gene Ontology Database [18]. We downloaded the list of essential genes for *S. cerevisiae* from the database of essential genes [19].

In the rest of this section, we first compare HIDEN with other existing hierarchical decomposition methods in Section Comparison with existing hierarchical decomposition methods. In Section Biological evaluation of network hierarchies we evaluate the results our method using a number of biological properties of TFs. Finally in Section Effects of input on HIDEN, we analyze the behavior of our algorithm with respect to different quantitative properties of the data.

The objective of hierarchical decomposition is to arrange the TFs of a given network to levels so that the gene that alter the activity of the other appears at a higher level than the other throughout the network as frequently as possible. The two *ϕ*functions described at the beginning of this paper model this relationship in terms of the adjacency and the reachability of the nodes in the given network. In this experiment we evaluate how well our method, HIDEN, compares against three state of the art methods, namely *vertex sort*[2], *HiNO*[3] and *BFS-level*[1], in achieving this objective. To perform this comparison, we compute the penalty values obtained by HIDEN when it is applied on *S. cerevisiae*, *E. coli* and *H. sapiens* networks. We compute the same penalty values for the vertex sort, HiNO and BFS-level methods on the same three datasets for which their hierarchical decompositions are available.

The penalty is a quantitative value that can be used to compare different methods on the same dataset. However, since the size (number of genes and interactions) and the topology of these networks deviate significantly, the resulting penalties will differ significantly across datasets. In order to report a statistically sound value that describes the success of a method independent of the network size and topology, we also compute the Z-scores of the resulting penalty values.

Let us denote the level assignment obtained by a specific method for an *m* node network with *T *= {*t*_{1},*t*_{2},,*t*_{m}}. Let *γ* denote the penalty of *T* according to a specific *ϕ *function. In order to compute the Z-score for *T*, we randomly produce many level assignments using the same level distribution as that of *T*. For each such assignment, we compute the resulting penalty value using the same *ϕ* function. Let *μ* and *σ* denote the mean and standard deviation of the resulting penalty values of all these random level assignments respectively. We calculate the Z-score as follows,

$$z=\frac{\mu -\gamma}{\sigma}$$

(7)

A higher Z-score implies a better level assignment. Typically, a Z-score of four or higher is very significant as they indicate a result which is 4 or more standard deviations more extreme than the mean Table Table11 summarizes the penalties and the corresponding Z-score values. For HIDEN, we reported the results for each of the six values of maximum number of levels (*M *= {3,4,,8}). For other methods the number of levels is not a configurable parameter. Hence, we reported the level that we obtained after execution of that method. We discuss the results for each organism next.

We compared HIDEN with all the three competing methods for this dataset. Our method outperformed all the three methods in terms of both adjacency and reachability penalty values as well as the Z-scores regardless of the number of levels. As the number of levels allowed increases, the penalty incurred by HIDEN monotonically decreases. This, however, is not true for the Z-score as it depends on the distribution of nodes to levels. For instance HIDEN attains the highest Z-score for adjacency penalty at level eight whereas it attains that using only six levels for the reachability penalty. The biggest drop of penalty takes place when the number of allowed levels increases from three to four. We observe further, yet, smaller improvement in the penalty as the number of allowed levels increases beyond four.

Among the competing methods, the vertex sort method of Jothi *et al.* incurs the lowest penalty. It, however, uses significantly more levels than the HiNO and BFS-level methods. Furthermore, although it uses more levels than HIDEN as well, it performs worse than HIDEN in terms of both penalty and Z-score measures. Among the remaining two methods, HiNO and BFS-level, there is no clear winner. BFS-level leads to slightly less penalty at the expense of an additional level. As a result, HiNO produces slightly better Z-scores.

For this dataset, we compared HIDEN with all three existing methods. The penalty values of all the methods for *E.coli* are smaller compared to those of *S. cerevisiae*. This is mainly because *E. coli* network is much smaller. HIDEN performs the best among all methods for four or more levels according to both penalty and Z-score values. We did not observe any improvement for HIDEN beyond seven levels. Vertex sort attains statistically better results than HiNO and BFS-level methods.

We compared HIDEN with vertex sort and BFS-level methods for this dataset. We omitted HiNO in this experiment because we could not run it on this dataset. The results follow a similar pattern as those of the two other datasets. HIDEN outperformed vertex sort and BFS-level even when it used fewer levels. The gap between the Z-scores of HIDEN and the other methods was even more significant than the previous datasets. HIDEN led to the highest drop of penalty of from three to four levels and continued to improve with increased number of levels.

We conclude that, HIDEN performs significantly better than the competing methods for all the three major datasets.

In this section, we analyze HIDEN using biological evidence. First, we check functional properties of genes across different levels. Then, we evaluate the locations of essential genes in the hierarchy.

TRNs regulate the expression of genes that take part in many processes in an organism [13]. Earlier works suggest that the concentration of genes participating in certain functions are closely related to the level in the hierarchy [1]. However, majority of cellular functions in the cell take place through multiple reactions happening in succession. Therefore, we expect a uniform distribution of functions among different levels. In order to confirm this theory, we calculated the functional enrichment score of every single level in the hierarchies we discovered. We first decomposed each of the *H. sapiens*, *E. coli* and *S. cerevisiae* TRNs to each of the three to eight levels. Then, for the resulting 18 combinations (i.e., 3 organisms and 6 levels), we calculated a p-value for each gene ontology term and level pair. We obtain the gene ontology terms from the Gene Ontology database [18]. We calculate these p-values assuming a hypergeometric distribution of the gene ontology term over all the TFs in the network. We observed that the p-values were similar at all levels of the hierarchy (see Figure Figure3).3). This supports our initial theory that majority of the functions the TFs in our network participate are not enriched at any level. One example among many is the *wound healing* function in human network [20]. However, in rare instances, we observed some functions being moderately enriched in some levels. For example, third of the eighth level (the third level when we decompose the network into eight levels) of human TRN is enriched with the *signal transduction* function. However, we could not detect any other levels in any other network enriched with this function.

Each gene in an organism takes part in at least one metabolic function. A gene participating in a large number of reactions is a common phenomena in many organisms. In this experiment, we compare the level of each gene with the number of functions they participate in. By doing so, we aim to discover any existing relation between the two. In order to do this, we use the gene ontology database [18]. The participation of a gene in a reaction is represented using gene ontology annotations in the literature. For each gene in our networks, we first count the number of gene ontology terms it is annotated with. We also decomposed each network into six layers using HIDEN. Then, we visualized the networks using hierarchy information as location and functional information as color of each node. Figures Figures4,4, ,55 and and66 show our results. Our results suggest that there is no strong correlation between the number of functions of each gene and the level of the gene in the hierarchy. However, in all three organisms, we observed that the genes with the highest number of annotations tend to lie at the middle levels (i.e. 2,3 or 4). This result indicates that regulatory hubs in the TRNs are not at the top levels. They are rather at the middle levels of the hierarchy.

The genes which an organism cannot survive without are called essential genes [19]. Such genes take part in vital functions in the cell. Earlier works proposed that the number of essential genes is strongly correlated to its location in the hierarchy [2]. In this experiment, we aim to find out if there exists any such relation. In order to do this, we count the number of essential genes at each level of hierarchy in a five level decomposition of *S. cerevisiae* TRN. We then report the ratio of number of essential genes to total number of genes in a level in the hierarchy. We also calculate P-values for the number of genes observed in each level to show how significant the observations are. Figure Figure77 shows our results for this experiment. We observe that there is a higher density of essential genes at the middle levels of the hierarchy. We also observe that, the P-values we calculated show that the level three (of maximum four) is highly enriched in essential genes. This result, combined with the results of the previous experiment support our theory that regulatory hubs of an organism are often at the middle levels of the hierarchy, rather than the top level. Indeed strong correlation between hubs and essentiality has been observed in the literature that supports our results [21].

Figure Figure88 shows a subnetwork of the human TRN. The highlighted TFs are shown to have abnormalities in many types of cancers. *c-Myc* is a TF which has a key role in cell proliferation [22]. Overexpression of *c-Myc* may result in development of different types of cancers. *TP53* is an essential gene which is regulated by c-Myc. The expression of this gene prevents formation of tumors by activating DNA repair, inhibiting cell growth and finally inducing apoptosis [23]. *TP53* executes apoptosis by activating caspases (i.e. *CASP8*, *CASP3*) [24]. *FLI1* is another protein regulated indirectly by *c-Myc*. The fusion of proteins *EWSR1*/*FLI1* and *EWSR1*/*ERG* due to a mutation creates a master regulator for the development in *Ewing’s Sarcoma*[25,26]. *EWSR1/FLI1* causes tumor formation by upregulating genes that are involved in cell proliferation (i.e. *IGF1*) and downregulating genes that control apoptosis and growth inhibition (i.e. *IGFBP3*, *TGFBR2*) [27]. These small scale observations support our previous justifications that regulatory hubs and essential genes are more likely to be situated in the middle layers of the TRNs.

In this section, we analyze HIDEN by changing the input of the algorithm. In order to do this, we first change the number of layers we decompose the network into. Then, we assume errors and uncertainties in input networks. Using our results, we explain how reliable our method is under different conditions. Finally, we discuss the quality of our results for different subnetworks.

The location of a gene in the hierarchy depends highly on the total number of levels. This leads to the following important question: How much can we rely on the relative levels of genes? One key feature of our method is that it allows the user to specify the number of levels in the hierarchical decomposition of the given network. By exploiting this feature, next, we answer this question. Particularly, we show how the change the number of levels affect the locations of the nodes in the hierarchy. In order to do this, we first calculate the levels of every node for *S. cerevisiae*, *E. coli* and *H. sapiens* networks in a six level hierarchy. We use these calculations to place every node in their respective positions. We then decompose the same networks to five levels. We use the result of the second decomposition to assign colors to each node in the network. Figure Figure99 shows the results of this experiment for *S. cerevisiae*. Our results demonstrate that for the majority of the genes, the relative position between different genes is preserved. In different decompositions, discovering genes in the same relative positions with respect to other genes suggest the accuracy of our method for the relevant genes. However, there exists some genes that violate this relationship. For example, in Figure Figure9,9, nodes *YGL013C*, *YMR280C* and *YKL109W * are at least two levels away from where they were earlier. Therefore, we conclude that the predicted levels of these genes not as reliable as the others. *This approach can be used for evaluating the reliability of our results.* Figures Figures1010 and and1111 present similar results for *E. coli* and *H. sapiens*.

One weakness of all hierarchical decomposition methods arises from the nature of the biological network datasets that they are incomplete and imprecise. As a result, the actual network topology observed can be slightly different than what is given in existing network databases [28]. Furthermore, studies demonstrate that the network topologies can vary due to many other factors such as external perturbations [29] and varying genetic profiles and disorders [30] even within the same species. This raises the question that how much can we rely on a hierarchical decomposition if the topology of the given network is not accurate?

This section evaluates the *robustness* of HIDEN with respect to inaccuracies in the given network. In order to do this, we carry out the following steps. Given a network, we first find its hierarchical decomposition, denoted by *T*. We then create many mutant networks from this network for a given mutation percentage using the *degree preserving edge shuffling* model [31]. This is the state of the art network alteration method that preserves the degree distribution of the network. We elaborate on this method later in this section. Thus, each mutant network denotes a potential precise network that is different than the original network. Using the topology of each mutated network, we compute the penalty of the hierarchical decomposition *T* we found at the first step. Thus, this penalty shows how bad our results are if our network is inaccurate. We repeat this for many mutant networks and report the average of their penalties.

Briefly, we mutate a given network $\mathcal{G}$ as follows. Let (*u*, *v*) and (*s*, *t*) denote two randomly selected edges from this network such that (i) *u* and *v* are different; *s* and *t* are different, and (ii) the edges (*u*, *t*) and (*s*, *v*) do not exist in $\mathcal{G}$. We remove edges (*u*, *v*) and (*s*, *t*) and add edges (*u*, *t*) and (*s*, *v*). Let *η* denote the number of times we repeat this edge shuffling process. Then the mutation percentage of the original network is $\frac{\eta}{\left|E\right|}\times 100\%$ rounded to the nearest integer.

We conducted the experiments on *S. cerevisiae*, *E. coli* and *H. sapiens* and on both penalty metrics adjacency and reachability for different number of levels of hierarchy. Figure Figure1212 summarizes the results for *S. cerevisiae*, *E. coli* and *H. sapiens* using the adjacency and reachability penalties when three, six or eight levels are allowed.

The most important observation that follows from our results is that the Z-score remains high even after we mutate the network by 20%. We observe a slight drop as the mutation rate increases, yet the results remain statistically significant. This observation holds for small (3), medium (6) and large (8) number of allowed hierarchical levels. This result has two major implications. First, HIDEN is extremely robust with respect to network mutations. It was able to identify hierarchical structure using the clues that remain in the topology of the given network after all mutations take place. Thus, even if the original network may be imprecise, the decomposition found by HIDEN will be a true decomposition with a high probability. Second, the degree preserving edge shuffling does not affect the decomposability of the network. The fact that even the original level assignment *T* yielded statistically significant penalties on the mutant network proves that it is possible to find another decomposition *T*’ of the mutant network that is statistically at least as significant as (possibly more significant than) *T*.

So far, we have observed that HIDEN was able to decompose the networks of the given three organisms successfully. This observation along with our last conclusion from the previous section begs the following question: Can HIDEN decompose the mutant networks or was there a bias in topology of these three networks in favor of HIDEN? In other words, how stable is HIDEN with respect to alterations in the network topology?

In order to evaluate the *stability* of HIDEN with respect to network alterations, we carry out the following steps. Given a network $\mathcal{G}$, we create many mutant networks $\mathcal{G}$’ from $\mathcal{G}$ for a given mutation percentage using the degree preserving edge shuffling. We then use HIDEN on each such $\mathcal{G}$’ to find a new hierarchical level assignment *T*’ specifically for that topology. Thus, this penalty shows how much the performance of HIDEN is affected from network alterations. We repeat this for many mutant networks and report the average of their results.

Tables Tables2,2, ,33 summarize the penalties and the corresponding Z-scores for varying mutation percentages as well as varying maximum number of allowed hierarchical levels with according to adjacency and reachability penalties respectively. For all the three organisms, we observe similar patterns in our experiments. The most important observation is that HIDEN achieves very high Z-scores at all mutation rates. Furthermore, these Z-scores are comparable to those of the original network (i.e., mutation percentage = 0%). The adjacency penalty values are also comparable to those for the original network. This coincides with the observation we made in the robustness test in Section Robustness of HIDEN that the degree preserving edge shuffling does not alter the decomposability of the given network. As the mutation percentage increases, Z-score and the adjacency penalties do not show a clear trend to increase or decrease. We, thus, reach to the conclusion that HIDEN is stable with respect to network alterations.

Stability Experiment for increasing mutation percentages: The numbers in parenthesis is the average adjacency penalty

The entire biological network of an organism can be considered as a (possibly overlapping) collection of smaller subnetworks where each subnetwork corresponds to a coherent functional group. For instance, cell cycle network describes the interactions that take place during the division and replication of a cell to produce new cells. Similarly, meiosis network describes a special type of cell division only observed in reproductive cells. These smaller subnetworks may follow a hierarchical structure as well within their local topologies. Clearly, we can use HIDEN on each of these subnetworks to find their hierarchical structure by isolating them from the rest of the network one by one. We call such hierarchical decomposition as *local hierarchy* since it only depends on the topology of the subnetwork. We call the hierarchical decomposition we obtain for a subnetwork from the entire network’s topology as its *global hierarchy*. In this experiment, we evaluate how well the global hierarchy of a subnetwork fits to its local hierarchy.

Let us denote the entire network with $\mathcal{G}$ and a subnetwork of $\mathcal{G}$ with $\mathcal{G}$’. Let us denote the level assignments for the networks $\mathcal{G}$ and $\mathcal{G}$’ by HIDEN with *T* and *T*’ respectively. Let $\widehat{T}\subseteq T$ be the global hierarchy of $\mathcal{G}$’ induced from *T*. We calculate the adjacency penalty and Z-score of $\widehat{T}$ and *T*’ using the topology of $\mathcal{G}$’. Table Table44 summarizes the results for *S. cerevisiae* for two major subnetworks, namely *cell cycle* and *meiosis* taken from the KEGG database [32] with different values of maximum number of allowed levels.

The results demonstrate that the local hierarchy is better than the global one. This is not surprising as the global hierarchy is determined based on the entire network. Thus, the levels are determined with the goal of optimizing all the interactions in the network. On the other hand, local hierarchy is determined only based on the restrictions asserted by the corresponding subnetwork. We observe that the gap between the local and the global hierarchy is small for the cell cycle network. It is, however, significant for the meiosis network. In order to understand the factors that contribute to this gap, we performed a detailed analysis of the topology of the entire *S. cerevisiae* network as well as these two subnetworks. Cell cycle contained 54 genes and 108 interactions. Meiosis was smaller, containing 44 genes and 62 interactions. We define an interaction from a gene that is not in the subnetwork to a gene that is in the subnetwork as an *incoming edge.* If the interaction points the opposite direction, we define it as an *outgoing edge*. We computed the number of incoming and outgoing edges to each subnetwork. The number of incoming edges per node was 1.9 and 3.6 for cell cycle and meiosis respectively. That for the outgoing edges was 20.6 and 18.8 respectively. This suggests that as the number of incoming edges increase, the chance that the global hierarchy deviates from the local one increases. This is indeed intuitive as the incoming edges influence the hierarchy much more than the outgoing edges. For the local hierarchy, we observe that a small number of levels is sufficient to get a good hierarchical decomposition. For instance, HIDEN resolved all conflicts for cell cycle in only five levels. It resolved all but one conflict for meiosis in four levels.

These results demonstrate that the local and global hierarchies can deviate significantly depending on the topological relationship between the subnetwork and the rest of the network. Thus, detailed analysis of both decompositions can yield useful information regarding how the functions of a given subnetwork is depends on the other genes. HIDEN is capable of revealing such information.

In this paper, we took a novel approach to the problem of discovering underlying network hierarchy. We first transformed our problem to a MIPP. Then, we solved this problem using existing optimizers. We named this method **HI**erarchical **DE**composition of gene regulatory **N**etworks. However, due to the growing size of the MIPP with increasing number of genes, we encountered scalability issues. We proposed a divide and conquer approach to tackle such problems. Later, we experimentally showed that our algorithm outperformed existing solutions in terms of minimizing conflicting edges in hierarchy. We also evaluated our method using biological and statistical tools. Then, we discussed the relation between the hierarchy of a gene in a TRN and its location in cell, essentiality and function, based on our experimental results and biological evidence.

The source code for HIDEN can be found in Additional file 1. The code is written in C++. The code requires IBM ILOG CPLEX version 12 or higher to compile and run. Please refer to the documentation of CPLEX for platform specific instructions on how to compile and run applications that use CPLEX libraries. The results of our code using the penalty schemes described in this paper for TRNs of *E. coli*, *H. sapiens* and *S. cerevisiae* can be found in Additional file 2.

The authors declare that they have no competing interests.

GG and TK designed the method. GG implemented the method. GG and NB gathered experimental results. GG and TK wrote the paper. All authors read and approved the final manuscript.

A c++ implementation of the algorithm developed in this paper.

Click here for file^{(494K, tar)}

**The resulting level assignments for the transcriptional regulatory networks of***S. Cerevisiae***,***E. coli***and***H. sapiens***using adjacency and reachability penalties.**

Click here for file^{(126K, xls)}

This work was supported partially by NSF under grants CCF-0829867 and IIS-0845439.

- Yu H, Gerstein M. Genomic analysis of the hierarchical structure of regulatory networks. Proc Natl Acad Sci U S A. 2006;103(40):14724–14731. doi: 10.1073/pnas.0508637103. [ http://dx.doi.org/10.1073/pnas.0508637103] [PubMed] [Cross Ref]
- Jothi R, Balaji S, Wuster A, Grochow JA, Gsponer J, Przytycka TM, Aravind L, Babu MM. Genomic analysis reveals a tight link between transcription factor dynamics and regulatory network architecture. Mol Syst Biol. 2009;5:294. [ http://dx.doi.org/10.1038/msb.2009.52] [PMC free article] [PubMed]
- Hartsperger ML, Strache R, Stumpflen V. HiNO: an approach for inferring hierarchical organization from regulatory networks. PLoS One. 2010;5(11):e13698. doi: 10.1371/journal.pone.0013698. [ http://dx.doi.org/10.1371/journal.pone.0013698] [PMC free article] [PubMed] [Cross Ref]
- Bhardwaj N, Kim PM, Gerstein MB. Rewiring of transcriptional regulatory networks: hierarchy, rather than connectivity, better reflects the importance of regulators. Sci Signal. 2010;3(146):ra79. doi: 10.1126/scisignal.2001014. [ http://dx.doi.org/10.1126/scisignal.2001014] [PubMed] [Cross Ref]
- Bhardwaj N, Yan KK, Gerstein MB. Analysis of diverse regulatory networks in a hierarchical context shows consistent tendencies for collaboration in the middle levels. Proc Natl Acad Sci U S A. 2010;107(15):6841–6846. doi: 10.1073/pnas.0910867107. [ http://dx.doi.org/10.1073/pnas.0910867107] [PubMed] [Cross Ref]
- Latchman SD. Transcription factors: an overview. Int J Biochem and Cell Biol. 1997;29(12):1305–1312. doi: 10.1016/S1357-2725(97)00085-X. [ http://www.sciencedirect.com/science/article/pii/S135727259700085X] [PubMed] [Cross Ref]
- Garfinkel RS, Nemhauser GL. Integer programming. New York: John Wiley & Sons; 1972.
- Balaji S, Babu MM, Iyer LM, Luscombe NM, Aravind L. Comprehensive analysis of combinatorial regulation using the transcriptional regulatory network of yeast. J Mol Biol. 2006;360:213–227. doi: 10.1016/j.jmb.2006.04.029. [ http://dx.doi.org/10.1016/j.jmb.2006.04.029] [PubMed] [Cross Ref]
- Balaji S, Iyer LM, Aravind L, Babu MM. Uncovering a hidden distributed architecture behind scale-free transcriptional regulatory networks. J Mol Biol. 2006;360:204–212. doi: 10.1016/j.jmb.2006.04.026. [ http://dx.doi.org/10.1016/j.jmb.2006.04.026] [PubMed] [Cross Ref]
- Gama-Castro S, Salgado H. et al. RegulonDB version 7.0: transcriptional regulation of escherichia coli K-12 integrated within genetic sensory response units (Gensor Units) Nucleic Acids Res. 2011;39(Database issue):D98—105. [ http://dx.doi.org/10.1093/nar/gkq1110] [PMC free article] [PubMed]
- Harbison CT, Gordon DB. et al. Transcriptional regulatory code of a eukaryotic genome. Nature. 2004;431(7004):99–104. doi: 10.1038/nature02800. [ http://dx.doi.org/10.1038/nature02800] [PMC free article] [PubMed] [Cross Ref]
- Horak CE, Luscombe NM, Qian J, Bertone P, Piccirrillo S, Gerstein M, Snyder M. Complex transcriptional circuitry at the G1/S transition in saccharomyces cerevisiae. Genes Dev. 2002;16(23):3017–3033. doi: 10.1101/gad.1039602. [ http://dx.doi.org/10.1101/gad.1039602] [PubMed] [Cross Ref]
- Jiang C, Xuan Z, Zhao F, Zhang MQ. TRED: a transcriptional regulatory element database, new entries and other development. Nucleic Acids Res. 2007;35(Database issue):D137—D140. [ http://dx.doi.org/10.1093/nar/gkl1041] [PMC free article] [PubMed]
- Lee TI, Rinaldi NJ. et al. Trascriptional regulatory networks in saccharomyces cerevisiae. Science. 2002;298(5594):799–804. doi: 10.1126/science.1075090. [ http://dx.doi.org/10.1126/science.1075090] [PubMed] [Cross Ref]
- Luscombe NM, Babu MM, Yu H, Snyder M, Teichmann SA, Gerstein M. Genomic analysis of regulatory network dynamics reveals large topological changes. Nature. 2004;431(7006):308–312. doi: 10.1038/nature02782. [ http://dx.doi.org/10.1038/nature02782] [PubMed] [Cross Ref]
- Svetlov VV, Cooper TG. Review: compilation and characteristics of dedicated transcription factors in saccharomyces cerevisiae. Yeast. 1995;11(15):1439–1484. doi: 10.1002/yea.320111502. [ http://dx.doi.org/10.1002/yea.320111502] [PubMed] [Cross Ref]
- Teichmann SA, Babu MM. Gene regulatory network growth by duplication. Nat Genet. 2004;36(5):492–496. doi: 10.1038/ng1340. [ http://dx.doi.org/10.1038/ng1340] [PubMed] [Cross Ref]
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. [PMC free article] [PubMed] [Cross Ref]
- Zhang R, Lin Y. DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res. 2009;37(Database issue):D455—D458. [ http://dx.doi.org/10.1093/nar/gkn858] [PMC free article] [PubMed]
- Galko MJ, Krasnow MA. Cellular and genetic analysis of wound healing in drosophila larvae. PLoS Biol. 2004;2(8):E239. doi: 10.1371/journal.pbio.0020239. [ http://dx.doi.org/10.1371/journal.pbio.0020239] [PMC free article] [PubMed] [Cross Ref]
- Jeong H, Mason SP, Barabasi AL, Oltvai ZN. Lethality and centrality in protein networks. Nature. 2001;411(6833):41–42. doi: 10.1038/35075138. [ http://dx.doi.org/10.1038/35075138] [PubMed] [Cross Ref]
- Dang CV, Resar LM, Emison E, Kim S, Li Q, Prescott JE, Wonsey D, Zeller K. Function of the c-Myc Oncogenic Transcription Factor. Experimental Cell Research. 1999;253:63–77. doi: 10.1006/excr.1999.4686. [ http://www.sciencedirect.com/science/article/pii/S0014482799946864] [PubMed] [Cross Ref]
- Vousden HK. p53: death star. Cell. 2000;103(5):691–694. doi: 10.1016/S0092-8674(00)00171-9. [ http://www.sciencedirect.com/science/article/pii/S0092867400001719] [PubMed] [Cross Ref]
- Schuler M, Green DR. Mechanisms of p53-dependent apoptosis. Biochem Soc Trans. 2001;29(Pt 6):684–688. [PubMed]
- Kauer M, Ban J, Kofler R, Walker B, Davis S, Meltzer P, Kovar H. A molecular function map of Ewing’s Sarcoma. Plos One. 2009;4(4):e5415. doi: 10.1371/journal.pone.0005415. [ http://dx.doi.org/10.1371/journal.pone.0005415] [PMC free article] [PubMed] [Cross Ref]
- Owen LA, Kowalewski AA, Lessnick SL. EWS/FLI mediates transcriptional repression via NKX2.2 during oncogenic transformation in Ewing’s Sarcoma. PLoS ONE. 2008;3(4):e1965. doi: 10.1371/journal.pone.0001965. [ http://dx.plos.org/10.1371] [PMC free article] [PubMed] [Cross Ref]
- Riggi N, Stamenkovic I. The biology of Ewing sarcoma. Cancer Lett. 2007;254:1–10. doi: 10.1016/j.canlet.2006.12.009. [ http://www.sciencedirect.com/science/article/pii/S0304383506006811] [PubMed] [Cross Ref]
- Sun N, Carroll RJ, Zhao H. Bayesian error analysis model for reconstructing transcriptional regulatory networks. Proc Natl Acad Sci U S A. 2006;103(21):7988–7993. doi: 10.1073/pnas.0600164103. [ http://dx.doi.org/10.1073/pnas.0600164103] [PubMed] [Cross Ref]
- Bandyopadhyay N, Somaiya M, Kahveci T, Ranka S. Modeling perturbations using gene networks. Proceedings of Life Sciences Society Computational Systems Bioinformatics (CSB) 2010. pp. 26–37. [ http://www.lifesciencessociety.org/CSB2010/toc/26.2010.html]
- Bandyopadhyay N, Somaiya M, Ranka S, Kahveci T. Identifying differentially regulated genes. Computational Advances in Bio and Medical Sciences (ICCABS), 2011 IEEE 1st International Conference on. 2011. pp. 19–25. [ http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5729877&tag=1]
- Mcsweeney PJ, Ashkenazi M, States D. Random network plugin. Google Summer of Code. 2008. p. 2008. [ https://sites.google.com/site/randomnetworkplugin/Home]
- Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 1999;27:29–34. doi: 10.1093/nar/27.1.29. [PMC free article] [PubMed] [Cross Ref]

Articles from BMC Bioinformatics are provided here courtesy of **BioMed Central**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |