|Home | About | Journals | Submit | Contact Us | Français|
Modularity is one of the most prominent properties of real-world complex networks. Here, we address the issue of module identification in two important classes of networks: bipartite networks and directed unipartite networks. Nodes in bipartite networks are divided into two nonoverlapping sets, and the links must have one end node from each set. Directed unipartite networks only have one type of node, but links have an origin and an end. We show that directed unipartite networks can be conveniently represented as bipartite networks for module identification purposes. We report on an approach especially suited for module detection in bipartite networks, and we define a set of random networks that enable us to validate the approach.
Units in physical, chemical, biological, technological, and social systems interact with each other defining complex networks that are neither fully regular nor fully random [1-3]. Among the most prominent and ubiquitous properties of these networks is their modular structure [2,4], that is, the existence of distinct groups of nodes with an excess of connections to each other and fewer connections to other nodes in the network.
The existence of modular structure is important in several regards. First, modules critically affect the dynamic behavior of the system. The modular structure of the air transportation system , for example, is likely to slow down the spread of viruses at an international scale  and thus somewhat minimize the effects of high-connectivity nodes that may otherwise function as “super-spreaders” [7,8]. Second, different modules in a complex modular network can have different structural properties . Therefore, characterizing the network using only global average properties may result in the misrepresentation of the structure of many, if not all, of the modules. Finally, the modular structure of networks is likely responsible for at least some of the correlations (e.g., degree-degree correlations [10-14]), that have attracted the interest of researchers in recent years .
For the above reasons, considerable attention has been given to the development of algorithms and theoretical frameworks to identify and quantify the modular structure of networks (see , and references therein). However, current research activity has paid little attention, except for a few studies in sociology [16,17], to the problem of identifying modules in a special and important class of networks known as bipartite networks (or graphs). Nodes in bipartite networks are divided into two nonoverlapping sets, and the links must have one end node from each set. Examples of systems that are more suitably represented as bipartite networks include the following:
Another important class of networks for which no sound module identification methods are available are unipartite directed networks. Examples of directed unipartite networks include the following:
The usual approach to identify modules in directed networks is to disregard the directionality of the connections, which will fail when different modules are defined based on incoming and outgoing links.
Here, we address the issue of module identification in complex bipartite networks. We start by reviewing the approaches that are currently used heuristically and aprioristically to solve this problem. We then suggest an approach especially suited for module detection in bipartite networks, and define a set of random networks that permit the evaluation of the accuracy of the different approaches. We then discuss how it is possible to use the same formalism to identify modules in directed unipartite networks. Our method enables one to independently identify groups of nodes with similar outgoing connections and groups of nodes with similar incoming connections.
For simplicity, from now on we denote the two sets of nodes in the bipartite network as the set of actors and the set of teams, respectively. Given a bipartite network, we are interested in identifying groups (modules) of actors that are closely connected to each other through coparticipation in many teams. Of course, one is free to select which set of nodes in a given network is the “actor set” and which one is the “team set,” so one can identify modules in either or both sets of nodes.
We require any module-identification algorithm to fulfill two quite general conditions: (i) the algorithm needs to be network independent; and (ii) given the list of links in the network, the algorithm must determine not only a good partition of the nodes into modules, but also the number of modules and their sizes.
The first condition is somewhat trivial. We just make it explicit to exclude algorithms that are designed to work with a particular network or family of networks, but that will otherwise fail with broad families of networks (for example, large networks or sparse/dense networks).
The second condition is much more substantial, as it makes clear the difference between the module-identification problem and the graph partitioning problem in computer science, in which both the number of groups and the sizes of the groups are fixed. To use a unipartite network analogy, given a set of 120 people attending a wedding and information about who knows whom, the graph partitioning problem is analogous to optimally setting 12 tables with 10 people in each table. In contrast, the module-identification problem is analogous to identifying “natural” groups of people, for example, the different families or distinct groups of friends.
The second condition also excludes algorithms (based, for example, on hierarchical clustering or principal component analysis ) that project network data into some low-dimensional space without specifying the location of the boundaries separating the groups. For example, given a dendogram generated using hierarchical clustering, one still needs to decide where to “cut it” in order to obtain the relevant modules. To be sure, one can propose a combination of algorithms that first project the data into some low-dimensional space and then set the boundaries, and assess the accuracy of the method. In general, however, one cannot evaluate the performance of hierarchical clustering, given that hierarchical clustering does not provide a single solution to module-identification problem. Neither can one test the infinite combinations of dimensionality reduction algorithms with techniques for the actual selection of modules.
Freeman  has recently compiled a collection of 21 algorithms that have been used in the social networks literature to identify modules in bipartite networks. To the best of our understanding none of the algorithms described there satisfies the two conditions above. Among the statistical physics community, on the other hand, the common practice is to project the bipartite network onto a unipartite actors' network, and then identify modules in the projection. In the scientists' projection of a scientific publication network, for example, two scientists are connected if they have coauthored one or more papers. The caveat of this approach is that, even if the projection is weighted (by, for example, the number of papers coauthored by a pair of scientists), some information of the original bipartite network, like the sizes of the teams, is lost in the projection. Here, we suggest an alternative to existing approaches to identify modules in complex bipartite networks.
A widely used and quite successful method for the identification of modules in unipartite networks is the maximization of a modularity function. Although this method has limitations [33-35], it yields the most accurate results reported in the literature for a wide family of random networks with prescribed modular structure [15,36,37].
In the same spirit, here we define a modularity function that, upon optimization, yields a partition of the actors in a bipartite network into modules. By doing this, the module identification problem becomes a combinatorial optimization problem that is analogous to the identification of the ground state of a disordered magnetic system [38,39].
A ubiquitous modularity function for unipartite networks is the Newman-Girvan modularity . The rationale behind this modularity is that, in a modular network, links are not homogeneously distributed. Thus, a partition with high modularity is such that the density of links inside modules is significantly higher than the random expectation for such density. Specifically, the modularity () of a partition of a network into modules is
where NM is the number of modules, L is the number of links in the network, ls is the number of links between nodes in module s, and ds is the sum of the degrees of the nodes in module s. Then ls/L is the fraction of links inside module s, and (ds/2L)2 is an approximation (assuming that self-links and multiple links between nodes are allowed) to the fraction of links one would expect to have inside the module from chance alone.
We define a modularity () that can be applied to identify modules in bipartite networks. We start by considering the expected number of times that actor i belongs to a team comprised of ma actors,
where ti is the total number of teams to which actor i belongs. Similarly, the expected number of times that two actors i and j belong to team a is
Therefore, the average number of teams in which i and j are expected to be together is
where we have used the identity Σama=Σktk. Note that Σama(ma−1) and (Σama)2 are global network properties, which do not depend on the pair of actors considered.
Equation (4) enables us to define the bipartite modularity as the cumulative deviation from the random expectation
where cij is the actual number of teams in which i and j are together. For convenience, we exclude the irrelevant diagonal term i=j from the sums , and normalize the modularity so that B→ 1 when (i) all actors in each team belong to a single module [ΣsΣi≠jscij=Σama(ma−1)], and (ii) the random expectation for pairs of nodes being in the same team is small [ΣsΣi≠jstitj (Σama)2].
As in the derivation of Eq. (1), the null model implicit in Eqs. (2) and (3) is such that one could, in principle, have multiple connections between an actor and a team. In most cases this situation would not make sense, so the null model is only appropriate when ma and ti are much smaller than Σama, for all a and all i.
Ensembles of random networks with prescribed modular structure  enable one to assess algorithm's performance quantitatively, and thus to compare the performance of different algorithms. Here, we introduce an ensemble of random bipartite networks with prescribed modular structure (Fig. 1).
We start by dividing the actors into NM of modules; each module s comprises Ss nodes. For clarity, we use different “colors” for different modules. The network is then created assuming that actors that belong to the same module have a higher probability of being together in a team than actors that belong to different modules . Specifically, we proceed by creating NT teams as follows:
We next investigate the performance of different module identification algorithms in both model networks with pre-defined modular structure, and in a simple real network that shows some interesting features.
We consider three approaches for the identification of modules in bipartite networks. First, we consider the un-weighted projection (UWP) approach. Within this approach, we start by building the projection of the bipartite network into the actors space. Then we consider the projection as a regular unipartite network and use the modularity given in Eq. (1).
Next, we consider the weighted projection (WP) approach. Within this approach, we start by building the weighted projection of the bipartite network. In the weighted projection, actors are connected if they are together in one or more teams, and the weight wij of the link indicates the number of teams in which the two actors are together (thus, wij=cij). We then use the simplest generalization to weighted networks of the modularity in Eq. (1)
where W = Σi≥jwij, is the sum of the weights of the links within module s, and .
Finally, we consider the bipartite (B) approach. Within this approach, we consider the whole bipartite network and use the modularity introduced in Eq. (5).
In all cases, we maximize the modularity using simulated annealing . Several alternatives have been suggested to maximize the modularity including greedy search , extremal optimization , and spectral methods [44,45]. In general, there is a trade-off between accuracy and execution time, with simulated annealing being the most accurate method , but at present too slow to deal properly with networks comprising hundreds of thousands or millions of nodes.
We consider the performance of the different module identification approaches when applied to the model bipartite networks described above. We assess the performance of an algorithm by comparing the partitions it returns to the predefined group structure. Specifically, we use the mutual information IAB  between partitions A and B to quantify the performance of the algorithms
Here, S is the total number of nodes in the network, is the number of modules in partition A, is the number of nodes in module i of partition A, and is the number of nodes that are in module i of partition A and in module j of partition B. The mutual information between partitions A and B is 1 if both partitions are identical, and 0 if they are uncorrelated.
In the simplest version of the model all modules have the same number of nodes, all teams have the same size, and the color of each team is set assuming equal probability for each color. Unless otherwise stated, we build networks with NM=4 modules, each of them comprising 32 actors, and NT=128 teams of size m=14.
We first investigate how team homogeneity p affects algorithm performance. For p=1, all the actors in a team belong to the same module, and any reasonable algorithm must perfectly identify the modular structure of the network; thus I=1. Conversely, for p=0, actors are perfectly mixed in teams, and all algorithms will return random partitions due to small fluctuations ; thus I=0. Any p>0 will provide a signal that an algorithm can, in principle, extract.
As shown in Fig. 2(a), the UWP approach performs systematically and significantly worse than the weighted projection and the bipartite algorithms for all values of p. For the choice of parameters described above, the last two algorithms start to be able to identify the modular structure of the network for p≈0.35. For p≥0.5, one already finds I>0.9. The WP and the B approaches yield indistinguishable results.
Team homogeneity is not the only parameter affecting algorithm performance. For example, the number of teams NT in the network critically affects the amount of information available to an algorithm. Interestingly, the number of teams affects in different ways the UWP approach on the one hand, and the WP and B approaches on the other; Fig. 2(b). For the WP and B algorithms, the larger NT, the larger the amount of information and, therefore, the easier the problem becomes. Indeed, even for very small values of p, the signal-to-noise ratio can become significantly greater than 1 if NT is large enough. On the contrary, as the number of teams increases the UWP becomes denser and denser and eventually becomes a fully connected graph, from which the algorithm cannot extract any useful information. Once more, the performance of the WP and the B approaches are indistinguishable.
In real networks, modules will have (sometimes dramatically) different sizes . Given the sizes of the modules in a network, and assuming that they are ordered so that S1≥S2≥…≥SNM, we define h as the ratio of sizes between consecutive modules (with integer rounding)
Additionally, we select the color of the teams with probabilities proportional to the size of the corresponding module, so that all actors participate, on average, in the same number of teams.
As we show in Fig. 2(c), we again observe that the WP and the B approach perform similarly, and clearly outperform the UWP approach for all values of h.
All the results so far suggest that the WP approach and the B approach yield results that are indistinguishable from each other. We know, however, that differences do exist between both. The distribution of team sizes, in particular, is taken into account in the B approach but disregarded in the WP approach, and “teams” with m=1 are totally disregarded in projection-based approaches, but not in the B approach.
We thus investigate what is the effect of the team size distribution on the performance of the algorithms. Instead of considering that all teams have the same size m, we now consider a distribution p(m) of team sizes. In particular, we consider a (displaced) geometric distribution
which is the discrete counterpart of the exponential distribution. The distribution has mean m=μ.
As we show in Fig. 2(d), some small differences seem to appear between the WP approach and the B approach, although it is difficult to establish conclusively if these differences are significant or not.
In light of this, we investigate in more depth the relationship between the bipartite modularity in Eq. (5) and the weighted extension of the unipartite modularity in Eq. (1). As we show in the Appendix, the bipartite modularity actually reduces to the weighted unipartite modularity (up to an irrelevant additive constant) when all teams in the bipartite network have the same size.
This observation explains why the WP and the B approach differ when teams have unequal sizes . Although our results suggest that each approach outperforms the other in certain cases, we believe that Eq. (5) is, in general, preferable because it explicitly takes into account the distribution of team sizes, while the weighted projection does not.
During the 1930s, ethnographers Davis, Stubbs Davis, St. Clair Drake, Gardner, and Gardner collected data on social stratification in the town of Natchez, Mississippi [32,47]. Part of their field work consisted in collecting data on women's attendance to social events in the town. The researchers later analyzed the resulting women-event bipartite network in light of other social and ethnographic variables. Since then, the dataset has become a de facto standard for discussing bipartite networks in the social sciences .
Here we analyze the modules of both women and events. We start by considering the unweighted projection of the network in the women's space (two women are connected if they co-attended at least one event), and in the events' space (two events are connected if at least one woman was in both events). As we show in Fig. 3(a), the unweighted projection does not capture the true modular structure of the network. The failure of this approach is due to the fact that the projections are very dense. For example, some central events were attended by most women and thus most pairs of women are connected in the projection.
As we show in Fig. 3(b), the weighted projection approach and the bipartite approach yield the exact same results, which do capture the two-module structure of the network. Except for one woman, the partition coincides with the original subjective partition proposed by the ethnographers who collected the data, and is in perfect agreement with some of the supervised algorithms reviewed in Ref. .
Another important class of networks for which no satisfactory module identification algorithm has so far been proposed is directed unipartite networks. In order to tackle this class of networks, we note that directed networks can be conveniently represented as bipartite networks where each node i is represented by two nodes Ai and Bi. A directed link from i to j would be represented in the bipartite network as an edge connecting Ai to Bj.
Consider, for example, a network in which nodes are companies and links represent investments of one company into another. By considering each company as two different objects, one that makes investments and one that receives investments, the directed network can be represented as an undirected bipartite network. Modules in the set of objects that make investments correspond to groups of companies that invest in the same set of companies, that is, groups of companies with a similar investing strategy.
The most widely used approach to identify communities in directed networks is to simply disregard the directionality of the links and identify modules using a method suitable for undirected unipartite networks. This method might work in some situations, but will fail when different modules are defined based on incoming and outgoing links.
Consider, for instance, the simple model network depicted in Figs. 4(a) and 4(b). According to the outgoing links of the nodes this network has two modules: nodes 1–12 and nodes 13–24. According to the incoming links of the nodes the network has also two modules, but they are different: nodes 1–6 and 13–18 on the one hand, and nodes 7–12 and 19–24 on the other. As we show in Fig. 4(c), a layout of the corresponding bipartite network already makes clear the modular structure of the network, and any of the approaches described above (UWP, WP, and B) is able to identify the in-modules and out-modules correctly; Fig. 4(d). Disregarding the direction of the links, however, results in modules that fail to capture the modular structure of the network; Fig. 4(e).
In this work, we have focused on approaches that aim at identifying modules in each of the two sets of nodes in the bipartite network independently. There are two main reasons for this choice. First, methodologically our choice enables comparison with projection-based algorithms, which, by definition, cannot identify modules of actors and teams simultaneously. Second, in most situations it is reasonable to assume that two actors belong to the same module if they coparticipate in many teams, regardless of whether the teams themselves belong to the same module or not. An alternative approach, however, would be to group nodes in both sets at the same time.
Another interesting observation relates to the optimization algorithm used to maximize the modularity. Although we have chosen to use simulated annealing to obtain the best possible accuracy [15,36,37], one can trivially use the modularity introduced in Eq. (5) with faster algorithms such as greedy search  or extremal optimization .
where gis=1 if node i belongs to module s and 0 otherwise, and the elements of the modularity matrix B are defined as
Even more importantly, by sampling all local maxima of the modularity in Eq. (5) one can study, not only the most modular partition of the network, but the hierarchical structure of nested modules and submodules  within each set of nodes in the bipartite network. This is particularly relevant taking into account that the most modular partition of a network may, in some cases, not represent the most “relevant” division of its nodes [33,34].
Finally, a few words are necessary on the comparison between the different approaches. First, we have shown that the (so far “default”) unweighted projection approach is not reliable and can lead, in most situations, to incorrect results. Therefore, we believe that this approach should not be used. As for the weighted projection approach and the bipartite approach, we have shown that their performance is very similar, and that they are actually equivalent when all teams in the bipartite network have the same size. We have also pointed out, however, that they can and do give noticeably different results when team sizes are not uniform. Given this, we believe that the bipartite approach has a more straight-forward interpretation and would be preferable in cases in which the modular structure of the network is unknown.
The authors thank R. D. Malmgren, E. N. Sawardecker, S. M. D. Seaver, D. B. Stouffer, M. J. Stringer, and especially M. E. J. Newman and E. A. Leicht for useful comments and suggestions. One of the authors (L.A.N.A.) gratefully acknowledges the support of a NIH/NIGMS K-25 grant, NSF Grant No. SBE 0624318, the J. S. McDonnell Foundation, and the W. M. Keck Foundation.
Next, we demonstrate that, when all teams in a bipartite network have the same size m, the bipartite modularity is equivalent to the modularity of the weighted projection.
We consider the usual weighted projection, in which each pair of nodes i ≠ j is connected by a link whose weight wij equals the number of times that i and j are together in a team; using our previous notation wij=cij. No self-links are included in the projection.
In this projection, and when all teams have the same number of actors mam, the constant team-size factors in Eq. (5) become
where, as before, W=Σi≥j wij.
Each time an actor is in a team, the total weight of the links in the projected network increases by (m−1). Using this and the identities above, we obtain
Once the summation over modules is carried out, the last term is simply a constant independent of the partition, and is therefore irrelevant. Thus, up to an irrelevant constant, when all teams in a bipartite network have the same size, the bipartite modularity in Eq. (5) is equivalent to the weighted modularity in Eq. (6).