Dynamic network clustering is an increasingly important problem across diverse disciplines. Our algorithm optimizes the likelihood modularity, which is asymptotically consistent (Bickel and Chen, 2009
). Other machine learning and physics approaches are based on probabilistic graphical models such as Latent Dirichlet Allocation (LDA, Airoldi et al., 2008
; Ball et al., 2011
; Blei et al., 2003
). Dynamic extensions have been proposed (Fu et al., 2009
), but prior to our work have been impractical except for very small networks with around 100 vertices and under 10 latent classes. Even efficient variational methods such as VBM (Hofman and Wiggins, 2008
) have scaling that is far worse than a near linear or at least quadratic run time in the number of nodes and edges.
Our DHAC algorithm scales as O
), the same as HAC (Park and Bader, 2011
), with a constant prefactor for the number of time points. This provides an excellent trade-off for genome-scale problems. Networks considered here with 2000 vertices required about 5 min on a single 2 GHz processor. A full-genome network with 10 000 to 100 000 vertices could be analyzed in a day to a week on single processor, but in practice would be much faster because each time point could be run in parallel.
The cluster matching algorithm MATCH-EM is a second contribution that provides a solution to the general problem of tracing the evolution of a set of groups or clusters over time. It generalizes a previous belief propagation method for bipartite matching (Bayati et al., 2006
). The bipartite max-product algorithm is exact with a worst-case run-time of O
) for K
classes. Our generalization has an additional linear factor of the number of time points. While it is no longer guaranteed to converge to the exact solution, for biological networks here it converges rapidly with good results.
Our methods applied to real biological data provide new insight. Many transcription time course experiments reveal waves of correlated gene expression, with no standard methods to parse a large set of correlated genes into well-defined protein complexes. The DHAC method is a general solution to this problem and provides a multi-resolution view of dynamic expression and organization of the proteome. Focusing on specific predicted complexes reveals possible mechanisms of regulation and control. Our analysis of the yeast metabolic cycle identifies protein complexes with asynchronous gene expression, which suggests RSM22 as an RNA methyltransferase whose early expression may be required to assemble and stabilize the mitochondrial ribosome.
Our methods permit proteins to switch between complexes over time, which we see in the dynamics of the nuclear pore. Hierarchical methods like DHAC also provide a natural multi-scale description of complexes, subcomplexes and proteins. A separate challenge is introducing mixed membership, with the same protein serving as a subunit in two distinct protein complexes (Palla et al., 2005
Several improvements to DHAC are possible. Previous work showed that the hierarchical structure inferred from static networks corresponds to levels of biological organization, pathway to complex to subcomplex and the fine structure underneath a collapsed complex can also be used to improve link prediction (Park and Bader, 2011
). In the current work, however, we lacked a method to match the dynamically evolving hierarchical structure across snapshots. Consequently the focus here is on the bottom-level clusters rather than the hierarchical structure.
This work assumes that the population average transcription data is a good representation for the transcriptional state of each cell. In reality, individual cells may differ from the mean. In the yeast metabolic cycle, for example, about half of the cells undergo cell division per metabolic cycle, potentially yielding two distinct cell populations. More advanced methods have been proposed to increase resolution and drive toward single-cell models (Baym et al., 2008
Direct measurements of protein abundance through quantitative mass spectrometry could improve the analysis and would be intriguing to combine with expression data. For transcription data, protein abundance may be better estimated by a transcription-translation model,
, where R
) is the measured transcriptional abundance, P
) is the abundance of the corresponding protein and β and α are production and degradation rates. This model generates exponentially weighted smoothing of protein abundance, similar to the exponential kernel we used for smoothing. Since the exponential smoothing kernel already works well, we anticipate that results should be robust to choices of β and α, with the possibility of using consensus values for most proteins.