The goal of this study was to infer and characterize statistical epistasis networks in a large population-based study of bladder cancer susceptibility. We observed distinguishing topologies of the networks assembled using the cancer data and the implication that a group of SNPs may jointly modify the disease outcome. Specifically, the networks Gt had many more high-degree vertices and their largest connected components emerged earlier and grew faster than expected. These characteristics were the most apparent when t = 0.013. The network G0.013 was shown to be approximately scale-free, an important property found in various natural and social networks. This property was no longer observable when t further decreased and edges representing weaker and possibly less biologically relevant pairwise interactions were added.
The network G0.013 allows for some interesting observations about the structure of the pairwise interaction space of the genetic data. First, SNPs aggregate to form connected components, which may indicate that multiple SNPs jointly modify disease outcome. In G0.013, SNPs are grouped into 79 connected components of size ranging from 2 to 39. These connected components show various structural patterns, also known as motifs, including lines, crosses, and stars. The largest connected component has a tree-like structure. This may imply the existence of unique interaction patterns among groups of SNPs.
Second, the network has an approximately scale-free topology and an ensemble of particularly high-degree vertices, which suggests that it may be exceptionally robust. Scale-free networks permeate natural and social sciences [
47-
49]. The most well-known scale-free networks are the backbone of the Internet and social networks. In biology, scale-free topologies have been found in metabolic networks [
31], protein-protein interaction networks [
33], and gene-regulatory networks [
34]. Those various scale-free networks share an intriguing property: the value of
γ in the degree distributions
p(
d) =
c ×
d -γ mostly satisfies 2 ≤
γ ≤ 3 [
47], which is also the case for
G0.013 (
γ = 2.01). As more scale-free networks are being discovered in a variety of fields, a question remains: how can systems as fundamentally different as the cell and the Internet have a similar architecture and obey the same laws [
47]? Scale-free networks typically have many vertices with low degrees and a few vertices with high degrees, also known as
hubs [
30]. This essentially differentiates scale-free networks from random networks where the majority of vertices have average degrees. The probability
p(
d) of degree
d in the Poisson distribution decreases exponentially as
d increases, and thus random networks are very unlikely to have hubs with degrees much larger than the average. The existence of hubs in a scale-free network implies strong robustness against failures. Because random vertex removal is very unlikely to affect hubs, the connectivity of the network most likely remains intact. In biological networks, this robustness translates into the resilience of organisms to intrinsic and environmental perturbations. For instance, in protein-protein interaction networks [
33], most proteins interact with only one or two other proteins but a few are able to interact to a large number. Such hub proteins are rarely affected by mutations and organisms can remain functional under most perturbations. The simultaneous emergence of scale-free topologies in many biological networks suggests that evolution has favored such a structure in natural systems. Moreover, it suggests that the robustness of natural systems does not only result from inherent genetic redundancy but also, and maybe more importantly, from the topological organization of entities and interactions [
33]. Although our epistasis network is developed based on statistical rather than on real bio-chemical interactions, it is interesting to observe similar topologies between biological and statistical networks.
Third, the existence of main effects does not necessarily correlate with the occurrence of interactions. This, in turn, suggests that many current main-effect-prioritized methods might have overlooked SNPs contributing to the disease susceptibility through their interactions with other SNPs rather than through their main effects. As shown in the graph, large main-effect SNPs do not necessarily associate with strong pairwise interactions or interact with many other SNPs. Instead, SNPs involved in potential important pairwise interactions, such as those located on the central path of the largest connected component, often have relatively small main effects.
The statistical epistasis network approach has many advantages. 1) Networks allow for efficiently visualizing both main and epistatic effects and how they interplay. 2) Networks serve as a very intuitive tool to study pairwise interactions and to characterize the entire epistatic interaction space. Moreover, they may also help identify higher-order interactions by grouping SNPs into connected components. High-order epistasis does not necessarily require detectable pairwise interactions between SNPs. However, given that current computational power allows only for exhaustively enumerating pairwise interactions in moderate-size data sets, pairwise interaction networks may serve as a useful guide to explore higher-order epistasis among SNPs that exhibit lower-order interactions. 3) Our network model is assembled using the entire set of available SNPs without limiting ourselves to only high main-effect ones. This reduces the risk of overlooking candidate SNPs that are involved in important interactions but with low main effects. 4) Network topological analyses are used to systematically determine the best network that captures the genetic architecture of a data set. 5) Networks, along with graph theory, are well-developed fields, and many advanced techniques and analytical tools are likely to benefit future network-based epistasis studies. In particular, additional topological properties such as motif distribution and network diameter [
30,
42] are worth investigating.
Among the limitations of this approach is that it is still under development and no user-friendly interface is available yet. Different data sets may require different analytical tools and a fully automated analysis software may therefore not be appropriate. Moreover, since the approach aims at highlighting pairs of SNPs with strong pairwise interactions, it is likely to overlook SNPs that are only involved in higher-order interactions. As mentioned previously, strong three- or higher-order interactions may exist despite the absence of pairwise interactions.
The statistical epistasis network approach we used can be extended in the following ways. 1) The network
G0.013 will be further studied for bladder cancer association. Through a closer investigation, such as gene ontologies and biological pathways, on those 319 SNPs in the network, especially those 39 SNPs in the largest connected component, we expect to prioritize gene categories with high bladder cancer susceptibility, and to testify whether SNP interactions tend to happen within the same category or across categories. Other possible applications include using the network to train classifiers in predicting bladder cancer risk [
50] and to supervise data mining methods for identifying high-order genetic interactions [
27]. 2) The approach can also be applied to other data sets. We are particularly interested in investigating network topologies in larger data sets or data associated with other diseases. 3) To corroborate the present results, future studies could use metrics other than information theoretical measures, such as SNP and gene annotation or SURF scores, which are obtained by directly assessing genetic variants depending on their phenotype relevance using machine learning techniques [
51]. 4) Given the effect of smoking [
37] and arsenic exposure [
41,
52] on bladder cancer prevalence, an additional next step is to account for gene-environment interactions in our analyses. This can be achieved by adding these environmental factors to our model, and investigating how the environmental background on which the genes are expressed modify the conclusions we draw.