Representing interaction networks
As is common in current genome-scale informatics, the fundamental object for MINER is the "gene", although this can actually refer variously to gene products such as transcripts, proteins, intergenic (promoter) regions, etc. A network may be formalised as a graph G = (V, E), where each vertex in the set V denotes a gene, and each edge in the set E represents some kind of interaction between genes. Edges may be directed or undirected, and may have labels, e.g., to distinguish between different types of interactions.
Using a machine learning toolkit
MINER uses the WEKA machine learning toolkit [20
] for tree learning. The advantage of a general-purpose machine learning toolkit in the exploratory analysis of genome-scale interaction data is the ease and rapidity with which many different forms of data mining can be performed. For example, it is possible to move quickly from simple visualizations of the data and summary statistics to sophisticated methods such as non-linear multi-variate regression or high-dimensionality kernel-based classifier learning.
Since the predominant mode of analysis in MINER is exploratory rather than hypothesis testing, it is necessary to have powerful methods capable of detecting the faint signals present in noisy data such as microarrays. Although these may increase the risk of Type 1 errors (i.e., false positives, suggesting interactions which in fact have no biological basis), it is understood that any detected interaction will be subject to further analysis by different techniques before they can be accepted. There is also a role in this process for integrating potential interactions with other sources of data to increase confidence. On the positive side there are many advantages in reverse engineering networks by interactively tracing out patterns of influence of genes on other genes using the powerful means of signal detection implemented in machine learning methods.
Non-linear regression of multiple genes on a target using model tree learning subsumes techniques such as correlation-based construction of co-expression matrices. This is important since regulatory relationships may be non-linear. In particular, this representation can learn context-dependent (potentially regulatory) relationships: as an example, we could have that given gene A > 1.3 and gene B < -0.9 then the dependence of genes C, D, and E on target F is given by the linear regression equation F = -0.2 C + 2.3 D + 0.1 E + 0.7. Such context sensitivity has the potential to detect regulatory signals in data that could be missed by simply finding the pairwise correlations of genes A, .., E with target gene F.
Tree learning methods also perform attribute (variable) selection during the learning process, finding a subset of genes implicated in potential regulatory relationships with a target, enabling inspection by a biologist, since typically this represents only a small subset of the whole genome. The potential for overfitting can be controlled by user-driven pruning built into the algorithms. Other learning methods such as high-dimensionality kernel methods can be applied to the same data sets; in this case feature selection can be applied by either pre-processing the data, or post-processing the learned model [28
Network construction from multiple trees
Transforming a set of trees (e.g., see Figure ), each of which encodes a disjunction of conjunctive rules on the conditions (gene expression levels) under which a single target gene is expressed, to a network that captures the combination of regulatory dependencies between multiple genes in a user-friendly way is not straightforward. We adopted a level-wise approach (Figure ). At the first level all the trees learned from the expression data are retained, since they capture the details of the regulatory relationships of genes on their targets. A higher-level network is then constructed by combining the trees at the first level and removing some of the detail. Recall that both levels are expanded only as the user explores the space of target genes.
At the network level, the goal is not to provide the detailed logic of combinations of condition-specific gene regulation, but rather to show the general organisation of regulatory gene interactions. To do this we use the structure of the trees. Parents of terminal (leaf) nodes are more closely linked with their target genes and edges are labelled to denote the principal regulatory effect (e.g., up or down). Edges linking non-terminal (internal) nodes are then added without labels to denote an indirect regulatory interaction. Note that functionally these relationships may be just as important. However, this structures the network and reduces clutter in the visualization. Since all details are retained in the trees at the lower level, no information is lost. An example of such a network is shown in Figure .
Integration of heterogeneous data sources
Gene Ontology: MINER uses a distance measure on the GO annotation of pairs of genes [22
] to evaluate their biological relatedness. This is currently implemented at the level of individual trees, but could be easily incorporated into network edges as well.
Kyoto Encyclopedia of Genes and Genomes (KEGG): each gene appearing in an internal node of a decision tree is annotated with a species-specific URL denoting its entry in the KEGG GENES database. This is then included in the SVG file that displays the decision tree graphically in the browser interface. When the user clicks on a node in the tree, the browser executes a query to open the gene's annotation page and display details of its name, sequence, and other annotation using KEGG's DBGET method.
Other sources of expression data: Since tree induction methods are non-parametric they may be applied to other data sources, as long as they are in a similar format to mRNA expression data, such as data from next-generation sequencing data, proteomics or glycomics. This is because data generated in the form of (absolute or relative) abundances, such as from high-throughput mass spectrometry are similar to microarray data in the sense of being an indirect measure of concentration of gene products or other molecules. However, this is left for future work since so far we have only applied MINER to microarray data.
A large number of methods have been proposed to infer whole gene regulatory networks from gene expression data (reviewed in [13
]). These methods all apply a "one-shot" paradigm that can lack transparency for the end user and does not allow the use of the biologist's domain knowledge. MINER differs from most approaches through its interactivity that allows the user to explore the data and generate testable hypotheses in the process.
Other interactive methods fall into two categories: network visualisation tools that can incorporate some network inference algorithm, and interactive data mining applications.
In the first category, SEBINI [29
] is designed to be a framework
to support testing of network inference algorithms using synthetic and other data sets. However, it has a limited number of inference methods incorporated, and cannot support the two level approach we have adopted. It also does not seem to be actively under development. ToPNet [30
] adopts the Petri Net formalism to represent interactions, which is more flexible than simple graphs, particularly for metabolic reactions. However, it does not support any data mining methods for network inference, and it is no longer supported. Cytoscape [31
] is a widely used visualisation and integration package that supports some network inference plug-ins (for example [32
]). All of these plug-ins perform a global network inference based on uni-variate correlation rather than the gene-by-gene approach of MINER that uses more involved multi-variate non-linear tree learning.
In the second category, SysNet [34
] combines visualization and exploratory data analysis, however its network inference is restricted to standard methods of correlation. Unlike MINER, SysNet infers a global network first then allows the user to drill down to inspect properties of individual nodes rather than building the network from individual relationships.