In recent years, high throughput biotechnologies have made large-scale gene expression surveys a reality. Gene expression data provide an opportunity to directly review the activities of thousands of genes simultaneously. However, computational methods that can handle the complexity (noisy, substantial amount of variables, high dimensionality, etc.) of these biological data are often unavailable [
1]. Powerful computational methods and data mining tools are needed for biologically meaningful inferences from gene expression data.
Cluster analysis has been used to separate genes into groups based on their expression profiles [
2], in which similar expression profiles will be more likely in the same group. Although cluster analysis gives insight into the groups of genes that may share similar functions, the inference of the relationships among these groups is beyond what cluster analysis can do.
A variety of continuous or discrete, static or dynamic, quantitative or qualitative models have been proposed for inference of biological networks. These include biochemically driven methods [
3], linear models [
4,
5], Boolean networks [
6], fuzzy logic [
7,
8], Bayesian networks [
9], and recurrent neural networks [
10-
12]. Biochemically inspired models are developed on the basis of the reaction kinetics between different components of a network. However, most of the biochemically relevant reactions under participation of proteins do not follow linear reaction kinetics, and the full network of regulatory reactions is very complex and hard to unravel in a single step. Linear models attempt to solve a weight matrix that represents a series of linear combinations of the expression level of each gene as a function of other genes, which is often underdetermined since gene expression data usually have far fewer dimensions than the number of genes. In a Boolean network, the interactions between genes are modeled as Boolean function. Boolean networks assume that genes are either "on" or "off" and attempt to solve the state transitions for the system. The validity of the assumptions that genes are only in one of these two states has been questioned by a number of researchers, particularly among those in the biological community. In [
7], an approach is proposed based on fuzzy rules of a known activator/repressor model of gene interaction. This algorithm transforms expression values into qualitative descriptors that can be evaluated by using a set of heuristic rules and searches for regulatory triplets consisting of activator, repressor, and target gene. This approach, though logical, is a brute force technique for finding gene relationships. It involves significant computation time, which restricts its practical usefulness. In [
8], we propose the use of clustering as an interface to a fuzzy logic-based method to improve the computational efficiency. In a Bayesian network model, each gene is considered as a random variable and the edges between a pair of genes represent the conditional dependencies entailed in the network structure. Bayesian statistics are applied to find certain network structure and the corresponding model parameters that maximize the posterior probability of the structure given the data. Unfortunately, this learning task is NP-hard, and it also has the underdetermined problem. The recurrent neural network (RNN) model has received considerable attention because it can capture the nonlinear and dynamic aspects of gene regulatory interactions. Several algorithms have been applied for RNN training in network inference tasks, such as fuzzy-logic [
11] and genetic algorithm [
12]. In [
10,
13], we applied particle swarm optimization (PSO) method to train the RNN for network inference, yielding promising results.
As variant sources of biological data are becoming available now, it is very necessary and helpful to infer gene regulatory network (GRN) not only from one single data source, but from data fusion of multiple complementary data sources. A few previous studies combined time course gene expression data with other data sources, such as genomic location data [
14] and sequence motif [
15]. Prior knowledge of GRN helps understand gene interactions in important biological processes such as differentiation, cell cycle, and development. Due to the specific properties of gene expression data, the task of inferring GRNs involves several challenges including: (1) living cells contain thousands of genes (high dimensionality); (2) each gene interacts with one or more other genes directly or indirectly with complex dynamic and nonlinear relationships, (3) current technologies generate data that involve a substantial amount of noise, and (4) due to the cost of large-scale gene expression profiling experiments, the sample size is extremely low compared with the number of genes. In this study, we address these challenges by: (1) preprocessing gene expression data (e.g. normalization and missing value imputation) to reduce the data noise; (2) clustering genes with gene expression data and gene functional category information to find the optimal modules with biological significance and reduce the problem dimensionality; (3) modeling GRNs with the particle swarm optimization – recurrent neural network (PSO-RNN) method between the modules to capture their nonlinear and dynamic relationships.
Our previous studies [
10,
13] demonstrate that we can benefit by incorporating known gene functional category information in terms of improving the inferential power of our framework. Moreover, instead of using fully connected RNN model, we propose a network pruning method to select the statistically significant weights for the final GRN structure using PSO. The hybrid PSO-RNN algorithm is applied to infer networks of interactions from two real-world gene expression data. The inferred GRNs are confirmed with previous studies.