Network component analysis (NCA)
Network Component Analysis (NCA) is a computational method to infer latent factors and the connection relationship of a network, given the initial topology (connection) information and the measurement of gene expression. In Fig. , we illustrate the NCA approach with an example from muscle regeneration studies [
8]. The mathematical model of NCA can be formulated as
where
E is the observation,
A connection matrix,
T latent factors, and
Z0 the initial topology of the network. L is the number of latent (hidden) factors, M the number of experiment conditions, and N the number of genes. As illustrated in Fig. , the latent factors are the transcription factors such as YY1 and MyoD; the network topology is formed by the connection matrices of the TFs to their target genes. The main objective of the NCA approach is to estimate the transcription factors' activities (TFAs) and their target genes. The NCA optimisation criterion can be simply denoted as [
6]:
The NCA algorithm was originally developed for gene regulatory network reconstruction. The model (1) can be interpreted in this way: the
N genes' expression pattern under
M different conditions can be seen as a combination effect of
L transcription factors (TFs). Note that it is well accepted that a linear model only holds after log-ratio transform [
6]:
where
Erij =
Eij(
t)/
Eij(0) (
i = 1,...,
M;
j = 1,...,
N) and
Trkl =
Tkl(
t)/
Tkl(0) (
k = 1,...,
L;
l = 1,...,
M) are ratios of gene expression values and transcription factor activities (TFAs), respectively. In the original NCA scheme, the topology information
Z0 is provided by the ChIP-on-chip data [
9]. With the ChIP-on-chip data available in yeast, NCA has been successfully applied to yeast stress response and cell cycle experiments. Among the estimated TFAs with an oscillation pattern, 75% correspond to known cell-cycle regulators [
7]. However, this NCA scheme is not readily applicable to many other biological studies due to the lack of topology information. In the next section, we will use motif information as a practical means to obtain the initial topology information for NCA.
Motif analysis for initial topology information
A transcription factor (TF) is a protein that regulates its target gene's transcription by binding to a specific regulatory motif in the DNA of the promoter region(s). Thus, we can utilize regulatory motif information to establish the putative topologic relationship between a TF and a downstream target gene. Below we propose a motif analysis procedure to obtain the initial topology information for network reconstruction.
First, the upstream regions of the genes can be extracted from the database PromoSer [
10]. Second, Match™ [
11] (or its improved version, P-Match [
12]) can be used to search the transcription factor binding sites (TFBSs) in each upstream region; this approach generates the scores of both "core similarity" and "matrix similarity" for each matched motif. Third, Match™ searches the TFBS for its position-weighted matrices (PWMs) that can be extracted from the TRANSFAC 11.1 Professional Database [
13]. Fourth, according to the PWMs, a motif score can be calculated for each TF-gene pair where the score is the maximum of the average scores of core similarity and matrix similarity. These motif scores provide the initial topology information for further mNCA analysis as is detailed in the next section.
Note that each motif is a relative short sequence pattern, thus the topology from motif information is merely a rough estimation and will usually include many false positives/negatives. While the topology information is often unreliable for any specific TF-gene pair, we can still infer some key transcription factor activities from gene expression and DNA sequence information using the stability analysis procedure developed in the next section.
Stability analysis for motif-directed NCA
Stability analysis was originally proposed to perform model selection for unsupervised learning, where the number of clusters can be correctly estimated [
14]. Previously, we have developed a stability analysis procedure to estimate the dimension for linear decomposition problems [
15]. The basic idea of stability analysis is that if a small perturbation is introduced equally in different model order, the best consistency will only occur when the model fits correctly the underlying structure of the data.
Here we develop a stability analysis procedure to assess the estimation results of mNCA. Since true functional data on TFAs are usually unavailable, we must establish whether an estimated TFA is a reliable estimate or if this prediction has arisen by error or by chance. When the topology information, either from motif analysis or ChIP-on-chip data, contains many false positives/negatives, we must also determine which TFAs are the reliable estimates of underlying transcription factor activities, or whether these are simply random outcomes.
If we intentionally perturb the network topology, each of the estimated TFAs will change. A falsely or poorly estimated TFA tends to be altered easily by small perturbations and will appear to be unstable. On the contrary, a good TFA estimation, reflecting the consistency between microarray expression data and topology knowledge, will tend to keep its activity pattern throughout multiple perturbations. Therefore, random perturbations should be performed multiple times to test the stability of each predicted TFA.
We propose two stability analysis strategies for our motif-directed NCA scheme. Both strategies estimate whether the predicted TFAs are stable or not when we intentionally alter the motif connection information. The perturbation methods are described as follows:
1. A TF-gene connection is deleted if the motif score is below a predetermined cut-off threshold. By setting different cut-off thresholds, we can change the number of connections and so perturb the network topology. The higher the motif score cut-off is set, the fewer the number of predicted connections.
2. Regardless of the motif score, for each transcription factor its TF-gene connections are randomly altered by either deleting the existing connections or by inserting new connections with some small percentage (e.g., 10%).
For K independent connection perturbations and repeated runs, we will obtain K different estimates of the same TFA. Pair-wise absolute correlation is calculated between different runs, and the stability measurement is defined as follows:
where j and k correspond to different perturbations, respectively. CorrCoef() is the Pearson correlation coefficient function. When stability measurements of a specific TFA are obtained, we can use several statistics including mean and variance estimates to describe a predicted TFA's robustness with respect to perturbation. In this paper, we use boxplot to visualize the stability measurement, simultaneously depicting its minimum, 25% percentile, median, 75% percentile, and maximum.