The Grubb’s test was used to identify and delete outlier gene expression data derived from the Gene Expression Omnibus (GEO). The Grubb’s test was performed on a gene-by-gene basis for each experimental condition, to detect and delete the abnormal numerical data that are included as flags of experimental error. To quantitatively identify the outlier data, a Z value was calculated.
where SD is the standard deviation. If Z was large, then the datum was considered to be an outlier. The SD was calculated from all data including outliers. A two-tailed probability for the Student t-distribution was obtained as follows:
Here, N is the number of values in the sample. The probability that a datum is an outlier can be obtained from the Student t-distribution for T and N − 2 degrees of freedom. The approximate P value for the outlier test, which was calculated by multiplying the obtained probability by N, is represented as the probability of observing an outlier, assuming the data were sampled from a Gaussian distribution. If P was significant, then the outlier was excluded and the Grubb’s test was performed on the next suspected outlier. The procedure was repeated until a P-value <0.05 was achieved.
A multilevel hierarchical regulation system between proteins and DNA is considered to be a good model for inferring protein-DNA interactions. To detect the hierarchical regulation in yeast, I utilized the TRANSFAC Database (http://biobase-international.com/index.php?id=transfac
). TRNSFAC is one of the useful databases that include information about transcriptional regulation.
Among all registered yeast transcription factors in the TRANSFAC database, 108 genes were detected as encoding transcription factors, which regulate 252 other genes. Some of these regulated genes had empirical confirmation for their transcriptional regulation, but other regulated genes were estimated by a computational analysis of their 5′UTR sequences. To compose the hierarchical regulation among the genes, I compiled 864 binomial relationships between the 108 transcription factor genes and the 252 regulated genes. The regulatory network with hierarchy was constructed from these binomial relationships. Among the constructed hierarchical structures, a 3 layered structure composed of 15 genes was the most complicated hierarchy. In this 3 layered hierarchical structure, many parts have been experimentally confirmed, but some parts still remain uncertain. The details of the components in this hierarchical structure are shown in .
Known binding sites and regulated genes of selected genes.
In the TRANSFAC database, the target genes of each transcription factor have been estimated from the binding site motif sequences. In addition, the database provides quality scores ranging from 1 to 6, which reflect the experimental reliability of a particular protein- DNA interaction. lists the regulated genes and their respective quality scores for each transcription factor. Ten genes are known to be targets of Mig1p, but three of them have not been empirically confirmed as targets. Furthermore, for 5 genes among the ten genes, their binding locations and binding site sequences have not been identified. On the other hand, the five genes known to be targeted by Gal4p have experimentally confirmed binding sites. In the last stage of this serial transcriptional regulation, the transcription factor Gal80p targets two genes. Among the 14 regulated genes in , the quality scores of the two genes were 1, which represents experimentally confirmed transcriptional regulation. The quality scores of the putatively regulated genes were 2, meaning that the transcription factor protein has been confirmed to bind to the motif sequences of the regulated genes, but the transcriptional regulation has not been confirmed. Thus, the inference of the transcriptional regulation of these genes is informative for the functional confirmation of transcription factor binding.
I utilized an abundance of gene expression data for the 15 genes to reconstruct and clarify the 3-layered hierarchical structure composed of them. Since the genes within each layer are controlled by one transcription factor, this 3-layered structure is considered to represent serial transcriptional regulation passing through 3 stages, connected by transcription factors.
All gene expression profiles were obtained from S. cerevisiae
and were downloaded from the GEO Database (http://www.ncbi.nlm.nih.gov/geo/
). The expression profiles from 11,923 experiments were obtained as series matrix files, which describe the expression levels as a log2-ratio of the raw expression signals. The data from these matrix files were transformed into Z-scores and compared. The Grubb’s test was performed on these 11,923 profiles, resulting in the analysis of 4,013 S. cerevisiae
A skeleton network structure of serial transcriptional regulations was constructed by using SEM. The framework of the network structure defined transcription factors as latent variables, and their coding genes and their target genes as observed variables. In this study, 15 genes were described as observed variables, but the number of latent variables was unknown. To find this number, factor analyses were used in each of the three stages of the serial transcriptional regulation.
Factor analysis is a statistical method for describing the variability among observed variables in terms of a potentially lower number of latent variables.28
The initial assumption is that any observed variables may be related with any latent variables. Let us assume that there are p
latent variables (proteins) and q
observed variables (genes) x1
, … xq
, with means u1
, … uq
. Note that the number p
of latent variables is always smaller than the number q
of observed variables. Each observed variable is expressed as linear combinations of p
latent variables, as follows.
is the vector of the expression levels of the gene i
is the partial regression weight of the latent variable Fj
, and i
is an independently distributed error term with zero mean and finite variance. In matrix form, equation (3)
is expressed as
If there are n samples in each of the observed variables, then X and U are the (q ×n) matrices composed of the observed data and their means, respectively. The partial regression coefficients of each latent variable are indicated as elements of Λ, the (q ×p) latent interaction matrix. In the matrix Λ, each column corresponds to a factor and each row corresponds to an observed variable, and thus each element of Λ indicates the strength of the regulation from each protein to each gene. The matrix F is the (p ×n) latent variable matrix, and Q is the (q ×n) error matrix.
In the factor analysis model, the error terms
are independent and multivariate normally distributed with a mean of zero, i~N
). If we let Var
)= Ψ i2
, then the covariance matrix of
is expressed as
The following assumptions are imposed on F
- F and are independent.
- E(F) =0, Var(F) =.
- E() =0, Var() =Ψ2.
The variance-covariance matrix between the observed variables ∑ is given by
The covariance matrix of the observation variables is structurized by the parameters. From this structurized matrix, the values of the partial regression weight matrix Λ and the variances of the “errors”
To clarify the possible number of latent variables in the network, Exploratory Factor Analysis (EFA) was performed. EFA is utilized to reveal the latent structure by assuming that the observational data are a synthetic amount of a lower number of latent variables. The number of latent variables was suggested by a principal factor method with varimax rotation, which is a general method for rotating factors to fit a hypothesized structure of latent variables. To extract the number of latent variables as factors of observed variables, an eigen value > 1 and a scree plot were applied, as usual.28
The number of latent variables estimated by EFA may not correctly result in the network with latent variables. To determine the number of latent variables in each stage, I performed a Confirmatory Factor Analysis (CFA) for all numbers smaller than or equal to the number of estimated p
latent variables. CFA seeks to determine whether the number of factors and associations between the factors and observed variables follow the assumption. Suggested latent variables were arranged within a genetic network as indicators, according to the initial assumption.29
As the initial assumption, I utilized the model that includes all possible associations between the factors and the observed variables. The details of the model assumption are described in the following section.
Regulatory network analysis by SEM consists of two parts: parameter fitting and structure fitting. Before parameter fitting, a network structure must be assumed. The framework of the network structure defined transcription factors as latent variables, and their coding genes and their target genes as observed variables. Based on their relationships, the transcription factors were considered as the predictor variables for the target genes, which in turn were represented as the criterion variables. Causality between transcription factors and their coding genes was treated inversely, with coding genes as the predictor variables and transcription factors as the criterion variables. Therefore, this SEM is a particular implementation of the Multiple Indicator Multiple Cause (MIMIC) model, with the latent variables and observed variables arranged alternately throughout the three transcription factor-regulated stages of the network.
Each gene was arranged as a variable in the network, according to the information registered in the TRANSFAC database. The three targeted transcription factors, Mig1p, Gal4p, and Gal80p, are known to be regulators of gene expression in the hierarchical transcription. I defined these proteins as latent variables. However, the number of latent variables in each stage and the details of the regulatory relationships between latent variables and observed variables were not initially known. Thus, I applied a modified four-step procedure, based on work by Mulaik and Millsap, which is known as a method for constructing models when there is no information about latent variables ().29
Figure 1 Modified four-step procedure. Step 1, construction of unrestricted models for each stage of the serial transcription; Step 2, construction of measurement models for each stage; Step 3, construction of structural equation models for each stage; Step 4, (more ...)
Step 1: Construction of the unrestricted model. By the application of EFA to the observed variables, the number of latent variables was determined. EFA is a technique for discovering the base structure of the variable. It is assumed that a relationship exists between the regulators and the regulated variables. The factor loading value is used to find the factor structure of the data by intuition.28–30
Step 2: Construction of the measurement model. The regulatory relationships between the observed and latent variables were solved by CFA in this step. Varimax rotation, which is the most common rotation strategy, was used to seek the factor loadings across variables for each factor. All paths from latent variables to observed variables were tested for their effectiveness in factor loading. Paths with low loading scores were deleted from the models; those with high loading scores were retained. This was done for each stage.
Step 3: Construction of the structural equation model. The relationships between latent variables were established in this step. This step also determined the causal relationships between observed variables and latent variables. The details of this step are described in the following “Structural equation modeling” section.
Step 4: Stepwise modeling connecting different stages of the network. The models for each stage were connected to create one model of the entire network. I tested all network structure possibilities for one observed variable and two latent variables, to identify the most likely ones. The stages were then connected sequentially by SEM. Finally, significant relationships between error terms, as estimated by the Modification Index (MI) defined by AMOS,31 were determined. These relationships among the error terms were used for the calculations but not incorporated into the network, and thus they have been excluded from the figures.
Structural Equation Modeling (SEM)
SEM is a comprehensive statistical model that includes two types of variables: observed and latent. These variables constitute the structural models that consider relationships between latent variables and the measurement models that consider relationships between the observed variables and the latent variables. These relationships can be presented both algebraically, as a system of equations, and graphically, as path diagrams.
In this study, the target genes and the coding genes of the transcription factors in the transcriptional regulation were defined as observed variables, since their expression data were obtained from the GEO. Meanwhile, the transcription factors were defined as latent variables, owing to the lack of measured data in the expression profiles. All variables were classified as one of two types: exogenous variables and endogenous variables. Exogenous variables are those that are not regulated by other variables in the system. Endogenous variables, on the other hand, are. In my model, the mig1 gene is defined as an exogenous variable, while all other genes are defined as endogenous variables. All latent variables are endogenous variables, since they are regulated by coding genes. There are three possible relationships between the variables: relationships between latent variables, causal relationships between observed variables and latent variables, and regulatory relationships between latent variables and observed variables. The model is defined as follows:
is a vector of p
latent variables (number of transcription factors); y
is a vector of q
observed variables (measured gene expression patterns); B is a p
matrix representing relationships between latent variables; Γ is a p
matrix representing causal relationships between observed and latent variables; and Λ is a q
matrix representing regulatory relationships between latent and observed variables. Errors that affect the observed and latent endogenous variables are denoted by ς and
From the above equations, I have
to represent the model. The structural equation modeling is based on a covariance analysis defined as S
), where S
is the covariance matrix calculated from the observed data and ∑( θ
) is a matrix-valued function of the parameter θ
denote the covariance matrices of the error terms ς
, and G
denote a q
×(p + q
) combined matrix of the q
zero matrix and the q
identity matrix. The covariance matrix of model ∑( θ
) is given by
Each element of the covariance matrix model ∑( θ) is expressed as a function of the parameters that appear in the model. The unknown parameters were estimated, in order to minimize the difference between the model covariance matrix ∑( θ) and the sample covariance S.
The SEM software package SPSS AMOS 17.0 (IBM, USA) was used to fit the model to the data. The quality of the fit was estimated by four different model fitting scores: GFI, AGFI, CFI and RMSEA. These scores were considered to be useful to clarify the degree of model fitting in this study, since the model can be evaluated by a general threshold, rather than a huge experimental number.
To make the model covariance matrix ∑( θ) closer to the sample covariance matrix S, the parameters within the model were estimated by the maximum likelihood (ML) method with the fitting function. Since the ML estimators are known to be consistent and asymptotically unbiased, ML is commonly used as a fitting function to estimate SEM parameters:
Here, ∑( θ
) is the estimated covariance matrix; S
is the sample covariance matrix; |∑| is the determinant of matrix ∑; tr
(∑) is the trace of matrix ∑; and q
is the number of observed variables. The fitting function FML
, ∑( θ
)) is a discrepancy function that compares the difference between the estimated covariance matrix ∑( θ
) and the sample covariance matrix S
. The principal objective of SEM is to minimize FML
, ∑( θ
)), which is the objective function that is used to obtain the maximum likelihood. Generally, FML
, ∑( θ
)) is a nonlinear function. Therefore, iterative optimization is required to minimize FML
, ∑( θ
)) and to find the solutions by Fisher’s scoring method in the AMOS software, since the Fisher method is highly effective, and may even converge in a single iteration.32
To avoid a local optimization, I executed iterative optimization procedures with different initial values, and chose the global optimum solutions of the parameters. In the optimization process, the range of parameters was not chosen, but one parameter of the relationships from the latent variable to the observed variable was fixed to one, as a restriction of the model. Furthermore, I used 1E-5 as the convergence criteria. As the limit on the number of iterations, I chose 100 iterations to be performed by AMOS. When this limit was reached and the convergence criteria had not been met, the model was rejected.