The current biological literature makes extensive use of gene mRNA expression data from experimental systems called gene chips/gene micro-arrays. These data are used to infer genomic level conclusions. For example, to infer the response of an organism or cell culture under treatment or perturbation. Microarrays as an experimental system are very valuable in that they provide a genome-wide picture (for all the (~30000) genes). Unfortunately, because of costs of collecting microarray data, the number of samples per treatment is quite small (~2-10).
The data from microarrays are noisy. There are a number of reasons to expect variability in the measurements of the expressions of the genes in mammalian organisms. These include biological causes, or the noise associated with the steps involved in the measurement of the gene expression. Depending on the question that the researcher is trying to answer, he/she would have to control for many of these sources of variation of gene expression. This paper is interested in understanding the variation in the expression data after the known/reported sources of variation have been controlled for.
The biological variability could be due to genetic or non-genetic factors [1
] studies the cis-acting variation that explains differences of 15 genes in the human brain [2
] reviews literature to suggest that the allele-specific differences in the rates of transcription are common. Other studies demonstrating the influence of genotype on gene expression include [3
] (in the human thyroid tissue), [4
] (in human lymphoblastoid cell lines), [6
] (human blood leukocytes), [7
] (in human liver tissue) and [8
] (in liver tissue from mice of different strains).
Among the non-genetic factors explaining the variation in gene expression include the gender of the organism - [9
] (in human skeletal muscle), [10
] (in human white blood cells), [11
] (in human peripheral blood mononuclear cells), [12
] (in human retina), [13
] (in human blood) and [14
] (in the liver and kidney of mice and rats). Age of the organism was shown to be a significant covariate in [12
] (in human retina), [11
] (in human peripheral blood mononuclear cells), [9
] (in human skeletal muscle) and [13
] (in human blood). There is a significant variability of gene expression across different tissue or cell types [15
]. There is even variation within a given tissue due to the presence of multiple cell types - [11
] (in human blood), [9
] (in human skeletal muscle) and [17
] (in human placenta). Other important factors include the diet and fasting status of the organism [14
] and the time of day that the samples were taken [13
]. Another important variability factor is the environmental condition that the organism was under before the tissue sample was taken - for example the mice could be sleep deprived [18
], undergoing craniofacial development [19
], given oral doses of synthetic triglycerides [20
] or fed doses of chemicals that are known to be lung carcinogens [21
] or it could be medical students under psychological stress before a major exam [22
The other classes of non-genetic gene expression variability that has been studied intensively (theoretically and experimentally) have been termed intrinsic, extrinsic and pathway-specific or global noises (see [23
] for reviews on this). Intrinsic noise is assigned to variation arising because of the stochastic nature of transcription and translation due to the small number of mRNA and protein molecules. Extrinsic noise could be due to changes in the cellular environment. These noises can be demonstrated experimentally by observing the output of two different reporters for the same gene in the same cell and separate cells. Pathway-specific noise can be viewed as the noise that is transferred along all the genes whose genes sequentially participate in given cellular or biochemical process.
Variability could arise at various stages involved in getting the output from a microarray and also after data from the microarray are obtained. This is in terms of intensity measurements (for the predefined surrogates for different genes that are termed probe sets) being normalized and preprocessed to get estimates of gene expressions. RNA are isolated from the cells obtained from the tissue sample has been drawn from the organism. The RNA are then subjected to the process of reverse transcription (RT) to obtain cDNA that are then subjected to the vitro transcription (IVT) process to obtain cRNA using polymerases. The cRNA are then hybridized to probes on the microarray platform [25
]. The factor that influence the final intensity measurements for different probes include the amount of polymerase used for the IVT process, the amount of time allotted for this process by the experimenter and the binding specificity of the cRNA to the corresponding probe sequence on the microarray platform [26
]. Tu et al [25
] performed a small controlled analysis of the noise characteristics in the gene expression data from microarrays. They provide empirical distributions of gene expressions arising just because of the variation introduced by the above described process of getting to intensity values from the microarray, i.e., they performed an analysis of the measurement variability. It does not seem trivial to control for a consistent microarray experiment protocol - [14
] demonstrated that the laboratory where the microarray experiment was performed is a major source of variation [27
] found that microarray experiments performed in different years had different characteristics of gene expression. Different microarray data preprocessing and normalization algorithms generate expression data with different characteristics [28
]. In addition, microarrays as tools for detecting changes in gene expression have been shown to be sensitive and specific in a well defined range of its output or gene expression. Low gene expression can be confused with optical noise while due to saturation, changes at higher levels of gene expression are more difficult to detect [29
Most of the current journals require the microarray data to be deposited on a database (like the Gene Expression Omnibus, GEO [30
]) if these data were used for analysis in a paper. The data deposited on the database are in the form of multiple samples corresponding to each of multiple conditions (typically one of these conditions would correspond to the normal situation when there is no disease or toxin dosed). The data are deposited depending on the type of microarray chip used. There are different commercial manufactures of microarrays - Affymetrix and Agilent being the most popular. Each of these manufacturers makes chips for specific organisms (mouse, rat, frog, human zebrafish etc.) and within chips for specific organisms also there are different varieties.
One of the main areas where microarray data has had its application is in the identification of differentially expressed genes across varying treatment conditions. The approaches used could be classified based on whether they use parametric assumptions about the underlying distribution the gene expression or not. Kerr et al [31
] used a ANOVA model to capture variation of gene expression arising from two color microarrays. The non-normality of the residuals of this model was noted in this work. Newton et al [32
] assume that all gene expressions are gamma distributed - a distribution that is right skewed and whose coefficient of variation decreases with increasing mean. They attempted to validate this using real data and found that the fits did capture the primary features of variation but were poor in general. Rocke and Durbin [33
] assume a additive-multiplicative error model for gene expression that is additive at low levels of gene expression and multiplicative at higher levels of expression. So by this one would expect the gene expression to be log-normally distributed at high levels of gene expression. Then there are a class of approaches called empirical Bayes methods - [34
] are examples - that make prior assumptions about the distribution of gene expression and then use the data itself to get the posterior probability of the gene being differentially expressed. They are all based on the observation that mean and the coefficient of variation of the distribution of gene expression have a definite pattern of variation (eg. see [37
]). The typical microarray set up uses a relatively small number of samples to generate data for tens of thousands of genes. The empirical Bayes approaches essentially attempt to improve the estimation of the variance of a given gene by utilizing the observed pattern of variation between mean and the coefficient of variation of expression of all the target genes in the system [34
] assume normal priors of gene expression [36
] assumes a normal prior for the difference in means of the gene expression over the two conditions. Non-parametric methods have also been proposed to be used for the analysis of differential expression (eg. see [38
]). Popular permutation-based method based on a modified t statistic is the so called SAM statistic and its modifications [39
]. Bayesian network inference represents another analysis that makes use of gene expression data and typically assumes that the data is normally distributed - [41
] and [42
]. An additional analysis where gene expression has been significantly used is in the development of a prediction model or a classifier - for example in the development of a classifier between two different forms of leukemia [43
], prediction of prognosis of patients with breast cancer [44
] and the prediction of potential carcinogenicity of a chemical [45
]. Such analyses do not make distributional assumptions on the expression of the gene expression data and the efficacy of the models developed in the above examples was done using cross-validation of data.
The above paragraph describes a snapshot of the analysis done using gene expression data, some of the analyses make use of distributional assumptions and some do not. Since distributional assumptions are made frequently, it appears prudent to validate this assumption. As mentioned above Newton et al [32
] and Kerr et al [31
] attempted to verify their distributional assumptions and did not find their assumptions adequately supported by the data. Tu et al [25
] did not consider the analysis of variation due to biological variability. The literature thus lacks an empirical validation of the distribution of gene expression as measured by microarrays across multiple univariate theoretical distributions with sufficient amount of data.
Since 2002 a significant amount of data from sources like the Gene Expression Omnibus GEO [30
] and ArrayExpress ([46
] has become available. This data can now potentially be used to validate the distribution and noise assumptions for statistical analysis and to develop improved inference methods to analyze the microarray data.
This paper focuses on identifying and validating empirical distribution fits of genome-wide gene expressions as measured by microarrays. In addition to the normal distribution we empirically tested the empirical fit for a number of well established probability distributions.
We analyzed four microarray data sets from the GEO database [30
]. They were all generated using the Affymetrix Mouse 430.2 platform. They are data from three tissues - brain, liver and craniofacial tissue and expression data for the so called "house keeping" genes [47
] across over 6000 samples collected over a wide range of conditions. These data sets were chosen in part because they each had a relatively large number of samples generated by the same laboratory. Also, in light of the points mentioned the samples in two of the datasets (from the craniofacial and liver) were partially matched for gender, age, mouse strain tissue source and environmental conditions prior to sample collection. The data set from the brain involved three different strains of mice. The microarray data were preprocessed and normalized in the same manner. It is therefore expected that the primary sources of variability in gene expression would be a convolution of intrinsic, extrinsic, pathway-specific or global noise and noise associated with microarray sample preparation and hybridization to the DNA chip for the two data sets that each used the same strain of mice for all the samples. The brain data set would be expected to show additional genetic variation due to the utilization of three different strains in the generation of the samples.