With the advent of microarrays, a vast body of statistical literature has emerged in which the problem of differential expression testing has been tackled with the most sophisticated statistical methodologies available. From a purely statistical point of view, analysis of microarray data is a new type of problem largely unfamiliar to traditional mathematical statistics. The core problem has been termed a curse of dimensionality
. In statistics, this term denotes the situation when the number of samples available for the analysis is smaller, or even much smaller, than the number of parameters to be estimated.52
In the microarray context, the number of mRNA abundances to be analyzed may be as large as 40,000, whereas the number of microarrays available for the analysis is typically in the dozens, and only very rarely may approach hundred. In order to defeat the curse of dimensionality and increase statistical power, a common idea of borrowing strength
from the totality of all the data available is being employed. Such a paradigm requires strong a priori
assumptions regarding probabilistic properties of parameters; it also requires development of special significance scores for detection of differentially expressed genes. This big problem has opened numerous opportunities for statistical creativity, and inspired development of a large variety of statistical methods. This multitude of methods, however, is not of much help in the routine work of an experimental biologist, unless a professional statistician is a member of his team. Not only does a biologist face the problem of deciding which statistical method is more appropriate for his/her experimental situation, but, regrettably, statisticians themselves still lack a consensus regarding comparative merits of various approaches (e.g. see discussion in).53
It is a sort of irony that the very statistical method of assigning a significance score to a differential expression turns out to be difficult to standardize. However, this part of overall standardization is crucial because generally different significance scores would produce different lists of differentially expressed genes.
A simple example of the kind is the selection between two alternative methods, both very popular, the one based on the fold change and the one based on the p
-values of t-test. This question is discussed in much detail in the work54
by this author. Understandably, biologists are more inclined to trust what they see and rely on fold change as a significance measure. Although it is well appreciated that big fold changes may be spurious and originate from pure noise, yet the fold-change-based estimates may serve as valuable leads for subsequent experimental verification using more accurate (and usually more expensive) methods such as Quantitative RT PCR or Northern Blot. An alternative to the fold-change approach is to compute the gene-specific t-tests and rank significance of differential expression in the reverse order of p
-values. This would be a preferable choice for a statistician. In contrast, a biologist would be leery about such a criterion because there is always a suspicion that small p
-values may originate not from the differences in transcription levels in the assay but from the ubiquitous small uncontrollable biases. As discussed in the above mentioned report,6
there are numerous sources from which such biases may originate. An attempt to use both criteria simultaneously usually results in a very meager list of differentially expressed genes or none at all. (In order to reconcile these two extremes, in54
a combination score, the bio-weight test statistic
, has been proposed by this author.) Importantly, the p
-value-based score and the fold-change-based score take into focus different, largely alternative, properties of the assay, and inevitably produce different lists of differentially expressed genes.
It is not always clearly understood that assigning different significance score to differential expression would actually mean assigning different meanings to the very notion of differential expression. (As a crude analogy, it is the same as comparing two groups of subjects by their BMIs, or alternatively, by the sizes of their shoes.) Some authors even go so far as to propose validation
of one statistical method by assessing its agreement with another. Thus, the authors55
claim that pessimistic view of microarrays as a diagnostic tool may originate from the fact that a single statistical practice is used without alternative validation.
In the above discussed example, that would mean that the fold-change-based score should be validated
by the p-values-based score. (Or, in the above crude analogy, the criterion based on shoe size should be validated
by that of BMI) Obviously, as long as the choice of statistical methodology is not restricted by some sort of consensus or tradition, it will always be possible to continue the search for such an approach that would allow declaring significant any group of genes a priori
recognized as significant. Ironically, it may easily happen that the variability associated with the selection from the pool of available statistical methodologies will make its own contribution towards poor reproducibility of microarray measurements. Essentially, this is a reflection of the well known in mathematical statistics effect of inflation of variance due to model selection, only applied in a different context. In anticipation of using the microarrays in clinical practice, it is easy to imagine a nightmarish situation when two statisticians, at the patient bedside, dispute whose statistical method is more reliable and whose list of genes should be selected as targets, with a physician in the corner of the room waiting for the verdict that he needs for administering a life-saving treatment.
There are several persistent patterns in statistical thinking that are usually taken for granted, but in fact are nothing else than elements of some sort of statistical mythology. Thus, in cluster analysis of microarray data, the genes belonging to the same cluster are thought to be co-regulated. It is not out of order to remind again, that in the DNA microarray technology, one is not dealing with the genes themselves but only with the fluorescent intensities presumably proportional to the mRNA abundances. As discussed above in Sections 2 and 3, the latter may cluster due to many reasons, with the mRNA half-lives being a dominant factor. It is, therefore, an unwarranted logical leap over many intermediate steps from the clustering of fluorescent intensities to the co-regulation of parent genes.
The genes that are up- or down-regulated, as compared to some standard or normal behavior, are often thought to be abnormal, faulty, perhaps mutated. As discussed in Section 5, each transcription event in genetic regulatory system is a result of the team work of a large number of transcription factors. Deficiency in any of these factors may slow or even shut down transcription of each particular gene. Figuratively speaking, in addition to the core reason that the parent gene may be faulty, there may be from 30 to 100 other reasons for the abnormal behavior of this gene. Therefore, there is no and cannot be any unambiguous relations between anomalies in fluorescent intensities observed in microarray experiments and fidelities of the corresponding genetic codes. At best, such anomalies indicate a target for further exploration by alternative, more advanced, techniques.
In statistics, there may be many different estimators for the same quantity of interest. Their comparative merits are measured by their asymptotic relative efficiency.
In statistical analysis of microarray data, however, the consideration of asymptotic efficiency is not directly applicable. This is because the number of microarrays usually available is miserably small, so small that any extrapolation of asymptotic efficiency to the experiment with just several subjects at hand would be preposterous. What should be actually done for demonstrating superiority of certain estimates is to perform a simulation in which sample size is the one actually available. To the best of this author’s knowledge, such an approach is very rare in the statistical literature on microarray data analysis. Such a stance has been adopted in the above cited work by this author.54
It has been shown by simulation that for the sample size smaller than ten, the bio-weight test statistic has a higher power than that of the t-test. For a larger sample size, the advantages of the bio-weight test statistic disappear. In formal terms, this means that the bio-weight test statistic and the t-test have the same asymptotic efficiency; however, the former is superior for the small sample size. All this means that considerations of asymptotic relative efficiency cannot be used as an argument in favor of one or another statistical method, as far as microarrays are concerned.
Some authors report high specificity in classification of cancers using the DNA microarrays (see56
as a recent example.) However, it often happens that efficiency is measured by the specificity in clustering the groups known a priori
. It is not usually the case in clinical settings. Cancer is a highly heterogeneous disease; it cannot be always known a priori
whether or not all the conceivable clinical outcomes have been adequately presented in training of the classifier. Therefore, in principle, successful classification of a priori
known outcomes may demonstrate some clinical potential but does not represent a tool for clinical diagnostics by itself.
In summary, the lack of consensus in statistical methodologies leaves wide latitude for different interpretations of precisely the same DNA microarray data. This means that not only ambiguities in biological interpretation but also the very statistical procedures that are supposed to articulate the outcome may contribute to uncertainties of the DNA microarray measurements thus posing difficulties in clinical setting.