Roughly speaking, the process of molecular signature discovery on the basis of omics data consists of four major stages:
(i) Defining the scientific and clinical context for the molecular signature;
(ii) Procuring the data;
(iii) Performing feature selection and model building; and
(iv) Evaluating the molecular signature on independent datasets.
In the sections that follow, we will discuss each of these stages in turn.
2.1 Stage 1: Defining the scientific and clinical context
We first consider the problem of selecting a suitable omics data type for a molecular signature. A signature intended to distinguish between cancer and normal tissue could be based upon a number of omics data types; for instance, one might base the signature upon gene expression measurements, if it is believed that this type of cancer shows altered expression of some genes relative to normal tissue, or upon DNA sequence data, if samples from this cancer are characterized by particular mutations or copy number changes. However, given a clinical phenotype of interest, certain types of omics data might not form the basis for a sensible molecular signature. For instance, it would not be reasonable to attempt to create a molecular signature to screen for adult onset (type II) diabetes on the basis of DNA sequence data alone because an individual's DNA sequence remains essentially static throughout his or her lifetime, but risk of developing the disease may change.
We now consider the clinical context of the molecular signature. A gene expression-based signature that can distinguish between cancer and normal tissues would be of little practical use if a physician can easily make the same distinction using standard (and less expensive) clinical approaches. Similarly, a signature that can distinguish between two subtypes of cancer is useful only if those two subtypes differ in some clinically relevant way, such as in survival time or response to therapy, since otherwise the information about cancer subtype provided by the molecular signature may not serve a practical purpose. As an example, gastrointestinal stromal tumors (GISTs) and leiomyosarcomas (LMSs) are remarkably similar morphologically and were originally classified as being the same cancer. However, it was found that they respond very differently to distinct therapies, and thus a signature that can distinguish between these two diseases based on gene expression in tissue samples can be useful [
3]. An example outside of cancer involves the use of metabolomic information from human serum to noninvasively diagnose and monitor Alzheimer's disease (AD) progression [
32–34].
2.2 Stage 2: Data procurement
The development of a molecular signature requires the availability of adequate omics data for which the clinical phenotype of interest is available. In general, there are two ways in which such data can be procured: new data can be collected experimentally for the specific purpose of molecular signature discovery, or else existing data (collected previously for other purposes, and generally publicly available) can be used. There are pros and cons of either approach. Collecting new data has a major advantage, in that all aspects of the experiment can be carefully controlled. On the other hand, data collection is expensive, and given the large sample sizes necessary for successful molecular signature discovery, using existing datasets may be a more feasible approach. There are a number of public data repositories from which omics data and associated clinical phenotypes can be obtained. For instance, a useful source of gene expression data is NCBI Gene Expression Omnibus (GEO), a repository of over 26000 studies that continues to grow at a rapid pace. Other public data repositories include ArrayExpress [
35] and Sequence Read Archive [
36]. Regardless of how the data are procured, it is crucial that the samples correspond to the scientific and clinical context of interest, as described in the previous section.
In order for a dataset to be suitable for molecular signature discovery, the samples must be collected under appropriate experimental and analytical conditions. As an example, any biological factors (such as gender, age, or ethnicity) that may be associated with the clinical phenotype of interest or with the omics measurements should be taken into consideration in the process of data procurement. In addition, to reduce the prevalence of
batch effects, factors such as sample collection and processing procedures, laboratory personnel, study run-dates, reagent sources, measurement instruments, and data processing methods should be carefully controlled [
37–39]. Deviations in these protocols can have a surprisingly large effect on the omics measurements obtained, often larger than the effect of the clinical phenotype of interest [
40]. Ideally, there should be no association between the clinical phenotype of interest and these factors. For instance, in the case of a molecular signature that classifies tissue samples into tumor versus normal, there should be no difference between the tumor and normal samples in terms of the laboratory personnel who performed the sample preparation, or the sample run-dates. If experimental and analytical procedures are not carefully controlled, they can result in confounding with the clinical phenotype of interest, leading to the development of a classifier that performs very well on the data used in its development, but that will perform poorly on independent test samples.
To the extent that analytical and experimental factors do vary among the samples, these factors should be explicitly included in the model used to develop the classifier. Normalization procedures have been proposed that are intended to reduce the effect of measured and unmeasured external factors on omics data [
41]; however, good experimental design remains the best strategy [
42]. Exploratory data analysis techniques, such as hierarchical clustering () and principal components analysis () can be useful tools to assess the extent to which covariates that are not of primary interest may have affected the data.
When existing data is used for omics-based molecular signature discovery, it is particularly important that sufficient information about the experiment is available to ensure that good experimental design was followed (this will be discussed further in Section 4). For instance, if the run date for each sample is not given, then one cannot be certain that the clinical phenotype of interest is not highly confounded with run date.
Unfortunately, many omics studies have sample sizes substantially smaller than would be required for the successful identification of molecular signatures. A molecular signature that is developed on the basis of a small number of samples is more likely to be sensitive to technical and biological sources of noise and variation, and less likely to capture the aspects of the data that are truly associated with the phenotype of interest. This exacerbates the risk of over-fitting, wherein the signature performs well on the samples used for signature development but fails to correctly predict the clinical phenotype of interest in previously unseen samples. In contrast, global molecular characteristics of a particular phenotype may become more apparent as sample size increases. Therefore, having a large sample size, while by no means a cure-all, will greatly improve the odds that a given attempt at molecular signature discovery will prove fruitful. Integrating across multiple datasets of the same phenotypes from different labs can also help to amplify the primary biological signal of interest relative to noise. Of course, whether a given sample size is “large” or “small” depends the type of omics data being used for signature discovery, the clinical phenotype of interest, and many other factors.
2.3 Stage 3: Feature selection and model building
Once a scientific and clinical context has been established and one or more datasets have been identified, we can develop a molecular signature through (i) feature selection; and (ii) model building. These two tasks can be performed together or separately.
We first consider the task of feature selection. A typical omics experiment simultaneously measures thousands or even millions of biological features (e.g. single nucleotide polymorphisms, RNA transcripts, protein levels) on each patient sample. However, just because thousands of molecular measurements are obtained does not mean that thousands of molecular measurements should be used in the molecular signature. Since financial cost, technical practicality, and measurement robustness are important criteria to select signatures, then if all else is equal, a signature that could be ultimately measured via PCR or Western blot is favored over a signature that requires a technique involving many more protocol steps, such as in omics measurements. In order to reduce the number of features used in molecular signature development, feature selection is performed. Feature selection can be performed in a supervised manner (e.g. the 20% of features that are most associated with the clinical phenotype of interest are selected), or in an unsupervised manner (e.g. the 20% of features with the highest variance are selected). Once a set of features has been selected, only those features are used in the model building process, which is described next.
We now consider the task of model building – i.e. the process of developing a specific computational procedure that can be applied to the omics measurements from a future patient sample in order to predict the unknown clinical phenotype of interest for that sample. There are many possible approaches to building such a model, and in particular, the type of model used will depend on the clinical phenotype of interest. For instance, if we wish to develop a molecular signature to predict time to cancer recurrence, then a Cox proportional hazards model might be appropriate. On the other hand, to develop a molecular signature that can distinguish between cancer and normal tissue, one could use a classification approach, such as logistic regression, support vector machines, neural networks, or linear discriminant analysis. Some approaches for model-building involve first performing an unsupervised technique, such as clustering or principal components analysis, followed by a supervised procedure, such as logistic regression.
Once we have developed a model, how can we determine whether it is any good? Despite certain drawbacks [
43,
44], the most popular approach for evaluating model performance in this context is
cross-validation. (Cross-validation is also often used for tuning parameter selection, though that application is outside of the scope of this paper.) Cross-validation involves repeatedly splitting the samples in the dataset into training and test sets, performing all aspects of feature selection and model building on the training set, and evaluating the model's performance on the test set. Cross-validation can also be used to select from among a small number of possible models: the model with the smallest cross-validation error rate should be chosen.
Cross-validation is a simple and intuitive approach to estimating the error rate associated with a model, but it must be performed with care. Most importantly, within each cross-validation fold, no information about the test set can be used in building the model on the training set. For instance, suppose that one performs feature selection by selecting the 10% of features whose t-statistics between cases and controls are largest. One then performs logistic regression, using only these features, to develop a classifier to distinguish between cases and controls. How should the cross-validation error rate be calculated? Consider the following two approaches:
Approach 1 (incorrect): identify the 10% of features that differ most between cases and controls, and use only those features henceforth. Perform cross-validation by repeatedly splitting the samples into training and test sets, fitting a logistic regression model on the training set (using just the 10% of features previously identified), and then evaluating the model's performance on the test set.
Approach 2 (correct): perform cross-validation by repeatedly splitting the samples into a training set and a test set. Within each training set, identify the 10% of features that differ most between cases and controls, and use those features to fit a logistic regression model. Then, evaluate the performance of this model on the test set.
The difference may seem subtle, but it is in fact crucial. Approach 1 will yield a woeful underestimate of the true error rate, because the 10% of features that differ most between cases and controls were identified using all of the samples, including those in the test set, rather than simply the training samples. In effect, if Approach 1 for cross-validation is taken, then perfect error rates can potentially be obtained even on datasets in which the “case” and “control” labels were assigned randomly! On the other hand, in Approach 2, feature selection is performed using the training set within each cross-validation fold, and so the resulting cross-validation error rate is valid. Unfortunately, the difference between Approaches 1 and 2 is often overlooked, and the literature is rife with papers in which extraordinarily low, but grossly inaccurate, cross-validation error rates are reported because some variant of Approach 1 has been performed. The key principle is that in computing cross-validation error rates, within each cross-validation fold only training observations can be used in any aspect of feature selection or model development. Deviations from this principle, even if seemingly innocuous, may result in dramatic underestimates of error.
At the end of the feature selection and model building process, the molecular signature must be locked down – i.e. the precise computational procedure used to convert a new omics sample into a prediction of the clinical phenotype must be completely specified. Only then can the molecular signature be fairly evaluated on independent datasets, as described next.
2.4 Stage 4: Evaluation on independent datasets
Once a promising molecular signature has been identified, its performance needs to be evaluated on completely independent patient samples. Unlike cross-validation, wherein the test set is drawn from the same population as that of the training set, an independent sample is one that is completely separate from the set of samples used for feature selection and model building. In particular, this means that the test set is not simply a random split from a large dataset (even if sequestered and not used in any training sets). If a molecular signature performs well on a truly independent set of samples, then this provides evidence that it will likely generalize to future patient samples. However, the amount of evidence for a molecular signature's performance based on independent data depends critically upon specific characteristics of the independent dataset.
Lower level of evidence. Good performance on an independent dataset collected at the same institution using carefully controlled protocols. This provides evidence that the molecular signature works well in this particular setting, with these protocols, with the patient profile at this institution, etc. However, it may not hold up elsewhere. At the very least, its ability to work in other settings has not been demonstrated.
Higher level of evidence. Good performance on multiple independent datasets collected at multiple institutions. Success in this setting is the best evidence that a molecular signature will perform well on future patient samples. This indicates that the signature is robust to the kinds of things that might change between locations: namely, aspects of the biology of the populations that tend to go to particular hospital, sample preparation and measurement techniques used, and so forth.
Evaluation of a molecular signature on fully independent patient samples is the gold standard for assessing its performance. Unfortunately, it often is the case that molecular signatures that seem promising in the feature selection and model building stage (i.e. that have very low cross-validation error rates) exhibit poor performance on independent data.