My critique follows the order of results presented in the original paper. A summary of the issues addressed in this correspondence and their consequences are given in Table . In many cases, the authors draw inferences from their analysis that can be explained by chance alone. To demonstrate this, I created a null (or random) dataset that could be analyzed with the approaches used in Chari et al. This "null" dataset was generated by combining the SAGE data from all 24 of their analyzed samples and then randomly re-distributing this "meta-transcriptome" into libraries with sizes equal to those in the original dataset. The organization of these virtual libraries into never, former, and current smoker sub-categories is identical to the original dataset. Thus, the reference dataset retains the effects introduced by having unequal library sizes, unequal group sizes, and a range of total signal for each tag. The most important feature of the null dataset is that any correlations of tag expression values with smoking status have occurred by chance alone.
The relationship between the transcriptomes of never, former, and current smokers
Chari et al. show the gross relationship between sample types in terms of the number of tags preferentially expressed in different subsets of the three smoking status groups (never, former, and current smokers). This is often seen in published SAGE studies, and it serves to orient the reader to the data and provide a preliminary sense of group-specific differences. A Venn diagram, like that found in Chari et al., is a popular way of visualizing the results. However, there are two concerns with the authors' particular application of this method.
First, the criterion for "preferential" expression is unconventional. Specifically, they state "the criteria chosen for preferential expression was a threshold of a raw tag count of ≥ 2 across all samples in a particular set, but not existing in the other sets". This would seem to imply that a tag is preferentially expressed if it appears at the given threshold in all of the libraries of one group, and is below the threshold in all of the libraries of the other two groups. However, an inspection of the authors' supplementary material reveals that "not existing in other sets" can mean that a raw tag count of 0 or 1 appears in any library in both of the competing groups. For example, tag CGGTTTGCAT is considered preferentially expressed in former smokers and has the following counts: current = (1, 16, 12, 5, 17, 14, 16, 8), former = (19, 11, 8, 11, 2, 8, 5, 7, 6, 8, 3, 11), and never = (15, 0, 11, 11). However, the average expression in each group is: current = 100.7 TPM, former = 59.9 TPM, and never = 67.6 TPM. This problem is quite pronounced for a significant fraction of tags preferentially expressed in each of the never, former, and current smoker groups: 57/227 (25%), 33/102 (32%), and 635/2013 (32%), respectively, actually have a higher mean expression in one or both of the competing groups!
Second, even if the above approach was valid, there are severe biases introduced as a result of unequal library and group sizes. The former bias is exacerbated because the authors utilize raw tag counts in their comparisons. This means that a tag count of 2 from a library of 50,000 tags is equivalent to a tag count of 2 from a library of 200,000 tags. In the latter, the exclusion criterion (a 0 or 1 in at least one library) is more easily met for groups that have more libraries. I applied the same procedure to the "null" dataset, and the resulting diagram shows similar values to those obtained from the real data (Figure ). In particular, the small number of never smoker libraries results in a large estimate of the "preferentially" expressed never smoker tags.
Figure 1 "Venn" diagram using Chari et al. methodology on the null and the actual dataset. The left hand Venn diagram shows values obtained using a null (randomized) dataset, and the right hand Venn diagrams shows values obtained using the actual dataset. Note (more ...)
For these reasons, the results reported by the authors are of little use in determining a relationship between never, current and former smoker transcriptomes. An alternative would be to use much higher expression thresholds that apply equally to all libraries within a group to reduce the effects of sampling variation and differences in library and group sizes (e.g. ≥50 TPM versus ≤10 TPM). Principal component analysis or hierarchical clustering of the data are also good alternatives to visualize the overall relationship between profiles.
Multiple testing and the determination of differentially expressed tags
In a review of microarray studies used to explore cancer outcomes, Dupuy and Simon identified the failure to adequately control for false positives as a result of multiple testing as the "most common and serious" flaw they observed in procedures for gene selection [5
]. The problem can be grasped intuitively when considering a hypothesis test of 10,000 genes (or, in this case, tags). If one considers a p
-value ≤ 0.05 as being significant, then we expect that 500 genes will be identified by chance alone. If 1500 genes are subsequently identified from the actual data, we can estimate that 500/1500 = 33% of these will be false positives. That this problem remains an issue in published studies is truly unfortunate, since a range of methods are available to overcome this problem. For instance, a simple Bonferroni correction or the Benjamini-Hochberg method are widely accepted and commonly used [6
]. These methods can provide corrected p
-values or estimates of the false discovery rate (FDR) for a given threshold p
-value. Moreover, powerful resampling approaches (like that used to estimate the FDR in this correspondence) are precise and increasingly common with the rise in available computing power. Regardless of the method used, the objective is to determine a statistical cutoff that results in a reasonable number of false positives. An acceptable adjusted p
-value or FDR is somewhat arbitrary, but for the latter metric a value lower than 10–20% is commonly cited [5
In Chari et al., the authors' failure to account for multiple testing or to provide some estimate of the FDR drastically diminishes the value of their findings. The authors use a Mann-Whitney U statistic (a non-parametric test) on each tag to calculate the p-value for the null hypothesis that never smokers are the same as current smokers. The authors restrict their testing to tags with a "mean tag count of ≥20 tags per million (TPM) in at least one of never, former or current smoker SAGE libraries". This introduces a bias by pre-selecting tags for testing based on criteria that includes the variable being tested (in this case, smoking status). This bias could be reasonably addressed by filtering using a criterion of a mean expression of 20 TPM across all libraries.
In any event, this filter produces a reduced test set of 8,148 tags to which the statistic is applied. The authors select those tags with p ≤ 0.05 and then apply a fold-change criterion of ≥2. Before more fully exploring the central issue of multiple testing, it should be briefly mentioned that the fold-change criteria is somewhat dubious. It is easier for tags with low counts to meet this threshold than for those with higher counts. For example, assuming libraries consisting of 100,000 tags, a count of 1 versus a count of 3 represents a 3-fold change (p = 0.62, χ2-test), while the more significant difference of a count of 50 versus a count of 75 represents only a 1.5-fold change (p = 0.032, χ2-test). This is a concern when using a non-parametric test like the Mann-Whitney U with SAGE, since the relative rank of expression rather than the actual difference is used to provide statistical inference.
To demonstrate the larger problem of multiple testing, I compared the number of tags that meet a given threshold p-value in the actual dataset against the average number obtained from the null dataset in order to estimate the FDR. One must also use the number of results found prior to the use of a post hoc filter like fold-change, and I do so here. It's not clear from Chari et al. whether all 8,148 tags or a smaller number representing a minimum of 20 TPM in only the never or current smoker groups were used. In my estimates of the FDR, I use the more generous assumption of the latter (7,764 tags). Of these, 885 tags have p ≤ 0.05 and 609 of these pass the fold-change filter. When this identical analysis is applied to the null dataset, 7,406 tags are expressed at a mean of 20 TPM in either the never or current smoker groups. Of these, 418 tags have p ≤ 0.05 and 195 of these pass the fold-change filter. Therefore, we can estimate the FDR as 418/885 = 47.2%. This means that, at a minimum, nearly half of the authors' findings are false positives. The true FDR will be higher due to the initial biased filter of group-specific expression of ≥20 TPM and still higher in the reduced set of 609 tags because of the bias introduced by the fold-change filter.
It is unfortunate that the authors only reported the findings of a test of one null hypothesis (N = C). The other null hypotheses of (N = F) and (F = C) were not discussed. When performing these comparisons using the real and null dataset, a disturbing trend emerges (Table ). Specifically, the estimated FDR is fairly similar amongst all three of the possible comparisons at p ≤ 0.05 (47.2%, 50.2%, and 46.5%). This would suggest that there are a similar number of tags with differential expression specific to each of the three groups. This presents a problem, since the authors' hypothesis of "reversible" and "irreversible" gene expression change would lead one to expect that never and current smokers would define the limits of expression, with former smokers falling somewhere within this continuum. In order to explore this further, I plotted the estimated FDR over a range of cutoff p-values to see if a clearer pattern would emerge with more stringent criteria (Figure ). It can be clearly seen that as one approaches a more rigorous p-value cutoff, the FDR falls below 20% for both the N = C and F = C null hypotheses, but remains at around 40% for the N = F null hypothesis. One must keep in mind that the number of samples in each group will affect the limits of confidence that can be obtained in their corresponding hypothesis test, but it seems clear that the current smoker group is similarly different from either never or former smokers. This can be explored more directly by formulating null hypotheses based on all of the groups. Indeed, this seems to be the most natural way of identifying "reversible" and "irreversible" gene expression differences. Three null hypotheses are possible: (N, F = C), (N = F, C), (N, C = F). The first two would correspond to "reversible" and "irreversible" changes, respectively, while the third does not correspond with the expected biology of smoke-exposure and could act as an internal control (i.e. this comparison would be expected to produce comparatively fewer results than the other two). When I performed this analysis using the authors' methods, the concerns alluded to in the two-group comparisons become all too clear (Table ). The only null hypothesis where a correction for multiple testing can yield a FDR ≤20% is (N, F = C), which corresponds to "reversible" changes. (N = F, C), which corresponds to "irreversible" changes, has an abysmal FDR once a cutoff of about p ≤ 0.01 is established (Figure ). Even the hypothesis test for genes specific to former smokers, (N, C = F), performs better.
Estimated false discovery rate (FDR) of null hypotheses using different combinations of the three groups
Figure 2 Estimated false discovery rate at different threshold p-values for comparisons of two groups. The x-axis represents different p-value cutoffs (0.00–0.10) to determine differential expression, and the y-axis represents the estimated false discovery (more ...)
Figure 3 Estimated false discovery rate at different threshold p-values for comparisons of three groups. The x-axis represents different p-value cutoffs (0.00–0.10) to determine differential expression, and the y-axis represents the estimated false discovery (more ...)
There do appear to be changes that correlate with the individual groups, since the estimated FDR is well below 100%. However, the differences are clearly difficult to distinguish from random variation in the dataset and most certainly do not support the irreversible/reversible dichotomy imposed by the authors. Most likely, there are other explanatory variables that would better explain the structure of the data. If such covariates were even weakly correlated with smoking status, they may well produce the results shown in this correspondence. In fact, several enticing variables are listed in the authors' own sample descriptions: age, pack-years, lung function, and years since quitting.
Supervised clustering of never, former, and current smoker profiles
Chari et al.
use a technique they refer to as "supervised clustering". Dupuy and Simon describe this approach as the "most common and serious flaw" when performing class discovery [5
]. Specifically, one cannot state that clusters corresponding with a variable (i.e. smoking status) are meaningful if the input genes were selected for a correlation with the same variable. In Chari et al.
, the authors cluster the 609 tags selected based on their differential expression between never and current smokers. Since the tags were selected based on this very property, it is not surprising that distinct clusters emerge; however, these clusters are not meaningful.
The authors go a step further by adding the expression data for the former smokers and then claiming that the clustering of all three
is meaningful (authors' Figure ). If one was to generate random data for three equal-sized groups, select genes that are the most differentially expressed between two of these groups, and then proceed to cluster the data for all three, one would expect a similar pattern to that observed by the authors to emerge. However, in this case, the dataset has unequal library and group sizes and so the null clustering pattern will be affected. I was unable to reproduce the exact hierarchical clustering pattern in Chari et al.
for their 609 differentially expressed tags, despite using the Genesis software to perform a single-link clustering with Euclidean distance (as described in the paper) [8
]. After exhaustive trials using different program options, it was found that row normalization of the data followed by single-link clustering with a Pearson correlation distance metric yielded the most similar results. When I clustered the 195 tags from a representative resampling in the null dataset that were differentially expressed between never and current smokers, two distinct clusters emerged (Figure ). The never smokers clustered tightly together, whereas the current and former smokers formed a distinct, but more diffuse, cluster. The clustering pattern observed for the authors' real data actually argues that former smokers are more similar to never smokers than expected by chance. Specifically, in contrast to the null clustering pattern, the former smoker libraries are distinct from current smokers and are more similar to the never smoker cluster.
Figure 4 Hierarchical clustering using Chari et al. methodology on the null and the actual dataset. Counts for SAGE tags that met a threshold p-value cutoff of 0.05 for the null hypothesis of never = current smokers. The counts were row normalized and underwent (more ...)
For these reasons, this section of the authors' results is almost meaningless. The little inference that can be made actually indicates a situation at odds with the authors' stated conclusions. Hierarchical clustering and principal component analysis are powerful techniques in class discovery and, as alluded to previously, would have been better applied at the beginning of the authors' paper in an unsupervised manner on an unbiased set of data that does not consider smoking status.
Although the key concerns have been described above, a number of additional issues with the authors' selection of irreversible and reversible genes, as well as the RT-PCR validation efforts, were removed for brevity. These have been included as a supplementary document (see Additional File 1