Several approaches have been undertaken to discover biomarkers for COPD that may be useful for early diagnosis, prevention, therapeutic intervention, and prognosis. The first COPD biomarker was described by Eriksson, in that patients lacking α1
-antitrypsin, the principal inhibitor of neutrophil elastase, developed early-onset emphysema (29
). Subsequent genetic studies have identified regions of the genome, and lists of gene variants, associated with COPD phenotypes (30
). DNA microarrays have been proven to be a major contributor in the discovery of biomarkers for various diseases. Microarray technology allows simultaneous comparison of expression of thousands of genes (32
). Numerous studies on the use of DNA microarrays have supported the effectiveness of gene expression patterns for clustering diseased tissues apart from each other and from normal tissues. However, comparison of the observed gene expression data often reveals significant biases in classification schemes.
Recently, gene expression microarray analysis of human lung tissue has been used in an effort to identify biomarkers, distinguish disease subtypes, and generate candidates for further genetic and biological studies. Spira and colleagues reported genome-wide expression profiling of subjects with severe emphysema undergoing lung volume reduction surgery (13
). These studies identified gene expression markers for severe emphysema as well as positive response to surgery. Golpon and coworkers used a similar approach and identified gene expression biomarkers distinguishing patients with α1
-antitrypsin deficiency (10
). As with most disease-focused microarray studies, there has been a general lack of consistency in the identification of COPD gene expression biomarkers. One notable exception is EGR1. EGR1 was identified in a microarray study as a gene overexpressed in subjects with emphysema by Zhang and colleagues (14
). Subsequently, Ning and coworkers, using a combined microarray/SAGE approach, validated EGR1 induction associated with COPD severity (12
). Ning and colleagues went on to show that EGR1 appears to contribute to disease pathogenesis, as it can regulate matrix remodeling potential through fibroblast protease production. Interestingly, we find no evidence of differential expression for EGR1 in our population with regard to either discrete or quantitative phenotypes.
We have recently used an integrated genomics approach to identify SERPINE2 as a candidate COPD susceptibility gene (15
). These data indicated that SERPINE2 expression was significantly correlated with quantitative COPD phenotypes in the data set of Spira and coworkers (13
). No probe sets for SERPINE2 passed the repeated criteria used in this current study to be defined as gene expression biomarkers. However, we did find significant association for each of the three SERPINE2 probe sets for individual quantitative traits (227487_s_at, rFEV1%predicted
= −0.36833, P
= 0.0061; 212190_at, rFEV1%predicted
= −0.28406, P
= 0.037; 236599_at, rFEV1/FVC
= −0.28908, P
= 0.034). These data are consistent with our previous observations, revealing robust and consistent increases in SERPINE2 gene expression in the lungs of subjects with airflow obstruction.
In the studies described here, we report the identification of a molecular signature for discrete and quantitative COPD phenotypes through the generation and analysis of microarray data from human lung tissue. We used a repeated approach for data analysis; gene expression level (signal intensity) values were extracted from raw data files using both nonnormalized (MAS5) and normalized (RMA) approaches, Frequentist (SAM) and Bayesian (BADGE) statistical methods were used to test for significant associations between gene expression and discriminate phenotypic variables (e.g., disease versus control), and linear (Pearson) and rank (Spearman) correlations were used to test for significance with continuous phenotypic variables (e.g., FEV1%predicted, FEV1/FVC). All analysis methods were repeated for each probe set and signal intensity data set. Results were summarized where data consistently implicated an association between gene expression and the disease variables. Initially, we identified genes differentially expressed between cases and controls, as has been performed in previous studies. In principle, differentially expressed genes (i.e., genes that are expressed more in one group than another) should provide the highest predictive power, yet methods developed to date fall short in their ability to predict the status of known samples. The identification of genes differentially expressed in the presence or absence of COPD in our data set appeared to be driven by a subset of the subjects and was potentially biased due to the small sample size (used only 33 of 56 subjects) and phenotypic heterogeneity. In addition, we performed an assessment of gene expression changes associated with quantitative changes in lung function. This allowed us to use the entire data set and control for phenotypic heterogeneity as defined by FEV1%predicted or FEV1/FVC. We suggest that the combined set of genes identified in these studies represents a robust molecular signature for discrete and quantitative COPD phenotypes.
Finally, we assessed the utility of our methods and results to predict COPD in a separate data set. Biomarkers were developed using our heterogeneous subject population, containing individuals with wide-ranging levels of airflow obstruction. We tested these biomarkers in a more homogeneous population composed of subjects with severe emphysema (13
). Using the 254 informative probe sets identified in our subjects, 84 of which were available in the data set of Spira and colleagues, we had 97% predictive accuracy and 100% sensitivity. This represents the first gene expression array biomarker for COPD validated in an independent population. In addition, we discovered a group of 40 of these probe sets (representing 38 genes) with 100% predictive accuracy.
Even though the establishment of a validated gene expression biomarker for COPD is a significant achievement, the current study has limitations. Due to the varying distribution of airflow obstruction in our study cohort, we chose to design analyses based on quantitative spirometry measures as opposed to GOLD criteria, as recently reported by others (12
). We classified cases on the basis of general criteria for significant airflow obstruction characteristic of COPD, including FEV1
< 70% predicted and FEV1
/FVC < 0.7, while controls showed no evidence of significant airflow obstruction (FEV1
> 80% predicted, FEV1
/FVC > 0.7). Of course, one must consider variability in the measurement of quantitative traits such as lung function that may contribute to reliability of marker detection. Further, the presence of emphysema by radiology/surgical pathology was not thoroughly assessed in a majority of our subjects. The phenotypic heterogeneity of COPD may be the cause of limited replication of previous results in the current study, and in previous studies in general. Other confounding factors that limit the reliability of these types of studies include tissue sample heterogeneity and small number of samples relative to the number of genes tested. The effect of phenotypic heterogeneity upon marker identification, at least in theory, can be minimized by assessing quantitative variables of disease severity. We applied such an approach here to both offset the obvious disease heterogeneity in our subjects and to substantially increase our sample size (n
= 33 versus 56). A sample size of 56 subjects makes this the largest gene expression biomarker study of COPD published to date. In addition, cigarette smoke can have broad and significant effects on gene expression. The genome-wide response to cigarette smoke exposure in airway epithelial cells has been reported (33
). It will be of great interest to examine the relationships between gene expression changes resulting from cigarette smoke exposure and those consistently associated with COPD phenotypes as defined in the current study. Those genes that are responsive to smoke and differentially expressed in diseased individuals may represent true susceptibility factors.
Another potential limitation of the current study is the diagnosis of tumors in most subjects. Lung cancer and COPD are both typically found in smokers, and the diagnosis of lung cancer can serve as an independent predictor for COPD, independent of smoking history. In this study, the presence of malignant, or even benign, tumors may result in significant effects on gene expression in the distant, histologically normal lung tissue used for our gene expression studies (see
). The vast majority of our subjects (80%) were diagnosed with either squamous cell carcinoma (34%) or adenocarcinoma (46%). We tested for and found no consistent differences in gene expression between tumor types within cases, within controls, or independent of lung function. Further, COPD biomarkers were not significantly differentially expressed between tumor types in any independent test. Finally, the potential influences of tumor upon gene expression did not limit the ability of our biomarkers to serve as successful class predictors in tumor-free patients with COPD. These data suggest that any effects of the tumor upon gene expression in distant, histologically normal tissue were not consistent or robust.
While there is no indication that the genes that we identified are etiological or causative in COPD pathology, an analysis of biomarker function using ontological assessment identifies an overrepresentation of genes involved in DNA binding and transcription factor activity. This was unanticipated and is independently observed for biomarkers of either discrete or quantitative COPD phenotypes. Historically, there has been only modest investigation of the involvement of transcriptional regulators in COPD pathogenesis. Notable exceptions include the previously identified and validated COPD expression biomarker gene EGR1 (12
), and the recent identification of Nrf2 as a genetic susceptibility factor for experimental emphysema in mice (35
). Interestingly, histone deacetylase activity (HDAC2) has recently been implicated in gene dysregulation in human patients with COPD (36
). The identity and regulatory function of individual biomarker genes identified in this study are not clear, but include a number of zinc finger–binding domain containing proteins.
We used a rigorous analytical approach for these studies, to identify the most robust and consistent set of biomarkers for discrete and quantitative COPD phenotypes. This strategy used multiple, independent microarray data extraction methods and repeated statistical testing. This approach is prompted by the limitations of any single analytical method when applied to complex, disease tissue–associated microarray data sets. This approach is supported by our successful validation using an independent COPD lung tissue data set. The genes we identified and validated have no previously described roles in processes relevant to disease pathogenesis, so they are more likely to be true markers rather than etiological. The identification of these markers may help to facilitate the development of noninvasive methods (such as genetic tests) that facilitate diagnosis, classification of disease subtypes, and/or provide a means to define response to therapeutic intervention. Further studies will be required to determine if any of these biomarker genes play a role in human COPD susceptibility or pathogenesis.