In this study, we carried out transcriptome sequencing on airway epithelial cells from healthy, never and current smoker volunteers and from smokers with a diagnosis of lung cancer or an alternative benign disease of the chest. The goals of this study were to compare several methods for conducting airway transcriptome sequencing and to evaluate the potential of this technology to provide new biological insights into the altered gene expression in the airway epithelium in response to tobacco smoke and lung cancer.
The design of this study () made it possible to compare different RNA-Seq protocols, read lengths, and sequencing types (SE or PE) across the same set of pooled samples. The samples were processed using 2 different, but complementary, protocols: the standard Illumina protocol, which selects polyadenylated RNA from total RNA prior to cDNA synthesis, and a prototype NuGEN Ovation RNA-Seq protocol, which uses a combination of oligo-d(T) primers and random hexamers to synthesize cDNA from total RNA. The samples processed using the NuGEN protocol had a lower percentage of uniquely aligned reads than samples processed using the Illumina protocol. This difference may be partly due to higher-quality reads from the samples prepared with Illumina protocol using a new Illumina sequencer (greater number of reads aligning with zero mismatches, ) and to a higher content of repetitive RNA in the samples prepared using the NuGEN protocol. The NuGEN protocol had a much higher percentage of reads aligning to mitochondrial RNA and rRNA (39% vs. 12%) due to the fact that it captures both polyadenylated and non-polyadenylated RNAs and that the protocol may not have been fully optimized (a prototype version of the protocol was used). We were able to assess the effects of read length and sequencing type, using the Illumina-protocol processed samples, by trimming the reads and considering each read of each pair separately. Increased read length (36 to 75 nt) results in 10% more reads aligning to a unique location in the genome and 8% more reads that span splice junctions. PE versus SE sequencing gives an additional 6% increase in the number of reads that align uniquely as a pair. The results show that there are clear advantages to longer sequencing length and PE sequencing versus SE sequencing. In addition, the standard Illumina protocol results in higher coverage of polyadenylated genes (mostly protein coding), but it fails to capture a subset of non-polyadenylated transcripts. In this study, the subset of genes detected only in samples processed using the NuGEN protocol is small; however, it is likely that an optimized protocol (incorporating methods to reduce repetitive and highly abundant RNA) combined with higher-quality PE 75 nt (or greater) reads would yield additional transcripts. Our data suggest that the optimal approach for sequencing the airway transcriptome would be to use an optimized protocol that captures both poly-adenylated and non-polyadenylated transcripts or a combination of protocols followed by 75 nt (or longer) PE sequencing at a depth of coverage greater than 30 million reads.
The union set of genes measured by both the Illumina and NuGEN protocols was used to define the compendium of genes expressed in the airway transcriptome. We believe this is the first comprehensive catalogue of genes expressed in the bronchial airway epithelium. Despite the small sample size, the numbers of genes (even the conservative airway transcriptome definition, Supplementary Table S3
) exceeds the number of genes determined to be expressed in the airway using microarrays (2
). The majority of genes were detected when the samples were processed using the Illumina protocol, resulting in higher read coverage of annotated genes (19,786 vs. 10,926 genes for samples processed using NuGEN, ), because less reads were lost to alignments to mitochondrial RNA, rRNA, and non-polyadenylated transcripts. The fact that the expression of some genes could only be detected when the samples were processed using the NuGEN protocol was therefore probably not the result of coverage differences but rather because of differences between the protocols. The genes whose expression is detected only by the samples processed using the NuGEN protocol (n
= 787) are more likely to belong to gene biotypes other than protein coding () and to the set of Ensembl genes not categorized as "known." Despite the differences between the 2 protocols, the read counts obtained are highly correlated (r
= 0.59, P
< 0.001) within the set of genes detected by both protocols (n
= 10,139). There is a group of genes (among the 10,139 genes) with markedly higher counts when using the NuGEN protocol versus the Illumina protocol () that predominantly belong to gene biotypes other than protein coding. One interesting example is the lincRNA MALAT1
, which has a mean RPKM of 14,915 (NuGEN) versus 2352 (Illumina). Reads aligning to MALAT1
from samples processed using the Illumina protocol are distributed along the entire length of the gene, whereas reads from samples processed using the NuGEN protocol are concentrated in a shorter, approximately 300-bp region of the transcript, which may represent an additional non-polyadenylated processed RNA transcribed from this locus (Supplementary Fig. S1
). Defining the airway transcriptome by combining the strengths of both RNA-Seq library preparation protocols is an important step in fully understanding the biology of the airway field of injury.
Genes in the airway transcriptome were classified as differentially expressed between the S and NS samples or the C and NC samples to find enriched biological pathways and functions. Surprisingly, genes involved in ion channel activity were enriched among the differentially expressed genes detected only in samples processed using the NuGEN protocol (Supplementary Table S4
). It is unclear if this finding is because of biases in the library preparation protocols, polyadenylated tail length, or the presence of non-polyadenylated isoforms. The presence of non-polyadenylated isoforms of transcripts measured in samples processed using the NuGEN protocol, which may have important regulatory functions, needs to be confirmed with further studies. Among genes detected in samples processed using the Illumina protocol (n
= 19,786), there were several smoking-related pathways such as oxidoreductase activity, retinol metabolism, and metabolism of xenobiotics by cytochrome P450 that were enriched among genes differentially expressed between the NS and S samples. Cell adhesion molecules, cytokine–cytokine receptor interaction, and chemokine activity were enriched among genes differentially expressed between the NC and C samples (Supplementary Table S5
). The RNA-Seq data appear to find gene expression alterations that are smoking and cancer related despite the small sample size.
RNA-Seq–derived fold changes between the NS and S samples and between the NC and C samples are significantly correlated with changes measured by microarrays among genes interrogated by both platforms. There is, however, a weaker correlation among the samples with and without cancer () that can be partially explained by the fact that the an older microarray platform was used and that only a subset of the cancer samples used in the pool had microarray data. In addition, the cancer signal is weaker than the smoking signal [less genes have an absolute log2 fold change > log2(1.5)], and therefore, the correlations of C/NC fold change between data generated using the Illumina and the NuGEN protocols or between RNA-Seq and microarray are weaker.
An advantage of RNA-Seq over microarray technology lies in the number of genes that are significantly differentially expressed by RNA-Seq but are not interrogated on the microarray as exemplified by the genes listed in . One example is a MUC5AC
(mucin 5AC)-processed transcript located within an intron of MUC5B
() that is upregulated in current smokers compared with never smokers and downregulated in smokers with lung cancer compared with smokers without lung cancer. MUC5AC
is a mucin gene expressed in the respiratory tract and is found in patients with asthma, cystic fibrosis, and COPD (33
). We also found that ANG
(angiogenin, ribonuclease, RNase A family, 5), a gene that is not measured by these microarrays but which is involved in lung adenocarcinoma cell proliferation and angiogenesis (34
), was up-regulated in smokers. Another gene, SCGB3A1
(secretoglobin, family 3A, member 1), which is not interrogated by the microarrays used in this study but was found to be upregulated in the normal airway of lung cancer patients using RNA-Seq, has been linked to poor prognosis in non-small-cell lung cancer (35
). also includes non-coding RNAs (lincRNAs, pseudogenes, and processed transcripts) whose biological functions have not been well described but which may have important gene regulatory functions in lung carcinogenesis. Identification of their cancer-associated differential expression by sequencing provides a rationale for studying their expression using targeted assays in larger cohorts of samples.
In addition to examining concordance with microarrays, we also used qRT-PCR to validate the concordance of fold change direction among genes detected only as differentially expressed log2
(1.5) by sequencing. The expression of the genes S100A8
, which are known to be involved in the inflammatory response in the lung (36
), and CYP4F2
, a member of the cytochrome P450 family of enzymes that play a role in xenobiotic pathways (39
), were found to be up-regulated in smokers by both RNA-Seq and qRT-PCR. Similarly, the expression of the genes CCL20, IL8, NFKB1A
, and SCGB3A1
was found to be up-regulated in the normal airway of patients with lung cancer versus those with benign disease using both RNA-Seq and qRT-PCR (). One gene, SCGB1A1
, however, was not concordant between RNA-Seq and microarrays. We also validated that the expression of select non-coding RNAs, which may have an important role in gene regulation (41
), changed in the same directions as measured by either RNA-Seq or qRT-PCR (). The correlation between qRT-PCR fold change and RNA-Seq fold change was significant (r
= 0.888, P<
0.001; data not shown). The concordance between qRT-PCR and RNA-Seq has been confirmed across a small set of genes, suggesting that RNA-Seq is a good method for assaying genes important in epithelial cell response to smoke and lung cancer.
In summary, we have established that transcriptome sequencing has the potential to provide new insights into the biology of the smoking- and cancer-related airway field of injury. While much of the airway transcriptome is captured with RNA-Seq of libraries enriched in polyadenylated transcripts, library preparation protocols that measure the expression of non-polyadenylated RNAs are needed to completely characterize the transcriptome, as are longer read lengths and PE sequencing strategies, both of which yield a higher percentage of mapped reads. Our results suggest that the expression of both protein-coding and non-protein-coding RNAs are impacted by exposure to tobacco smoke and the presence of lung cancer, and that long non-protein-coding RNAs that we have begun to characterize in this study may be important in the response to tobacco smoke in airway epithelial cells. Larger sample sizes are needed to characterize the RNAs uncovered in this study and to confidently assess transcript splicing patterns and the presence of novel transcripts. Novel coding and non-coding transcripts uncovered by RNA-Seq provide amore complete portrait of the smoking- and cancer-related airway field of injury have the potential to help elucidate mechanisms of response to tobacco smoke and to function as additional biomarkers of disease risk or novel targets for chemoprevention.