Differential gene expression analysis
Read counts were obtained for 22,316 genes from 15 lung adenocarcinoma samples with and without KRAS
mutation. The read count data were normalized using mode normalization and log2 transformation as described in the Section “Materials and Methods.”
Genes with a median read count <16 in lung adeoncarcinoma samples with and without KRAS
mutation were eliminated from further analysis. The remaining 15,092 transcripts were analyzed using ANOVA to identify 374 transcripts that were differentially expressed genes in KRAS
mutation samples versus KRAS
-wild-type samples with p
-value <0.05 and twofold-changes in gene counts. The association between the KRAS
genotype and the top six differentially expressed genes is shown in Figure A. Hierarchical clustering of the 374 differentially expressed genes is shown in Figure B, in which total gene counts were standardized by the mean value among the samples.
Figure 4 Results from differential gene expression analysis. (A) Top six significantly differentially expressed genes between KRAS-mutant genotype (GT) and KRAS-wild-type genotype (GG). (B) Hierarchical clustering of 374 genes that are differentially expressed (more ...)
Ingenuity pathway analysis was used to determine biological relationships among the 374 differentially expressed genes. The top five networks, based on Fisher’s exact test, were associated with immunological disease, cell signaling, cell death, nervous system development, and function and cellular development pathways. The key nodes of the five networks are NFκB, ERK1/2, AKT, MAPK, PI3K complex, IL12, and JNK. The top four networks (gene–gene relationships) from IPA analysis are shown in Figures A–D. IPA also determines the number of subgroups of genes that are associated with a known or canonical pathway. The top 3 canonical pathways from the 374 gene list are cell cycle regulation by B-cell translocation gene (BTG) family proteins, glioblastoma multiforme signaling, and Wnt/β-catenin signaling.
Figure 5 Network analysis. (A–D) Top networks identified with IPA software for the 374 genes that are differentially expressed between KRAS-mutant and wild-type lung adenocarcinoma samples. The pink or red color nodes in networks indicate a gene that is (more ...)
Alternative splicing analysis
We used CASPER, an R package, to obtain the expression of isoforms for all the refseq genes or transcripts. The UCSC hg19 assembly consists of 31,599 known transcripts. Genes with only one known isoform were removed from the analysis (13,633 transcripts). CASPER estimates relative expression levels, i.e., proportion of transcripts for a given gene originating from each variant. Transcript expression values from individual samples were organized into a single file and transcripts with no reads were set to zero. Data from CASPER files contained 17,966 transcripts for further analysis. There were 314 transcripts corresponding to 259 genes with a p-value <0.05 and minimum twofold-change in median of multiple ratios between KRAS-mutant and wild-type samples. An example of output from CASPER package is shown in Figure . The estimated abundance of alternatively spliced NM_00268 and NM_053024 transcripts is 66.79 and 33.21 for the profiling two gene PFN2 in KRAS mutated samples (Figure A), whereas the abundance ratios for NM_00268 and NM_053024 transcripts are estimated as 82.28 and 17.72% respectively for KRAS-wild-type samples (Figure C). Figures B,D show the reads for PFN2 transcripts for the same KRAS mutated and wild-type lung adenocarcinoma samples. We also analyzed data using junction data obtained from TopHat to identify isoforms. Our in silico analysis of junction data for PFN2 gene from TopHat is similar to CASPER data and indicates that the short isoform (NM_002628) is abundantly expressed in KRAS-mutant samples compared to KRAS WT group.
Figure 6 Output of CASPER from splicing analysis. (A) CASPER output for alternate splice forms in KRAS-mutant sample. The data indicates that NM_002628 transcript is highly abundant compared to NM_053024 transcript of PFN2 gene. (B) Observed reads for PFN2 gene. (more ...)
For pathway analysis using IPA, we have used the 259 genes corresponding to 314 transcripts. The top five networks from IPA analysis were associated with cellular growth and proliferation, cell-to-cell signaling and interaction, nervous system development and function, molecular transport, and infection mechanisms. Similar to the pathway analysis of differentially expressed genes, the key nodes in the top five networks were NFκB, ERK1/2, SNCA, AKT, PKC, MAPK, and Insulin.
The SNVs that are ±5
kb from a refseq gene were obtained and filtered based on several criteria as described in Section “Materials and Methods.”
From the SNVMix output, genotypes ranging from 60 to 120
million per sample were investigated to call an SNV.
Nucleotide sequence at every genomic coordinate was evaluated when we had sufficient depth of sequence. We evaluated each nucleotide position for read depth, and for reads supporting the reference and alternate alleles for an SNV. An SNV was called when the genotype of one or more samples deviated from the reference genome genotype. In the current analysis, a SNV was not investigated if the genotype consists of multiple alleles, if there was no variation in genotype, or if the variation is present in less than two samples. We also eliminated SNVs from further analysis if the read depth was <3. After the application of the filters described above, a final set of 73,717 unique single nucleotide variants remained for further investigation. Since we have considered the regions that are close to a gene for SNV investigation, the majority of the SNVs are present in exonic (34,411) and 3′ UTR (25,736) regions. Of the 34,411 exonic SNVs, there are 11,949 synonymous SNVs that do not change an amino acid. Hence, we ignored these SNVs and estimated alternate allele frequencies from genotype data obtained from SNVMix software between the lung adenocarcinomas groups in 23,987 non-synonymous, exonic SNVs. For additional investigation of SNVs, we also obtained the 1000 genome and dbSNP132 data for all the 23,987 SNVs which include non-synonymous, stop gain, and stop loss mutations.
Fisher’s exact test, called from PLINK software, was used to calculate differences in alternate allele frequencies. Eighty-four SNVs corresponding to 74 genes had a p-value <0.05 between tumor samples with and without KRAS mutation. Table shows details of 13 SNVs with a p-value <0.01. Table consists of 18 columns with a variety of information for 13 SNVs.
Top 13 SNVs that have different allelic frequencies in KRAS mutation group compared to the KRAS-wild-type samples.
Table consists of details of SNV (chr, chromosome; Position, chromosome location; Gene, gene corresponding to SNV; Ref, reference allele; Alt, alternate allele; KRAS Mut, frequency of alternate allele observed in KRAS mutation group; KRAS WT, frequency of alternate allele observed in group without KRAS mutation; MAF, minor allele frequency; P, Fishers exact p-value; 1KGenome, frequency of alternate allele observed in 1000 genome samples; dbSNP132, dbSNP ID based on chromosome location if it exists; SIFT, SIFT score that predicts amino acid changes that may be affected by protein function; MAF Body Map, frequency of alternate allele in Illumina body map 16 tissues that are sequenced and analyzed similarly for SNVs; Illumina Lung, genotype of Illumina lung sample from body map data and lung epithelial genotypes for NormNonSmoker – normal non-smoker; NormSmoker, normal smoker; SmokerNon-Cancer, smoker non-cancer; and SmokerLungCancer, smoker lung cancer samples respectively). “?” represents data not available in Table .
As shown in Table , the top significant SNV located on chromosome 12 corresponded to the KRAS
gene. Thus, our SNV analysis confirms our in silico
validation of sample classification based on KRAS
mutational status. In addition to KRAS
mutation there are an additional 79 SNVs that are present in the 1000 genome dataset with a minor allele frequency (maf) ranging from 0.95 to 0.02 (median maf
0.35) and four more SNVs that are not present in the 1000 genome or dbSNP132 datasets, but which are differentially observed in the KRAS
-mutant and wild-type samples with Fisher’s exact p
To test if any of the four SNVs, corresponding to WDTC1 (chr1:27630115), GLS (chr2:191819311), SRSF3 (chr6:36564670), and RBM23 (chr14:23370943) are novel, we performed TopHat alignment and individually examined the data in nucleotide sequence pileup files created from TopHat bam files. SNVs corresponding to RBM23 and GLS are located at exonic splicing junction. Hence the alignments for these two variants were also examined carefully in integrative genomic browser (IGV). We found that the reads supporting the alternate allele for these two variants were present at the 5′ ends of the sequence reads, and therefore likely to represent sequence and/or alignment errors. Hence we dropped these two SNVs from our analysis. The SNV corresponding to WDTC1 was not called as a variant during TopHat alignment. There were no reads supporting the alternate allele, hence we discarded that as a novel variant. Finally, we investigated the variant corresponding to SRSF3 from TopHat alignment and pileup data. This variant had an average read depth of 184; however, we found there is a strand bias of reads for the alternate allele. Thus, reads supporting the alternate allele C in the forward strand were 21, compared to 565 reads from the negative strand, such that this SNV was also eliminated on this basis.
We then filtered the remaining differentially detected SNVs to eliminate those that are at a splice junction, have alternate alleles that are exclusively found at the 5′ end of the reads, or have strand bias for alternate reads. These rules plus a limit for maximum read depth to call an SNV and genotype confidently using SNVMix were applied to remove false positives. This led us to 72 SNVs corresponding to 65 genes that have different allelic frequencies between the tumors with and without KRAS mutation.
An IPA analysis was carried out on the 65 genes corresponding to SNVs that have a preferential alternate allele present in samples with or without KRAS mutation. The top five networks were associated with tumor morphology, cellular growth and proliferation, cell death, cellular function and maintenance of cancer, neurological disease, cellular development, cell cycle, and cell morphology. The most significant disease associated with these SNVs was cancer and the key nodes in the top five networks were NFκB, ERK1/2, AKT, TNF, PI3K, ESR1, beta estradiol, and TGFB1.
Genes from differential expression (374 genes), alternate splicing (259 genes), and SNV (65 genes) analyses were used for integration analyses. Genes existing in multiple features were removed such that the final gene set for integration analysis consisted of 659 unique genes. Chromosomal mapping of these 659 genes identified several lung adenocarcinoma-specific clusters (Figure ). Individual evaluation of chromosomes suggest that there may be clusters of genes on chromosomes 1, 3, 6, and 11 that are associated with KRAS lung adenocarcinomas (Figure ).
Figure 7 Genomic view diagram. Chromosomal view of all the genes obtained from multi-feature analysis of lung adenocarcinomas with and without KRAS mutation. Chromosomes 1, 3, 6, and 11 consists of abundant gene clusters associated with KRAS mutation. Arrow in (more ...)
NFκB, ERK1/2, and AKT canonical pathways were observed in all three independent feature analyses and also in the integrated analysis of 659 genes. The NFκB pathway has previously been shown to be essential for the development of tumors with KRAS
mutation in a mouse model of lung adenocarcinoma (Meylan et al., 2009
). Previous studies have also shown the involvement of AKT (Okudela et al., 2004
) and ERK1/2 canonical pathways (Blasco et al., 2011
) in mutant KRAS
lung adenocarcinomas. In independent feature analyses, the MAPK pathway is targeted by both differential gene expression and alternate splicing but not with SNVs. Similarly, PI3K has been shown to target through differential gene expression and SNV features but not by alternative splicing. These data indicate pathways involved in mutant KRAS
tumors are targeted by multiple genetic mechanisms in these tumors.
RT-PCR gene validation
We randomly selected 6/15 samples for qPCR validation of four differentially expressed genes identified in our RNA-Seq analysis. Significant correlations were observed ranging from 0.69 to 0.84, when qRT-PCR results were compared with RNA-seq expression data (data not shown). We also randomly selected two small genes with multiple splice variants for functional validation.
Two genes, one with two splice variants and another with more than two splice variants were used for functional validation of our splicing analysis. As shown in Figure , qPCR data of samples agree with CASPER analysis for PFN2. KRAS-mutant samples preferentially express the NM_002628 variant of PFN2 when compared to the NM_053024 variants (Figure ). The two most significant SNVs (KRAS and GSTZ1) showing differences in frequency in KRAS-mutant versus wild-type tumors were validated using Sanger sequencing of genomic DNA. The presence of the reference and alternate allele (A) is shown in IGV Browser (Figure B). Sanger sequencing validation of the SNV is also shown in Figure C.
Figure 8 The qPCR validation and SNV validation results. (A) Validation of CASPER estimations of relative abundances for PFN2 gene in two mutant and wild-type samples using qPCR (B) Visualization of KRAS-mutant SNV using IGV Browser. Coverage file and bam files (more ...)
A 659 gene set obtained from our integration analysis was used to build lung adenocarcinoma networks. Reactome and protein–protein interactome databases were used to build networks using Cytoscape. Linker genes (genes not present in our list but known to interact with genes in our list based on the published literature or protein-protein interaction databases) were also included to construct networks using Cytoscape. Thus, our lung adenocarcinoma network consists of linker genes and the set of genes obtained from different genomic features of RNA sequencing analyses. The functionality and directionality of connections between the nodes are also indicated in Figure . Topological parameters of the directed network were obtained using network analyzer. Table consists of the top 50 genes from this analysis along with edge count, degree, and neighborhood connectivity parameters. As shown in the Table , MAPK14 is a linker gene consisting of 63 edge counts, 12 outgoing edges, and 51 in degree edges, with a high neighborhood connectivity score of 23.71. Most of top genes from network analysis are linker genes, due to the fact that most of these genes are well established regulators of our candidate genes derived from different features of tumors. In summary, the most significant connected genes (hubs) from our analyses are MAPK14 (linker gene), and CCND1 from differential gene expression, LAMA4 from SNV, and RPS27A from alternate splicing analysis.
Figure 9 Lung adenocarcinoma interaction network. In the diagram white squares are linker genes. Dark pink are differentially expressed genes, light blue are splicing variants, purple are genes that are differentially expressed and alternately spliced. Green are (more ...)
List of top 50 nodes from lung adeonocarcinoma network.
To understand the connections between these genes in terms of known druggable targets and pathways we used IPA software. We obtained 387 genes that had at least two or more edges from the lung adenocarcinoma network developed previously (Figure ) and submitted these data to IPA analysis. The 387 gene set consisted of 171 linker genes and 216 genes that were obtained from the multi-feature analysis of lung adenocarcinomas with and without KRAS mutation. In this way, we identified canonical pathways that are prevalent in the 387 dataset. Figure represents the top canonical pathways identified in our IPA analysis. The most significant canonical pathway consisted of 52/377 genes associated with molecular mechanisms of cancer (Figure ).
Figure 10 Most significant canonical pathway from this study. Molecular mechanisms of cancer is the most significant known pathway obtained from multi-feature analysis of lung adenocarcinoma network. In the diagram green symbols represent linker genes and red or (more ...)
Identification of key druggable nodes from KRAS lung adenocarcinoma network
To specifically explore the connections of the lung adenocarcinoma network with the KRAS gene, we obtained the upstream and downstream connections that are specific to KRAS gene from our 659 integrated multi-featured gene list. Figure shows the direct and indirect connections to KRAS. As shown in Figure , there are 14 genes (LOC100506736, FBLN2, CCND1, AGRN, FBN1, MYCN, NQO1, SCNN1, ST5, TNFRSF10B, PPARG, GAS1, PLCB3, and IGF1) that are indirectly connected with KRAS from our 659 unique gene list obtained from three different genomic features. To build and expand the KRAS sub-specific network, we also used the grow feature in IPA to obtain direct gene-gene interactions from the Ingenuity Knowledge Base. This analysis gave us an additional 12 genes (RAF1, RALGDS, RASSF1, ICMT, PTBP1, ELAVL1, WT1, ESX1, Ras, FEZF, RASGRF2, and IGF2BP1) with direct connections to KRAS. The IPA overlay feature for known drugs was also used to determine if there are known FDA-approved drugs or clinical drug candidates within the 27 gene KRAS sub network. Three of the 27 genes are the target of known FDA-approved drugs (RAF1, PPARG, and TNFRSF10B).
Figure 11 KRAS sub network. The network connections are obtained from our multi-feature gene list and Ingenuity knowledge base (IKB). Direct and indirect connections to KRAS gene are obtained from multi-feature list and only direct connections are obtained from (more ...)
In the 27 gene KRAS
sub network we identified four known biomarkers for NSCLC (CCND1, KRAS, PPARG
, and Ras
). From the Ingenuity knowledge base, it is known that human CCND1 protein and PPARG
are useful biomarkers for measuring the efficacy of the PPARγ agonist pioglitazone hydrochloride in the treatment of NSCLC. Similarly KRAS
has been used as a biomarker for cetuximab, pazopanib, carboplatin, and erlotinib treatment of NSCLC. Interestingly, three therapeutic trials using PPARG agonists in NSCLC are currently being conducted9
. Given our current analysis, it will be of interest to determine whether a correlation between clinical benefit and KRAS
mutational status emerges from these trials.