|Home | About | Journals | Submit | Contact Us | Français|
Whole-genome resequencing is still a costly method to detect genetic mutations that lead to altered forms of proteins and may be associated with disease development. Since the majority of disease-related single nucleotide variations (SNVs) are found in protein-coding regions, we propose to identify SNVs in expressed exons of the human genome using the recently developed RNA-Seq technique. We identify 12 176 and 10 621 SNVs, respectively, in Jurkat T cells and CD4+ T cells from a healthy donor. Interestingly, our data show that one copy of the TAL-1 proto-oncogene has a point mutation in 3′ UTR and only the mutant allele is expressed in Jurkat cells. We provide a comprehensive dataset for further understanding the cancer biology of Jurkat cells. Our results indicate that this is a cost-effective and efficient strategy to systematically identify SNVs in the expressed regions of the human genome.
Mutations in DNA sequence can alter or disrupt cellular function and lead to disease development through various pathways depending on their genomic location. Although mutations in noncoding regions may disrupt functional cis-regulatory elements that control transcription and lead to altered transcript levels, the majority of genetic diseases identified so far are linked to mutations in protein coding regions of the genome that lead to altered forms of proteins. Whole-genome resequencing of individual human genome at a high-enough coverage should be able to determine most of single nucleotide polymorphisms (SNPs) as well as somatic point mutations in coding and noncoding regions (1). However, it is very costly to do so at present. Since the majority of human genome mutations associated with diseases are either missense/nonsense or affecting splicing in only about 2% of the human genome (2), it would significantly reduce cost and increase efficiency by resequencing the exonic regions. We propose to identify point mutations in the expressed coding regions of the human genome by sequencing cDNAs using RNA-seq. This strategy can provide both the gene expression and single nucleotide variation (SNV) information at the same time. Since expressed regions constitute only a minor fraction of the human genome, sufficient coverage of these regions can be achieved at low sequencing cost using the next-generation sequencing technologies. Our data demonstrate that this approach efficiently identifies SNVs that alter protein sequence.
mRNAs isolated from Jurkat T cells, derived from acute lymphoblastic leukemia (ALL), and CD4+ T cells from a healthy donor, were converted to cDNA using standard protocols. The cDNAs were fragmented to 100–200 bp using sonication, followed by end repair and Solexa adaptor ligation. The products were sequenced on an Illumina GAII system according to established procedures (3–6).
The short reads were analyzed as outlined in Figure 1. The quality-filtered 30-bp short sequence reads were aligned to the reference sequence consisting of hg18 (NCBI Build 36) human genome plus a library of synthetic exon junction sequences using ELAND (Efficient Local Alignment of Nucleotide Data) software, allowing up to two mismatches with the reference sequence. There are three types of uniquely aligned reads: U0 reads that align perfectly to the reference sequence, U1 reads that have one mismatch with the reference sequence and U2 reads that have two mismatches with the reference sequence. About 25% of reads in our samples are U1 reads and 13% are U2 reads. The library of exon junction sequences was created as follows. Human exon sequences were retrieved from Ensembl database (release 50). We joined all possible pairs of exons that belong to the same transcript such that the genomic order of exons is respected. A junction sequence consists of last 26 bp of 5′ exon and first 26 bp of 3′ exon. Redundant sequences were removed from the resulting set of 52-bp exon junction sequences.
In order to remove possible PCR amplification artifacts and to reduce confounding effects of systematically bad sequencing cycles in short sequence reads, the following two consecutive filters were applied to the set of uniquely aligned reads (Figure 2A).
As a result of application of filters 1 and 2, at most three reads are retained at any given genomic location (Figure 3A). However, the stringent procedure of randomly selecting at most one read from U1 and U2 reads that map to the same location should not significantly reduce the statistical power to detect SNVs. As can be seen from the example in Figure 3A, there can be plenty of overlapping but noncoincident reads that cover an SNV. In fact, for reads of length 30 bp there can be as many as 3 × 30 = 90 reads that cover an SNV. Since 90-fold coverage is the upper-bound for coverage possible by filtered reads, the number of reads in very highly expressed exons will not correspond to actual expression levels. However, it should not be a concern because the purpose of the filtering procedure is to reduce false positive rate of SNV detection and 90-fold coverage is a very significant coverage. By restricting the number of reads that can map to the same genomic location, we allow the evidence for presence of SNV come mainly from overlapping but noncoincident reads.
In order to further justify our use of Filter 1 + Filter2, we compared the numbers of known and unknown SNVs detected using the following four different filtering procedures:
We applied these four filters to unique reads from CD4+ sample. The results shown in Figure 3B suggest that Filters B–D lead to significant increase in the number of false positives while only marginally increasing number of true positives. Let us argue that the last statement is indeed true. Exons expressed in CD4+ cells contain SNVs. Some of these SNVs are known SNVs, i.e. previously annotated in dbSNP database, while some SNVs are novel, i.e. previously unknown SNVs. Let the ratio of number of unknown to known SNVs in expressed exons be U:K. An SNV-detection algorithm blind to the distinction between known and unknown SNVs will detect a certain number of SNVs. Let Kd and Ud be numbers of detected known and unknown SNVs, respectively. We expect that the ratio Ud:Kd is comparable to U:K. The fact that Ud:Kd ratio for SNVs detected following the filters B, C or D is significantly higher than the corresponding ratio for filter A implies that a large fraction of novel SNVs detected following the filters B–D are actually false positives. Note that false-positive rate among detected known SNVs is low since the probability of sequencing errors occurring at precisely dbSNP SNP locations with the right alternative allele is small. This is supported by the fact that the number of detected known SNVs is not very sensitive to the filtering procedure used (see Figure 3B).
The overall error rate of sequencing for each sample is estimated as the frequency of nucleotide mismatches. Using numbers of matches and mismatches from Table 1, the sequencing error rates for Jurkat and CD4+ samples are estimated to be 0.017 and 0.019, respectively. We use a simple binomial distribution to model null distribution of number of mismatches at a genomic locus with N uniquely aligned reads (Figure 2B). This model assumes that error rate is independent of the position in the read. It should be straightforward to extend our model to a model with position-dependent error rate.
Since there are a large number of candidate variant sites, nominal P-values calculated using the binomial null distribution need to be adjusted for multiple statistical testing. We used a very conservative multiple-testing procedure—Bonferroni method–to adjust nominal P-values. With around 6 million candidate variant sites, Bonferroni-adjusted P-value 0.01 corresponds to the nominal P-value 0.01/6 000 000 = 1.7 × 10–9. In this paper, we called a variant site significant if the nominal P-value is <1.0 × 10–9. There are less conservative multiple-testing procedures available such as Benjamini–Hochberg method. These less conservative methods should be used to test statistical significance of candidate SNV sites with low reads coverage. In addition, the position dependence of error rates in short reads cannot be ignored for SNV sites with low reads coverage.
The theoretical upper bound for the number of SNVs detectable using cDNA of expressed exons in a tissue is given by the number M of all SNVs contained in the expressed exons of the tissue. At a given sequencing depth D, a fraction F of these M SNVs will be detected. We estimate the curve F(D) as follows. Let N be the total number of nonredundant unique 30-bp sequenced reads.
In order to avoid the problem of double counting in regions where exons overlap, we merge all Ensembl exons and consider merged exonic regions. For each merged exonic region Ek, we compute reads per kilobase of exon model per million mapped reads (3) RPKM value Rk of its expression as Rk = (109 × Nk)/(N × Lk), where Nk is the number of nonredundant unique reads in Ek and Lk is the length of Ek. At the sequencing depth we achieved (400 Mbp ~13 million nonredundant unique 30-bp genomic reads in CD4+ sample), we compute Rk for all merged exonic regions. We then assume that Rk value is independent of sequencing depth. We validated the latter assumption by comparison of Rk values computed using full CD4+ data set and 50% of the dataset (= four lanes of Solexa data). The Pearson correlation coefficient for this comparison is found to be >0.99.
Given the sequencing depth D, exon expression level Rk can be used to estimate coverage Ck of the exon by sequence reads. At the sequencing depth D, the expected number of 30-bp nonredundant unique reads in the exon is Nk = (Rk × D × Lk)/(30 × 109). Thus, the expected coverage at sequencing depth D is given by Ck = (30 × Nk)/Lk = (Rk × D)/109.
The stringent P-value cutoff 10–9 used in this paper is related to fold coverage C as follows. The P-value for a heterozygous SNV with wt to mutant reads ratio 7:7 is ~10–9 when sequencing error rate is q = 0.019 (see example in Figure 2B). Thus, our P-value cutoff 10–9 corresponds to the fold coverage threshold C = 14. For sequencing depth that we achieved, D = 400 Mbp, coverage fold C = 14 corresponds to RPKM value ~35. Thus, heterozygous SNVs in exonic regions with RPKM <35 are undetectable at the sequencing depth D = 400 Mbp and P-value cutoff 10–9. It is easier to detect homozygous SNVs. At the sequencing error rate q = 0.019, homozygous SNV with wt to mutant reads ratio 0:5 has P-value ~10–9. The coverage C = 5 corresponds to RPKM value ~13.
With a large enough sequencing depth, most SNVs in expressed exons will eventually be detected. It has been shown in (3) that RPKM value 1 corresponds approximately to one transcript per cell. We thus assume that an exonic region Ek is expressed if its RPKM value Rk ≥ 1. The total size of merged exonic regions with RPKM ≥ 1 is ~35 Mbp which is about 50% of all merged exonic regions (~69 Mbp) in the human genome. Since SNVs should be more or less uniformly distributed in exonic regions, we introduce the density ρ of SNVs per base pair of exonic region. The expected number of SNVs in the exonic region Rk is thus equal to ρ × Lk. Given a SNV detection fold coverage threshold C, we assume that all SNVs in an exonic region Rk are detected once the fold coverage Ck of the region exceeds the threshold C. As the sequencing depth increases, more and more exonic regions pass coverage threshold. The expected fraction F(D) of expressed SNVs detected at the sequencing depth D is given by the equation:
where Ck = (Rk × D)/109, Z = Σk Lk × Θ(Rk ≥ 1) is the total size (in bp) of expressed merged exonic regions and the function Θ of the Boolean variable is given by Θ(true) = 1, Θ(false) = 0. In this manner, we derived the SNV detection cost curve 100 × F(D) shown in Figure 4B.
The coverage analysis results shown in Figure 4A were obtained as follows. In order to avoid double-counting sequencing tags in the regions where exons overlap, we merged all exons and considered merged exonic regions. For each contiguous exonic region Ek we compute sequencing coverage as Ck = (30 × Nk)/Lk, where Nk is the number of unique and non-redundant 30-bp sequencing reads detected in the region and Lk is the length of the region. The fraction P of exonic regions passing sequencing coverage threshold C is defined as the ratio of the total size of exonic regions covered at least C-fold to the total size of all exonic regions: P(C) = (1/L)Σk Θ(Ck ≥ C) × Lk, where L =Σ k Lk.
Genome sequence files, and Ensembl and RefSeq gene annotations tables for hg18 human genome were downloaded from UCSC Genome Bioinformatics website: http://genome.ucsc.edu. We determined synonymous/nonsynonymous status of SNVs based on the codon changes resulting from nucleotide substitutions. An SNV was annotated to be a splice-site SNV if it lies in the first or last 2 bp of an intron.
We implemented Redundant Reads Filter and Point Mutation Analyzer in UNIX shell, Perl and C++. The entire suite of software is available at: http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/pma. On the Dell Precision T7400 Linux workstation, the analysis of 27 million unique reads from CD4+ sample took only 12 min: 8 min at filtering step using Redundant Reads Filter and 4 min for assembly of reads, SNV detection and assigning significance P-values to SNVs using Point Mutation Analyzer. As an input, the algorithm uses ELAND results files from Illumina Genome Analyzer. We deposited raw and processed data set for Jurkat and CD4+ T cells to Gene Expression Omnibus (GEO) database under accession number GSE16190.
In order to detect SNVs in the protein-coding and untranslated regions of the human genome using the next generation sequencing techniques, we designed a strategy as outlined in Figure 1. cDNAs synthesized from mRNAs were fragmented to 100–200 bp by sonication and sequenced using Illumina Genome Analyzer II. The short reads of 30 bp were mapped to the reference consisting of hg18 human genome plus a collection of synthetic exon junctions using ELAND software, allowing up to two mismatches with the reference (see ‘Materials and methods’ section).
The mismatches with the reference sequence can occur due to sequencing errors or point mutations present in the sample. In order to distinguish between these two possibilities and hence filter noise from signal, we applied the following two-step procedure to the set of uniquely mapped reads (see ‘Materials and methods’ section). Multiple identical copies of a read can be present as an artifact of PCR amplification procedure and this can provide false evidence for variant site discovery. Therefore, in the first step, we retained only a single copy of each read (Figure 2A). This filter can also reduce confounding effects of systematically bad sequencing cycles within a read. In the second step, if multiple reads map to the same genomic position, we randomly selected only one read from each of the categories U0, U1 and U2 (Figure 2A). Thus, there can be at most three reads that map to the same genomic position (Figure 3A). The application of the above two filters (named together as ‘Redundant Reads Filter’ in Figure 1) should reduce false-positive rate of SNV discovery. Since there can be only a small number of unique and nonredundant genomic reads at the exon edges, we generated a library containing exon junctions to detect potential SNVs in these genomic regions, which increased the power of SNV detection at the exon edges. We found that about 6% of all significant SNVs are detected due to exon-junction reads. The nonredundant reads were analyzed by our point mutation analyzer. A very small probability of observing multiple overlapping but noncoincident short sequence reads agreeing at a given mismatched genomic location by random chance is taken as the evidence in favor of the presence of a genuine SNV at that location (Figure 2B and ‘Materials and methods'section).
The number of reads that align uniquely to the genome and exon junctions is shown in Table 1. We obtained about 27 million uniquely mapped 30-bp sequence reads for each sample. The resulting mean coverage of exonic regions is ~11×. Since gene expression varies dramatically, we examined the distribution of coverage for all exonic sequences (Figure 4A). Our data indicate that with 26 million uniquely mapped non-redundant short sequence reads, about 40% of exonic regions were covered ≥5 times.
We performed sequencing cost analysis for SNV detection (see ‘Materials and methods’ section). We show that at the stringency we use to call SNV (P-value = 10–9), fold coverage of C = 5 and C = 14 are needed to detect homozygous and heterozygous SNVs, respectively. At the sequencing depth we achieved (around 13 million 30-bp unique nonredundant reads), these fold coverages correspond to RPKM values 13 and 35, respectively. Thus, we estimate that about 40% of homozygous and 14% of heterozygous expressed SNVs were detected in this work. Our analysis demonstrates that about 80% of homozygous and 55% of heterozygous SNVs in expressed exons can be detected using 67 million 30-bp nonredundant unique reads (Figure 4B). However, our hypothesis is that mutation of a highly expressed gene may have more functional consequence than a gene expressed at low level or not expressed; therefore, it may not be necessary to do much deeper sequencing than what we have achieved in this study.
At a very stringent significance threshold (P-value < 1.0 × 10–9), we detected 12176 and 10621 SNV in Jurkat and CD4+ T cells, respectively. Many of detected sites overlap with known single nucleotide polymorphism sites (dbSNP build 126): 7473 for Jurkat and 7669 for CD4+ T cells (Figure 5A). Interestingly, more nonsynonymous SNVs in Jurkat cells as compared to CD4+ T cells (Figure 5B and Tables 1, S1 and S2 for further details), which could be related with the disease or generated during in vitro culture.
To validate the genetic mutations detected using RNA-Seq, we randomly selected five nonsynonymous SNVs that are also present in dbSNP and four SNVs that are novel in Jurkat cells (Table 2). The genomic regions containing these SNVs were amplified using PCR and sequenced using Sanger sequencing method. Our results indicate that all the nine SNVs were confirmed (Figure S1). Interestingly, the SNV identification indicated existence of only the mutated allele in the TAL1 gene that is implicated in T-cell acute leukaemia (7). However, the Sanger sequencing revealed that both the wild-type and mutated alleles were present, suggesting that only one parental copy is mutated and it is the mutated allele but not the wild-type allele that is expressed in Jurkat cells.
Among all the 12 176 SNVs identified in Jurkat cells, 4703 are novel and 7473 are known (Figure 5B). Among these, we detected 3206 nonsynonymous and 47 nonsense mutations. Further analysis of the 47 nonsense SNVs indicates that 41 are novel. Interestingly, all the 20 Jurkat-specific nonsense SNVs are single-allele changes (Table 3). We were able to PCR amplify genomic regions containing 18 of these 20 SNVs and obtained their sequences using Sanger sequencing method. Our results indicate that 16 SNVs were confirmed (Figure S1). Interestingly, we found that one of the two SNVs not confirmed by sequencing of genomic DNA was in fact present in mRNA as revealed by Sanger sequencing of cDNA (Figure S2). The SNV is located in the last exon of TAF6 gene. These results suggest that the SNV may be introduced by RNA-editing.
Mutations affecting splicing events are linked to a large number of genetic diseases (2). RNA-seq can be used to detect SNVs overlapping with potential splicing sites if the corresponding intron sequence is present at a sufficiently high level in cDNA. This can happen if (1) the gene is highly expressed so that there is a sufficient amount of pre-spliced RNA present, and/or (2) as a consequence of the splice-site mutation itself the intron is retained due to aberrant splicing. We identified 66 and 31 SNVs in Jurkat and CD4 cells, respectively, which overlapped with splice sites and may lead to aberrant splicing process (Figure 5B). However, using exon body and junction reads, we did not observe any significant Jurkat-specific aberrant splicing events such as exon-skipping and intron retention at the sites of splicing variants (data not shown).
Even though whole-genome resequencing can provide unbiased information on SNVs in the human genome, it is still very expensive and time consuming. Additionally, the vast majority of genetic mutations linked to human diseases have been found in protein-coding regions, which make less than 2% of the human genome. Therefore, sequencing methods that are selective for protein-coding regions may significantly increase the efficiency and reduce the cost of identifying disease-related mutations. Since a genetic disease may be related with increased or decreased expression of a protein or a change in protein structure, sequencing the mRNAs may be a satisfying strategy because it can reveal the information of both gene expression and DNA sequence changes in the protein-coding regions of the human genome. In this report, we have demonstrated that analyzing mRNAs using the next-generation sequencing techniques is an efficient and cost-effective method to identify SNVs in the protein-coding regions of the human genome, in addition to providing the expression information of the genome. Hence, it can be applied to identifying DNA changes in genetic diseases associated with expressed aberrant protein products resulting from inherited or somatic point mutations.
Supplementary Data are available at NAR Online.
The Intramural Research Program of the NIH, National Heart, Lung and Blood Institute. Funding for open access charge: The Intramural Research Program of the National Institutes of Health.
Conflict of interest statement. None declared.