|Home | About | Journals | Submit | Contact Us | Français|
Genomic sequence comparisons between individuals are usually restricted to the analysis of single nucleotide polymorphisms (SNPs). While the interrogation of SNPs is efficient, they are not the only form of divergence between genomes. In this report, we expand the scope of polymorphism detection by investigating the occurrence of double nucleotide polymorphisms (DNPs) and triple nucleotide polymorphisms (TNPs), in which two or three consecutive nucleotides are altered compared to the reference sequence. We have found such DNPs and TNPs throughout two complete genomes and eight exomes. Within exons, these novel polymorphisms are over-represented amongst protein-altering variants; nearly all DNPs and TNPs result in a change in amino acid sequence and, in some cases, two adjacent amino acids are changed. DNPs and TNPs represent a potentially important new source of genetic variation which may underlie human disease and they should be included in future medical genetics studies. As a confirmation of the damaging nature of xNPs, we have identified changes in the exome of a glioblastoma cell line that are important in glioblastoma pathogenesis. We have found a TNP causing a single amino acid change in LAMC2 and a TNP causing a truncation of HUWE1.
While all human genomes are extremely similar to one another, there is variability that allows for the uniqueness of each individual. This variability can take the form of copy number variation, chromosomal rearrangements, or nucleotide polymorphisms. The overwhelming majority of recent studies of human variability have utilized microarray technology because of their relatively cheap cost and ready availability. For example, Genome Wide Association Studies (GWAS) have been performed for numerous diseases (1), with varying levels of success. In a GWAS study, numerous individuals with a specific disease or trait and ethnically matched controls are profiled using a microarray for single nucleotide polymorphisms (SNPs). SNP allelles that are more prevalent in the affected individuals relative to the controls are considered to be associated with illness. For a few diseases, such as age related macular degeneration (2), common SNPs with large effects on risk have been identified. In other cases, even when large sample sizes were utilized, the risk alleles identified by GWAS were only able to explain a small percentage of disease heritability (3).
One potential reason for the disappointing results of GWAS studies is because of their limitation to SNPs. The microarray platforms are designed to robustly identify single nucleotide variations (4), but they are not effective in detecting variations involving more than one consecutive nucleotide. If two sequences are identical except for two adjacent nucleotides being altered (e.g. one sequence has AC and the other sequence has GT), this cannot be effectively measured using a microarray. Additionally, considerations regarding DNA melting temperature and the exclusion of repetitive sequences restrict the probes that can be used on a microarray (5).
Recently, high-throughput DNA sequencing techniques (6,7) have been developed and have begun to replace microarrays for genome analysis studies (8). These sequencing techniques are free of the single nucleotide mismatch and melting temperature restrictions of microarrays. In addition, sequencing can produce a more comprehensive picture of a genome, than the particular features included on a given microarray. By analyzing raw sequencing reads, multiple nucleotide polymorphisms can be studied just as easily as SNPs.
We have used the raw sequencing data from two complete genomes, the Venter/HuRef Genome (9) and the Chinese Genome (10), and eight complete exomes (11), to analyze nucleotide polymorphism beyond the single nucleotide level. We have aligned the sequencing reads to the human reference genome and identified thousands of loci with polymorphisms of 2 or 3 nt. These polymorphisms are denoted as double nucleotide polymorphisms (DNPs) and triple nucleotide polymorphisms (TNPs) (see Figure 1 for examples). For simplicity, as a group, SNPs, DNPs and TNPs are identified as xNPs. These xNPs do not include indels where nucleotides are found to be inserted or deleted in one sequence relative to another sequence. We focus on xNPs where the sequence length remains the same, but one, two or three nucleotides are changed. Indels in human genomes and exomes have been extensively characterized in (12) and (11).
While SNPs are certainly an important source of variation between human genomes, there are a few reasons why DNPs and TNPs have a greater propensity to be involved in disease causing mutations. First, SNPs have a strong propensity to be synonymous (13) whereby they change the nucleotide sequence, but do not alter the amino acid sequence due to the wobble allowed by the genetic code. These synonymous changes are usually silent and do not effect the phenotype, but there are notable exceptions (14,15). In contrast, a DNP or a TNP would effect multiple positions in a codon. Secondly, a SNP can at most result in the change of one amino acid, whereas a DNP or a TNP can change the residue at two adjacent positions and cause a more dramatic change.
Before looking for the xNPs in genomic sequence, we first computationally determined their predicted effects on amino acid sequence under an assumption of randomness (Table 1). For example, given all possible permutations of nucleotides in a codon, 24% of SNPs would be expected to result in a synonymous mutation due to nonspecificity in the genetic code. On the other hand, a DNP or a TNP would randomly produce a synonymous mutation only 9 or 0.4% of the time, respectively. The rare possibility of a synonymous TNP can only occur when it overlaps two codons and changes both of them in a way that they still code for the same amino acid. As also displayed in Table 1, when a DNP causes an amino acid change, it is much more likely to be a single change rather than a double. Similarly, a TNP has a greater chance of causing one amino acid change than two changes, but the ratio is smaller. These theoretical results support the premise that DNPs and TNPs can be important sources of genomic variation, and our analysis of real data will be compared against these predicted results.
We have found that in the human genome there is a considerable amount of variation with regard to DNPs and TNPs. For the two complete genomes that we analyzed, we found tens of thousands of DNPs and thousands of TNPs throughout the genome. As with all genomic variation, the majority of this variation was found outside of coding regions. Even so, a substantial amount of xNPs are found within coding exons and they have a strong potential to be involved in disease pathology. In order to test this hypothesis, we have applied our technique to the analysis of an exome from a glioblastoma cell line. In this exome, we have found xNPS causing amino acid changes and a truncated protein in genes whose mis-expression have been previously found in glioblastoma.
For SNPs, each codon was iterated through, and each position in the codon was changed to one of the three possible different nucleotides. The percentage of changes that caused amino acid changes or no change were tallied. For DNPs and TNPs, two adjacent codons were used and every possible set of two (for DNP) and three (for TNP) changes were performed. In order to allow for the querying of each position in each codon, the last positions of the second codon were wrapped onto the first codon. To illustrate, the six positions in the two codons from 5′ to 3′ will be listed as integers from 1 to 6, such that the list of TNPs is: 123, 234, 345, 456, 561, 612.
The Chinese genome data was obtained from (10) as raw FASTQ sequencing reads and only those paired ended reads that were 35 bp in length were used in the analysis. The Venter/HuRef raw sequencing reads were obtained from (9). Since these sequencing reads were from an Applied Biosystems 3730xl they were much longer than 36-bp Illumina reads. To allow for comparison, the long reads were cut into non-overlapping 3-bp reads. The eight exome sequences were from (11). The glioblastoma exome sequence was from (16).
The sequences were aligned using the Bowtie (17) alignment program and three mismatches were allowed. The reference genome used was hg18. The consensus repeat elements were taken from the UCSC genome browser annotations (18) and the genes used were the CCDS gene set (19).
After the reads were aligned to the genome, any single base mismatch was counted as a putative SNP and two or three consecutive mismatches within reads were marked as putative DNPs or TNPs respectively. For each putative xNP, the number of sequencing reads coding for the xNP or the reference sequence were tallied. The following criteria were used for calling an xNP: if there were no reads matching the reference at that position, there needed to be a minimum of three reads supporting the xNP, and the xNP would be called as homozygous. If there were reads matching the reference, two requirements needed to be met to call a heterozygous xNP: First, there needed to be at least three xNP supporting reads at that location. Second, a binomial distribution was computed at each genomic position, based on the total number of reads at that location and a 50% allele probability. A heterozygous xNP was called if the number of xNP-reads was at least half of the total number of reads at that location minus twice the standard deviation of the binomial distribution. This threshold allowed for <5% false negative rate of calling heterozygotes.
The functional categorization of genes with xNPs was performed using DAVID (20). The lethality analysis of the xNPs was performed using Polyphen version 1.1.7 (21) and SIFT version 4.0.3 (22). Additionally, we analyzed the xNPs using PANTHER version 6.1 (23). For a substantial number of the polymorphisms, PANTHER was not able to give a prediction of the probability of it being deleterious. This is because the amino acid substitution occurred in a part of the protein that was not covered by the multi-sequence alignments underlying the predictions. This is a known shortcoming of PANTHER (23). Overall, the percentages of polymorphisms predicted to be deleterious by PANTHER were much lower than the percentages from both Polyphen and SIFT. A strong cause of this was polymorphisms not being scored and therefore not having a possibility of being predicted to be damaging. We therefore have not reported the PANTHER predictions.
In order to analyze xNPs in complete human genomes, we selected the Venter (9) and the Chinese (10) genomes as examples for our analysis. The sequencing reads for each of the genomes were aligned to the genome reference hg18 (‘Materials and methods’ section). For each alignment, up to three mismatches were allowed in order to capture SNPs, DNPs or TNPs. Any two adjacent mismatches were marked as a DNP; while three adjacent mismatches indicated a TNP.
The number of xNPs found throughout the genome and their locations are shown in Table 2. For each genome and type of xNP, the total number of xNPs and the number that are homozygous are listed. Since the SNPs for these two genomes have been previously determined, we compared our results to the published counts. For the Venter genome, 3.2 million SNPs were reported (9), as compared to our finding of 2.89 million SNPS. The Chinese genome was reported to have 3.07 million SNPs (10), while our method yielded 3.69 million. It should be noted that these differences in SNP counts are in the expected directions, given the different alignment and SNP calling techniques and thresholds that were utilized (‘Discussion’ section).
Overall, the numbers of DNPs and TNPs are greatly reduced relative to the numbers of SNPs. This is expected because the production of a DNP or a TNP requires the mutation of two or three adjacent nucleotides whereas a SNP only requires one change. For all of the xNPs, the greatest percentage occurs in intergenic regions, followed by introns, both of which are non-coding and are expected to have relatively lower levels of consistency across individuals. In contrast, far less than 1% of xNPs occur in coding exons, which are under selective pressure to prevent amino acid mutations. TNPs are almost completely absent from coding exons and there are only three coding TNPs from the Chinese genome and six from the Venter genome. Approximately half of all xNPs were observed in portions of the genome that are defined as repeats by RepeatMasker (24); this is expected since such regions cover 45% of the human genome (25).
Since coding exons are important regions of the genome for protein production, we focused on the analysis of xNPs in these regions. The xNPs were classified according to whether they caused no amino acid change (neutral/synonymous), caused one amino acid change, caused two amino acid changes, changed from a stop codon to a coding codon (read-through), or changed from a coding amino acid to a stop codon (premature stop). These results are shown in Table 3, and a complete list of each gene that had any xNPs along with the change produced by each type of xNP is shown in Supplementary Table S1.
We first compared the results with the theoretical calculations from Table 1. For the SNPs, a lower percentage resulted in amino acid changes than would be predicted at random. Theoretically, 68% of SNPs should change one amino acid, whereas this was found for 58 and 47% of the SNPs in the Chinese genome and the Venter genome respectively. This decrease was caused by a greater than predicted number of synonymous SNPs. We predicted that there would be 24% synonymous SNPs and we found that 41 and 53% of the Chinese and Venter genome SNPs, respectively, were synonymous. For both genomes, the synonymous to non-synonymous SNP ratio is around 50–50, as has been previously found for the genomes of multiple species (26–28).
In contrast to the bias from the calculations towards synonymous SNPs, we found a strong bias towards non-synonymous DNPs. For both genomes almost all of the exonic DNPs resulted in an amino acid change. The theoretical calculations predicted 86% of the DNPs causing amino acid changes and we found 98% of the DNPs for each genome causing a change. There were only a small number of exonic TNPs, but these were completely non-synonymous.
For all types of xNPs, the occurrence of both premature stop codons and stop codon read-through was less than predicted. For example, while 7% of DNPs were predicted to change a stop codon to a coding codon and result in stop codon read-through; neither genome had any DNPs producing read-through. Premature stop codons were predicted to result from 4 to 6% of SNPs and DNPs, but they were only found in 2% or less of all such events. These findings are presumably due to selective pressure against the potentially catastrophic results of either a protein truncation or elongation.
We found a 225 DNPs located within 200 genes in the Venter and Chinese genomes. Sixty-six (29.3%) of these DNPs are found in both genomes, and may reflect the presence of variants or errors in the reference genome. In both the Venter and Chinese genomes, over 90% of the DNPs resulted in a single amino acid change. For the Venter genome, 30% of them were predicted to me damaging by both Polyphen (21) and SIFT (22). For the Chinese genome, 35% were predicted by Polyphen to be damaging while 34% were predicted to be deleterious by SIFT. We then used DAVID (20) to functionally annotate the genes containing DNPs. For the Chinese genome, the top Gene Ontology category to describe the DNPs was for the MHC protein complex (Benjamini corrected P = 0.032) and for the Venter genome, no categories were found to be significant.
Both genomes had a small number of TNPs within their exons, with the Venter genome only containing six exonic TNPS and the Chinese genome only containing three exonic TNPs. None of these TNPs are shared between the two genomes. For the Venter exonic TNPs, only one is homozygous, but they all cause a single amino acid change. For these TNPs, four of them are predicted by Polyphen to be either damaging to the protein structure, while two are predicted to be deleterious by SIFT. For the Chinese TNPs, only one is homozygous, but all cause two amino acid changes. Since Polyphen and SIFT only look at single amino acid changes, we were unable to evaluate the double nucleotide changes for their damaging potential. These TNPs were not concentrated in proteins of one type and include proteins that are structural (RDX, KRTAP10-1), signal regulatory (SIRPA), RNA binding (GPATC4), an olfactory receptor (OR5L2), bind to protein kinase A (AKAP3), a metalloprotease (ADAMTS9) and an antigen presenter (HLA-DRB5).
To further investigate the occurrence of xNPs in human genes and their consistency, we utilized complete exome sequencing data from eight individuals (11). Since only the exomes of these individuals were sequenced, we were not able to quantify xNPs outside of genes. The SNPs, DNPs and TNPs in each exome were determined using the same criteria that were used for our initial two genomes (Venter and Chinese) and the results are shown in Table 4. For each of these exomes, we found an average of 17 164 exonic SNPs per exome which is very close to the reported count (11) of 17 272. We found a smaller number of DNPs and TNPs, with an average of 164 DNPs per exome and five TNPs per exome. Notably, the average number of exonic DNPs was identical to that observed in the Chinese genome, which was sequenced on the same platform (Illumina). We then looked at the pervasiveness of each xNP among the eight exomes (Supplementary Table S2). On average the same SNP, DNP or TNPs was found in 2.5, 1.8 and 1.4 exomes respectively. Despite this low average, there were SNPs and DNPs that were found across all eight exomes. For these loci, the occurrence of a SNP or a DNP relative to the reference genome in eight samples indicates that the reference genome probably does not contain the most common sequence. This is the case for 1450 SNPs (8.4%) and 18 DNPs (11%). These low percentages of pervasive xNPS indicate that the majority of xNPS are true variations between genomes rather than reflecting inaccuracies in the reference genome.
The most common TNP was found in four exomes in KRTAP10-1 gene and it was determined by Polyphen and SIFT to be a benign change. As with the TNPs found in the two full genomes, a significant amount of them result in the change of two adjacent amino acids which cannot be easily evaluated.
To further confirm the findings of xNPs in the exomes, we compared our findings for one exome (NA19240) to the complete genome of that individual that has recently been completed using the Complete Genomics technology (29). In our analysis of the data from the exome sequencing (Table 4), we identified 180 DNPs and two TNPs in coding regions; while using the Complete Genomics data, we identified 155 coding DNPs and five coding TNPs. Seventy of the DNPs and one of the TNPs were found using both techniques. Thus, results of xNP analysis could be to some extent be dependent upon sequencing platform. At the same time, the overall abundance of DNPs and TNPs observed in the human genome and exome appear to be relatively consistent across various sequencing technologies.
We then looked at the positions within codons where xNPs occur. SNPs should preferentially occur in the third codon position, as has been previously found (26,30). This is because of the wobble nature of the genetic code whereby a change in this position is often silent. DNPs and TNPs have not been previously profiled, but it is expected that a DNP would preferentially occur in either positions 1 and 2 or 2 and 3 of a codon so as not to overlap two codons. Similarly, TNPs would be expected to completely overlap an individual codon. A plot of each type of xNP and the percentage that begin in each position in a codon is shown in Figure 2. As expected, the largest percentage of SNPs occur in the third codon position, but this was not an overwhelming majority (43%). For DNPs, unexpectedly, there appears to be little preference for any codon position. For TNPs, there is a bias towards their beginning in the first codon position (54%) and covering a single codon rather than overlapping two adjacent codons.
Nucleotide substitutions can be categorized as either transitions or transversions depending upon the 2 nt involved. There is generally considered to be a strong bias of transitions as compared to transversions in metazoan genomes (31). This was confirmed by our findings for SNPs in the combined set of eight exomes. There were 37 457 transitions and 14 792 transversions observed. For DNPs, and TNPs, the terms of transition and transversions do not directly apply since they are associated with individual nucleotides. Nevertheless, we were able to investigate the positions within a DNP or a TNP as transition or transversions (Table 5). For DNPs, the first position was dominated (66%) by a transition, while there was much less preference at the second position. In contrast, there was a preference among TNPs (46%) for three transversions in a row.
Finally, we applied our analysis to the sequenced exome of the U87 glioblastoma cell line (16). Rather than sequencing a complete exome, this study only sequenced the exons of 5253 cancer associated genes. Using our analysis, we found 53 DNPs and eight TNPs. For the DNPs, four caused a double amino acid change. Of the 49 that caused a single amino acid change, 37% were predicted by Polyphen to be damaging while SIFT determined that 31% would be deleterious. For the TNPs, four caused a double amino acid change and three that caused a single amino acid change. Of the single amino acid changes, all of them were predicted by both Polyphen and SIFT to be damaging. In addition, one TNP caused a premature stop codon in this exome. A summary of the mutations found in each gene are shown in Table 6.
In order to determine whether any of these xNPs were in genes that have been previously found to be related to glioblastoma, we conducted Pubmed searches for each of the genes that was found to have a damaging xNP, or an xNP changing the location of a stop codon. We found that a TNP in the LAMC2 gene that results in a single amino acid change L952D which is predicted to be damaging. This gene has been found to be amplified in glioblastomas (32) as well as other cancers (33,34). A TNP in the HUWE1 gene causes a truncation of the protein by the insertion of a premature stop codon reducing its length from 4374 residues to 1668 residues. The HUWE1 gene has been found to be important in brain development, and its deletion has been found to be important in malignant brain tumors (35). In the case of this cell line, HUWE1 is not deleted, but it is truncated and therefore most likely not functional.
We have characterized a novel source of genomic variation, DNPs and TNPs, which occur with a frequency of ~1% of the total number of SNPs. In the two genomes we examined, we found tens of thousands of DNPs and thousands of TNPS. While only a small percentage of these changes are found in coding sequence and directly affect the transcribed protein, the non-exonic xNPs could of course be located in regulatory regions. Although not directly examined in this report, alteration of sequence in a promoter or an enhancer could change the expression dynamics of the associated gene (36).
The coding xNPs, while small in number, could be very important clinically. In order to cause a disease, only a single amino acid change in the genome may be required. Since DNPs and TNPs cause an amino acid change in at least 90% of instances, they could very easily be a cause of a disease. Those DNPs and TNPs that cause two amino acid changes are especially intriguing since they are likely to have a pronounced affect on the protein structure and function. Exonic DNPs and TNPs are approximately 3-fold over-represented amongst amino acid-changing polymorphisms and produce greater than 100 such changes in each normal human genome.
Based upon the average frequency of SNPs and simple probability, DNPs and TNPs should be much rarer than what we found. Assuming the occurrence of a SNP to be ~1 in 1000 bp (3 million SNPs in a 3 billion base genome) and assuming independence of all SNPs, there should be one DNP every 1 million base pairs (10002), which would total 3000 DNPs in the entire human genome. There should also be one TNP every 1 billion base pairs (10003) which would total three TNPs for the entire human genome. These numbers are vastly lower than the numbers that we observed, supporting the idea that the mutations in a DNP or a TNP are not independent. It has been found that SNPs tend to cluster along the genome rather than being evenly distributed; certain regions of the genome have large amounts of SNPs and other regions of the genome are devoid of SNPs (37). Given a region of the genome with a large amount of SNPs, it is statistically more likely that a DNP or TNP would occur. The occurrence of DNPs and TNPs (as well as clusters of nearby, though non-adjacent SNPs) can be explained by results of polymerase mis-incorporation experiments. In these assays, it was found that if a polymerase incorporates the incorrect nucleotide at a particular location, it increases the likelihood that another nearby nucleotide will be incorporated incorrectly (38,39).
In considering our results, it is important to recognize that the number of variants identified by sequencing can vary as a function of numerous factors, including sequencing platform, read length, depth of coverage, and read alignment parameters (including quality control filters). The present study utilized different alignment parameters from prior whole genome sequencing studies, in order to permit detection of DNPs and TNPs. For example, the Chinese genome (10) was aligned to the reference using the SOAP (40) tool and the paired-end reads were aligned together allowing for two mismatches in each read. In our analysis, we aligned all of the reads using Bowtie (17), because it is very fast at aligning reads and this speed can be further increased by allowing it to use multiple threads on a multiprocessor machine (41). Moreover, we permitted up to three mismatches in each read in order to be able to detect TNPs; if we had used the standard cutoff of two mismatches, any read providing evidence for a TNP would have been discarded as an unmappable read. Given this less stringent filter, it is not surprising that we identified a somewhat larger number of SNPs in the Chinese genome than originally reported (Wang et al. 2008). By contrast, the eight exomes were originally aligned using Maq (Li et al. 2008a) which does not have an explicit cutoff for the number of mismatches allowed; our alignment procedures resulted in an average number of SNPs that was nearly identical to the original report (Shendure et al. 2009). Finally, the original Venter genome analysis (9,42) was based upon the traditional Sanger sequencing and assembly of the Venter genome (43). As such, the SNPs between this genome and the human reference genome were determined by comparing the two genome assemblies de novo. Our analysis of the Venter/HuRef genome was completely different in that we utilized their raw sequencing reads which were truncated into non-overlapping 36-bp reads to simulate Ilumina sequencing reads (‘Materials and Methods’ section). This procedure resulted in a slight under-estimate of the total number of SNPs compared to the Venter Institute report, presumably due to some variation in regions with low-depth of coverage failing to meet our criteria for SNP calling.
As an application of our technique to a real disease, we investigated the U87 glioblastoma cell line. We found two TNPs that cause pathogenic changes in genes that have already been implicated in the disease. Besides these mutations, it is very likely that a further understanding of glioblastoma could be gained from an analysis of the xNPs that were found in genes that have not already been suspected of involvement in glioblastoma.
In conclusion, the detection of DNPs and TNPs has not been previously studied, to our knowledge, and would be impractical using microarrays. With the recent advent of high-throughput sequencing and the possibility of sequencing complete exomes (11) and genomes (29), the investigation of DNPs and TNPs should be relatively straightforward. Their identification could be computationally accomplished in a manner as SNPs are called in sequenced genomes. It is hoped that the investigation of DNPs and TNPs in genomes will lead to the identification of causative mutations for genetic diseases that have thus far eluded SNP-based studies.
Supplementary Data are available at NAR Online.
National Institutes of Health (R01MH084098 to T.L. and P50MH080173 to A.K.M.). Funding for open access charge: R01MH084098.
Conflict of interest statement. None declared.
The authors would like to thank Rob DeSalle for helpful suggestions concerning the analysis and the writing of the article. The authors would also like to thank John Cholewa for technical assistance.