Over the past few years, large scale sequencing efforts have provided a greater understanding of the variability of the human genome. Notably, whole genome sequencing studies have shown that each individual harbors 2.7–4.2 million single nucleotide variants (SNVs) that differ from the human reference genome [1
], whereas exome sequencing typically identifies 18–24,000 coding region based SNVs per individual [2
]. With regard to SNVs in coding regions, the findings generated by whole genome or exome sequencing parallel those observed when sequencing individual genes. First, a majority of identified variants are present in dbSNP [8
] and therefore represent more common variation. For example, exome and genome sequencing reports have shown that 88–99% of observed SNVs reside in dbSNP [1
]. Second, the number of SNVs is dependent on individual genetic variability, ethnicity, and the reference sequence to which results are aligned and compared. At present, most studies utilize the human genome reference sequence for alignment (hg18/GRCh36 or hg19/GRCh37), which shares greatest similarity to Caucasian individuals of Northern European ancestry. As a consequence, the number of SNVs observed can vary considerably depending on the ethnic background of samples. Third, at the individual gene level, variants may have already been described and classified in a gene specific database. Not infrequently, however, novel SNVs are identified, even in genes that have been extensively studied through clinical research or diagnostic testing. While guidelines exist to assist in SNV annotation and functional prediction [11
], many novel variants continue to be classified as variants of uncertain significance (VUS). As a greater number of exomes and genomes are sequenced, having a more comprehensive catalogue of human genetic variation will facilitate individual gene variant classification.
In the context of the above observations, we proposed that analysis of a dataset of variants identified in a single gene would yield insights into what will be revealed by large scale sequencing studies going forward. In our referral laboratory setting, we chose to study the cystic fibrosis transmembrane conductance regulator (CFTR) gene, representing a high volume full gene sequencing diagnostic assay. It should be noted, however, that at ARUP Laboratories, most cases have previously undergone testing with a 32-mutation panel identifying the most common disease-causing alleles, before sequencing. Thus, sequencing results are enriched for rare CFTR mutations. CFTR (NM_000492) is located at 7q31.2 and consists of 27 exons coding for a 1480 amino acid protein, which is a member of the ATP-binding cassette (ABC) transporter superfamily. Mutations in CFTR are known to result in multiple conditions, ranging from classic cystic fibrosis (CF) to monosymptomatic diseases such as congenital absence of the vas deferens, pancreatitis, or chronic bronchiectasis.
Classic CF, a recessively inherited genetic disorder, has an incidence of one in 2500–3200 in Caucasians making it one of the most common lethal genetic disorders [15
]. CF occurs with different frequencies in different ethnic groups with estimated carrier rates of one in 28, 29, 46, 65, and 90 in Caucasians, Ashkenazi Jews, Hispanics, African Americans, and Asians, respectively [16
]. The American College of Medical Genetics recommends carrier screening for CF in expectant individuals or those planning a pregnancy by testing for 23 known disease-causing mutations [18
]; between 48% and 84% of clinically diagnosed CF patients have at least one of these mutations [19
]. The most common CFTR
gene mutation is a three base pair deletion, p.Phe508del (prevalence of 24%-88% depending on ethnic background [17
]), which is associated with a more severe phenotype when present in a homozygous state [21
]. Similarly, other variants have variable frequency in different populations [22
]. In all ethnic groups, the majority of CFTR
variants are of unknown clinical significance [22
]. Several databases have reported variants in CFTR
including dbSNP [8
], the Cystic Fibrosis Mutation Database (CFMDB) [24
], and the Human Gene Mutation Database (HGMD) [25
]. Variants in HGMD are assumed to be disease causing, but there are exceptions. Variants in dbSNP, on the other hand, are often assumed to be benign; however, that is not always the case. The CFMDB contains both disease causing and benign variants.
Herein we present results from a six-year period of CFTR diagnostic testing, including 21 novel variants, during which samples from 1407 individuals were referred to ARUP Laboratories for full gene CFTR sequencing. We focus on the need to develop a more complete understanding of variants in non-Caucasian ethnic groups, evaluate the usefulness and completeness of databases for clinical testing, and report novel variants observed at ARUP with ethnicity and clinical classifications.