|Home | About | Journals | Submit | Contact Us | Français|
Correspondence to: Hakon Hakonarson, MD, PhD, Director, Center for Applied Genomics, Children’s Hospital of Philadelphia, Abramson Research Center Suite 1216, 3615 Civic Center Blvd, Philadelphia, PA 19104, United States. hakonarson/at/chop.edu
Telephone: +1-267-4266047 Fax: +1-267-4260363
Approaches to understanding the genetic contribution to inflammatory bowel disease (IBD) have continuously evolved from family- and population-based epidemiology, to linkage analysis, and most recently, to genome-wide association studies (GWAS). The next stage in this evolution seems to be the sequencing of the exome, that is, the regions of the human genome which encode proteins. The GWAS approach has been very fruitful in identifying at least 163 loci as being associated with IBD, and now, exome sequencing promises to take our genetic understanding to the next level. In this review we will discuss the possible contributions that can be made by an exome sequencing approach both at the individual patient level to aid with disease diagnosis and future therapies, as well as in advancing knowledge of the pathogenesis of IBD.
Core tip: The genetic understanding of inflammatory bowel disease (IBD) has progressed over the last twenty years as new technologies and analytic techniques have become available. The nascent revolution in next-generation sequencing will enable us to sequence the exome - all the protein coding genes in the genome - in thousands of individuals. This review discusses the implications of this new approach for diagnosis in very early onset IBD and as a tool to gain understanding of the hereditary basis of the common polygenic form of the disease at the population level.
The inflammatory bowel diseases (IBDs) consist of two main types of pathology: Crohn’s disease and ulcerative colitis. Over the preceding decades, genetic epidemiology of twins and families indicated that these diseases have a strong genetic component, but that they do not segregate according to a Mendelian pattern of inheritance such as autosomal dominant, autosomal recessive, or X-linked. Twin studies of Crohn’s have shown a concordance of 20%-50% for monozygotic twins and 0%-7% for dizygotic twins. For ulcerative colitis the concordance is 14%-19% for monozygotic and 0%-7% for dizygotic. The fact that the monozygotic concordance is well below 100% shows that there are strong environmental contributions and that there is incomplete penetrance of the genetic susceptibility loci. At the same time, the risk is considerably elevated compared to the general population. Supported by the results of recent genome-wide association studies, the most commonly accepted model of IBD susceptibility is a multifactorial model in which polygenic inheritance at hundreds of genetic loci, each with small effects, contribute along with non-genetic factors, such as diet and microbiome composition.
One of the first successful approaches to identifying specific risk genes was family-based linkage analysis. This approach seeks to identify chromosomal regions containing causative genes on the basis of recombinations within a family between a microsatellite marker and the trait of interest. Six loci were identified using linkage analysis, including the IBD3 locus containing the human leukocyte antigen complex on chromosome 6, and the IBD1 locus, the single largest genetic risk factor for Crohn’s, which contains the nucleotide-binding oligomerization domain protein 2 (NOD2) gene on chromosome 16[3-5].
The next technology to make a major impact in IBD genetics has been genome-wide association studies (GWAS). These studies involve genotyping hundreds of thousands of single nucleotide polymorphisms (SNPs) throughout the entire genome in order to find direct association between a specific polymorphism and the case/control status. The first successful study found an association between the interleukin (IL)23R locus and Crohn’s disease in addition to replicating the NOD2 association. Expanding the number of cases and controls in the cohort as genotyping prices dropped resulted in identification of ATG16L1, IRGM, MST1, NKX2-3, and PTPN2[8,9]. The first IBD GWAS studies in a pediatric cohort were reported by our group, highlighting associations with TNFRSF6B and IL27[10,11]. As studies have grown more powered with increased cohort sizes, genotype imputation techniques, and international collaboration through the IBD Genetics Consortium, the tally of associated loci for Crohn’s and ulcerative colitis has risen to 163 in the latest meta-analysis, demonstrating unequivocally the polygenic nature of IBD inheritance. Notably, the distribution of SNPs genotyped in a GWAS study covers intergenic as well as exonic and intronic regions, so that polymorphisms which predominately affect the regulation of gene expression through transcriptional control can be assessed. Analysis of data from the ENCODE consortium has advanced the notion that much of the heritability of complex disorders originates in these non-coding regulatory regions of the genome. There is no assumption in GWAS that the susceptibility or protective variants are confined to amino acid substitutions in proteins, the type of variation that would be found in exome sequencing. However, a major disadvantage of GWAS studies is that they are much more attuned to detecting common variation, that is, greater than 5% minor allele frequency for a SNP. It is worthy of note that IBD has generated a greater number of associations than any form of pathology studied genetically to date, leading some to suggest that evolutionary selective pressures for variants in the genes underlying the immune response drove autoimmune-risk alleles to relatively high frequencies, a phenomenon known as balancing selection. The greater sensitivity of GWAS towards common variants is one reason among many that GWAS studies have only been able to account for a fraction of the heritability of polygenic diseases such as IBD. It is becoming increasingly clear that there is more to the story than the common disease-common variant hypothesis, and that rare variants, detectable only through sequencing, must also play a role[18,19]. Moreover, these coding variants are more likely to have high ORs, greater penetrance, and to be amenable to follow-up by functional experimentation. Figure Figure11 illustrates the relationship between variant frequency and the phenotypic impact of the variant. Highly disruptive mutations will not rise to high frequency due to purifying selection. Exome sequencing is an ideal technology to fill in the intermediate frequency range of variants which may have stronger impacts than the weak associations detected by common GWAS variants.
With current technology, sequencing the whole 3 billion-base pair genome at the required depth of coverage to make rare variant calls is an expensive process, making it impractical to use on the scale required to implicate less-common variants in IBD. Statistically validating less common variants with phenotypic impact would require GWAS-sized datasets comprising thousands of cases and controls. A more practical alternative that has arisen since the emergence of next generation sequencing technology is to sequence the exome, the 1% of the genome that encodes protein. It has been estimated that 85% of monogenic, Mendelian disorders are the result of alterations in protein amino acid sequence, supporting the idea that exon-focused sequencing will yield the most functionally interesting variants.
The most common way to fractionate the genome for exome sequencing is in-solution hybridization. This can be accomplished by shearing the DNA into small 200-300 bp fragments by ultrasonic or enzymatic methods followed by ligation of common adapter sequences to the 3' and 5' ends of the fragments so that the sequencing primer can anneal. This whole-genome library is captured by hybridization in solution with 50-120 nucleotide-long "baits" that are complementary to the exon sequence being targeted. The library-bound baits are bound to magnetic beads and the non-coding DNA is washed out. The captured fragments are eluted and amplified by PCR. Next generation sequencing instruments that utilize the exome library include the HiSeq and MiSeq (Illumina, Inc.) and Ion Torrent Proton (Life Technologies). The instruments sequence by synthesis with a DNA polymerase, analyzing the incorporation of the next nucleotide by fluorescence imaging with modified nucleotides (Illumina) or by electrical measurement of the protons produced by the incorporation of nucleotides (Ion Torrent). This generates a short “read” typically 100-200 bp in length, significantly shorter than the 700-bp reads produced by traditional Sanger sequencing. The reads are furnished as a list of sequences accompanied by quality metrics, known as a FASTQ file. One of these instruments can generate 20-60 gigabases of sequence per day.
The FASTQ file is analyzed by a read-mapping program, such as the popular Burrows-Wheeler aligner, which matches these short reads with a reference genome. The alignment is stored in a common file format called BAM which is interpretable by a variety of analysis tools for visualization and variant identification. When enough independent reads have been aligned at the same nucleotide location in the genome, usually at least 20 reads, a variant calling application, for instance, Genome Analysis Toolkit[23,24], the variant caller for the 1000 Genomes Project, can be used to decide if the site matches the reference sequence or contains an alternate nucleotide. The variant calls can be collected in a variety of formats, typically the Variant Call File (VCF). A range of statistical analyses can be performed on the VCF files for each exome, including annotating them for function (missense, indel, synonymous) and likely impact of the variant (damaging, tolerated) using tools such as ANNOVAR, Sorting Intolerant From Tolerant[26,27], and PolyPhen. These tools use evolutionary conservation of the gene across diverse species as well as the chemistry of the amino acid substitution to generate a predictive score of each variant’s potential impact. The software tools also integrate information about the frequency of the variants in the general population using databases such as the The National Heart, Lung, and Blood Institute Exome Sequencing Project and dbSNP, since it is most likely that a damaging and impactful mutation would be quite rare due to purifying evolutionary selection.
In large case/control studies, the coding variants will be rarer than the polymorphisms identified through GWAS, so that any individual rare variant is unlikely to achieve a threshold of statistical significance. Therefore, a variety of groups have developed methods to aggregate all of the rare variants in a gene and test them collectively in order to identify a rare variant burden in cases compared with controls or to detect an unusual distribution of variant frequencies between cases and controls for a given gene. A number of these tests have the feature of being able to detect association in the presence of a mixture of risk, protective, and neutral variation.
In attempting to identify a role for exome sequencing in inflammatory bowel disease we can appreciate two scenarios where it might be used. The first scenario is that of an individual patient or family with an atypical clinical presentation whose diagnosis or therapeutic decision may be influenced by genetic information. This can be seen in the very young children who present with clinical symptoms of IBD, known as very early onset IBD (VEO-IBD). These children frequently present with a more severe disease and often with a phenotype that is distinct from older children and adults, including extensive colonic disease unresponsive to standard therapy. These findings suggest distinct etiopathogenic pathways. In one well-known case, a 15-mo-old child presented with failure to thrive and perianal fistulae that was refractory to medical care. His disease progressed to pancolonic involvement, however the terminal ileum and upper tract were spared. This early age of onset and severity suggested a severe perturbation of the immune system. He underwent numerous surgical procedures and treatment with immunosuppressive drugs, as well as targeted genetic and immunologic testing that did not yield a recognizable diagnosis or remission of symptoms. Sequencing of the child’s exome revealed that this patient had an exceedingly rare mutation on the X chromosome in the XIAP gene, a potent regulator of the inflammatory response. He was treated by bone marrow transplant resulting in resolution of his disease. In our own IBD center at the Children’s Hospital of Philadelphia, we encountered a 5-mo-old patient with colonic inflammatory bowel disease. She presented with severe disease that was unresponsive to medical therapy. Her course was complicated by frequent episodes of dehydration and she became transfusion dependent despite various treatments. Exome sequencing in this patient revealed a mutation in the MEFV gene, resulting in a diagnosis of familial Mediterranean fever. The patient was referred to a pediatric rheumatology specialist and is being successfully treated for FMF with colchicine.
These successes highlight the critical role of exome sequencing in carefully selected patients by providing diagnoses that can guide treatment. Factors that suggest a patient may have a rare genetic perturbation that might be elucidated by exome sequencing would include early onset of disease, unusual severity, familial pattern of transmission, and a refractory response to standard therapies. In these cases, collecting DNA samples from parents so that exome sequencing in a trio setting can be performed is of high value. This will allow the identification of de novo variants as well as aiding in the elimination of the numerous false positive variant calls that exome sequencing generates by checking for non-Mendelian transmission of mutations. If a Mendelian inheritance model can be specified, as in the case of a consanguineous family which is likely to be autosomal recessive, such information can be of great help in narrowing down the causal variant. Homozygosity mapping in two consanguineous families was successfully used to identify mutations in the IL-10 receptor genes that resulted in severe VEO-IBD unresponsive to therapy. With this discovery, the disease resolved with bone marrow transplant. This critical finding has been replicated in larger cohorts of patients with VEO-IBD and has shed light on an important pathway in the development of VEO-IBD. A further appeal of applying exome sequencing in a family setting is in identifying novel monogenic causes of IBD that might yield an unexpected insight into the biology of disease, thereby directing interest towards novel targets for therapeutic development. An example would be the development of monoclonal antibodies that dramatically lower low-density lipoprotein (LDL) cholesterol by inhibiting proprotein convertase subtilisin kexin 9, a protein that was found to be deficient in a small number of individuals which genetically very low LDL.
An area where exome sequencing has been impactful is in the sequencing of cancer tissue exomes in comparison with the patient’s inherited exome. Some studies have been successful in identifying somatic “driver mutations” which are essential for the growth of the tumor, which can spur the development of chemotherapeutic interventions that will target the cancer specifically[37,38]. Great interest has sprung up around the promise of personalized, or precision, medicine for cancer driven by the somatic genomics of tumors. Whether sequencing of intestinal biopsies in IBD present an avenue to identity somatic mutations that may be critically important for microbiome interaction is yet to be determined, but studies are underway that are addressing this possibility.
The second scenario in which exome sequencing can be impactful is as a research tool to augment GWAS in uncovering novel susceptibility loci and specific coding variants in the typical polygenic form of Crohn’s and ulcerative colitis. Whether exome sequencing will succeed in this role to the same degree as GWAS is still controversial. It is clear that identifying genes carrying a burden of exonic rare variants in a disease with the highly polygenic architecture of IBD will require GWAS-sized cohorts, that is, ones consisting of tens of thousands of cases and controls. The high cost and labor intensity of such an effort currently makes these studies prohibitively expensive to all but the most resource-rich groups. Nevertheless, some groups have succeeded in finding rare variant associations through sequencing at the phenotypic extremes of several complex traits in carefully selected candidate genes such as ANGPTL4 and ANGPTL5 or LPL in triglycerides, SLC12A1 in blood pressure, and IFIH1 in type 1 diabetes. Targeted next-generation sequencing in IBD has even produced some rare variant associations by following up GWAS hits, such as coding mutations that reduce signaling through the IL-23 receptor. Targeted next-generation sequencing by Rivas et al identified additional NOD2 and IL23R coding variants, as well as novel coding variants in CARD9, IL18RAP, CUL2, C1orf106, PTPN22 and MUC19. Our group has also recently identified rare nonsynonymous variants in the TNFRSF6B gene in IBD patients with pediatric onset disease, suggesting that this could be true for other GWAS loci as well. Table Table11 summarizes genes that have been shown to have nonsynonymous variants with disease relevance in IBD.
Despite the success of these candidate gene efforts, doubts remain about how practical rare variant studies will be when applied to the entire exome. Most of the rare variant associations identified so far in candidate gene sequencing would not meet the stringent Bonferroni correction for multiple testing on an exome scale, estimated to be a P < 2.5 × 10-6. Investigators must also consider that supporting novel rare variant associations requires replication in additional cohorts since rare variants are often population-specific and frequencies can vary in very inhomogeneous ways in spatially structured populations. This type of confounding, known as population stratification, can lead to spurious associations. Therefore, replication would likely require additional sequencing of large cohorts since genotyping of specific variants would likely not be useful in a different geography or ethnicity, although the replication sequencing might be limited only to genes of interest in the discovery cohort.
Concerns about the likelihood of uncovering a substantial amount of heritability in common autoimmune diseases was raised by a recent report by Hunt et al. This effort selected 25 risk genes that were identified in GWAS of at least two different common autoimmune diseases. The exons of these 25 genes were sequenced with excellent coverage in a cohort of 24892 subjects with six autoimmune disease phenotypes and 17019 controls. They found that the great majority of variants uncovered occurred in a single subject. Five aggregating gene-based tests (rather than individual variant-based tests) were used to identify rare-variant enrichment for any of the genes but none were found to be statistically significant. The authors concluded that there was little support for large-scale whole-exome sequencing projects in common autoimmune diseases. While this report may portend that the impact from rare coding variants is negligible, there are some limitations to the study that leave open the possibility for meaningful discovery. The study considered only 25 genes, while an exome-based approach would survey all 20000 human protein-coding genes. It is likely that the risk conferred by these 25 GWAS genes is carried by non-coding variation, while the risk at some subset of loci in the genome could be carried by rare coding variation that are not captured by GWAS SNPs. This is particularly true for variants in the intermediate 0.5%-5% frequency range which could be impactful in aggregate while escaping detection in GWAS studies due to the weak linkage disequilibrium for variants in this frequency range with common variants. Indeed, Hunt et al did identify three risk mutations at approximately the 5% minor allele frequency. Furthermore, the abundant singleton mutations could still identify IBD risk genes through the use of statistical tests which weight mutations more or less heavily depending on their frequency, as the very rare variants are the most likely to be impactful functionally. Several methods such as adaptive sum tests, Sequence Kernel Association Test, and Variable Threshold tests have been developed specifically for sustaining a high statistical power with rare variants. Finally, the six autoimmune disease phenotypes were quite heterogeneous in their pathologic nature, ranging from IBD to autoimmune thyroid disease to multiple sclerosis. These diseases have distinct mechanisms with different rare variants underlying them, possibly in none of the 25 genes sequenced. Therefore, it is arguably diluting the power to detect rare variant association by combining diverse diseases.
A recent study by Ellinghaus et al utilized exome sequencing to identify a role for missense variants in PRDM1 and NDP52 in Crohn’s disease. Variants in these two genes were discovered in a cohort of 42 whole-exome sequenced individuals, with discovered variants being prioritized by functional impact scores and presence within GWAS-delineated loci. Over 20000 combined Crohn’s and ulcerative colitis cases and controls were genotyped to establish that two variants, p.Ser354Asn in PRDM1 and p.Val248Ala in NDP52 were associated with IBD. Functional studies showed that the PRDM1 mutant increased T cell proliferation and cytokine secretion while the NDP52 mutant impaired the ability of the protein to downregulate nuclear factor kappa B signaling in toll-like receptor signaling pathways. This paper provides an example of how exome sequencing, even in a modest cohort, can refine GWAS signals and uncover less common risk variants, especially when coupled with functional validation.
Our group recently developed a machine-learning approach to predicting risk for IBD using data from the International IBD Genetics Consortium’s ImmunoChip project. The ImmunoChip assays 200000 SNPs with very dense coverage in genomic regions that have been associated with autoimmune disease through genome-wide association studies. Due to the ImmunoChip’s wide spectrum of variants and the large number of cases and controls genotyped in the project, it was possible to use a penalized logistic regression model to predict risk for IBD with area under the curve of 0.86 for Crohn’s disease and 0.83 for ulcerative colitis. With the coming availability of large-scale whole exome data we expect that risk prediction can be improved further and may achieve clinically-useful levels with the comprehensive catalog of variation that would be produced through eventual whole-genome sequencing.
We can predict with some confidence that exome sequencing will have a place in IBD in a patient- or family-based settings where features of the clinical presentation suggest a likely monogenic, Mendelian basis for the disease. Personalized medicine based on the patient’s genome in these carefully selected cases is no longer a far-off dream but a nascent reality. More uncertain are the prospects of large-scale exome sequencing projects for discovery of population-scale heritability for such a common and highly polygenic disease. Theoretical arguments can be made to support either position, but the debate can only be resolved by experimental testing of the common disease-rare variant hypothesis. Exome sequencing of rare variants may not collectively yield much explanation of the population attributable risk of disease, but it has great potential to highlight the key players in the pathogenesis of disease along with variants amenable to functional study and thereby influence the development of potent new therapeutics.
Supported by A Senior Research Award from the Crohn’s to Cardinale CJ; Colitis Foundation of America to Hakonarson H; and a special purpose fund from the Edmunds Family Foundation for Ulcerative Colitis Studies to Baldassano RN
P- Reviewers Decorti G, Fitzpatrick LR, Gazouli M, Yamamoto S S- Editor Gou SX L- Editor A E- Editor Liu XM