Recently, SNP rs13361189 was found to be strongly associated (
P = 2.1 × 10
−10) with Crohn’s disease in a genome-wide association scan and independent replication study
1,2. rs13361189 lies immediately upstream of
IRGM, a gene previously shown to be essential for autophagy
3. Because the most strongly associated SNPs span the 5′ end of
IRGM, and because CD risk is also associated with the autophagy gene
ATG16L1 (refs.
4,5), the association signal seems to arise from
IRGM2. However, coding-sequence variation in
IRGM has been excluded as the source of the association signal: resequencing
IRGM exons in 248 individuals revealed only three coding-sequence variants, of which two were uncorrelated with CD risk and the third was a synonymous exonic SNP that did not affect IRGM protein sequence or splice sites
2.
HapMap SNPs upstream of
IRGM showed a pattern of assay failure (multiple SNPs yielding null genotypes in the same 34 samples) that is characteristic of structural polymorphisms
6. To directly assess whether structural polymorphisms reside near
IRGM, we analyzed experimental data in which DNA from the 270 HapMap samples were analyzed using a hybrid array of SNP and copy-number probes (SNP6.0 array; S.A.M., F.G. Kuruvilla, J.M. Korn, M.J.D. and D.A., unpublished data) (). Six copy-number probes spanning the 13-kb region from 150.186 Mb to 150.199 Mb (spanning the failing HapMap SNPs) showed a correlated variation in intensity across samples, suggesting the existence of a common copy-number polymorphism upstream of
IRGM ().
Quantitative PCR assays across the identified region revealed that individuals have 0, 1 or 2 copies of the region per diploid genome, indicating that the structural polymorphism is an insertion/deletion (
Supplementary Fig. 1 online). The insertion/deletion was in perfect linkage disequilibrium with rs13361189 (
r2 = 1.0) in all HapMap analysis panels, indicating that it is an ancestral mutation and making it a candidate to explain the association signal at rs13361189.
To determine the physical extent and molecular nature of the deletion polymorphism, we used PCR assays to map its breakpoints (). PCR capture and sequencing of the deletion breakpoints revealed that the deletion removes 20,103 nucleotides, replacing them with 7 nucleotides (). Identical lesions were identified in 6/6 individuals tested, reinforcing the linkage disequilibrium data in suggesting that this insertion/deletion represents a single ancestral mutation. The 20-kb affected sequence was observed at the same genomic location in the chimpanzee genome, indicating that the insertion allele is the ancestral state. The right breakpoint of the deletion was 123 bp from the CD-associated SNP rs13361189 () and 2.7 kb before the reported
IRGM transcription start
7.
We then sought to determine whether this deletion polymorphism showed CD association consistent with it being the causal allele at this locus. First, to directly confirm that the deletion was associated with risk of inflammatory bowel disease and CD, we typed the polymorphism in a North American case-control collection of 685 individuals. Relative to its frequency in unaffected individuals (10%), the deletion allele showed an elevated frequency in individuals with inflammatory bowel disease (15%, odds ratio (OR) = 1.5,
P < 0.01), including association to CD (allele frequency 15%, OR = 1.6,
P < 0.01) and ulcerative colitis (allele frequency 14%, OR = 1.4,
P < 0.05). These data contained 150 copies of the deletion allele and showed a perfect (
r2 = 1.0) correlation between the deletion and the CD-associated SNP rs13361189, further indicating that rs13361189 is a proxy for the deletion. We further confirmed this relationship by evaluating regional probe-intensity information and rs13361189 genotypes from newly generated data on 990 additional extended HapMap samples run on the SNP 6.0 array (
Supplementary Fig. 1). In total, combining IBD, HapMap and extended HapMap data, we observed perfect correlation of rs13361189 and the deletion polymorphism across 933 instances of the minor allele of each in a sample comprising individuals of various ancestries. These results indicate equivalence of rs13361189 and the structural polymorphism for the purposes of association.
To compare the association signal at rs13361189 and the deletion to other SNPs in the region, we used additional SNP data from the National Institute of Diabetes and Digestive and Kidney Diseases Inflammatory Bowel Disease Genetics Consortium (NIDDK IBDGC) genome scan
5,8. As in the combined Wellcome Trust Case Control Consortium (WTCCC) and replication study
2, rs13361189 and its perfectly correlated neighbors showed the strongest CD association (
P = 3.0 × 10
−4) of all SNPs in the region (). A second set of SNPs at
IRGM (rs4958847 and its perfectly correlated neighbors), also reported in the WTCCC replication study
2 and partially correlated with rs13361189 and the deletion, was more modestly associated with CD (
P = 0.003). In combination with the WTCCC results, these SNPs showed association more than two orders of magnitude less significant than rs13361189 (3.8 × 10
−10 versus 2.1 × 10
−12) and therefore did not seem to explain the association. Because the earlier HapMap CEU data suggested the existence of a large block of linkage disequilibrium that extended across the nearby gene
ZNF300, we also examined linkage disequilibrium and CD association in the genes near
IRGM. The extended HapMap sample and IBDGC CD cohort indicated that SNPs in other genes were only partially correlated with rs13361189. Notably, in the CD cohort, rs13361189 remained associated (
P < 0.05) conditional on genotypes at all SNPs beyond the boundaries of
IRGM, but no SNPs showed association conditional on genotype at rs13361189. Thus, rs13361189 and its strongly correlated neighbors at the 5′ end of
IRGM, including the 20-kb deletion polymorphism, are the primary polymorphisms that can explain the CD association in this region.
Given the nature and location of these potential causal polymorphisms, we next assessed whether the
IRGM haplotypes differ in their regulation of
IRGM expression and whether
IRGM expression levels have physiological consequence. To assess whether the deletion (CD risk) and reference (CD protective) haplotypes of
IRGM differ in their ability to activate
IRGM expression, we measured the relative abundance of
IRGM transcripts derived from the two haplotypes in cell lines that were heterozygous for the two haplotypes. Comparing the relative expression of two alleles in heterozygous cells allows the analysis of
cis-acting variation in a way that controls for
trans effects and environmental influences
9,10. This approach was facilitated by the existence of an exonic synonymous SNP (rs10065172) in
IRGM that was in strong linkage disequilibrium (
r2 = 1.0 in samples tested) with both rs13361189 and the deletion polymorphism, such that transcripts arising from the risk (deletion) haplotype carry the T allele of rs10065172, and transcripts arising from the protective (reference) haplotype carry the C allele ().
The two IRGM haplotypes showed different patterns of expression across a panel of heterozygous cell lines (). cDNA from HeLa cells, whose genomic DNA was heterozygous for the two IRGM haplotypes, almost exclusively contained the C allele arising from the protective (reference) haplotype; this result was consistent across multiple HeLa isolates (). Similarly, the hepatocellular carcinoma cell line SNU182 expressed the C allele 4–6 times more strongly than the T allele (), and lymphoblastoid cell lines from ten heterozygous individuals all expressed the C allele more strongly than the T allele (). In cells derived from some other tissues, however, we observed much stronger expression of IRGM from the deletion haplotype: both the colon carcinoma cell line HCT116 and primary smooth muscle cells from human bronchus expressed the T allele approximately six times more strongly than the C allele (). These results indicate that the CD risk (deletion) and CD protective (reference) haplotypes activate IRGM expression in different cellular contexts.
We then sought to assess whether a relationship between
IRGM expression and cellular autophagy existed in a manner that could plausibly be linked to CD. To address an emerging connection between CD and autophagic processing of internalized bacteria
5, we manipulated
IRGM expression in HeLa cells infected with
Salmonella typhimurium, and then assayed the ability of the infected cells to form autophagic vesicles around the infecting bacteria. Reductions in
IRGM expression, using siRNA constructs that reduced
IRGM mRNA expression by six- to eightfold (), significantly compromised the efficiency of anti-bacterial autophagy (). Together with existing data on cellular control of
Mycobacterium tuberculosis3, these data support a role for
IRGM in anti-bacterial autophagy. To test a further hypothesis that expression of IRGM, a GTPase with putative signaling function, can actually regulate rates of autophagy, we next overexpressed IRGM in HeLa cells. Modest overexpression of IRGM enhanced autophagy of
Salmonella (), indicating that endogenous cellular levels of IRGM limit autophagic efficiency. These results indicate that the expression level of IRGM can regulate the efficiency of the anti-bacterial autophagic response.
Together, these results establish that the risk and protective alleles of IRGM differ strongly in the extent to which they are expressed in different cell types, and that expression levels of IRGM regulate the efficiency of anti-bacterial autophagy; they also identify a large deletion polymorphism upstream of IRGM resulting in population segregation of IRGM with two distinct upstream sequences, which we propose as a candidate explanation for the observed difference in expression patterns and association to CD.
The study of autophagy has to date relied upon knockout or siRNA ablation of gene products, revealing little of how the regulation and signaling involved in initiation of autophagy are affected by expression levels. Although components of the autophagic core apparatus may be required in only catalytic amounts
11, it is likely that the signaling molecules that initiate autophagy are required to exceed an initiation threshold before initiation takes place
12; in addition, active signaling molecules may be quickly sequestered by the local autophagy machinery. The hypothesis that the degree of expression of such signaling molecules limits rates of autophagy is supported by our data indicating that IRGM overexpression enhances the anti-bacterial autophagic efficiency of HeLa cells ().
The CD risk and protective haplotypes, which carry different genomic sequences upstream of
IRGM (), showed different patterns of tissue-specific expression of
IRGM (). The extent to which human gene expression variation is tissue specific is not yet known, as large-scale surveys of the genetic basis of human gene expression variation have primarily used a single cell type (lymphoblastoid cell lines). A recent study of allelic imbalance in the liver, spleen and brain of F
1 mice suggests that tissue specificity of expression variation is common: one-third (11/33) of genes with allelic imbalance showed differences in allelic imbalance between tissues, and several (3/33) showed strongly opposite allelic effects in different tissues
10, analogous to our observations for
IRGM. The replacement of the upstream sequence of a gene by genomic polymorphism may also increase the prior likelihood of a complex pattern of expression differences such as that observed at
IRGM.
IRGM seems to have arrived at its primate genomic location as a small translocation or retroposition of an ancestral gene that was encoded elsewhere in the genome; the genomic region upstream of
IRGM at this new locus has subsequently undergone intense evolutionary change, with heavy modification by retroposons along the primate lineage (
Supplementary Note online). Although the reproducible cellular phenotype of multiple
IRGM siRNAs
3 () indicates that
IRGM is expressed at a level sufficient to be functional, we found no conserved transcription factor binding sequences at
IRGM, and its expression in most tested cell lines was low. One or more unknown genomic feature(s) seem to be able to activate the transcription of
IRGM at a low but functionally relevant level. The most likely candidates may be among the 33 subfamilies of retroposon sequences that have populated the region upstream of
IRGM during primate evolution (
Supplementary Note); such sequences are increasingly observed to have tissue-specific enhancer properties, although the association of specific sequences with specific expression patterns is at an early stage
13–18.
The extent to which linkage disequilibrium–based approaches will be able to identify associations between genome structural polymorphisms and disease risk is the subject of intense debate
6,19–23. Here we have identified such an association by combining SNP association data with linkage disequilibrium analysis of a common structural polymorphism we found in the associated region. Genome-wide maps of human structural polymorphisms and the SNP haplotypes on which they segregate, together with data from genome-wide SNP association studies, could in principle enable large-scale investigation of the relationships between structural polymorphisms and human disease.