While only a small proportion of the genome codes for proteins and regulatory RNAs, cis-regulatory elements (CREs), the DNA sequences controlling the expression of the coding segments, are located in the vast non-coding portion of the genome
1. It is therefore not surprising that genome-wide association (GWA) studies are linking an increasing number of human diseases to non-coding DNA, most likely with regulatory function (reviewed in
2,3). However, in these cases, the assignation of the candidate disease gene may not be straightforward: CREs can act at long distances, and their target gene may not be the one closest to the CRE (see, for example,
4). Thereby, methods for predicting which gene, or genes are under regulation by particular non-coding genome segments should help in the identification of the candidate disease gene in cases where the lesion lies in non-coding regions.
Research from many laboratories has shown that the 11 zinc-finger nuclear factor CCCTC-binding protein (CTCF) contributes to the regulation of gene expression and higher order organization of the genome
5. CTCF is evolutionarily conserved and widely distributed along the vertebrate and
Drosophila genomes
6–9. Although at present the primary function(s) of CTCF cannot be directly derived from its genomic distribution, some of the CTCF-bound sites are well known to function as regulatory boundaries, confining the range of actions of CREs to genes within those boundaries (reviewed in
5,10). Different cofactors are able to interact with CTCF, including the SNF2-like chromodomain helicase CHD8 and, more recently, the DEAD-box RNA helicase p68
11,12. CTCF also binds to the cohesin complex at a large number of genomic sites
13–15. Indeed, at several loci, cohesin complex seems to regulate this insulator activity
13–15. Constitutive CTCF-bound sites are more likely to serve this function, while more labile sites may be involved in tissue specific gene expression regulation. In fact, a proportion of CTCF sites have been shown to be constitutively occupied in several human cell types and even to be conserved between human and mice cell types
7,16. This conservation might extend even further evolutionarily, since the development of the shared body plan of vertebrates is controlled by an also shared set of transcription factors and signaling molecules deployed in similar patterns
17. However, genome-wide CTCF distribution has not yet been examined outside mammals. If CTCF-bound sites are found at syntenic positions in different vertebrates, these evolutionary conserved boundaries could be used to resolve ambiguous associations of target genes affected by mutation in non-coding regions in human diseases, as is the case of Multiple Sclerosis and the
GFI1 and
EVI5 genes.
Multiple Sclerosis (MS, [MIM 126200]) is the most common progressive and disabling neurological condition affecting young adults in the world today. The overall prevalence of MS ranges from 2 to 150 per 100,000 individuals. Pathogenetically, MS is considered an autoimmune disease leading to the demyelination of central nervous system axons
18. From a genetic point of view, MS is considered a complex disorder resulting from a combination of genetic and non-genetic factors
19. In addition to the human leukocyte antigen (HLA), which is recognized as the strongest locus for MS in most populations, other genetic factors involved in MS have remained elusive until the arrival of Genome-Wide Association Studies (GWAS) (The MSGene Database.
http://www.msgene.org/.). To date, seven GWAS have been performed for MS; even though study design and results vary substantially between experiments, some new susceptibility genes have been identified and replicated using this approach
20. However, even after convincing replications, the localization of the causal variant(s) of most of these loci remains to be determined. Several GWAS found a set of MS-associated polymorphisms belonging to the same linkage disequilibrium block located in a region containing the
GFI1 (growth factor-independent 1),
EVI5 (ecotropic viral integration site 5),
RPL5 (ribosomal proteinL5) and
FAM69 (family with sequence similarity 69)
21,22,23. A fine mapping of this genomic region was performed pointing to polymorphisms located within the 17th intron of the
EVI5 gene as the most probable causal variants of the association
24. However, these findings did not clarify the functional role of this
EVI5 risk region. Our analysis of the CTCF sites within this genetic block indicates that the 17th intron of the
EVI5 gene likely belongs to the
GFI1, and not the
EVI5, regulatory domain. We further demonstrate that this intron indeed contains CREs that contact the
GFI1, but not the
EVI5, gene. We finally show that increased
GFI1, but not
EVI5, expression is associated by the MS risk haplotype. We therefore conclude that
GFI1, and not
EVI5, is the causal gene associated to MS.