|Home | About | Journals | Submit | Contact Us | Français|
The claudin-1 gene (CLDN1) is a member of a family of genes that encodes proteins found in tight junctions and it has recently been implicated as one of several receptors for late stage binding of hepatitis C virus (HCV). Exploration of the population genetics of this gene could be informative, especially in the investigation of a possible genetic contribution to HCV infection. Comparison to a highly similar gene, claudin-7 (CLDN7) could provide insight into the recent molecular evolution of CLDN1. Mean interspecies conservation score was 0.11 (SD 0.28) for CLDN1 and 0.31 (SD 0.43) for CLDN7. Re-sequence analysis was performed across all exons and evolutionarily conserved regions in CLDN1 (13 kb in total) and CLDN7 (2 kb in total) in 204 chromosomes drawn from the SNP500Cancer resource of four self-described ethnic groups in the US. For CLDN1, 133 SNPs were identified as well as 8 indels and an AC repeat length polymorphism. For CLDN7, 5 SNPs were identified. Assessment of nucleotide diversity (including Fst, θ and π statistics) did not show evidence for recent positive or negative selection in either gene. The pattern of linkage disequilibrium was determined for each group and there is substantial difference for common SNPS (>5%) between populations as well as genes, further supporting the absence of signatures of recent selection.
The claudins are a family of highly related proteins that are important in tight junction formation and function. Claudin-1 was initially found in the liver of chickens . Since that discovery 21 members of this family of proteins have been identified in humans . CLDN gene expression is frequently altered in human cancers and claudin-1 expression has been found to be downregulated in various cancers . Loss of function germline mutations in CLDN1 were found in patients with neonatal ichthyosis-sclerosing cholangitis (NISCH) syndrome [3, 4].
Hepatitis C virus (HCV) is a blood borne infection with a worldwide distribution that varies markedly based on the risk of exposure to contaminated blood. It is estimated that about 130–170 million people have been infected with HCV. After initial infection with HCV, about 75–80% of people develop chronic infection, which is an important cause of cirrhosis, end stage liver disease and hepatocellular carcinoma . The age of HCV is not known. Determining when a virus originated is generally speculative and this is especially true for HCV, which was discovered only 20 years ago . On the basis of the rate of sequence change HCV genotypes are estimated to have originated 500–2,000 years ago, however, this relatively short time period seems inconsistent with the widespread epidemiological distribution of HCV in human populations .
Recently it was shown that claudin-1, which is highly expressed in the liver , is required for HCV cell entry . Residues within the first extracellular loop of claudin-1 are critical for viral entry and mutations in this region render cell-lines less permissive to HCV infection. Based on protein sequences, claudin-7 is the closest related member to claudin-1 . It has been reported that claudin-7 may play a role in infection of CD4-negative cells by human immunodeficiency virus type 1 (HIV-1) .
Studies of CCR5 Δ32, a loss of function mutation in the gene that encodes the primary HIV-1 co-receptor, have shed light on HIV-1 pathogenesis  and novel therapies that mimic this allele by blocking CCR5 are effective against HIV-1 infection [11, 12]. Genetic variation in CLDN1, which encodes for a HCV viral receptor, could be associated with susceptibility, outcomes and treatment in HCV infection, but further studies are needed.
The requirement of claudin-1 for cell entry by HCV and the high worldwide prevalence of HCV infection raise the possibility that genetic variation in this gene could be under selective pressure of HCV or a related virus. The current tools of analyses of genetic variation in populations with different histories are well suited for determination of recent positive selection. With re-sequence data, it is also possible to look for indications of purifying selection in one or more populations.
We investigated common genetic variation in CLDN1 in four self-described ethnically distinct U.S. populations to look for genetic variation and signatures of selection. Furthermore, we compared CLDN1 with CLDN7, which is the most similar of the genes in the claudin family.
Re-sequencing analysis was performed on genomic DNA from the SNP500Cancer population, which consists of 102 unrelated individuals of self-described heritage: 24 of African/African American heritage (AA), 31 of Caucasian heritage (CA), 23 of Hispanic heritage (HI) and 24 of Pacific Rim heritage (PR)  (http://snp500cancer.nci.nih.gov). With 204 chromosomes bi-directionally sequenced, the detection rate for SNPs with a minor allele frequency (MAF) >5% and >1% is 99% and 87%, respectively .
Primers were designed using the web-based version Primer3 software (http://fokker.wi.mit.edu) to cover the exons and evolutionary conserved regions within the gene. Sequence analysis was performed on amplified fragments generated by standard PCR techniques. 20 ng of genomic DNA was used per reaction and amplification conditions were as follows: initial 95°C for 10 min; 35 cycles at 94°C for 30 s, an assay specific optimized temperature between 54°C and 69°C for 45 s, 72°C for 45 s and a final extension at 72°C for 10 min. Bi-directional sequence analysis was performed using Dye Terminator method (ABI Perkin-Elmer, Foster City, Calif., USA) and was analyzed on ABI-Perkin Elmer platforms (model 3730) with Sequence Analysis 3.7 software and Seqscape software version 2.5 (Applied Biosystems).
The SNP frequencies were determined and haplotypes were estimated using PHASE version 2.1 . DNAsp version 4.0 was used to estimate nucleotide diversity, as well as to identify synonymous and nonsynonymous substitutions . Using this program we also calculated Tajima's D test statistic proposed by Tajima (1989), equation 38, for testing the hypothesis that all mutations are selectively neutral. The D test is based on the differences between the number of segregating sites and the average number of nucleotide differences. The confidence limits of D (two-tailed test) is obtained assuming that D follows the beta distribution (Tajima 1989, equation 47), i.e. the confidence limits given in table table22 of Tajima . Fst was calculated as described by Hudson et al. . Haploview version 3.31 (http://www.broad.mit.edu/mpg/haploview/index.php) was used to determine the degree of linkage disequilibrium (LD) and generate plots based on r2 values . Haplotype blocks were determined using the confidence intervals (CIs) set by Gabriel et al.  (upper CI = 0.98, lower CI = 0.7). HapMap Phase II data was used to look at LD in the regions flanking the genes, namely about 25 kb on each side . The data were analyzed comparing the African, European and the combined Japanese and Chinese populations with each other. All SNPs and their frequencies in these populations can be found in Genewindow (http://genewindow.nci.nih.gov).
Evolutionary conservation was estimated comparing 28 vertebrates (Human, Armadillo, Bushbaby, Cat, Chicken, Chimpanzee, Cow, Dog, Elephant, Frog, Fugu, Guinea pig, Hedgehog, Horse, Lizard, Medaka, Mouse, Opossum, Platypus, Rabbit, Rat, Rhesus, Shrew, Stickleback, Tenrec, Tetraodon, Tree shrew and Zebrafish) with the use of a phylogenetic hidden Markov model, phastCons  in the UCSC genome browser (http://genome.ucsc.edu/cgi-bin/hgGateway). This method calculates the posterior probability that each site was generated by the conserved state (Conservation score). Evolutionary conservation was also examined using the subset of 17 of these species that are placental mammals and, therefore, reflect more recent evolution. Evolutionary conserved regions were determined using a cut-off of 0.7 for the conserved areas.
21 human claudin gene DNA sequences were downloaded from Ensemble release 45, June 2007 (http://www.ensembl.org). ClustalW software was used to produce a multiple sequence alignment and Jalview software was used to visualize the results. ClustalW phylogenetic calculations were based on the neighbor-joining method of Saitou and Nei .
In the resequence analysis of CLDN1, which spans 16,723 bp on chromosome 3 (chr3: 191,506,187–191,522,909) and has 4 exons, 13,458 bp were analyzed: 12,602 bp (75%) of the genic region, plus 756 bp in the 5′ untranslated region (UTR) and 100 bp in the 3′ UTR. A total of 133 SNPs were identified, yielding an average of 1 SNP per 95 bp or 10.6 SNPs per kb. Of these SNPs, 57 had a MAF >5%. Six coding variants were observed of which two nonsynonymous SNPs, A124T and A135I, were each observed only once among the 102 individuals. In addition, there were 8 indels, 5 in intron 1 (MAFs, 1–11%), 1 in intron 3 (MAF, 1%) and 1 in a non-coding region of exon 4 (MAF, 1%). In the 5′ UTR region an AC repeat length polymorphism was found 578 bp upstream from the start of exon 1.
The resequence analysis of CLDN7 (chr17: 7,104,182–7,106,513), which has 60% sequence similarity to CLDN1 and includes 2,331 bp, targeted 1,989 bp across 61% of the gene including the four exons, 468 bp of the 5′ UTR and 103 bp of the 3′ UTR. Only five SNPs were found in this region. The density of SNPs across CLDN7 is 1 in 398 bp, which is substantially lower than that of CLDN1. One nonsynonymous SNP, A197V (MAF, 30% across all four populations) was observed. Additionally, one synonymous SNP and three SNPs outside the coding regions were detected. Two indels were found, one in the promoter region of the gene (MAF, 10.3%) and one in the non-coding region of exon 1 (MAF, 40%).
Table Table11 shows the nucleotide diversity, which was calculated using the average number of nucleotide differences per site between two sequences (π) and the population mutation parameter (θ) for each gene within each subpopulation (African American, Caucasian, Hispanic and Pacific Rim), as well as overall (n = 102). For CLDN1, π was highest in African Americans (1.5 × 10–3), but similar in the other subpopulations (1.2 × 10–3 to 1.4 × 10–3). There was no difference in θ between the populations. Tajima's D statistic, which is based on the number of segregating sites, was calculated for CLDN1 (table (table1A).1A). Tajima's D statistic was not significant for the four groups or overall.
The results for CLDN7 are shown in table table1B.1B. π was highest in the Pacific Rim and lowest in the African American group. θ did not show any difference between the populations and neither did Tajima's D.
We calculated the Fst statistic, a measure of population differentiation, to compare heterozygosity between the ethnic groups. For CLDN1 the highest degree of heterozygosity (Fst = 0.05) was observed between Caucasian and African American populations and the lowest (Fst = 0.005) between Caucasians and Hispanics (table (table2A).2A). These data suggest no differentiation between the 4 populations. In comparison, the highest degree of heterozygosity for CLDN7 was observed between African American and Pacific Rim populations (Fst = 0.10) and the lowest (Fst = −0.02) was between the Caucasian and Hispanic populations (table (table2B2B).
LD was determined by group and, when possible, haplotype blocks were estimated (fig. (fig.1).1). Interestingly, neither gene displayed a significant degree of LD across the coding region of the gene. In fact, there is minimal LD across CLDN1, which has only one small block in the 3′ end of the gene. Using the HapMap Phase II data we looked at LD across the wider gene region. An area of 50 kb with the gene in the middle was chosen (fig. (fig.2).2). CLDN1 lies in an area between two blocks of strong LD. Both the 5′ UTR and the 3′ UTR show blocks of LD in all three Hapmap populations (fig. (fig.2A).2A). CLDN7 is found in a block of strong LD in the European population. LD disintegrates over a shorter distance in the two Asian populations combined, as well as in the African population (fig. (fig.2B2B).
The mean conservation score for CLDN1 was 0.11 (SD: 0.28) for 28 vertebrate species and 0.13 (SD: 0.31) for 17 placental species. CLDN7 had a mean conservation score of 0.31 (SD: 0.43) for the vertebrate species and 0.27 (SD: 0.42) for the placental species.
Apart from the coding regions, areas of CLDN1 with conservation >0.7 were found in intron 1 and in the intronic areas around exons 3 and 4 for both the 28 vertebrate species (fig. (fig.3A)3A) and the 17 placental species (fig. (fig.3B).3B). This similarity suggests the preservation of sequences that could have functional activities. CLDN7 shows similar conservation in intron 1, but not in the intronic areas around exons 3 and 4 (fig. (fig.3C).3C). As in CLDN1, most conservation in CLDN7 seems to be old because there is little difference between all 28 vertebrate species and the subset of 17 placental species (fig. (fig.3D3D).
Previously it was shown that the protein sequence of claudin-1 and claudin-7 cluster together within the protein family of claudins . We compared the DNA sequences of the family of genes using ClustalW and found that CLDN1 and CLDN7 cluster together along with CLDN19 (fig. (fig.44).
In a comparative re-sequence analysis of the two most closely similar genes in the claudin family, CLDN1 and CLDN7, we observed no evidence for recent positive or negative selection. As expected, our re-sequence analysis revealed that the pattern of genetic diversity differs for these two genes, particularly in distinct populations, which we estimated by analyzing the four self-described populations of the SNP500Cancer set . Moreover, neither gene shows any signatures of selection, and our sample size was sufficiently large to detect small signatures. Although several rare nonsynonymous variants are found in these genes, given the sample size, it is not possible to make a case for purifying selection. Like most genes studied to date, the analysis does not support a recent signature of positive selection and we conclude that CLDN1, which is a critical receptor for HCV, appears to be neutral with regard to selection.
Previously, loss of function mutations in CLDN1 have been reported in children presenting with NISCH syndrome [3, 4]. In a Swiss girl with NISCH syndrome, a homozygous deletion in exon 2 caused a premature stopcodon and a truncated protein . Four individuals with NISCH syndrome from 2 inbred Moroccan families shared a homozygous deletion in exon 1, which was absent in 52 unrelated Moroccan controls . We found neither of these variants in our study population. When we examined the catalogue of genetic variants across CLDN1, we noted that the first extracellular loop of the protein, which appears to be crucial for HCV binding, did not reveal non-synonymous SNPs. Therefore, we observed no evidence for a common loss of function variant that could restrict disease, as has been established in HIV-1-infection for the CCR5 Δ32 allele [10, 24, 25]. The population genetic parameters, estimated using our sequencing data for both CLDN1 and CLDN7, appear to be consistent with being selectively neutral, which argues against the selective pressure of HCV, a viral disease that has recently become common worldwide.
The nucleotide diversities for CLDN1 (1.4 × 10–3) and CLDN7 (0.3 × 10–3) fall well within the range found in other genes by direct sequencing. In a resequencing study of the T cell receptor, Mackelprang et al. found a nucleotide diversity (π) of 12.7 × 10–4. In a study of 213 environmental response genes, π ranged from 0.72 × 10–4 to 19.3 × 10–4 with the overall π = 6.7 × 10–4. And in a study using 180 SeattleSNPs genes comparing African American and European American populations, the former group was found to have a greater nucleotide diversity (9.01 × 10–4 and 6.96 × 10–4, respectively) . The difference in π between CLDN1 and CLDN7 might be explained by the larger non-coding region of CLDN1, or likely reflects the normal variance of genes, in other words, within the range for selectively neutral genes.
In a genome wide genotyping study of East Asian, African-American, and European-Americanindividuals, the average Fst for autosomal SNPs was 0.107 in coding regions, 0.118 in intronic regions and 0.123 overall . However, these estimates were based on selected genotyping of SNPs rather than resequencing as was done in our study. In our analysis of CLDN1, the highest Fst was 0.05 (for the comparison of African American and Caucasian populations). Because this is below what was found in genome wide analysis, it suggests no difference in selection between any of the populations studied. We found that CLDN7 had a similar Fst with the highest value (0.1) for the comparison of the African American and Pacific Rim populations.
Various studies report on the estimates of Tajima's D. The mean value for Tajima's D in resequenced genes was estimated to be −0.54 for people of African descent and 0.26 for people of European descent . This was substantially different from the 0.94 and 1.25 for the same groups as estimated from a dataset of dense SNP genotyping . In 180 SeattleSNPs genes the average Tajima's D in African-American was found to be −0.51 and 0.18 in European Americans . In our study, Tajima's D for the 4 populations combined was −0.6 for both CLDN1 and CLDN7. As was shown in a resequencing study of 132 genes , this does not represent evidence for recent selective pressure.
The DNA sequences of CLDN1 and CLDN7 cluster together with CLDN19. These 3 genes also cluster together based on protein sequences, as reported in the past . CLDN1 is about 7.3 times larger than CLDN7, although both genes encode a protein of 211 amino acids. Despite these differences in intronic region these genes cluster together in the ClustalW analyses of both protein and DNA sequences within the claudin family of genes, suggesting a common ancestor gene.
For both CLDN1 and CLDN7, a high degree of conserved sequence in intron 1 was observed among 28 different vertebrate species, which is expected because of the putative regulatory sequences in the first intron. The genes also show a high degree of similarity in protein and DNA sequence, as expected in a family of genes with overlapping functional attributes in tight junction formation.
Recently it was shown that alignment analysis has uncertainties and that different programs can lead to different conclusions . Interpretation of our data should be done with this caution in mind. The same is true for summary statistics such as Tajima's D , as population demography and the recombination rate can influence the outcome of this test. In this study we resequenced the 2 genes in 102 individuals. With this approach, we estimate a 99% probability to find a segregating site that in a larger data set would be estimated to have a MAF over 5%. However, less common SNPs have a lower change to be found, which might influence our results. Also the use of HapMap data has its limitations especially the use of haplotype estimates .
In conclusion, we conducted a resequence analysis of claudin genes implicated in HCV entry and found no evidence of positive selection, thus suggesting that the genes are most likely selectively neutral. In this regard, functional mutations were not found in otherwise healthy individuals, suggesting a degree of constraint on these two genes, of which one is a late stage receptor for HCV infection.
We would like to thank Renee Chen, Nicholas Orr, Hye Kim and Jun Fang for their technical assistance, as well as Charles Rice and Thomas von Hahn for helpful conversations about claudin-1.