|Home | About | Journals | Submit | Contact Us | Français|
The von Willebrand factor (VWF) gene is highly polymorphic, with variants correlated with VWF antigen levels, adhesion activity, clearance, and factor VIII binding. VWF mutations are detected in patients with von Willebrand disease (VWD), whereas polymorphic variants could be associated with thrombosis. However, information on the ethnic diversity of VWF variants and their association with diseases is limited.
To characterize novel VWF variants from different ethnicities in the general population.
We analyzed samples from 1,092 subjects of 14 ethnicities available in the 1000 Genomes database for VWF variants and their potential functional impacts.
We identified 2,728 SNPs and 91 insertions and deletions that had a high level of ethnic diversity, with Africans having the highest number of variants. The highest level of diversity was found in the D′ and D2 domains. Among 94 non-synonymous variants, 31 were predicted to be deleterious, including 19 that were previously associated with VWD. Most of these “VWD variants” had allele frequencies consistent with disease incidence in European subjects; but some had a significantly higher frequency in other ethnicities. The mutation R2185Q, H817Q and M740I associated with type 1 and type 2N VWD were present in more than 13% of African subjects.
These results highlight the complexity of VWF variations in different ethnic groups and emphasize the importance of interrogating variations on multiple ethnic backgrounds for associations with bleeding and thrombosis.
Von Willebrand factor (VWF) affixed to the subendothelium mediates the initial tethering of platelets at sites of vessel injury. VWF is initially synthesized as a peptide precursor of 2,813 amino acids with well-defined domains in the order of D1-D2-D′-D3-A1-A2-A3-D4-B1-B2-C1-C2-CK. The N-terminal D domains are critical for VWF multimerization (D1-2, and D′) and contain the binding site for coagulation factor VIII (D′ and D3 domains) . Binding sites for the platelet GP Ib-IX-V complex and integrin αIIbβ3 are located in the A and C domains, respectively. The absence or a low level of VWF and/or adhesive activity are associated with bleeding found in patients with von Willebrand disease (VWD) [2, 3], whereas an elevated level of VWF and/or enhanced adhesion activity is a well-established risk factor for thrombotic diseases such as myocardial infarction and stroke [4–7]. VWF also contributes to the development of atherosclerosis, especially at sites of vascular bifurcations, where blood flow is turbulent . This bidirectional activity suggests that VWF is tightly regulated in order to achieve efficient hemostasis without promoting thrombosis.
The VWF gene spans ~ 180 kilobases on Chromosome 12 [9, 10] with 52 exons. Changes in the VWF gene could alter VWF biosynthesis, secretion, clearance, and adhesion activity. The ISTH-SSC VWF Online Database (http://www.vwf.group.shef.ac.uk/vwd.html) lists ~500 mutations and polymorphic variants that were associated with VWD . Single nucleotide polymorphisms (SNPs) in exons, 5′ regulatory region, and introns are also reported to influence levels of VWF antigen and FVIII activity in healthy subjects [12–15]. Some of these VWF SNPs are associated with an elevated risk for thrombosis [5, 16].
Previous studies have found a large number of polymorphisms at the VWF locus [17, 18], primarily from individuals with European ancestries. A recent study by Bellissimo, D.B. et al.  showed that mutations previously considered to be causative for VWD in European subjects have allele frequencies up to 20% in African Americans. These variants are possible false positive in VWD associations from the initial scans, but could also be pathogenic in Europeans while non-pathogenic in other ethnicities . Studying genetic variants in multiple populations is a powerful approach to enrich the number of polymorphisms (particularly rare alleles with frequency less than 1%). Ethnicity has been increasingly recognized as having a confounding effect on VWF expression, making genotypic and phenotypic association more complex [5–7]. Genetic diversity shaped by population demographics and environmental covariates should be taken into consideration in the identification and interpretation of pathogenic mutations. With the rapid dissemination of next generation sequencing (NGS) technologies, interrogating genetic polymorphisms in a large number of samples from diverse ethnic background becomes feasible [20–22]. Using this new technology, the ongoing 1000 Genomes Project (1000G) examined genomic variations among 14 ethnicities from four continents: Africa, America, Asia, and Europe [20–22]. More than 38 million genetic variants were detected in the genomes of 1,092 subjects 14 ethnicities around the world in the Phase 1 Project , a significant enrichment than the previous public database. Our study presents the characterization of allelic diversity at the VWF gene. This study can detect SNPs at 0.1% frequency with the power of 90% in the exome and nearly 70% across the genome , and allows us uncover rare ethnic-specific variants and provide new insights into the ethnic diversity of the VWF gene.
We obtained VWF variants from April 2012 Integrated Variant Set release of the 1000 Genomes Project (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz) from 1,092 subjects of 14 ethnicities originated from four continents (Table S1, http://1000genomes.org). The 14 ethnicities are: Yoruba in Ibadan, Nigeria(YRI); African Americans in Southwest US (ASW); Luhya in Webuye, Kenya (LWK); Mexican Ancestry in Los Angeles (MXL); Colombian in Medellin Colombia (CLM); Puerto Rican in Puerto Rico (PUR); Han Chinese in Beijing (CHB); Han Chinese in South China (CHS); Japanese in Tokyo (JPT); Utah residents with Northern and Western European ancestry (CEU); Finish from Finland (FIN); English from Great Britain (GBR); Iberian in Spain (IBS), and Toscani in Italy (TSI). The exonic regions of the genome were captured and sequenced at a high coverage rate (average > 50X). The whole genome was shotgun sequenced at a low coverage rate (~2–6X). The false discovery rate (FDR) was estimated to be 1.6% for exonic SNPs, 1.8% for non-coding SNPs and <5% for indels .
A region spanning from the position 17,161,397 to 17,185,967 on Chromosome 22 (Build 37) contains the VWF pseudogene, VWFP, and a portion of the TP TEP1 gene. This region has 97% sequence homology with exon 23–34 of the VWF gene. To remove potential influence of this homologous sequence on variations in the VWF gene, we first aligned the homologous sequences from the two regions in chromosome 12 and chromosome 22 using CLC Sequence Viewer 6.0.2. We then aligned variants in the VWF gene to the corresponding positions on the VWF pseudogene. We removed any variant that was a reference allele of the VWF pseudogene in the corresponding positions to the VWF gene, or whose origin couldn’t be identified because of identical flanking sequences for the VWF gene and VWF pseudogene. A total of 104 variants that met these criteria were considered to be potentially derived from the VWF pseudogene and removed from further analysis. The VWF variants calling file that removed possible false positive variants could be accessed at http://www.hgsc.bcm.tmc.edu/ftp-archive/VWFVariantsStudy/.
We applied ANNOVAR  to annotate VWF SNPs and insertions/deletions (indels). The reference genome we used was NCBI human genome build 37. The VWF gene spans from position 6,058,040 to 6,233,836 on Chromosome 12. The 5′ and 3′ un-translated regions are 255bp and 141bp, respectively. Variants that were not present in dbSNP129 before the entry of SNPs from the Genomes pilot project and phase 1 project were reported as novel.
The Principal Component Analysis (PCA) that summarizes high-dimensional genetic variation data to infer population structures  was applied to characterize the distribution of VWF variations among different ethnic groups. The frequencies of all variants in different ethnicities were calculated. The input was the frequency of variants in different ethnicities. The top two PCs were extracted for analysis.
DnaSP software was used to determine the nucleotide diversity (pi) for different VWF domains in different populations [26, 27]. where xi and xj are the respective frequencies of the ith and jth sequences, and piij is the number of nucleotide differences per nucleotide site between the ith and jth sequences .
Conservation scores of nucleotides in the VWF gene were from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/placentalMammals/. An average conservation score was calculated in order to evaluate the degree of conservation for different domains. A Chi-squared test was used to compare the distribution of allele frequencies.
The phased 1000G data VCF files (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz) were converted into an input format that was proper for Haploview . Haploview was applied to visualize linkage disequilibrium (LD) patterns of the VWF gene in subjects of different ethnicities. LD reflects correlation among neighboring alleles to be transmitted together in a set of linked alleles . LD patterns were visualized using the following settings that (1) ignored the pairwise comparison of markers that were more than 500 kb apart, (2) examined alleles found 5% or more in the population, (3) removed markers with Hardy-Weinberg disequilibrium (Fisher’s exact test, P<0.001). LD blocks were defined using a confidence interval method.
We identified 2,728 SNPs and 91 indels in the VWF gene after removal of variants that are potentially from the VWF pseudogene. All of the 91 indels were in intronic regions. Among the VWF SNPs, 2,573 were intronic (94.3%), 146 exonic (5.4%), 2 in the 5′ untranslated regions (UTR) (0.07%), and 7 at splice sites (0.26%, 6 are within coding regions) (Table 1). The majority (75.1%) was defined as novel (95.9% and 64.1% of variants in the MAF<0.01 and MAF=0.01–0.1, respectively, Figure 1). These variants have not been interrogated by previous VWF related functional studies, demonstrating the value of using the 1000G for enriching the polymorphism knowledge of particular genes of interest.
When data were stratified into four continental groups, we identified 1,986, 1,545, 1,011 and 1,244 variants in subjects from Africa, America, Asia, and Europe, respectively (Table 1). As expected, Africans harbor more genetic polymorphisms compared to non-Africans across all annotation categories. The proportion of variants shared among populations varied by annotation categories, ranging from 4.4% (4/91) for missense SNPs to 21.0% (541/2573) for intronic SNPs (Table 1). Most of the shared variants had an allele frequency that was higher than 1% (Table 1). Compared to SNPs, 57.1% of indels were shared among populations. This high sharing rate could be attributed to a more stringent method that filtered variants with a minor allele frequency of < 0.5%.
Overall, 45.7% (1,288/2,819) of the variants were specific to one of the four populations, where Africa and Asia had the highest proportion of missense SNPs that were specific to the two continents: 63.8% and 73.7% respectively.
Ninety-one of the 152 exonic SNPs were non-synonymous (59.9%), 55 were synonymous (36.2%) and 6 were at a splicing site (3.9%) (Table 1, and see Table S2 for a detailed list). We further analyzed their potential for functional impacts using two widely used in silico approaches: the SIFT  and Polyphen-2 . The former  identifies critical amino acids by their conservation in a specific protein family, whereas the latter  compares the difference in a protein structure between a wild-type and variants. For the 91 non-synonymous and three coding splicing site non-synonymous SNPs, 36 were predicted to be deleterious by SIFT, 44 by PolyPhen-2, and 31 by both models (Table 2, see Table S2 for the full list of predictions on the 152 variants). When closely examined, the 31 variants that were predicted deleterious by both programs were distributed in all domains except the A1 and CK domains. After correction for domain sizes, no particular clustering pattern was found (data not shown).
The functional impact of these VWF variants was also examined by their association with VWD (http://www.vwf.group.shef.ac.uk/). We identified 19 variants previously associated with or determined to be causative for VWD (Table 3, Table S3–S4) [31–34]. Seven of them (36.8%) —L129M, M576I, M740I, H817Q, R924Q, R2185Q and R2287W— had an allele frequency of > 1% in one population, but not in others (Table 3). M7401I, H817Q and R2185Q were found in 19.5%, 13.8%, and 20.7% of AFR subjects, respectively, whereas their allele frequencies in non-AFR subjects were consistent with previous reports [31, 33, 35]. M740I was found by screening families with type 2M VWD with an Italian ancestry and in 3 additional cases in the European study [32, 33]. It was consistently co-segregated with R1205H in the European study of VWD patietns, but none of the individuals in the 1000G had R1205H variant. H817Q was found to affect the binding of mature VWF to coagulation factor VIII in a European patient with type 1 VWD, who was compound heterozygous also for R782W . R2185Q was found in one index European patient and no function study was performed .
Among the 19 VWD variants, 7 were also predicted by computations as deleterious (annotated with asterisk in Table 2). When deleterious variants were further annotated with 1000G allele frequencies, 4 (L129M, T346I, R2287W, and G2705R) had MAF greater than 1% in one of the four continents. This is higher than what we would expect given the prevalence of the VWD (estimates ranging from 0.01% – 0.8%) [36, 37]. L129M, T346I, and R2287W were found only in AFR subjects (Table 2). L129M and R2287W were initially identified as Type 1 VWD mutations by screening patients with European ancestry [31, 32]. A likely explanation is that their VWD propensity is specific to non-AFR population, but this propensity is modified by environmental and/or genetic factors in AFR subjects. G2705R segregated with MAFs of 5.5% and 7.2% for EUR and AMR subjects, respectively (Table 2). Because AMR consists of a number of admixed ethnicities (MXL, COL, and PUR), a proportion of alleles is of European origin. It is therefore likely that G2705R in AMR is originated from European founders. It may also be false positive prediction by both SIFT and PolyPhen-2 because of a relatively high population frequency.
We next quantified the nucleotide diversity among exonic VWF domains by calculating pi . The D′ domain had the highest nucleotide diversity, followed by the D2 and D4 domains (Figure 2). Alleles with MAF>10% in the three domains may have accounted for this higher nucleotide diversity (Table S3). Many of the variants (60%) found at high frequency in these three domains were synonymous (Table S3). The conservation scores, calculated by comparing human sequences against 46 species, found that D2 and D′ had a similar level of sequence conservation as compared to other VWF domains (1.0 and 1.1 versus 0.9, 0.9 and 1.0 for the B1, B2 and CK domains), suggesting that the D2 and D′ domains are not targeted for positive selection.
The nucleotide diversity of VWF domains also varied among four continents. As expected, AFR subjects were more polymorphic than others (Table S4). Some of the domains had too few variants (B1, B2, CK, Table S2) to have statistical power for comparison of nucleotide diversity. This limitation could in part account for the low nucleotide diversity in these domains.
The diversity of intronic SNPs was more evenly distributed throughout the VWF gene, with those in introns 29–32 (flanking exons that encode the A3 domain) being the least variable (Figure S1). The ascertainment complexity in the 1000G could account for this low diversity because intronic SNPs were detected only by the low pass sequencing (average read depth of coverage = ~2–6X/subject), whereas exonic SNPs were detected by both low pass and exome sequencing.
We performed PCA to study the genetic relationship among 13 ethnicities (14 subjects from Iberian in Spain (IBS) were excluded because of an insufficient sample size). The 1st and 2nd PCs accounted for 96.8% of total variance. When the 1st and 2nd principal components were examined, subjects could be divided into three groups: ASN, AMR/EUR, and AFR. The ASW was between AFR and AMR/EUR, but close to AFR (Figure S1), indicating its closer relationship with Africans. This finding is consistent with the genome wide expectation and historical record that ~80% of African American ancestries were closely related to western Africans .
Haplotypes of VWF SNPs among subjects from the four continents showed distinct LD patterns. LD blocks from Africans had a much shorter range than those from other continents (Figure S4). Africans had 83 haplotype blocks in the VWF gene with the longest being 4 kb of nucleotides, whereas Asian, American and European subjects had 29, 42, 40 LD blocks, with the longest blocks of 21kb, 13kb and 19kb, respectively. The four European ethnicities had a similar LD structure that was different from the three African ethnicities (Figure S4 & S5).
We have examined the genetic variation and ethnic diversity of the VWF gene using the 1000G dataset (Table 1 and Figure 1). As expected, there were more variations in Africans than other ethnic groups (Table 1). Bellissimo, D.B, et al  have recently reported 14 variants that were previously associated with VWD, but have a minimal influence on plasma VWF antigen. We identified 11 of these 14 variants in the 1000G dataset. For the rest, R1342C was not detected in 1000G, and V1229G and N1231T were removed because of possible contamination from the VWF pseudogene. By working on a much larger sample size with more diversified ethnic background, we extended the Bellissimo’s study by identifying a larger number of non-synonymous SNPs (69 out of 94 were not in dbSNP129), including 19 listed in the ISTH VWF database (Table 3 and TableS3).
The allelic diversity of VWF variants was further interrogated by PCA (Figure S2) and analysis for linkage disequilibrium (Figure S3–S5). The four European ethnicities had very similar LD patterns that differed from the African ethnicities. The three Asian ethnicities (CHS, CHB, and JPT) can be grouped together on the PCA panel, indicating their similar genetic structures. Together, these data demonstrate that the VWF gene is ethnically diversified at a level that has not been reported before. This global analysis of diversity in the VWF gene provides necessary background information for understanding the presence of “VWD variants” in disease and non-disease populations. We further showed that the nucleotide diversity of exonic VWF variants was highest in the D′ and D2 domains (Figure 2). The functional impacts of these variants have not been examined experimentally, but one could speculate that this higher genetic diversity in the D′ and D2 domains could result in variations in VWF multimer patterns among subjects from different ethnicities.
A diversified presence of the “VWD mutations” in four continents (Table 3) raises three critically important and related questions. First, do these variants alter cellular and biochemical properties of VWF consistent with a VWD phenotype? The answer is yes, at least for some, as demonstrated in vitro for recombinant S1731T, Y1584C, and R273W variants that result in intracellular retention and the lack of high molecular weight multimers [39–41]. However, question remains as whether these genetic variations cause a mild phenotype that requires additional mutations that co-segregate (i.e. in strong LD) or environmental triggers to present a bleeding phenotype. For the former, 57.9% of the VWD variants found in the 1000G cohort are in the D domains that are likely to generate a mild phenotype as compared to those found in the A1 domain, where the binding site for the platelet GP Ib-IX-V complex is located. For the latter, M740I were found to co-segregate with R1205H in three European patients with type 1/2M VWD [32, 33]. This mutation has an allele frequency of 19.5% in Africans, but none has R1205H co-segregation.
Second, can these mutations cause VWD in an ethnic-specific manner? The question is raised because very few VWF variants, especially non-synonymous SNPs, are shared among subjects from different continents (Table 1, ,3).3). M740I, H817Q and R2185Q, which were originally reported in European patients with VWD, have minor allele frequencies of >10% in Africans. There may be several reasons for this ethnic diversity in disease association: 1) These mutations result in low VWF antigen that is partially compensated by a high baseline VWF antigen level found in Africans and African Americans ; 2) they are VWD-causing only when they are confined in ethnic specific haplotypes; and 3) they arose recently to escape negative selection pressures in a more diverse gene pool such as Africans [20, 22]. These possibilities can be further investigated in larger cohorts of healthy subjects and VWD patients with available VWF antigen and adhesive activity.
Third, in addition to 19 VWF variants that were previously associated with VWD, 31 non-synonymous variants were also considered to be deleterious by two computational programs that evaluate the potential for affecting functions by a specific variant based on structural differences between wild type and the variant, and their conservations in a protein family. The biological impact of these deleterious variants remained to be verified experimentally, but one can speculate that they induce mild-to-moderate alterations in the VWF structure or folding that are insufficient to cause a catastrophic phenotype. This is especially possible for variations in the D domains, where changes in multimer patterns and interaction with FVIII may not severely or directly affect VWF interaction with the platelet GP Ib-IX-V complex as mutants in the A domains found in most of type 2 VWD. However, these mild-to-moderate variants can be additive to influence VWF structure and function when they reside in specific haplotypes together with other variants.
Comparing to SNPs that have been extensively studied in the past, indels are new variants for which functional significance remains to be investigated. We identified 91 indels all in introns so that no frame-shift for the coding sequence is expected. Biological impacts of these 91 indels are unknown, but could potentially result in changes in the rate of VWF gene splicing and microRNA binding. Some may also serve as markers for other genetic variations.
In summary, in this comprehensive study of VWF variants in different ethnicities, we have identified 2,728 VWF SNPs and 91 indels in 1000G subjects with 75.1% being novel. The D′ and D2 domains had the highest level of nucleotide diversity. Furthermore, 19 non-synonymous SNPs that have previously been associated with VWD in Europeans are detected in 1000G subjects and some have different allele frequencies among four populations. Results from this study demonstrate that the VWF gene is ethnically diverse but this ethnic complexity and its contribution to diseases requires further study in large cohorts. Molecular diagnostic panels for VWD may also benefit by considering ethnic diversity in linking a specific VWF variant to a bleeding phenotype.
The authors thank the participants of the 1000 Genomes Project for their work and contributions.
This work is supported by the grants HG003273, HG005211-01, MH089175, HL71895 and HL085769 from the National Institute of Health. Q.Y. Wang is a recipient of the Chinese Scholarship Council.
Q.Y. Wang: designed the study, analyzed the data, and wrote the manuscript;
J. Song: analyzed the data;
R.A. Gibbs and E. Boerwinkle: designed the study;
F.L. Yu and J.F. Dong: developed the hypothesis, designed the study, analyzed the data and wrote the manuscript.
Disclosure of Conflict of Interest:
The authors declare no relevant financial interests.