PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1331285)

Clipboard (0)
None

Related Articles

1.  Efficient haplotype block recognition of very long and dense genetic sequences 
BMC Bioinformatics  2014;15:10.
Background
The new sequencing technologies enable to scan very long and dense genetic sequences, obtaining datasets of genetic markers that are an order of magnitude larger than previously available. Such genetic sequences are characterized by common alleles interspersed with multiple rarer alleles. This situation has renewed the interest for the identification of haplotypes carrying the rare risk alleles. However, large scale explorations of the linkage-disequilibrium (LD) pattern to identify haplotype blocks are not easy to perform, because traditional algorithms have at least Θ(n2) time and memory complexity.
Results
We derived three incremental optimizations of the widely used haplotype block recognition algorithm proposed by Gabriel et al. in 2002. Our most efficient solution, called MIG ++, has only Θ(n) memory complexity and, on a genome-wide scale, it omits >80% of the calculations, which makes it an order of magnitude faster than the original algorithm. Differently from the existing software, the MIG ++ analyzes the LD between SNPs at any distance, avoiding restrictions on the maximal block length. The haplotype block partition of the entire HapMap II CEPH dataset was obtained in 457 hours. By replacing the standard likelihood-based D′ variance estimator with an approximated estimator, the runtime was further improved. While producing a coarser partition, the approximate method allowed to obtain the full-genome haplotype block partition of the entire 1000 Genomes Project CEPH dataset in 44 hours, with no restrictions on allele frequency or long-range correlations. These experiments showed that LD-based haplotype blocks can span more than one million base-pairs in both HapMap II and 1000 Genomes datasets. An application to the North American Rheumatoid Arthritis Consortium (NARAC) dataset shows how the MIG ++ can support genome-wide haplotype association studies.
Conclusions
The MIG ++ enables to perform LD-based haplotype block recognition on genetic sequences of any length and density. In the new generation sequencing era, this can help identify haplotypes that carry rare variants of interest. The low computational requirements open the possibility to include the haplotype block structure into genome-wide association scans, downstream analyses, and visual interfaces for online genome browsers.
doi:10.1186/1471-2105-15-10
PMCID: PMC3898000  PMID: 24423111
2.  ParallABEL: an R library for generalized parallelization of genome-wide association studies 
BMC Bioinformatics  2010;11:217.
Background
Genome-Wide Association (GWA) analysis is a powerful method for identifying loci associated with complex traits and drug response. Parts of GWA analyses, especially those involving thousands of individuals and consuming hours to months, will benefit from parallel computation. It is arduous acquiring the necessary programming skills to correctly partition and distribute data, control and monitor tasks on clustered computers, and merge output files.
Results
Most components of GWA analysis can be divided into four groups based on the types of input data and statistical outputs. The first group contains statistics computed for a particular Single Nucleotide Polymorphism (SNP), or trait, such as SNP characterization statistics or association test statistics. The input data of this group includes the SNPs/traits. The second group concerns statistics characterizing an individual in a study, for example, the summary statistics of genotype quality for each sample. The input data of this group includes individuals. The third group consists of pair-wise statistics derived from analyses between each pair of individuals in the study, for example genome-wide identity-by-state or genomic kinship analyses. The input data of this group includes pairs of SNPs/traits. The final group concerns pair-wise statistics derived for pairs of SNPs, such as the linkage disequilibrium characterisation. The input data of this group includes pairs of individuals. We developed the ParallABEL library, which utilizes the Rmpi library, to parallelize these four types of computations. ParallABEL library is not only aimed at GenABEL, but may also be employed to parallelize various GWA packages in R. The data set from the North American Rheumatoid Arthritis Consortium (NARAC) includes 2,062 individuals with 545,080, SNPs' genotyping, was used to measure ParallABEL performance. Almost perfect speed-up was achieved for many types of analyses. For example, the computing time for the identity-by-state matrix was linearly reduced from approximately eight hours to one hour when ParallABEL employed eight processors.
Conclusions
Executing genome-wide association analysis using the ParallABEL library on a computer cluster is an effective way to boost performance, and simplify the parallelization of GWA studies. ParallABEL is a user-friendly parallelization of GenABEL.
doi:10.1186/1471-2105-11-217
PMCID: PMC2879286  PMID: 20429914
3.  A Candidate Gene Approach Identifies the TRAF1/C5 Region as a Risk Factor for Rheumatoid Arthritis 
PLoS Medicine  2007;4(9):e278.
Background
Rheumatoid arthritis (RA) is a chronic autoimmune disorder affecting ∼1% of the population. The disease results from the interplay between an individual's genetic background and unknown environmental triggers. Although human leukocyte antigens (HLAs) account for ∼30% of the heritable risk, the identities of non-HLA genes explaining the remainder of the genetic component are largely unknown. Based on functional data in mice, we hypothesized that the immune-related genes complement component 5 (C5) and/or TNF receptor-associated factor 1 (TRAF1), located on Chromosome 9q33–34, would represent relevant candidate genes for RA. We therefore aimed to investigate whether this locus would play a role in RA.
Methods and Findings
We performed a multitiered case-control study using 40 single-nucleotide polymorphisms (SNPs) from the TRAF1 and C5 (TRAF1/C5) region in a set of 290 RA patients and 254 unaffected participants (controls) of Dutch origin. Stepwise replication of significant SNPs was performed in three independent sample sets from the Netherlands (ncases/controls = 454/270), Sweden (ncases/controls = 1,500/1,000) and US (ncases/controls = 475/475). We observed a significant association (p < 0.05) of SNPs located in a haplotype block that encompasses a 65 kb region including the 3′ end of C5 as well as TRAF1. A sliding window analysis revealed an association peak at an intergenic region located ∼10 kb from both C5 and TRAF1. This peak, defined by SNP14/rs10818488, was confirmed in a total of 2,719 RA patients and 1,999 controls (odds ratiocommon = 1.28, 95% confidence interval 1.17–1.39, pcombined = 1.40 × 10−8) with a population-attributable risk of 6.1%. The A (minor susceptibility) allele of this SNP also significantly correlates with increased disease progression as determined by radiographic damage over time in RA patients (p = 0.008).
Conclusions
Using a candidate-gene approach we have identified a novel genetic risk factor for RA. Our findings indicate that a polymorphism in the TRAF1/C5 region increases the susceptibility to and severity of RA, possibly by influencing the structure, function, and/or expression levels of TRAF1 and/or C5.
Using a candidate-gene approach, Rene Toes and colleagues identified a novel genetic risk factor for rheumatoid arthritis in theTRAF1/C5 region.
Editors' Summary
Background.
Rheumatoid arthritis is a very common chronic illness that affects around 1% of people in developed countries. It is caused by an abnormal immune reaction to various tissues within the body; as well as affecting joints and causing an inflammatory arthritis, it can also affect many other organs of the body. Severe rheumatoid arthritis can be life-threatening, but even mild forms of the disease cause substantial illness and disability. Current treatments aim to give symptomatic relief with the use of simple analgesics, or anti-inflammatory drugs. In addition, most patients are also treated with what are known as disease-modifying agents, which aim to prevent joint damage. Rheumatoid arthritis is known to have a genetic component. For example, an association has been shown with the part of the genome that contains the human leukocyte antigens (HLAs), which are involved in the immune response. Information on other genes involved would be helpful both for understanding the underlying cause of the disease and possibly for the discovery of new treatments.
Why Was This Study Done?
Previous work in mice that have a disease similar to human rheumatoid arthritis has identified a number of possible candidate genes. One of these genes, complement component 5 (C5) is involved in the complement system—a primitive system within the body that is involved in the defense against foreign molecules. In humans the gene for C5 is located on Chromosome 9 close to another gene involved in the inflammatory response, TNF receptor-associated factor 1 (TRAF1). A preliminary study in humans of this region had shown some evidence, albeit weak, to suggest that this region might be associated with rheumatoid arthritis. The authors set out to look in more detail, and in a larger group of individuals, to see if they could prove this association.
What Did the Researchers Do and Find?
The researchers took 40 genetic markers, known as single-nucleotide polymorphisms (SNPs), from across the region that included the C5 and TRAF1 genes. SNPs have each been assigned a unique reference number that specifies a point in the human genome, and each is present in alternate forms so can be differentiated. They compared which of the alternate forms were present in 290 patients with rheumatoid arthritis and 254 unaffected participants of Dutch origin. They then repeated the study in three other groups of patients and controls of Dutch, Swedish, and US origin. They found a consistent association with rheumatoid arthritis of one region of 65 kilobases (a small distance in genetic terms) that included one end of the C5 gene as well as the TRAF1 gene. They could refine the area of interest to a piece marked by one particular SNP that lay between the genes. They went on to show that the genetic region in which these genes are located may be involved in the binding of a protein that modifies the transcription of genes, thus providing a possible explanation for the association. Furthermore, they showed that one of the alternate versions of the marker in this region was associated with more aggressive disease.
What Do These Findings Mean?
The finding of a genetic association is the first step in identifying a genetic component of a disease. The strength of this study is that a novel genetic susceptibility factor for RA has been identified and that the overall result is consistent in four different populations as well as being associated with disease severity. Further work will need to be done to confirm the association in other populations and then to identify the precise genetic change involved. Hopefully this work will lead to new avenues of investigation for therapy.
Additional Information.
Please access these Web sites via the online version of this summary at http://dx.doi.org/10.1371/journal.pmed.0040278.
• Medline Plus, the health information site for patients from the US National Library of Medicine, has a page of resources on rheumatoid arthritis
• The UK's National Health Service online information site has information on rheumatoid arthritis
• The Arthritis Research Campaign, a UK charity that funds research on all types of arthritis, has a booklet with information for patients on rheumatoid arthritis
• Reumafonds, a Dutch arthritis foundation, gives information on rheumatoid arthritis (in Dutch)
• Autocure is an initiative whose objective is to transform knowledge obtained from molecular research into a cure for an increasing number of patients suffering from inflammatory rheumatic diseases
• The European league against Rheumatism, an organisation which represents the patient, health professionals, and scientific societies of rheumatology of all European nations
doi:10.1371/journal.pmed.0040278
PMCID: PMC1976626  PMID: 17880261
4.  Performance of Single Nucleotide Polymorphisms versus Haplotypes for Genome-Wide Association Analysis in Barley 
PLoS ONE  2010;5(11):e14079.
Genome-wide association studies (GWAS) may benefit from utilizing haplotype information for making marker-phenotype associations. Several rationales for grouping single nucleotide polymorphisms (SNPs) into haplotype blocks exist, but any advantage may depend on such factors as genetic architecture of traits, patterns of linkage disequilibrium in the study population, and marker density. The objective of this study was to explore the utility of haplotypes for GWAS in barley (Hordeum vulgare) to offer a first detailed look at this approach for identifying agronomically important genes in crops. To accomplish this, we used genotype and phenotype data from the Barley Coordinated Agricultural Project and constructed haplotypes using three different methods. Marker-trait associations were tested by the efficient mixed-model association algorithm (EMMA). When QTL were simulated using single SNPs dropped from the marker dataset, a simple sliding window performed as well or better than single SNPs or the more sophisticated methods of blocking SNPs into haplotypes. Moreover, the haplotype analyses performed better 1) when QTL were simulated as polymorphisms that arose subsequent to marker variants, and 2) in analysis of empirical heading date data. These results demonstrate that the information content of haplotypes is dependent on the particular mutational and recombinational history of the QTL and nearby markers. Analysis of the empirical data also confirmed our intuition that the distribution of QTL alleles in nature is often unlike the distribution of marker variants, and hence utilizing haplotype information could capture associations that would elude single SNPs. We recommend routine use of both single SNP and haplotype markers for GWAS to take advantage of the full information content of the genotype data.
doi:10.1371/journal.pone.0014079
PMCID: PMC2989918  PMID: 21124933
5.  Haplotype-Based Analysis: A Summary of GAW16 Group 4 Analysis 
Genetic epidemiology  2009;33(Suppl 1):S24-S28.
In this summary paper, we describe the contributions included in the haplotype-based analysis group (Group 4) at the Genetic Analysis Workshop 16, which was held September 17-20, 2008. Our group applied a large number of haplotype-based methods in the context of genome-wide association studies. Two general approaches were applied: a two-stage approach that selected significant single-nucleotide polymorphisms and then created haplotypes and genome-wide analysis of smaller sets of single-nucleotide polymorphisms selected by sliding windows or estimating haplotype blocks. Genome-wide haplotype analyses performed in these ways were feasible. The presence of the very strong chromosome 6 association in the North American Rheumatoid Arthritis Consortium data was detected by every method, and additional analyses attempted to control for this strong result to allow detection of additional haplotype associations.
doi:10.1002/gepi.20468
PMCID: PMC2916652  PMID: 19924718
population stratification; multiple comparisons
6.  Multi-locus stepwise regression: a haplotype-based algorithm for finding genetic associations applied to atopic dermatitis 
BMC Medical Genetics  2012;13:8.
Background
Genome-wide association studies (GWAS) provide an increasing number of single nucleotide polymorphisms (SNPs) associated with diseases. Our aim is to exploit those closely spaced SNPs in candidate regions for a deeper analysis of association beyond single SNP analysis, combining the classical stepwise regression approach with haplotype analysis to identify risk haplotypes for complex diseases.
Methods
Our proposed multi-locus stepwise regression starts with an evaluation of all pair-wise SNP combinations and then extends each SNP combination stepwise by one SNP from the region, carrying out haplotype regression in each step. The best associated haplotype patterns are kept for the next step and must be corrected for multiple testing at the end. These haplotypes should also be replicated in an independent data set. We applied the method to a region of 259 SNPs from the epidermal differentiation complex (EDC) on chromosome 1q21 of a German GWAS using a case control set (1,914 individuals) and to 268 families with at least two affected children as replication.
Results
A 4-SNP haplotype pattern with high statistical significance in the case control set (p = 4.13 × 10-7 after Bonferroni correction) could be identified which remained significant in the family set after Bonferroni correction (p = 0.0398). Further analysis revealed that this pattern reflects mainly the effect of the well-known FLG gene; however, a FLG-independent haplotype in case control set (OR = 1.71, 95% CI: 1.32-2.23, p = 5.6 × 10-5) and family set (OR = 1.68, 95% CI: 1.18-2.38, p = 2.19 × 10-3) could be found in addition.
Conclusion
Our approach is a useful tool for finding allele combinations associated with diseases beyond single SNP analysis in chromosomal candidate regions.
doi:10.1186/1471-2350-13-8
PMCID: PMC3398269  PMID: 22284537
7.  Finding type 2 diabetes causal single nucleotide polymorphism combinations and functional modules from genome-wide association data 
Background
Due to the low statistical power of individual markers from a genome-wide association study (GWAS), detecting causal single nucleotide polymorphisms (SNPs) for complex diseases is a challenge. SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity.
Methods
We aim to detect type 2 diabetes (T2D) causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. T2D causal SNP combinations and genome-wide SNPs are mapped into functional modules using expanded gene set enrichment analysis (GSEA) considering pathway, transcription factor (TF)-target, miRNA-target, gene ontology, and protein complex functional modules. The prediction error rates are measured for SNP sets from functional module-based filtration that selects SNPs within functional modules from genome-wide SNPs based expanded GSEA.
Results
A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected using optimal filtration criteria, with an error rate of 10.25%. Matching 101 SNPs with known T2D genes and functional modules reveals the relationships between T2D and SNP combinations. The prediction error rates of SNP sets from functional module-based filtration record no significance compared to the prediction error rates of randomly selected SNP sets and T2D causal SNP combinations from optimal filtration.
Conclusions
We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.
doi:10.1186/1472-6947-13-S1-S3
PMCID: PMC3618247  PMID: 23566118
8.  Detection of disease-associated deletions in case–control studies using SNP genotypes with application to rheumatoid arthritis 
Human genetics  2009;126(2):303-315.
Genomic deletions have long been known to play a causative role in microdeletion syndromes. Recent whole-genome genetic studies have shown that deletions can increase the risk for several psychiatric disorders, suggesting that genomic deletions play an important role in the genetic basis of complex traits. However, the association between genomic deletions and common, complex diseases has not yet been systematically investigated in gene mapping studies. Likelihood-based statistical methods for identifying disease-associated deletions have recently been developed for familial studies of parent-offspring trios. The purpose of this study is to develop statistical approaches for detecting genomic deletions associated with complex disease in case–control studies. Our methods are designed to be used with dense single nucleotide polymorphism (SNP) genotypes to detect deletions in large-scale or whole-genome genetic studies. As more and more SNP genotype data for genome-wide association studies become available, development of sophisticated statistical approaches will be needed that use these data. Our proposed statistical methods are designed to be used in SNP-by-SNP analyses and in cluster analyses based on combined evidence from multiple SNPs. We found that these methods are useful for detecting disease-associated deletions and are robust in the presence of linkage disequilibrium using simulated SNP data sets. Furthermore, we applied the proposed statistical methods to SNP genotype data of chromosome 6p for 868 rheumatoid arthritis patients and 1,197 controls from the North American Rheumatoid Arthritis Consortium. We detected disease-associated deletions within the region of human leukocyte antigen in which genomic deletions were previously discovered in rheumatoid arthritis patients.
doi:10.1007/s00439-009-0672-3
PMCID: PMC2992885  PMID: 19415332
9.  Suggestive evidence for association between L-type voltage-gated calcium channel (CACNA1C) gene haplotypes and bipolar disorder in Latinos: a family-based association study 
Bipolar disorders  2013;15(2):206-214.
Objectives
Through recent genome-wide association studies (GWAS), several groups have reported significant association between variants in the alpha 1C subunit of the L-type voltage-gated calcium channel (CACNA1C) and bipolar disorder (BP) in European and European-American cohorts. We performed a family-based association study to determine whether CACNA1C is associated with BP in the Latino population.
Methods
This study consisted of 913 individuals from 215 Latino pedigrees recruited from the United States, Mexico, Guatemala, and Costa Rica. The Illumina GoldenGate Genotyping Assay was used to genotype 58 single-nucleotide polymorphisms (SNPs) that spanned a 602.9 kb region encompassing the CACNA1C gene including two SNPs (rs7297582 and rs1006737) previously shown to associate with BP. Individual SNP and haplotype association analyses were performed using Family-Based Association Test (version 2.0.3) and Haploview (version 4.2) software.
Results
An eight-locus haplotype block that included these two markers showed significant association with BP (global marker permuted p = 0.0018) in the Latino population. For individual SNPs, this sample had insufficient power (10%) to detect associations with SNPs with minor effect (odds ratio = 1.15).
Conclusions
Although we were not able to replicate findings of association between individual CACNA1C SNPs rs7297582 and rs1006737 and BP, we were able to replicate the GWAS signal reported for CACNA1C through a haplotype analysis that encompassed these previously reported significant SNPs. These results provide additional evidence that CACNA1C is associated with BP and provides the first evidence that variations in this gene might play a role in the pathogenesis of this disorder in the Latino population.
doi:10.1111/bdi.12041
PMCID: PMC3781018  PMID: 23437964
bipolar disorder; calcium channels; genetic association studies; haplotypes; Hispanic Americans; L-type; pedigree; polymorphism; single nucleotide
10.  Association mapping of susceptibility loci for rheumatoid arthritis 
BMC Proceedings  2007;1(Suppl 1):S15.
We analyzed a case-control data set for chromosome 18q from the Genetic Analysis Workshop 15 to detect susceptibility loci for rheumatoid arthritis (RA). A total number of 460 cases and 460 unaffected controls were genotyped on 2300 single-nucleotide polymorphisms (SNPs) by the North American Rheumatoid Arthritis Consortium. Using a multimarker approach for association mapping under the framework of the Malecot model and composite likelihood, we identified a region showing significant association with RA (p < 0.002) and the predicted disease locus was at a genomic location of 53,306 kb with a 95% confidence interval (CI) of 53,295–53,331 kb. A common haplotype in this region was protective against RA (p = 0.002). In another region showing nominal significant association (51,585 kb, 95% CI: 51,541–51,628 kb, p = 0.037), a haplotype was also protective (p = 0.002). We further demonstrated that reducing SNP density decreased power and accuracy of association mapping. SNP selection based on equal linkage disequilibrium (LD) distance generally produced higher accuracy than that based on equal kilobase distance or tagging.
PMCID: PMC2367513  PMID: 18466494
11.  The Association Between Genetic Variants in SORL1 and Alzheimer’s Disease in an Urban, Multiethnic, Community-Based Cohort 
Archives of neurology  2007;64(4):501-506.
Context
Variants in 3′ and 5′ regions of SORL1, the neuronal sorting protein-related receptor, were recently found to be associated with late onset familial and sporadic Alzheimer’s disease in several datasets that were selected for familial aggregation or were ethnically diverse or clinic-based selected series.
Objective
To investigate the association between Alzheimer’s disease and variant alleles in SORL1 using a series of single nucleotide polymorphisms (SNPs) in an urban, multiethnic community-based population.
Design & Setting
We used a nested case-control analysis in a population-based, prospective study of aging and dementia in Medicare recipients, 65 years and older, residing in northern Manhattan.
Participants
There were 296 patients with probable Alzheimer’s disease and 428 healthy elderly controls. The participants were of African American (34%), Caribbean Hispanic (51%) or non-Hispanic whites (15%).
Main Outcome Measures
We genotyped all 29 SNPs in SORL1 that were examined in the earlier report. We assessed allelic association with AD using standard case-control methods which included APOE genotype as a covariate.
Results
Several individual SNPs and SNP haplotypes were significantly associated with AD in this prospectively collected community-based cohort, confirming the previously reported positive association of SORL1 with Alzheimer’s disease. SNP 12 near the 5′ region was associated with AD in African-Americans and Hispanics. Two SNPs in the 3′ region were also associated with AD in African-Americans (SNP 26) and Whites (SNP 20). A single haplotype in the 3′ region was associated with AD in Hispanics. However, several different haplotypes were associated with AD in the African-Americans and Whites, including the “TTC” haplotypes at SNPs 23–25 (p=0.035) that was significantly associated with AD in the North European Whites in the previous report.
Conclusions
This study confirms the association between genetic variants in SORL1 and AD. While the associations observed in these datasets overlap with those previously reported, the finding of novel SNP and haplotype associations suggest that there may be extensive allelic heterogeneity in SORL1. Broad regions of the SORL1 gene will therefore need to be scrutinized for functional pathogenic variants.
doi:10.1001/archneur.64.4.501
PMCID: PMC2639214  PMID: 17420311
SORL1; Alzheimer’s disease; sporadic; African American; Caribbean Hispanic
12.  A Genome-Wide Association Study Confirms VKORC1, CYP2C9, and CYP4F2 as Principal Genetic Determinants of Warfarin Dose 
PLoS Genetics  2009;5(3):e1000433.
We report the first genome-wide association study (GWAS) whose sample size (1,053 Swedish subjects) is sufficiently powered to detect genome-wide significance (p<1.5×10−7) for polymorphisms that modestly alter therapeutic warfarin dose. The anticoagulant drug warfarin is widely prescribed for reducing the risk of stroke, thrombosis, pulmonary embolism, and coronary malfunction. However, Caucasians vary widely (20-fold) in the dose needed for therapeutic anticoagulation, and hence prescribed doses may be too low (risking serious illness) or too high (risking severe bleeding). Prior work established that ∼30% of the dose variance is explained by single nucleotide polymorphisms (SNPs) in the warfarin drug target VKORC1 and another ∼12% by two non-synonymous SNPs (*2, *3) in the cytochrome P450 warfarin-metabolizing gene CYP2C9. We initially tested each of 325,997 GWAS SNPs for association with warfarin dose by univariate regression and found the strongest statistical signals (p<10−78) at SNPs clustering near VKORC1 and the second lowest p-values (p<10−31) emanating from CYP2C9. No other SNPs approached genome-wide significance. To enhance detection of weaker effects, we conducted multiple regression adjusting for known influences on warfarin dose (VKORC1, CYP2C9, age, gender) and identified a single SNP (rs2108622) with genome-wide significance (p = 8.3×10−10) that alters protein coding of the CYP4F2 gene. We confirmed this result in 588 additional Swedish patients (p<0.0029) and, during our investigation, a second group provided independent confirmation from a scan of warfarin-metabolizing genes. We also thoroughly investigated copy number variations, haplotypes, and imputed SNPs, but found no additional highly significant warfarin associations. We present power analysis of our GWAS that is generalizable to other studies, and conclude we had 80% power to detect genome-wide significance for common causative variants or markers explaining at least 1.5% of dose variance. These GWAS results provide further impetus for conducting large-scale trials assessing patient benefit from genotype-based forecasting of warfarin dose.
Author Summary
Recently, geneticists have begun assaying hundreds of thousands of genetic markers covering the entire human genome to systematically search for and identify genes that cause disease. We have extended this “genome-wide association study” (GWAS) method by assaying ∼326,000 markers in 1,053 Swedish patients in order to identify genes that alter response to the anticoagulant drug warfarin. Warfarin is widely prescribed to reduce blood clotting in order to protect high-risk patients from stroke, thrombosis, and heart attack. But patients vary widely (20-fold) in the warfarin dose needed for proper blood thinning, which means that initial doses in some patients are too high (risking severe bleeding) or too low (risking serious illness). Our GWAS detected two genes (VKORC1, CYP2C9) already known to cause ∼40% of the variability in warfarin dose and discovered a new gene (CYP4F2) contributing 1%–2% of the variability. Since our GWAS searched the entire genome, additional genes having a major influence on warfarin dose might not exist or be found in the near-term. Hence, clinical trials assessing patient benefit from individualized dose forecasting based on a patient's genetic makeup at VKORC1, CYP2C9 and possibly CYP4F2 could provide state-of-the-art clinical benchmarks for warfarin use during the foreseeable future.
doi:10.1371/journal.pgen.1000433
PMCID: PMC2652833  PMID: 19300499
13.  Rare variation at the TNFAIP3 locus and susceptibility to rheumatoid arthritis 
Human Genetics  2010;128(6):627-633.
Genome-wide association studies (GWAS) conducted using commercial single nucleotide polymorphisms (SNP) arrays have proven to be a powerful tool for the detection of common disease susceptibility variants. However, their utility for the detection of lower frequency variants is yet to be practically investigated. Here we describe the application of a rare variant collapsing method to a large genome-wide SNP dataset, the Wellcome Trust Case Control Consortium rheumatoid arthritis (RA) GWAS. We partitioned the data into gene-centric bins and collapsed genotypes of low frequency variants (defined here as MAF ≤0.05) into a single count coupled with univariate analysis. We then prioritised gene regions for further investigation in an independent cohort of 3,355 cases and 2,427 controls based on rare variant signal p value and prior evidence to support involvement in RA. A total of 14,536 gene bins were investigated in the primary analysis and signals mapping to the TNFAIP3 and chr17q24 loci were selected for further investigation. We detected replicating association to low frequency variants in the TNFAIP3 gene (combined p = 6.6 × 10−6). Even though rare variants are not well-represented and can be difficult to genotype in GWAS, our study supports the application of low frequency variant collapsing methods to genome-wide SNP datasets as a means of exploiting data that are routinely ignored.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-010-0889-1) contains supplementary material, which is available to authorized users.
doi:10.1007/s00439-010-0889-1
PMCID: PMC2978888  PMID: 20852893
14.  Regional replication of association with refractive error on 15q14 and 15q25 in the Age-Related Eye Disease Study cohort 
Molecular Vision  2013;19:2173-2186.
Purpose
Refractive error is a complex trait with multiple genetic and environmental risk factors, and is the most common cause of preventable blindness worldwide. The common nature of the trait suggests the presence of many genetic factors that individually may have modest effects. To achieve an adequate sample size to detect these common variants, large, international collaborations have formed. These consortia typically use meta-analysis to combine multiple studies from many different populations. This approach is robust to differences between populations; however, it does not compensate for the different haplotypes in each genetic background evidenced by different alleles in linkage disequilibrium with the causative variant. We used the Age-Related Eye Disease Study (AREDS) cohort to replicate published significant associations at two loci on chromosome 15 from two genome-wide association studies (GWASs). The single nucleotide polymorphisms (SNPs) that exhibited association on chromosome 15 in the original studies did not show evidence of association with refractive error in the AREDS cohort. This paper seeks to determine whether the non-replication in this AREDS sample may be due to the limited number of SNPs chosen for replication.
Methods
We selected all SNPs genotyped on the Illumina Omni2.5v1_B array or custom TaqMan assays or imputed from the GWAS data, in the region surrounding the SNPs from the Consortium for Refractive Error and Myopia study. We analyzed the SNPs for association with refractive error using standard regression methods in PLINK. The effective number of tests was calculated using the Genetic Type I Error Calculator.
Results
Although use of the same SNPs used in the Consortium for Refractive Error and Myopia study did not show any evidence of association with refractive error in this AREDS sample, other SNPs within the candidate regions demonstrated an association with refractive error. Significant evidence of association was found using the hyperopia categorical trait, with the most significant SNPs rs1357179 on 15q14 (p=1.69×10−3) and rs7164400 on 15q25 (p=8.39×10−4), which passed the replication thresholds.
Conclusions
This study adds to the growing body of evidence that attempting to replicate the most significant SNPs found in one population may not be significant in another population due to differences in the linkage disequilibrium structure and/or allele frequency. This suggests that replication studies should include less significant SNPs in an associated region rather than only a few selected SNPs chosen by a significance threshold.
PMCID: PMC3826323  PMID: 24227913
15.  Single Nucleotide Polymorphism (SNP)-Strings: An Alternative Method for Assessing Genetic Associations 
PLoS ONE  2014;9(4):e90034.
Background
Genome-wide association studies (GWAS) identify disease-associations for single-nucleotide-polymorphisms (SNPs) from scattered genomic-locations. However, SNPs frequently reside on several different SNP-haplotypes, only some of which may be disease-associated. This circumstance lowers the observed odds-ratio for disease-association.
Methodology/Principal Findings
Here we develop a method to identify the two SNP-haplotypes, which combine to produce each person’s SNP-genotype over specified chromosomal segments. Two multiple sclerosis (MS)-associated genetic regions were modeled; DRB1 (a Class II molecule of the major histocompatibility complex) and MMEL1 (an endopeptidase that degrades both neuropeptides and β-amyloid). For each locus, we considered sets of eleven adjacent SNPs, surrounding the putative disease-associated gene and spanning ∼200 kb of DNA. The SNP-information was converted into an ordered-set of eleven-numbers (subject-vectors) based on whether a person had zero, one, or two copies of particular SNP-variant at each sequential SNP-location. SNP-strings were defined as those ordered-combinations of eleven-numbers (0 or 1), representing a haplotype, two of which combined to form the observed subject-vector. Subject-vectors were resolved using probabilistic methods. In both regions, only a small number of SNP-strings were present. We compared our method to the SHAPEIT-2 phasing-algorithm. When the SNP-information spanning 200 kb was used, SHAPEIT-2 was inaccurate. When the SHAPEIT-2 window was increased to 2,000 kb, the concordance between the two methods, in both of these eleven-SNP regions, was over 99%, suggesting that, in these regions, both methods were quite accurate. Nevertheless, correspondence was not uniformly high over the entire DNA-span but, rather, was characterized by alternating peaks and valleys of concordance. Moreover, in the valleys of poor-correspondence, SHAPEIT-2 was also inconsistent with itself, suggesting that the SNP-string method is more accurate across the entire region.
Conclusions/Significance
Accurate haplotype identification will enhance the detection of genetic-associations. The SNP-string method provides a simple means to accomplish this and can be extended to cover larger genomic regions, thereby improving a GWAS’s power, even for those published previously.
doi:10.1371/journal.pone.0090034
PMCID: PMC3984082  PMID: 24727690
16.  Short Reads Phasing to Construct Haplotypes in Genomic Regions That Are Associated with Body Mass Index in Korean Individuals 
Genomics & Informatics  2014;12(4):165-170.
Genome-wide association (GWA) studies have found many important genetic variants that affect various traits. Since these studies are useful to investigate untyped but causal variants using linkage disequilibrium (LD), it would be useful to explore the haplotypes of single-nucleotide polymorphisms (SNPs) within the same LD block of significant associations based on high-density variants from population references. Here, we tried to make a haplotype catalog affecting body mass index (BMI) through an integrative analysis of previously published whole-genome next-generation sequencing (NGS) data of 7 representative Korean individuals and previously known Korean GWA signals. We selected 435 SNPs that were significantly associated with BMI from the GWA analysis and searched 53 LD ranges nearby those SNPs. With the NGS data, the haplotypes were phased within the LDs. A total of 44 possible haplotype blocks for Korean BMI were cataloged. Although the current result constitutes little data, this study provides new insights that may help to identify important haplotypes for traits and low variants nearby significant SNPs. Furthermore, we can build a more comprehensive catalog as a larger dataset becomes available.
doi:10.5808/GI.2014.12.4.165
PMCID: PMC4330250
genome-wide association study; haplotypes; Korea; NGS; phasing; single-nucleotide polymorphism
17.  Identification of polymorphic inversions from genotypes 
BMC Bioinformatics  2012;13:28.
Background
Polymorphic inversions are a source of genetic variability with a direct impact on recombination frequencies. Given the difficulty of their experimental study, computational methods have been developed to infer their existence in a large number of individuals using genome-wide data of nucleotide variation. Methods based on haplotype tagging of known inversions attempt to classify individuals as having a normal or inverted allele. Other methods that measure differences between linkage disequilibrium attempt to identify regions with inversions but unable to classify subjects accurately, an essential requirement for association studies.
Results
We present a novel method to both identify polymorphic inversions from genome-wide genotype data and classify individuals as containing a normal or inverted allele. Our method, a generalization of a published method for haplotype data [1], utilizes linkage between groups of SNPs to partition a set of individuals into normal and inverted subpopulations. We employ a sliding window scan to identify regions likely to have an inversion, and accumulation of evidence from neighboring SNPs is used to accurately determine the inversion status of each subject. Further, our approach detects inversions directly from genotype data, thus increasing its usability to current genome-wide association studies (GWAS).
Conclusions
We demonstrate the accuracy of our method to detect inversions and classify individuals on principled-simulated genotypes, produced by the evolution of an inversion event within a coalescent model [2]. We applied our method to real genotype data from HapMap Phase III to characterize the inversion status of two known inversions within the regions 17q21 and 8p23 across 1184 individuals. Finally, we scan the full genomes of the European Origin (CEU) and Yoruba (YRI) HapMap samples. We find population-based evidence for 9 out of 15 well-established autosomic inversions, and for 52 regions previously predicted by independent experimental methods in ten (9+1) individuals [3,4]. We provide efficient implementations of both genotype and haplotype methods as a unified R package inveRsion.
doi:10.1186/1471-2105-13-28
PMCID: PMC3296650  PMID: 22321652
18.  Inference of disease associations with unmeasured genetic variants by combining results from genome-wide association studies with linkage disequilibrium patterns in a reference data set 
BMC Proceedings  2009;3(Suppl 7):S55.
Results from whole-genome association studies of many common diseases are now available. Increasingly, these are being incorporated into meta-analyses to increase the power to detect weak associations with measured single-nucleotide polymorphisms (SNPs). Imputation of genotypes at unmeasured loci has been widely applied using patterns of linkage disequilibrium (LD) observed in the HapMap panels, but there is a need for alternative methods that can utilize the pooled effect estimates from meta-analyses and explore possible associations with SNPs and haplotypes that are not included in HapMap.
By a weighted average technique, we show that association results for common SNPs in an observed data set can be scaled and combined to infer the effect of a genetic variant that has been measured only in an independent reference data set. We show that the ratio p(R-1)/[1 + p(R-1)], where R is the relative risk associated with a measured or unmeasured allele of frequency p, is appropriately scaled by 1/D' and weighted in proportion to r2, both common measures of LD being derived from the reference data set.
We illustrate this computationally simple method by combining the results of a genome-wide association screen from the North American Rheumatoid Arthritis Consortium with LD measures from the British 1958 Birth Cohort, and explore the validity of underlying assumptions about the generalizability of LD from one population to another, and from healthy subjects to subjects with clinical disease.
PMCID: PMC2795955  PMID: 20018048
19.  Racial or ethnic differences in allele frequencies of single‐nucleotide polymorphisms in the methylenetetrahydrofolate reductase gene and their influence on response to methotrexate in rheumatoid arthritis 
Annals of the Rheumatic Diseases  2006;65(9):1213-1218.
Background
The anti‐folate drug methotrexate (MTX) is commonly used to treat rheumatoid arthritis.
Objective
To determine the allele frequencies of five common coding single‐nucleotide polymorphisms (SNPs) in the methylenetetrahydrofolate reductase (MTHFR) gene in African‐Americans and Caucasians with rheumatoid arthritis and controls to assess whether there are differences in allele frequencies among these ethnic or racial groups and whether these SNPs differentially affect the efficacy or toxicity of MTX.
Methods
Allele frequencies in the 677, 1298 and 3 additional SNPs in the MTHFR coding region in 223 (193 Caucasians and 30 African‐Americans) patients with rheumatoid arthritis who previously participated in one of two prospective clinical trials were characterised, and genotypes were correlated with the efficacy and toxicity of MTX. Another 308 subjects with rheumatoid arthritis who participated in observational studies, one group predominantly Caucasian and the other African‐American, as well as 103 normal controls (53 African‐Americans and 50 Caucasians) were used to characterise allele frequencies of these SNPs and their associated haplotypes.
Results
Significantly different allele frequencies were seen in three of the five SNPs and haplotype frequencies between Caucasians and African‐Americans. Allele frequencies were similar between patients with rheumatoid arthritis and controls of the same racial or ethnic group. Frequencies of the rs4846051C, 677T and 1298C alleles were 0.33, 0.11 and 0.13, respectively, among African‐Americans with rheumatoid arthritis. Among Caucasians with rheumatoid arthritis, these allele frequencies were 0.08 (p<0.001 compared with African‐Americans with rheumatoid arthritis), 0.30 (p = 0.002) and 0.34 (p<0.001), respectively. There was no association between SNP alleles or haplotypes and response to MTX as measured by the mean change in the 28‐joint Disease Activity Score from baseline values. In Caucasians, the 1298 A (major) allele was associated with a significant increase in MTX‐related adverse events characteristic of a recessive genetic effect (odds ratio 15.86, 95% confidence interval 1.51 to 167.01; p = 0.021), confirming previous reports. There was an association between scores of MTX toxicity and the rs4846051 C allele, and haplotypes containing this allele, in African‐Americans, but not in Caucasians.
Conclusions
: These results, although preliminary, highlight racial or ethnic differences in frequencies of common MTHFR SNPs. The MTHFR 1298 A and the rs4846051 C alleles were associated with MTX‐related adverse events in Caucasians and African‐Americans, respectively, but these findings should be replicated in larger studies. The rs4846051 SNP, which is far more common in African‐Americans than in Caucasians, can also be proved to be a useful ancestry informative marker in future studies on genetic admixture.
doi:10.1136/ard.2005.046797
PMCID: PMC1798268  PMID: 16439441
20.  MixSIH: a mixture model for single individual haplotyping 
BMC Genomics  2013;14(Suppl 2):S5.
Background
Haplotype information is useful for various genetic analyses, including genome-wide association studies. Determining haplotypes experimentally is difficult and there are several computational approaches that infer haplotypes from genomic data. Among such approaches, single individual haplotyping or haplotype assembly, which infers two haplotypes of an individual from aligned sequence fragments, has been attracting considerable attention. To avoid incorrect results in downstream analyses, it is important not only to assemble haplotypes as long as possible but also to provide means to extract highly reliable haplotype regions. Although there are several efficient algorithms for solving haplotype assembly, there are no efficient method that allow for extracting the regions assembled with high confidence.
Results
We develop a probabilistic model, called MixSIH, for solving the haplotype assembly problem. The model has two mixture components representing two haplotypes. Based on the optimized model, a quality score is defined, which we call the 'minimum connectivity' (MC) score, for each segment in the haplotype assembly. Because existing accuracy measures for haplotype assembly are designed to compare the efficiency between the algorithms and are not suitable for evaluating the quality of the set of partially assembled haplotype segments, we develop an accuracy measure based on the pairwise consistency and evaluate the accuracy on the simulation and real data. By using the MC scores, our algorithm can extract highly accurate haplotype segments. We also show evidence that an existing experimental dataset contains chimeric read fragments derived from different haplotypes, which significantly degrade the quality of assembled haplotypes.
Conclusions
We develop a novel method for solving the haplotype assembly problem. We also define the quality score which is based on our model and indicates the accuracy of the haplotypes segments. In our evaluation, MixSIH has successfully extracted reliable haplotype segments. The C++ source code of MixSIH is available at https://sites.google.com/site/hmatsu1226/software/mixsih.
doi:10.1186/1471-2164-14-S2-S5
PMCID: PMC3582441  PMID: 23445519
21.  Genetic variants associated with idiopathic pulmonary fibrosis susceptibility and mortality: a genome-wide association study 
The lancet. Respiratory medicine  2013;1(4):309-317.
Summary
Background
Idiopathic pulmonary fibrosis (IPF) is a devastating disease that probably involves several genetic loci. Several rare genetic variants and one common single nucleotide polymorphism (SNP) of MUC5B have been associated with the disease. Our aim was to identify additional common variants associated with susceptibility and ultimately mortality in IPF.
Methods
First, we did a three-stage genome-wide association study (GWAS): stage one was a discovery GWAS; and stages two and three were independent case-control studies. DNA samples from European-American patients with IPF meeting standard criteria were obtained from several US centres for each stage. Data for European-American control individuals for stage one were gathered from the database of genotypes and phenotypes; additional control individuals were recruited at the University of Pittsburgh to increase the number. For controls in stages two and three, we gathered data for additional sex-matched European-American control individuals who had been recruited in another study. DNA samples from patients and from control individuals were genotyped to identify SNPs associated with IPF. SNPs identified in stage one were carried forward to stage two, and those that achieved genome-wide significance (p<5 × 10−8) in a meta-analysis were carried forward to stage three. Three case series with follow-up data were selected from stages one and two of the GWAS using samples with follow-up data. Mortality analyses were done in these case series to assess the SNPs associated with IPF that had achieved genome-wide significance in the meta-analysis of stages one and two. Finally, we obtained gene-expression profiling data for lungs of patients with IPF from the Lung Genomics Research Consortium and analysed correlation with SNP genotypes.
Findings
In stage one of the GWAS (542 patients with IPF, 542 control individuals matched one-by-one to cases by genetic ancestry estimates), we identified 20 loci. Six SNPs reached genome-wide significance in stage two (544 patients, 687 control individuals): three TOLLIP SNPs (rs111521887, rs5743894, rs5743890) and one MUC5B SNP (rs35705950) at 11p15.5; one MDGA2 SNP (rs7144383) at 14q21.3; and one SPPL2C SNP (rs17690703) at 17q21.31. Stage three (324 patients, 702 control individuals) confirmed the associations for all these SNPs, except for rs7144383. Linkage disequilibrium between the MUC5B SNP (rs35705950) and TOLLIP SNPs (rs111521887 [r2=0.07], rs5743894 [r2=0.16], and rs5743890 [r2=0.01]) was low. 683 patients from the GWAS were included in the mortality analysis. Individuals who developed IPF despite having the protective TOLLIP minor allele of rs5743890 carried an increased mortality risk (meta-analysis with fixed-effect model: hazard ratio 1.72 [95% CI 1.24–2.38]; p=0.0012). TOLLIP expression was decreased by 20% in individuals carrying the minor allele of rs5743890 (p=0.097), 40% in those with the minor allele of rs111521887 (p=3.0 × 10−4), and 50% in those with the minor allele of rs5743894 (p=2.93 × 10−5) compared with homozygous carriers of common alleles for these SNPs.
Interpretation
Novel variants in TOLLIP and SPPL2C are associated with IPF susceptibility. One novel variant of TOLLIP, rs5743890, is also associated with mortality. These associations and the reduced expression of TOLLIP in patients with IPF who carry TOLLIP SNPs emphasise the importance of this gene in the disease.
Funding
National Institutes of Health; National Heart, Lung, and Blood Institute; Pulmonary Fibrosis Foundation; Coalition for Pulmonary Fibrosis; and Instituto de Salud Carlos III.
doi:10.1016/S2213-2600(13)70045-6
PMCID: PMC3894577  PMID: 24429156
22.  Analysis of genome-wide association data by large-scale Bayesian logistic regression 
BMC Proceedings  2009;3(Suppl 7):S16.
Single-locus analysis is often used to analyze genome-wide association (GWA) data, but such analysis is subject to severe multiple comparisons adjustment. Multivariate logistic regression is proposed to fit a multi-locus model for case-control data. However, when the sample size is much smaller than the number of single-nucleotide polymorphisms (SNPs) or when correlation among SNPs is high, traditional multivariate logistic regression breaks down. To accommodate the scale of data from a GWA while controlling for collinearity and overfitting in a high dimensional predictor space, we propose a variable selection procedure using Bayesian logistic regression. We explored a connection between Bayesian regression with certain priors and L1 and L2 penalized logistic regression. After analyzing large number of SNPs simultaneously in a Bayesian regression, we selected important SNPs for further consideration. With much fewer SNPs of interest, problems of multiple comparisons and collinearity are less severe. We conducted simulation studies to examine probability of correctly selecting disease contributing SNPs and applied developed methods to analyze Genetic Analysis Workshop 16 North American Rheumatoid Arthritis Consortium data.
PMCID: PMC2795912  PMID: 20018005
23.  Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies 
PLoS Genetics  2008;4(7):e1000130.
Testing one SNP at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. We show that simultaneous analysis of the entire set of SNPs from a genome-wide study to identify the subset that best predicts disease outcome is now feasible, thanks to developments in stochastic search methods. We used a Bayesian-inspired penalised maximum likelihood approach in which every SNP can be considered for additive, dominant, and recessive contributions to disease risk. Posterior mode estimates were obtained for regression coefficients that were each assigned a prior with a sharp mode at zero. A non-zero coefficient estimate was interpreted as corresponding to a significant SNP. We investigated two prior distributions and show that the normal-exponential-gamma prior leads to improved SNP selection in comparison with single-SNP tests. We also derived an explicit approximation for type-I error that avoids the need to use permutation procedures. As well as genome-wide analyses, our method is well-suited to fine mapping with very dense SNP sets obtained from re-sequencing and/or imputation. It can accommodate quantitative as well as case-control phenotypes, covariate adjustment, and can be extended to search for interactions. Here, we demonstrate the power and empirical type-I error of our approach using simulated case-control data sets of up to 500 K SNPs, a real genome-wide data set of 300 K SNPs, and a sequence-based dataset, each of which can be analysed in a few hours on a desktop workstation.
Author Summary
Tests of association with disease status are normally conducted one SNP at a time, ignoring the effects of all other genotyped SNPs. We developed a computationally efficient method to simultaneously analyse all SNPs, either in a genome-wide association (GWA) study, or a fine-mapping study based on re-sequencing and/or imputation. The method selects a subset of SNPs that best predicts disease status, while controlling the type-I error of the selected SNPs. This brings many advantages over standard single-SNP approaches, because the signal from a particular SNP can be more clearly assessed when other SNPs associated with disease status are already included in the model. Thus, in comparison with single-SNP analyses, power is increased and the false positive rate is reduced because of reduced residual variation. Localisation is also greatly improved. We demonstrate these advantages over the widely used single-SNP Armitage Trend Test using GWA simulation studies, a real GWA dataset, and a sequence-based fine-mapping simulation study.
doi:10.1371/journal.pgen.1000130
PMCID: PMC2464715  PMID: 18654633
24.  Combined genotype and haplotype tests for region-based association studies 
BMC Genomics  2013;14:569.
Background
Although single-SNP analysis has proven to be useful in identifying many disease-associated loci, region-based analysis has several advantages. Empirically, it has been shown that region-based genotype and haplotype approaches may possess much higher power than single-SNP statistical tests. Both high quality haplotypes and genotypes may be available for analysis given the development of next generation sequencing technologies and haplotype assembly algorithms.
Results
As generally it is unknown whether genotypes or haplotypes are more relevant for identifying an association, we propose to use both of them with the purpose of preserving high power under both genotype and haplotype disease scenarios. We suggest two approaches for a combined association test and investigate the performance of these two approaches based on a theoretical model, population genetics simulations and analysis of a real data set.
Conclusions
Based on a theoretical model, population genetics simulations and analysis of a central corneal thickness (CCT) Genome Wide Association Study (GWAS) data set we have shown that combined genotype and haplotype approach has a high potential utility for applications in association studies.
doi:10.1186/1471-2164-14-569
PMCID: PMC3852120  PMID: 23964661
Genotype-based tests; Haplotype-based tests; Association analysis; Test statistic combination
25.  Haplotype Block Structure Is Conserved across Mammals 
PLoS Genetics  2006;2(7):e121.
Genetic variation in genomes is organized in haplotype blocks, and species-specific block structure is defined by differential contribution of population history effects in combination with mutation and recombination events. Haplotype maps characterize the common patterns of linkage disequilibrium in populations and have important applications in the design and interpretation of genetic experiments. Although evolutionary processes are known to drive the selection of individual polymorphisms, their effect on haplotype block structure dynamics has not been shown. Here, we present a high-resolution haplotype map for a 5-megabase genomic region in the rat and compare it with the orthologous human and mouse segments. Although the size and fine structure of haplotype blocks are species dependent, there is a significant interspecies overlap in structure and a tendency for blocks to encompass complete genes. Extending these findings to the complete human genome using haplotype map phase I data reveals that linkage disequilibrium values are significantly higher for equally spaced positions in genic regions, including promoters, as compared to intergenic regions, indicating that a selective mechanism exists to maintain combinations of alleles within potentially interacting coding and regulatory regions. Although this characteristic may complicate the identification of causal polymorphisms underlying phenotypic traits, conservation of haplotype structure may be employed for the identification and characterization of functionally important genomic regions.
Synopsis
Differences at the DNA level are the major contributant underlying the phenotypic diversity between individuals in a population. The most common type of this genetic variation are single nucleotide polymorphisms (SNPs). Although the majority of SNPs do not have a functional effect, others may affect chromosome organization, gene expression, or protein function. SNPs and their individual states (alleles) are not randomly distributed throughout the genome and within a population. Recombination and mutation events, in combination with selection processes and population history, have resulted in common block-like structures in genomes. These structures are characterized by a common combination of SNP alleles, a so-called haplotype. Selection for specific haplotypes within a population is primarily driven by the advantageous effect of an individual polymorphism in the haplotype block.
By comparing the orthologous rat, mouse, and human haplotype structure of a 5-megabase region from rat Chromosome 1, the authors now show that haplotype block structure is conserved across mammals, most prominently in genic regions, suggesting the existence of an evolutionary selection process that drives the conservation of long-range allele combinations. Indeed, genome-wide gene-centric analysis of human HapMap data revealed that equally spaced polymorphic positions in genic regions and their upstream regulatory regions are genetically more tightly linked than in non-genic regions.
These findings may complicate the identification of causal polymorphisms underlying phenotypic traits, because in regions where haplotype structure is conserved, not a single polymorphism, but rather combinations of tightly linked polymorphisms could contribute to the phenotypic difference. On the other hand, conservation of haplotype structure may be employed for the identification and characterization of functionally important genomic regions.
doi:10.1371/journal.pgen.0020121
PMCID: PMC1523234  PMID: 16895449

Results 1-25 (1331285)