Genome-wide association studies (GWAS) are a standard approach for studying the genetics of natural variation. A major concern in GWAS is the need to account for the complicated dependence-structure of the data both between loci as well as between individuals. Mixed models have emerged as a general and flexible approach for correcting for population structure in GWAS. Here we extend this linear mixed model approach to carry out GWAS of correlated phenotypes, deriving a fully parameterized multi-trait mixed model (MTMM) that considers both the within-trait and between-trait variance components simultaneously for multiple traits. We apply this to human cohort data for correlated blood lipid traits from the Northern Finland Birth Cohort 1966, and demonstrate greatly increased power to detect pleiotropic loci that affect more than one blood lipid trait. We also apply this to an Arabidopsis dataset for flowering measurements in two different locations, identifying loci whose effect depends on the environment.
Population structure causes genome-wide linkage disequilibrium between unlinked loci, leading to statistical confounding in genome-wide association studies. Mixed models have been shown to handle the confounding effects of a diffuse background of large numbers of loci of small effect well, but do not always account for loci of larger effect. Here we propose a multi-locus mixed model as a general method for mapping complex traits in structured populations. Simulations suggest that our method outperforms existing methods, in terms of power as well as false discovery rate. We apply our method to human and Arabidopsis thaliana data, identifying novel associations in known candidates as well as evidence for allelic heterogeneity. We also demonstrate how a priori knowledge from an A. thaliana linkage mapping study can be integrated into our method using a Bayesian approach. Our implementation is computationally efficient, making the analysis of large datasets (n > 10000) practicable.
Understanding the mechanism of cadmium (Cd) accumulation in plants is important to help reduce its potential toxicity to both plants and humans through dietary and environmental exposure. Here, we report on a study to uncover the genetic basis underlying natural variation in Cd accumulation in a world-wide collection of 349 wild collected Arabidopsis thaliana accessions. We identified a 4-fold variation (0.5–2 µg Cd g−1 dry weight) in leaf Cd accumulation when these accessions were grown in a controlled common garden. By combining genome-wide association mapping, linkage mapping in an experimental F2 population, and transgenic complementation, we reveal that HMA3 is the sole major locus responsible for the variation in leaf Cd accumulation we observe in this diverse population of A. thaliana accessions. Analysis of the predicted amino acid sequence of HMA3 from 149 A. thaliana accessions reveals the existence of 10 major natural protein haplotypes. Association of these haplotypes with leaf Cd accumulation and genetics complementation experiments indicate that 5 of these haplotypes are active and 5 are inactive, and that elevated leaf Cd accumulation is associated with the reduced function of HMA3 caused by a nonsense mutation and polymorphisms that change two specific amino acids.
Cadmium (Cd) is a potentially toxic metal pollutant that threatens food quality and human health in many regions of the world. Plants have evolved mechanisms for the acquisition of essential metals such as zinc and iron from the soil. Though often quite specific, such mechanisms can also lead to the accumulation of Cd by plants. Understanding natural variation in the processes that contribute to Cd accumulation in food crops could help minimize the human health risk posed. We have discovered that DNA sequence changes at a single gene, which encodes the Heavy Metal ATPase 3 (HMA3), drives the variation in Cd accumulation we observe in a world-wide sample of Arabidopsis thaliana. We identified 10 major HMA3 protein variants, of which five contribute to reduce Cd accumulation in leaves of A. thaliana.
Arabidopsis thaliana is native to Eurasia and naturalized across the world due to human disturbance. Its easy propagation and immense phenotypic variability make it an ideal model system for functional, ecological and evolutionary genetics. To date, analyses of its natural variation have involved small numbers of individuals or genetic markers. Here we genotype 1,307 world-wide accessions, including several regional samples, at 250K SNPs, enabling us to describe the global pattern of genetic variation with high resolution. Three complementary tests applied to these data reveal novel targets of selection. Furthermore, we characterize the pattern of historical recombination and observe an enrichment of hotspots in intergenic regions and repetitive DNA, consistent with the pattern observed for humans but strikingly different from other plant species. We are making seeds for this Regional Mapping (RegMap) panel publicly available; they comprise the largest genomic mapping resource available for a naturally occurring, non-human, species.
We present the 207 Mb genome sequence of the outcrosser Arabidopsis lyrata, which diverged from the self-fertilizing species A. thaliana about 10 million years ago. It is generally assumed that the much smaller A. thaliana genome, which is only 125 Mb, constitutes the derived state for the family. Apparent genome reduction in this genus can be partially attributed to the loss of DNA from large-scale rearrangements, but the main cause lies in the hundreds of thousands of small deletions found throughout the genome. These occurred primarily in non-coding DNA and transposons, but protein-coding multi-gene families are smaller in A. thaliana as well. Analysis of deletions and insertions still segregating in A. thaliana indicates that the process of DNA loss is ongoing, suggesting pervasive selection for a smaller genome.
Studies of the model plant Arabidopsis thaliana may seem to have little impact on advances in medical research, yet a survey of the scientific literature shows that this is a misconception. Many discoveries with direct relevance to human health and disease have been elaborated using Arabidopsis, and several processes important to human biology are more easily studied in this versatile model plant.
Genomic imprinting is an epigenetic phenomenon leading to parent-of-origin specific differential expression of maternally and paternally inherited alleles. In plants, genomic imprinting has mainly been observed in the endosperm, an ephemeral triploid tissue derived after fertilization of the diploid central cell with a haploid sperm cell. In an effort to identify novel imprinted genes in Arabidopsis thaliana, we generated deep sequencing RNA profiles of F1 hybrid seeds derived after reciprocal crosses of Arabidopsis Col-0 and Bur-0 accessions. Using polymorphic sites to quantify allele-specific expression levels, we could identify more than 60 genes with potential parent-of-origin specific expression. By analyzing the distribution of DNA methylation and epigenetic marks established by Polycomb group (PcG) proteins using publicly available datasets, we suggest that for maternally expressed genes (MEGs) repression of the paternally inherited alleles largely depends on DNA methylation or PcG-mediated repression, whereas repression of the maternal alleles of paternally expressed genes (PEGs) predominantly depends on PcG proteins. While maternal alleles of MEGs are also targeted by PcG proteins, such targeting does not cause complete repression. Candidate MEGs and PEGs are enriched for cis-proximal transposons, suggesting that transposons might be a driving force for the evolution of imprinted genes in Arabidopsis. In addition, we find that MEGs and PEGs are significantly faster evolving when compared to other genes in the genome. In contrast to the predominant location of mammalian imprinted genes in clusters, cluster formation was only detected for few MEGs and PEGs, suggesting that clustering is not a major requirement for imprinted gene regulation in Arabidopsis.
Genomic imprinting poses a violation to the Mendelian rules of inheritance, which state functional equality of maternally and paternally inherited alleles. Imprinted genes are expressed dependent on their parent-of-origin, implicating an epigenetic asymmetry of maternal and paternal alleles. Genomic imprinting occurs in mammals and flowering plants. In both groups of organisms, nourishing of the progeny depends on ephemeral tissues, the placenta and the endosperm, respectively. In plants, genomic imprinting predominantly occurs in the endosperm, which is derived after fertilization of the diploid central cell with a haploid sperm cell. In this study we identify more than 60 potentially imprinted genes and show that there are different epigenetic mechanisms causing maternal and paternal-specific gene expression. We show that maternally expressed genes are regulated by DNA methylation or Polycomb group (PcG)-mediated repression, while paternally expressed genes are predominantly regulated by PcG proteins. From an evolutionary perspective, we also show that imprinted genes are associated with transposons and are more rapidly evolving than other genes in the genome. Many MEGs and PEGs encode for transcriptional regulators, implicating important functional roles of imprinted genes for endosperm and seed development.
We have explored the genetic basis of variation in vernalization requirement and
response in Arabidopsis accessions, selected on the basis of their phenotypic
distinctiveness. Phenotyping of F2 populations in different environments, plus
fine mapping, indicated possible causative genes. Our data support the
identification of FRI and FLC as candidates
for the major-effect QTL underlying variation in vernalization response, and
identify a weak FLC allele, caused by a Mutator-like
transposon, contributing to flowering time variation in two N. American
accessions. They also reveal a number of additional QTL that contribute to
flowering time variation after saturating vernalization. One of these was the
result of expression variation at the FT locus. Overall, our
data suggest that distinct phenotypic variation in the vernalization and
flowering response of Arabidopsis accessions is accounted for by variation that
has arisen independently at relatively few major-effect loci.
Plants can defend themselves against a wide array of enemies, yet one of the most striking observations is the variability in the effectiveness of such defences, both within and between species. Some of this variation can be explained by conflicting pressures from pathogens with different modes of attack1. A second explanation comes from an evolutionary tug of war, in which pathogens adapt to evade detection, until the plant has evolved new recognition capabilities for pathogen invasion2-5. If selection is, however, sufficiently strong, susceptible hosts should remain rare. That this is not the case is best justified by costs incurred from constitutive defences in a pest free environment6-11. Using a combination of forward genetics and genome-wide association analyses, we demonstrate that allelic diversity at a single locus, ACCELERATED CELL DEATH 6 (ACD6)12,13, underpins dramatic pleiotropic differences in both vegetative growth and resistance to microbial infection and herbivory among natural Arabidopsis thaliana strains. A hyperactive ACD6 allele, compared to the reference allele, strongly enhances resistance to a broad range of pathogens from different phyla, but at the same time slows the production of new leaves and greatly reduces the biomass of mature leaves. This allele segregates at intermediate frequency both throughout the worldwide range of A. thaliana and within local populations, consistent with this allele providing substantial fitness benefits despite its drastic impact on growth.
Although pioneered by human geneticists as a potential solution to the challenging problem of finding the genetic basis of common human diseases1,2, advances in genotyping and sequencing technology have made genome-wide association (GWA) studies an obvious general approach for studying the genetics of natural variation and traits of agricultural importance. They are particularly useful when inbred lines are available because once these lines have been genotyped, they can be phenotyped multiple times, making it possible (as well as extremely cost-effective) to study many different traits in many different environments, while replicating the phenotypic measurements to reduce environmental noise. Here we demonstrate the power of this approach by carrying out a GWA study of 107 phenotypes in Arabidopsis thaliana, a widely distributed, predominantly selfing model plant, known to harbor considerable genetic variation for many adaptively important traits3. Our results are dramatically different from those of human GWA studies in that we identify many common alleles with major effect, but they are also, in many cases, harder to interpret because confounding by complex genetics and population structure make it difficult to distinguish true from false associations. However, a priori candidates are significantly overrepresented among these associations as well, making many of them excellent candidates for follow-up experiments by the Arabidopsis community. Our study clearly demonstrates the feasibility of GWA studies in A. thaliana, and suggests that the approach will be appropriate for many other organisms.
With the advance of next-generation sequencing (NGS) technologies, increasingly ambitious applications are becoming feasible. A particularly powerful one is the sequencing of polymorphic, pooled samples. The pool can be naturally occurring, as in the case of multiple pathogen strains in a blood sample, multiple types of cells in a cancerous tissue sample, or multiple isoforms of mRNA in a cell. In these cases, it's difficult or impossible to partition the subtypes experimentally before sequencing, and those subtype frequencies must hence be inferred. In addition, investigators may occasionally want to artificially pool the sample of a large number of individuals for reasons of cost-efficiency, e.g., when carrying out genetic mapping using bulked segregant analysis. Here we describe PoolHap, a computational tool for inferring haplotype frequencies from pooled samples when haplotypes are known. The key insight into why PoolHap works is that the large number of SNPs that come with genome-wide coverage can compensate for the uneven coverage across the genome. The performance of PoolHap is illustrated and discussed using simulated and real data. We show that PoolHap is able to accurately estimate the proportions of haplotypes with less than 2% error for 34-strain mixtures with 2X total coverage Arabidopsis thaliana whole genome polymorphism data. This method should facilitate greater biological insight into heterogeneous samples that are difficult or impossible to isolate experimentally. Software and users manual are freely available at http://arabidopsis.gmi.oeaw.ac.at/quan/poolhap/.
The genetic model plant Arabidopsis thaliana, like many plant species, experiences a range of edaphic conditions across its natural habitat. Such heterogeneity may drive local adaptation, though the molecular genetic basis remains elusive. Here, we describe a study in which we used genome-wide association mapping, genetic complementation, and gene expression studies to identify cis-regulatory expression level polymorphisms at the AtHKT1;1 locus, encoding a known sodium (Na+) transporter, as being a major factor controlling natural variation in leaf Na+ accumulation capacity across the global A. thaliana population. A weak allele of AtHKT1;1 that drives elevated leaf Na+ in this population has been previously linked to elevated salinity tolerance. Inspection of the geographical distribution of this allele revealed its significant enrichment in populations associated with the coast and saline soils in Europe. The fixation of this weak AtHKT1;1 allele in these populations is genetic evidence supporting local adaptation to these potentially saline impacted environments.
The unusual geographical distribution of certain animal and plant species has provided puzzling questions to the scientific community regarding the interrelationship of evolutionary and geographic histories for generations. With DNA sequencing, such puzzles have now extended to the geographical distribution of genetic variation within a species. Here, we explain one such puzzle in the European population of Arabidopsis thaliana, where we find that a version of a gene encoding for a sodium-transporter with reduced function is almost uniquely found in populations of this plant growing close to the coast or on known saline soils. This version of the gene has previously been linked with elevated salinity tolerance, and its unusual distribution in populations of plants growing in coastal regions and on saline soils suggests that it is playing a role in adapting these plants to the elevated salinity of their local environment.
Flowering time is a key life-history trait in the plant life cycle. Most studies to unravel the genetics of flowering time in Arabidopsis thaliana have been performed under greenhouse conditions. Here, we describe a study about the genetics of flowering time that differs from previous studies in two important ways: first, we measure flowering time in a more complex and ecologically realistic environment; and, second, we combine the advantages of genome-wide association (GWA) and traditional linkage (QTL) mapping. Our experiments involved phenotyping nearly 20,000 plants over 2 winters under field conditions, including 184 worldwide natural accessions genotyped for 216,509 SNPs and 4,366 RILs derived from 13 independent crosses chosen to maximize genetic and phenotypic diversity. Based on a photothermal time model, the flowering time variation scored in our field experiment was poorly correlated with the flowering time variation previously obtained under greenhouse conditions, reinforcing previous demonstrations of the importance of genotype by environment interactions in A. thaliana and the need to study adaptive variation under natural conditions. The use of 4,366 RILs provides great power for dissecting the genetic architecture of flowering time in A. thaliana under our specific field conditions. We describe more than 60 additive QTLs, all with relatively small to medium effects and organized in 5 major clusters. We show that QTL mapping increases our power to distinguish true from false associations in GWA mapping. QTL mapping also permits the identification of false negatives, that is, causative SNPs that are lost when applying GWA methods that control for population structure. Major genes underpinning flowering time in the greenhouse were not associated with flowering time in this study. Instead, we found a prevalence of genes involved in the regulation of the plant circadian clock. Furthermore, we identified new genomic regions lacking obvious candidate genes.
Dissecting the genetic bases of adaptive traits is of primary importance in evolutionary biology. In this study, we combined a genome-wide association (GWA) study with traditional linkage mapping in order to detect the genetic bases underlying natural variation in flowering time in ecologically realistic conditions in the plant Arabidopsis thaliana. Our study involved phenotyping nearly 20,000 plants over 2 winters under field conditions in a temperate climate. We show that combined linkage and association mapping clearly outperforms each method alone when it comes to identifying true associations. This highlights the utility of combining different methods to localize genes involved in complex trait natural variation. Most candidate genes found in this study are involved in the regulation of the plant circadian clock and, surprisingly, were not associated with flowering time scored under greenhouse conditions. While rapid advances have been made in high-throughput genotyping and sequencing, high-throughput phenotyping of complex traits under natural conditions will be the next challenge for dissecting the genetic bases of adaptive variation in “laboratory” model organisms.
The population structure of an organism reflects its evolutionary history and influences its evolutionary trajectory. It constrains the combination of genetic diversity and reveals patterns of past gene flow. Understanding it is a prerequisite for detecting genomic regions under selection, predicting the effect of population disturbances, or modeling gene flow. This paper examines the detailed global population structure of Arabidopsis thaliana. Using a set of 5,707 plants collected from around the globe and genotyped at 149 SNPs, we show that while A. thaliana as a species self-fertilizes 97% of the time, there is considerable variation among local groups. This level of outcrossing greatly limits observed heterozygosity but is sufficient to generate considerable local haplotypic diversity. We also find that in its native Eurasian range A. thaliana exhibits continuous isolation by distance at every geographic scale without natural breaks corresponding to classical notions of populations. By contrast, in North America, where it exists as an exotic species, A. thaliana exhibits little or no population structure at a continental scale but local isolation by distance that extends hundreds of km. This suggests a pattern for the development of isolation by distance that can establish itself shortly after an organism fills a new habitat range. It also raises questions about the general applicability of many standard population genetics models. Any model based on discrete clusters of interchangeable individuals will be an uneasy fit to organisms like A. thaliana which exhibit continuous isolation by distance on many scales.
Much of the modern field of population genetics is premised on particular models of what an organism's population structure is and how it behaves. The classic models generally start with the idea of a single randomly mating population that has reached an evolutionary equilibrium. Many models relax some of these assumptions, allowing for phenomena such as assortative mating, discrete sub-populations with migration, self-fertilization, and sex-ratio distortion. Virtually all models, however, have as their core premise the notion that there exist classes of exchangeable individuals each of which represents an identical, independent sample from that class' distribution. For certain organisms, such as Drosophila melanogaster, these models do an excellent job of describing how populations work. For other organisms, such as humans, these models can be reasonable approximations but require a great deal of care in assembling samples and can begin to break down as sampling becomes locally dense. For the vast majority of organisms the applicability of these models has never been investigated.
Aquilegia formosa and pubescens are two closely related species belonging to the columbine genus. Despite their morphological and ecological differences, previous studies have revealed a large degree of intercompatibility, as well as little sequence divergence between these two taxa , . We compared the inter- and intraspecific patterns of variation for 9 nuclear loci, and found that the two species were practically indistinguishable at the level of DNA sequence polymorphism, indicating either very recent speciation or continued gene flow. As a comparison, we also analyzed variation at two loci across 30 other Aquilegia taxa; this revealed slightly more differentiation among taxa, which seemed best explained by geographic distance. By contrast, we found no evidence for isolation by distance on a more local geographic scale. We conclude that the extremely low levels of genetic differentiation between A. formosa and A.pubescens at neutral loci will facilitate future genome-wide scans for speciation genes.
The domestic dog exhibits greater diversity in body size than any other terrestrial vertebrate. We used a strategy that exploits the breed structure of dogs to investigate the genetic basis of size. First, through a genome-wide scan, we identified a major quantitative trait locus (QTL) on chromosome 15 influencing size variation within a single breed. Second, we examined genetic variation in the 15-megabase interval surrounding the QTL in small and giant breeds and found marked evidence for a selective sweep spanning a single gene (IGF1), encoding insulin-like growth factor 1. A single IGF1 single-nucleotide polymorphism haplotype is common to all small breeds and nearly absent from giant breeds, suggesting that the same causal sequence variant is a major contributor to body size in all small dogs.
Studies of nucleotide diversity have found an excess of low-frequency amino acid polymorphisms segregating in Arabidopsis thaliana, suggesting a predominance of weak purifying selection acting on amino acid polymorphism in this inbreeding species. Here, we investigate levels of diversity and divergence at synonymous and nonsynonymous sites in 6 circumpolar populations of the outbreeding Arabidopsis lyrata and compare these results with A. thaliana, to test for differences in mutation and selection parameters across genes, populations, and species. We find that A. lyrata shows an excess of low-frequency nonsynonymous polymorphisms both within populations and species wide, consistent with weak purifying selection similar to the patterns observed in A. thaliana. Furthermore, nonsynonymous polymorphisms tend to be more restricted in their population distribution in A. lyrata, consistent with purifying selection preventing their geographic spread. Highly expressed genes show a reduced ratio of amino acid to synonymous change for both polymorphism and fixed differences, suggesting a general pattern of stronger purifying selection on high-expression proteins.
McDonald–Kreitman test; site-frequency spectrum; Arabidopsis; inbreeding; nonsynonymous; synonymous
Previously, a candidate gene linkage approach on brother pairs affected with prostate cancer identified a locus of prostate cancer susceptibility at D3S1234 within the fragile histidine triad gene (FHIT), a tumor suppressor that induces apoptosis. Subsequent association tests on 16 SNPs spanning approximately 381 kb surrounding D3S1234 in Americans of European descent revealed significant evidence of association for a single SNP within intron 5 of FHIT. In the current study, re-sequencing and genotyping within a 28.5 kb region surrounding this SNP further delineated the association with prostate cancer risk to a 15 kb region. Multiple SNPs in sequences under evolutionary constraint within intron 5 of FHIT defined several related haplotypes with an increased risk of prostate cancer in European-Americans. Strong associations were detected for a risk haplotype defined by SNPs 138543, 142413, and 152494 in all cases (Pearson's χ2 = 12.34, df 1, P = 0.00045) and for the homozygous risk haplotype defined by SNPs 144716, 142413, and 148444 in cases that shared 2 alleles identical by descent with their affected brothers (Pearson's χ2 = 11.50, df 1, P = 0.00070). In addition to highly conserved sequences encompassing SNPs 148444 and 152413, population studies revealed strong signatures of natural selection for a 1 kb window covering the SNP 144716 in two human populations, the European American (π = 0.0072, Tajima's D = 3.31, 14 SNPs) and the Japanese (π = 0.0049, Fay & Wu's H = 8.05, 14 SNPs), as well as in chimpanzees (Fay & Wu's H = 8.62, 12 SNPs). These results strongly support the involvement of the FHIT intronic region in an increased risk of prostate cancer.
A central question in genomic imprinting is how a specific sequence is recognized as the target for epigenetic marking. In both mammals and plants, imprinted genes are often associated with tandem repeats and transposon-related sequences, but the role of these elements in epigenetic gene silencing remains elusive. FWA is an imprinted gene in Arabidopsis thaliana expressed specifically in the female gametophyte and endosperm. Tissue-specific and imprinted expression of FWA depends on DNA methylation in the FWA promoter, which is comprised of two direct repeats containing a sequence related to a SINE retroelement. Methylation of this element causes epigenetic silencing, but it is not known whether the methylation is targeted to the SINE-related sequence itself or the direct repeat structure is also necessary. Here we show that the repeat structure in the FWA promoter is highly diverse in species within the genus Arabidopsis. Four independent tandem repeat formation events were found in three closely related species. Another related species, A. halleri, did not have a tandem repeat in the FWA promoter. Unexpectedly, even in this species, FWA expression was imprinted and the FWA promoter was methylated. In addition, our expression analysis of FWA gene in vegetative tissues revealed high frequency of intra-specific variation in the expression level. In conclusion, we show that the tandem repeat structure is dispensable for the epigenetic silencing of the FWA gene. Rather, SINE-related sequence is sufficient for imprinting, vegetative silencing, and targeting of DNA methylation. Frequent independent tandem repeat formation events in the FWA promoter led us to propose that they may be a consequence, rather than cause, of the epigenetic control. The possible significance of epigenetic variation in reproductive strategies during evolution is also discussed.
Genomic imprinting, mono-allelic gene expression depending on the parent-of-origin, is an epigenetic process known in mammals and flowering plants. A central question in genomic imprinting is how a specific sequence is recognized as the target for epigenetic marking. In both mammals and plants, imprinted genes are often associated with tandem repeats and transposon-related sequences, but the role of these elements in epigenetic gene silencing remains elusive. FWA is an imprinted gene in Arabidopsis thaliana expressed specifically in the female gametophyte and endosperm. The FWA promoter is comprised of two direct repeats containing a sequence related to a SINE retroelement. Methylation of this element causes epigenetic silencing, but it is not known whether the methylation is targeted to the SINE-related sequence itself or the direct repeat structure is necessary. Here we show that the direct repeat structure is highly diverse in species within the genus Arabidopsis. Unexpectedly, we found that the direct repeat structure is dispensable for the epigenetic silencing and methylation of the FWA promoter. Rather, the SINE-related promoter sequence is sufficient for these features. Frequent independent formation of the tandem repeats suggests that they may be a consequence of the epigenetically controlled system.
We apply an analysis based upon mixed-models to the Genetic Analysis Workshop 15, Problem 3 simulated data. Such models are commonly used to mitigate the tendency for population structure, or cryptic relatedness, to inflate the false-positive rate of test statistics. They also allow for explicit modeling of varying degrees of relatedness in samples in which some individuals are related by (possibly unknown) pedigree, whereas others are not. Furthermore, the implementation of the method we describe here is quick enough to be used effectively on genome-wide data. We present an analysis of the data for Genetic Analysis Workshop 15, Problem 3, in which we show that these methods can effectively find signals in this data. Somewhat disappointingly, the false-positive rate does not appear to be reduced, but this is largely because the method used to simulate the data appears not to have encompassed effects, such as population stratification, that might have led to inflation of p-values.
A potentially serious disadvantage of association mapping is the fact that marker-trait associations may arise from confounding population structure as well as from linkage to causative polymorphisms. Using genome-wide marker data, we have previously demonstrated that the problem can be severe in a global sample of 95 Arabidopsis thaliana accessions, and that established methods for controlling for population structure are generally insufficient. Here, we use the same sample together with a number of flowering-related phenotypes and data-perturbation simulations to evaluate a wider range of methods for controlling for population structure. We find that, in terms of reducing the false-positive rate while maintaining statistical power, a recently introduced mixed-model approach that takes genome-wide differences in relatedness into account via estimated pairwise kinship coefficients generally performs best. By combining the association results with results from linkage mapping in F2 crosses, we identify one previously known true positive and several promising new associations, but also demonstrate the existence of both false positives and false negatives. Our results illustrate the potential of genome-wide association scans as a tool for dissecting the genetics of natural variation, while at the same time highlighting the pitfalls. The importance of study design is clear; our study is severely under-powered both in terms of sample size and marker density. Our results also provide a striking demonstration of confounding by population structure. While statistical methods can be used to ameliorate this problem, they cannot always be effective and are certainly not a substitute for independent evidence, such as that obtained via crosses or transgenic experiments. Ultimately, association mapping is a powerful tool for identifying a list of candidates that is short enough to permit further genetic study.
There is currently tremendous interest in using association mapping to find the genes responsible for natural variation, particularly for human disease. In association mapping, researchers seek to identify regions of the genome where individuals who are phenotypically similar (e.g., they all have the same disease) are also unusually closely related. A potentially serious problem is that spurious correlations may arise if the population is structured so that members of a subgroup tend to be much more closely related. We have previously demonstrated that this problem can be severe in Arabidopsis thaliana, and that established statistical methods for controlling for population structure are insufficient. Here, we evaluate a broader range of methods. We find that a recently introduced mixed-model approach generally performs best. By combining the association results with results from linkage mapping in F2 crosses, we identify one previously known true positive and several promising new associations, but also demonstrate the existence of both false positives and false negatives. Our results illustrate the potential of genome-wide association scans as a tool for dissecting the genetics of natural variation, while at the same time highlighting the pitfalls.
The transition to flowering is an important event in the plant life cycle and is modulated by several environmental factors including photoperiod, light quality, vernalization, and growth temperature, as well as biotic and abiotic stresses. In contrast to light and vernalization, little is known about the pathways that mediate the responses to other environmental variables. A mild increase in growth temperature, from 23 °C to 27 °C, is equally efficient in inducing flowering of Arabidopsis plants grown in 8-h short days as is transfer to 16-h long days. There is extensive natural variation in this response, and we identify strains with contrasting thermal reaction norms. Exploiting this natural variation, we show that FLOWERING LOCUS C potently suppresses thermal induction, and that the closely related floral repressor FLOWERING LOCUS M is a major-effect quantitative trait locus modulating thermosensitivity. Thermal induction does not require the photoperiod effector CONSTANS, acts upstream of the floral integrator FLOWERING LOCUS T, and depends on the hormone gibberellin. Analysis of mutants defective in salicylic acid biosynthesis suggests that thermal induction is independent of previously identified stress-signaling pathways. Microarray analyses confirm that the genomic responses to floral induction by photoperiod and temperature differ. Furthermore, we report that gene products that participate in RNA splicing are specifically affected by thermal induction. Above a critical threshold, even small changes in temperature can act as cues for the induction of flowering. This response has a genetic basis that is distinct from the known genetic pathways of floral transition, and appears to correlate with changes in RNA processing.
When to flower is an important decision in the life cycle of a plant, as it determines the plant's reproductive success. Not surprisingly, plants closely monitor the state of their life cycle along with the external environment in order to determine the onset of flowering. Several factors including light, temperature, and abiotic stress are known to affect the timing of flowering. The authors show that growth temperatures above a finely tuned threshold can rapidly trigger flowering, bypassing the need for other inductive stimuli such as day length. Exploiting a combination of Mendelian genetics, natural variation, and genomics, they show thermal induction of flowering to have a unique genetic basis. Genomic responses to temperature and light during floral induction differ, and temperature-specific changes include alterations in RNA processing.
The detection of footprints of natural selection in genetic polymorphism data is fundamental to understanding the genetic basis of adaptation, and has important implications for human health. The standard approach has been to reject neutrality in favor of selection if the pattern of variation at a candidate locus was significantly different from the predictions of the standard neutral model. The problem is that the standard neutral model assumes more than just neutrality, and it is almost always possible to explain the data using an alternative neutral model with more complex demography. Today's wealth of genomic polymorphism data, however, makes it possible to dispense with models altogether by simply comparing the pattern observed at a candidate locus to the genomic pattern, and rejecting neutrality if the pattern is extreme. Here, we utilize this approach on a truly genomic scale, comparing a candidate locus to thousands of alleles throughout the
Arabidopsis thaliana genome. We demonstrate that selection has acted to increase the frequency of early-flowering alleles at the vernalization requirement locus
FRIGIDA. Selection seems to have occurred during the last several thousand years, possibly in response to the spread of agriculture. We introduce a novel test statistic based on haplotype sharing that embraces the problem of population structure, and so should be widely applicable.
A nonparametric approach to detecting selection, based on haplotype sharing throughout the genome, provides strong evidence for selection in
Arabidopsis at the
FRIGIDA locus responsible for early flowering.
There is currently tremendous interest in the possibility of using genome-wide association mapping to identify genes responsible for natural variation, particularly for human disease susceptibility. The model plant Arabidopsis thaliana is in many ways an ideal candidate for such studies, because it is a highly selfing hermaphrodite. As a result, the species largely exists as a collection of naturally occurring inbred lines, or accessions, which can be genotyped once and phenotyped repeatedly. Furthermore, linkage disequilibrium in such a species will be much more extensive than in a comparable outcrossing species. We tested the feasibility of genome-wide association mapping in A. thaliana by searching for associations with flowering time and pathogen resistance in a sample of 95 accessions for which genome-wide polymorphism data were available. In spite of an extremely high rate of false positives due to population structure, we were able to identify known major genes for all phenotypes tested, thus demonstrating the potential of genome-wide association mapping in A. thaliana and other species with similar patterns of variation. The rate of false positives differed strongly between traits, with more clinal traits showing the highest rate. However, the false positive rates were always substantial regardless of the trait, highlighting the necessity of an appropriate genomic control in association studies.
There is currently tremendous interest in using association mapping to find the genes responsible for natural variation, particularly for human disease. In association mapping, researchers seek to identify regions of the genome where individuals that are phenotypically similar (for example, they all have the same disease) are also unusually closely related. A potentially serious problem is that spurious correlations may arise if the population is structured so that members of a subgroup tend to be much more closely related. Because few genome-wide association studies have been carried out, it is not yet known how important this problem will be in practice.
In one of the first genome-wide association studies to date, this paper considers the model plant Arabidopsis thaliana. A very large number of spurious genotype–phenotype correlations are found, especially for traits that vary geographically. For example, plants from northern latitudes flower later; however, in addition to sharing genetic variants that make them flower late, they also tend to share variants across the genome, making it difficult to determine which genes are responsible for flowering. This notwithstanding, several previously known genes were successfully identified in this study, and the researchers are optimistic about the prospects for association mapping in this species.