Despite advances in sequencing, the goal of obtaining a comprehensive view of genetic variation in populations is still far from reached. We sequenced 180 lines of A. thaliana from Sweden to obtain as complete a picture as possible of variation in a single region. Whereas simple polymorphisms in the unique portion of the genome are readily identified, other polymorphisms are not. The massive variation in genome size identified by flow cytometry seems largely to be due to 45S rDNA copy number variation, with lines from northern Sweden having particularly large numbers of copies. Strong selection is evident in the form of long-range linkage disequilibrium (LD), as well as in LD between nearby compensatory mutations. Many footprints of selective sweeps were found in lines from northern Sweden, and a massive global sweep was shown to have involved a 700-kb transposition.
Life-history traits controlling the duration and timing of developmental phases in the life cycle jointly determine fitness. Therefore, life-history traits studied in isolation provide an incomplete view on the relevance of life-cycle variation for adaptation. In this study, we examine genetic variation in traits covering the major life history events of the annual species Arabidopsis thaliana: seed dormancy, vegetative growth rate and flowering time. In a sample of 112 genotypes collected throughout the European range of the species, both seed dormancy and flowering time follow a latitudinal gradient independent of the major population structure gradient. This finding confirms previous studies reporting the adaptive evolution of these two traits. Here, however, we further analyze patterns of co-variation among traits. We observe that co-variation between primary dormancy, vegetative growth rate and flowering time also follows a latitudinal cline. At higher latitudes, vegetative growth rate is positively correlated with primary dormancy and negatively with flowering time. In the South, this trend disappears. Patterns of trait co-variation change, presumably because major environmental gradients shift with latitude. This pattern appears unrelated to population structure, suggesting that changes in the coordinated evolution of major life history traits is adaptive. Our data suggest that A. thaliana provides a good model for the evolution of trade-offs and their genetic basis.
Variation in human skin and eye color is substantial and especially apparent in admixed populations, yet the underlying genetic architecture is poorly understood because most genome-wide studies are based on individuals of European ancestry. We study pigmentary variation in 699 individuals from Cape Verde, where extensive West African/European admixture has given rise to a broad range in trait values and genomic ancestry proportions. We develop and apply a new approach for measuring eye color, and identify two major loci (HERC2[OCA2] P = 2.3×10−62, SLC24A5 P = 9.6×10−9) that account for both blue versus brown eye color and varying intensities of brown eye color. We identify four major loci (SLC24A5 P = 5.4×10−27, TYR P = 1.1×10−9, APBA2[OCA2] P = 1.5×10−8, SLC45A2 P = 6×10−9) for skin color that together account for 35% of the total variance, but the genetic component with the largest effect (∼44%) is average genomic ancestry. Our results suggest that adjacent cis-acting regulatory loci for OCA2 explain the relationship between skin and eye color, and point to an underlying genetic architecture in which several genes of moderate effect act together with many genes of small effect to explain ∼70% of the estimated heritability.
Differences in skin and eye color are some of the most obvious traits that underlie human diversity, yet most of our knowledge regarding the genetic basis for these traits is based on the limited range of variation represented by individuals of European ancestry. We have studied a unique population in Cape Verde, an archipelago located off the West African coast, in which extensive mixing between individuals of Portuguese and West African ancestry has given rise to a broad range of phenotypes and ancestral genome proportions. Our results help to explain how genes work together to control the full range of pigmentary phenotypic diversity, provide new insight into the evolution of these traits, and provide a model for understanding other types of quantitative variation in admixed populations.
Genome-wide association studies (GWAS) are a standard approach for studying the genetics of natural variation. A major concern in GWAS is the need to account for the complicated dependence-structure of the data both between loci as well as between individuals. Mixed models have emerged as a general and flexible approach for correcting for population structure in GWAS. Here we extend this linear mixed model approach to carry out GWAS of correlated phenotypes, deriving a fully parameterized multi-trait mixed model (MTMM) that considers both the within-trait and between-trait variance components simultaneously for multiple traits. We apply this to human cohort data for correlated blood lipid traits from the Northern Finland Birth Cohort 1966, and demonstrate greatly increased power to detect pleiotropic loci that affect more than one blood lipid trait. We also apply this to an Arabidopsis dataset for flowering measurements in two different locations, identifying loci whose effect depends on the environment.
Population structure causes genome-wide linkage disequilibrium between unlinked loci, leading to statistical confounding in genome-wide association studies. Mixed models have been shown to handle the confounding effects of a diffuse background of large numbers of loci of small effect well, but do not always account for loci of larger effect. Here we propose a multi-locus mixed model as a general method for mapping complex traits in structured populations. Simulations suggest that our method outperforms existing methods, in terms of power as well as false discovery rate. We apply our method to human and Arabidopsis thaliana data, identifying novel associations in known candidates as well as evidence for allelic heterogeneity. We also demonstrate how a priori knowledge from an A. thaliana linkage mapping study can be integrated into our method using a Bayesian approach. Our implementation is computationally efficient, making the analysis of large datasets (n > 10000) practicable.
Understanding the mechanism of cadmium (Cd) accumulation in plants is important to help reduce its potential toxicity to both plants and humans through dietary and environmental exposure. Here, we report on a study to uncover the genetic basis underlying natural variation in Cd accumulation in a world-wide collection of 349 wild collected Arabidopsis thaliana accessions. We identified a 4-fold variation (0.5–2 µg Cd g−1 dry weight) in leaf Cd accumulation when these accessions were grown in a controlled common garden. By combining genome-wide association mapping, linkage mapping in an experimental F2 population, and transgenic complementation, we reveal that HMA3 is the sole major locus responsible for the variation in leaf Cd accumulation we observe in this diverse population of A. thaliana accessions. Analysis of the predicted amino acid sequence of HMA3 from 149 A. thaliana accessions reveals the existence of 10 major natural protein haplotypes. Association of these haplotypes with leaf Cd accumulation and genetics complementation experiments indicate that 5 of these haplotypes are active and 5 are inactive, and that elevated leaf Cd accumulation is associated with the reduced function of HMA3 caused by a nonsense mutation and polymorphisms that change two specific amino acids.
Cadmium (Cd) is a potentially toxic metal pollutant that threatens food quality and human health in many regions of the world. Plants have evolved mechanisms for the acquisition of essential metals such as zinc and iron from the soil. Though often quite specific, such mechanisms can also lead to the accumulation of Cd by plants. Understanding natural variation in the processes that contribute to Cd accumulation in food crops could help minimize the human health risk posed. We have discovered that DNA sequence changes at a single gene, which encodes the Heavy Metal ATPase 3 (HMA3), drives the variation in Cd accumulation we observe in a world-wide sample of Arabidopsis thaliana. We identified 10 major HMA3 protein variants, of which five contribute to reduce Cd accumulation in leaves of A. thaliana.
Arabidopsis thaliana is native to Eurasia and naturalized across the world due to human disturbance. Its easy propagation and immense phenotypic variability make it an ideal model system for functional, ecological and evolutionary genetics. To date, analyses of its natural variation have involved small numbers of individuals or genetic markers. Here we genotype 1,307 world-wide accessions, including several regional samples, at 250K SNPs, enabling us to describe the global pattern of genetic variation with high resolution. Three complementary tests applied to these data reveal novel targets of selection. Furthermore, we characterize the pattern of historical recombination and observe an enrichment of hotspots in intergenic regions and repetitive DNA, consistent with the pattern observed for humans but strikingly different from other plant species. We are making seeds for this Regional Mapping (RegMap) panel publicly available; they comprise the largest genomic mapping resource available for a naturally occurring, non-human, species.
We present the 207 Mb genome sequence of the outcrosser Arabidopsis lyrata, which diverged from the self-fertilizing species A. thaliana about 10 million years ago. It is generally assumed that the much smaller A. thaliana genome, which is only 125 Mb, constitutes the derived state for the family. Apparent genome reduction in this genus can be partially attributed to the loss of DNA from large-scale rearrangements, but the main cause lies in the hundreds of thousands of small deletions found throughout the genome. These occurred primarily in non-coding DNA and transposons, but protein-coding multi-gene families are smaller in A. thaliana as well. Analysis of deletions and insertions still segregating in A. thaliana indicates that the process of DNA loss is ongoing, suggesting pervasive selection for a smaller genome.
Studies of the model plant Arabidopsis thaliana may seem to have little impact on advances in medical research, yet a survey of the scientific literature shows that this is a misconception. Many discoveries with direct relevance to human health and disease have been elaborated using Arabidopsis, and several processes important to human biology are more easily studied in this versatile model plant.
Genomic imprinting is an epigenetic phenomenon leading to parent-of-origin specific differential expression of maternally and paternally inherited alleles. In plants, genomic imprinting has mainly been observed in the endosperm, an ephemeral triploid tissue derived after fertilization of the diploid central cell with a haploid sperm cell. In an effort to identify novel imprinted genes in Arabidopsis thaliana, we generated deep sequencing RNA profiles of F1 hybrid seeds derived after reciprocal crosses of Arabidopsis Col-0 and Bur-0 accessions. Using polymorphic sites to quantify allele-specific expression levels, we could identify more than 60 genes with potential parent-of-origin specific expression. By analyzing the distribution of DNA methylation and epigenetic marks established by Polycomb group (PcG) proteins using publicly available datasets, we suggest that for maternally expressed genes (MEGs) repression of the paternally inherited alleles largely depends on DNA methylation or PcG-mediated repression, whereas repression of the maternal alleles of paternally expressed genes (PEGs) predominantly depends on PcG proteins. While maternal alleles of MEGs are also targeted by PcG proteins, such targeting does not cause complete repression. Candidate MEGs and PEGs are enriched for cis-proximal transposons, suggesting that transposons might be a driving force for the evolution of imprinted genes in Arabidopsis. In addition, we find that MEGs and PEGs are significantly faster evolving when compared to other genes in the genome. In contrast to the predominant location of mammalian imprinted genes in clusters, cluster formation was only detected for few MEGs and PEGs, suggesting that clustering is not a major requirement for imprinted gene regulation in Arabidopsis.
Genomic imprinting poses a violation to the Mendelian rules of inheritance, which state functional equality of maternally and paternally inherited alleles. Imprinted genes are expressed dependent on their parent-of-origin, implicating an epigenetic asymmetry of maternal and paternal alleles. Genomic imprinting occurs in mammals and flowering plants. In both groups of organisms, nourishing of the progeny depends on ephemeral tissues, the placenta and the endosperm, respectively. In plants, genomic imprinting predominantly occurs in the endosperm, which is derived after fertilization of the diploid central cell with a haploid sperm cell. In this study we identify more than 60 potentially imprinted genes and show that there are different epigenetic mechanisms causing maternal and paternal-specific gene expression. We show that maternally expressed genes are regulated by DNA methylation or Polycomb group (PcG)-mediated repression, while paternally expressed genes are predominantly regulated by PcG proteins. From an evolutionary perspective, we also show that imprinted genes are associated with transposons and are more rapidly evolving than other genes in the genome. Many MEGs and PEGs encode for transcriptional regulators, implicating important functional roles of imprinted genes for endosperm and seed development.
We have explored the genetic basis of variation in vernalization requirement and
response in Arabidopsis accessions, selected on the basis of their phenotypic
distinctiveness. Phenotyping of F2 populations in different environments, plus
fine mapping, indicated possible causative genes. Our data support the
identification of FRI and FLC as candidates
for the major-effect QTL underlying variation in vernalization response, and
identify a weak FLC allele, caused by a Mutator-like
transposon, contributing to flowering time variation in two N. American
accessions. They also reveal a number of additional QTL that contribute to
flowering time variation after saturating vernalization. One of these was the
result of expression variation at the FT locus. Overall, our
data suggest that distinct phenotypic variation in the vernalization and
flowering response of Arabidopsis accessions is accounted for by variation that
has arisen independently at relatively few major-effect loci.
Plants can defend themselves against a wide array of enemies, yet one of the most striking observations is the variability in the effectiveness of such defences, both within and between species. Some of this variation can be explained by conflicting pressures from pathogens with different modes of attack1. A second explanation comes from an evolutionary tug of war, in which pathogens adapt to evade detection, until the plant has evolved new recognition capabilities for pathogen invasion2-5. If selection is, however, sufficiently strong, susceptible hosts should remain rare. That this is not the case is best justified by costs incurred from constitutive defences in a pest free environment6-11. Using a combination of forward genetics and genome-wide association analyses, we demonstrate that allelic diversity at a single locus, ACCELERATED CELL DEATH 6 (ACD6)12,13, underpins dramatic pleiotropic differences in both vegetative growth and resistance to microbial infection and herbivory among natural Arabidopsis thaliana strains. A hyperactive ACD6 allele, compared to the reference allele, strongly enhances resistance to a broad range of pathogens from different phyla, but at the same time slows the production of new leaves and greatly reduces the biomass of mature leaves. This allele segregates at intermediate frequency both throughout the worldwide range of A. thaliana and within local populations, consistent with this allele providing substantial fitness benefits despite its drastic impact on growth.
Although pioneered by human geneticists as a potential solution to the challenging problem of finding the genetic basis of common human diseases1,2, advances in genotyping and sequencing technology have made genome-wide association (GWA) studies an obvious general approach for studying the genetics of natural variation and traits of agricultural importance. They are particularly useful when inbred lines are available because once these lines have been genotyped, they can be phenotyped multiple times, making it possible (as well as extremely cost-effective) to study many different traits in many different environments, while replicating the phenotypic measurements to reduce environmental noise. Here we demonstrate the power of this approach by carrying out a GWA study of 107 phenotypes in Arabidopsis thaliana, a widely distributed, predominantly selfing model plant, known to harbor considerable genetic variation for many adaptively important traits3. Our results are dramatically different from those of human GWA studies in that we identify many common alleles with major effect, but they are also, in many cases, harder to interpret because confounding by complex genetics and population structure make it difficult to distinguish true from false associations. However, a priori candidates are significantly overrepresented among these associations as well, making many of them excellent candidates for follow-up experiments by the Arabidopsis community. Our study clearly demonstrates the feasibility of GWA studies in A. thaliana, and suggests that the approach will be appropriate for many other organisms.
With the advance of next-generation sequencing (NGS) technologies, increasingly ambitious applications are becoming feasible. A particularly powerful one is the sequencing of polymorphic, pooled samples. The pool can be naturally occurring, as in the case of multiple pathogen strains in a blood sample, multiple types of cells in a cancerous tissue sample, or multiple isoforms of mRNA in a cell. In these cases, it's difficult or impossible to partition the subtypes experimentally before sequencing, and those subtype frequencies must hence be inferred. In addition, investigators may occasionally want to artificially pool the sample of a large number of individuals for reasons of cost-efficiency, e.g., when carrying out genetic mapping using bulked segregant analysis. Here we describe PoolHap, a computational tool for inferring haplotype frequencies from pooled samples when haplotypes are known. The key insight into why PoolHap works is that the large number of SNPs that come with genome-wide coverage can compensate for the uneven coverage across the genome. The performance of PoolHap is illustrated and discussed using simulated and real data. We show that PoolHap is able to accurately estimate the proportions of haplotypes with less than 2% error for 34-strain mixtures with 2X total coverage Arabidopsis thaliana whole genome polymorphism data. This method should facilitate greater biological insight into heterogeneous samples that are difficult or impossible to isolate experimentally. Software and users manual are freely available at http://arabidopsis.gmi.oeaw.ac.at/quan/poolhap/.
The genetic model plant Arabidopsis thaliana, like many plant species, experiences a range of edaphic conditions across its natural habitat. Such heterogeneity may drive local adaptation, though the molecular genetic basis remains elusive. Here, we describe a study in which we used genome-wide association mapping, genetic complementation, and gene expression studies to identify cis-regulatory expression level polymorphisms at the AtHKT1;1 locus, encoding a known sodium (Na+) transporter, as being a major factor controlling natural variation in leaf Na+ accumulation capacity across the global A. thaliana population. A weak allele of AtHKT1;1 that drives elevated leaf Na+ in this population has been previously linked to elevated salinity tolerance. Inspection of the geographical distribution of this allele revealed its significant enrichment in populations associated with the coast and saline soils in Europe. The fixation of this weak AtHKT1;1 allele in these populations is genetic evidence supporting local adaptation to these potentially saline impacted environments.
The unusual geographical distribution of certain animal and plant species has provided puzzling questions to the scientific community regarding the interrelationship of evolutionary and geographic histories for generations. With DNA sequencing, such puzzles have now extended to the geographical distribution of genetic variation within a species. Here, we explain one such puzzle in the European population of Arabidopsis thaliana, where we find that a version of a gene encoding for a sodium-transporter with reduced function is almost uniquely found in populations of this plant growing close to the coast or on known saline soils. This version of the gene has previously been linked with elevated salinity tolerance, and its unusual distribution in populations of plants growing in coastal regions and on saline soils suggests that it is playing a role in adapting these plants to the elevated salinity of their local environment.
Flowering time is a key life-history trait in the plant life cycle. Most studies to unravel the genetics of flowering time in Arabidopsis thaliana have been performed under greenhouse conditions. Here, we describe a study about the genetics of flowering time that differs from previous studies in two important ways: first, we measure flowering time in a more complex and ecologically realistic environment; and, second, we combine the advantages of genome-wide association (GWA) and traditional linkage (QTL) mapping. Our experiments involved phenotyping nearly 20,000 plants over 2 winters under field conditions, including 184 worldwide natural accessions genotyped for 216,509 SNPs and 4,366 RILs derived from 13 independent crosses chosen to maximize genetic and phenotypic diversity. Based on a photothermal time model, the flowering time variation scored in our field experiment was poorly correlated with the flowering time variation previously obtained under greenhouse conditions, reinforcing previous demonstrations of the importance of genotype by environment interactions in A. thaliana and the need to study adaptive variation under natural conditions. The use of 4,366 RILs provides great power for dissecting the genetic architecture of flowering time in A. thaliana under our specific field conditions. We describe more than 60 additive QTLs, all with relatively small to medium effects and organized in 5 major clusters. We show that QTL mapping increases our power to distinguish true from false associations in GWA mapping. QTL mapping also permits the identification of false negatives, that is, causative SNPs that are lost when applying GWA methods that control for population structure. Major genes underpinning flowering time in the greenhouse were not associated with flowering time in this study. Instead, we found a prevalence of genes involved in the regulation of the plant circadian clock. Furthermore, we identified new genomic regions lacking obvious candidate genes.
Dissecting the genetic bases of adaptive traits is of primary importance in evolutionary biology. In this study, we combined a genome-wide association (GWA) study with traditional linkage mapping in order to detect the genetic bases underlying natural variation in flowering time in ecologically realistic conditions in the plant Arabidopsis thaliana. Our study involved phenotyping nearly 20,000 plants over 2 winters under field conditions in a temperate climate. We show that combined linkage and association mapping clearly outperforms each method alone when it comes to identifying true associations. This highlights the utility of combining different methods to localize genes involved in complex trait natural variation. Most candidate genes found in this study are involved in the regulation of the plant circadian clock and, surprisingly, were not associated with flowering time scored under greenhouse conditions. While rapid advances have been made in high-throughput genotyping and sequencing, high-throughput phenotyping of complex traits under natural conditions will be the next challenge for dissecting the genetic bases of adaptive variation in “laboratory” model organisms.
The population structure of an organism reflects its evolutionary history and influences its evolutionary trajectory. It constrains the combination of genetic diversity and reveals patterns of past gene flow. Understanding it is a prerequisite for detecting genomic regions under selection, predicting the effect of population disturbances, or modeling gene flow. This paper examines the detailed global population structure of Arabidopsis thaliana. Using a set of 5,707 plants collected from around the globe and genotyped at 149 SNPs, we show that while A. thaliana as a species self-fertilizes 97% of the time, there is considerable variation among local groups. This level of outcrossing greatly limits observed heterozygosity but is sufficient to generate considerable local haplotypic diversity. We also find that in its native Eurasian range A. thaliana exhibits continuous isolation by distance at every geographic scale without natural breaks corresponding to classical notions of populations. By contrast, in North America, where it exists as an exotic species, A. thaliana exhibits little or no population structure at a continental scale but local isolation by distance that extends hundreds of km. This suggests a pattern for the development of isolation by distance that can establish itself shortly after an organism fills a new habitat range. It also raises questions about the general applicability of many standard population genetics models. Any model based on discrete clusters of interchangeable individuals will be an uneasy fit to organisms like A. thaliana which exhibit continuous isolation by distance on many scales.
Much of the modern field of population genetics is premised on particular models of what an organism's population structure is and how it behaves. The classic models generally start with the idea of a single randomly mating population that has reached an evolutionary equilibrium. Many models relax some of these assumptions, allowing for phenomena such as assortative mating, discrete sub-populations with migration, self-fertilization, and sex-ratio distortion. Virtually all models, however, have as their core premise the notion that there exist classes of exchangeable individuals each of which represents an identical, independent sample from that class' distribution. For certain organisms, such as Drosophila melanogaster, these models do an excellent job of describing how populations work. For other organisms, such as humans, these models can be reasonable approximations but require a great deal of care in assembling samples and can begin to break down as sampling becomes locally dense. For the vast majority of organisms the applicability of these models has never been investigated.
Aquilegia formosa and pubescens are two closely related species belonging to the columbine genus. Despite their morphological and ecological differences, previous studies have revealed a large degree of intercompatibility, as well as little sequence divergence between these two taxa , . We compared the inter- and intraspecific patterns of variation for 9 nuclear loci, and found that the two species were practically indistinguishable at the level of DNA sequence polymorphism, indicating either very recent speciation or continued gene flow. As a comparison, we also analyzed variation at two loci across 30 other Aquilegia taxa; this revealed slightly more differentiation among taxa, which seemed best explained by geographic distance. By contrast, we found no evidence for isolation by distance on a more local geographic scale. We conclude that the extremely low levels of genetic differentiation between A. formosa and A.pubescens at neutral loci will facilitate future genome-wide scans for speciation genes.
The domestic dog exhibits greater diversity in body size than any other terrestrial vertebrate. We used a strategy that exploits the breed structure of dogs to investigate the genetic basis of size. First, through a genome-wide scan, we identified a major quantitative trait locus (QTL) on chromosome 15 influencing size variation within a single breed. Second, we examined genetic variation in the 15-megabase interval surrounding the QTL in small and giant breeds and found marked evidence for a selective sweep spanning a single gene (IGF1), encoding insulin-like growth factor 1. A single IGF1 single-nucleotide polymorphism haplotype is common to all small breeds and nearly absent from giant breeds, suggesting that the same causal sequence variant is a major contributor to body size in all small dogs.
Studies of nucleotide diversity have found an excess of low-frequency amino acid polymorphisms segregating in Arabidopsis thaliana, suggesting a predominance of weak purifying selection acting on amino acid polymorphism in this inbreeding species. Here, we investigate levels of diversity and divergence at synonymous and nonsynonymous sites in 6 circumpolar populations of the outbreeding Arabidopsis lyrata and compare these results with A. thaliana, to test for differences in mutation and selection parameters across genes, populations, and species. We find that A. lyrata shows an excess of low-frequency nonsynonymous polymorphisms both within populations and species wide, consistent with weak purifying selection similar to the patterns observed in A. thaliana. Furthermore, nonsynonymous polymorphisms tend to be more restricted in their population distribution in A. lyrata, consistent with purifying selection preventing their geographic spread. Highly expressed genes show a reduced ratio of amino acid to synonymous change for both polymorphism and fixed differences, suggesting a general pattern of stronger purifying selection on high-expression proteins.
McDonald–Kreitman test; site-frequency spectrum; Arabidopsis; inbreeding; nonsynonymous; synonymous
Previously, a candidate gene linkage approach on brother pairs affected with prostate cancer identified a locus of prostate cancer susceptibility at D3S1234 within the fragile histidine triad gene (FHIT), a tumor suppressor that induces apoptosis. Subsequent association tests on 16 SNPs spanning approximately 381 kb surrounding D3S1234 in Americans of European descent revealed significant evidence of association for a single SNP within intron 5 of FHIT. In the current study, re-sequencing and genotyping within a 28.5 kb region surrounding this SNP further delineated the association with prostate cancer risk to a 15 kb region. Multiple SNPs in sequences under evolutionary constraint within intron 5 of FHIT defined several related haplotypes with an increased risk of prostate cancer in European-Americans. Strong associations were detected for a risk haplotype defined by SNPs 138543, 142413, and 152494 in all cases (Pearson's χ2 = 12.34, df 1, P = 0.00045) and for the homozygous risk haplotype defined by SNPs 144716, 142413, and 148444 in cases that shared 2 alleles identical by descent with their affected brothers (Pearson's χ2 = 11.50, df 1, P = 0.00070). In addition to highly conserved sequences encompassing SNPs 148444 and 152413, population studies revealed strong signatures of natural selection for a 1 kb window covering the SNP 144716 in two human populations, the European American (π = 0.0072, Tajima's D = 3.31, 14 SNPs) and the Japanese (π = 0.0049, Fay & Wu's H = 8.05, 14 SNPs), as well as in chimpanzees (Fay & Wu's H = 8.62, 12 SNPs). These results strongly support the involvement of the FHIT intronic region in an increased risk of prostate cancer.
A central question in genomic imprinting is how a specific sequence is recognized as the target for epigenetic marking. In both mammals and plants, imprinted genes are often associated with tandem repeats and transposon-related sequences, but the role of these elements in epigenetic gene silencing remains elusive. FWA is an imprinted gene in Arabidopsis thaliana expressed specifically in the female gametophyte and endosperm. Tissue-specific and imprinted expression of FWA depends on DNA methylation in the FWA promoter, which is comprised of two direct repeats containing a sequence related to a SINE retroelement. Methylation of this element causes epigenetic silencing, but it is not known whether the methylation is targeted to the SINE-related sequence itself or the direct repeat structure is also necessary. Here we show that the repeat structure in the FWA promoter is highly diverse in species within the genus Arabidopsis. Four independent tandem repeat formation events were found in three closely related species. Another related species, A. halleri, did not have a tandem repeat in the FWA promoter. Unexpectedly, even in this species, FWA expression was imprinted and the FWA promoter was methylated. In addition, our expression analysis of FWA gene in vegetative tissues revealed high frequency of intra-specific variation in the expression level. In conclusion, we show that the tandem repeat structure is dispensable for the epigenetic silencing of the FWA gene. Rather, SINE-related sequence is sufficient for imprinting, vegetative silencing, and targeting of DNA methylation. Frequent independent tandem repeat formation events in the FWA promoter led us to propose that they may be a consequence, rather than cause, of the epigenetic control. The possible significance of epigenetic variation in reproductive strategies during evolution is also discussed.
Genomic imprinting, mono-allelic gene expression depending on the parent-of-origin, is an epigenetic process known in mammals and flowering plants. A central question in genomic imprinting is how a specific sequence is recognized as the target for epigenetic marking. In both mammals and plants, imprinted genes are often associated with tandem repeats and transposon-related sequences, but the role of these elements in epigenetic gene silencing remains elusive. FWA is an imprinted gene in Arabidopsis thaliana expressed specifically in the female gametophyte and endosperm. The FWA promoter is comprised of two direct repeats containing a sequence related to a SINE retroelement. Methylation of this element causes epigenetic silencing, but it is not known whether the methylation is targeted to the SINE-related sequence itself or the direct repeat structure is necessary. Here we show that the direct repeat structure is highly diverse in species within the genus Arabidopsis. Unexpectedly, we found that the direct repeat structure is dispensable for the epigenetic silencing and methylation of the FWA promoter. Rather, the SINE-related promoter sequence is sufficient for these features. Frequent independent formation of the tandem repeats suggests that they may be a consequence of the epigenetically controlled system.
We apply an analysis based upon mixed-models to the Genetic Analysis Workshop 15, Problem 3 simulated data. Such models are commonly used to mitigate the tendency for population structure, or cryptic relatedness, to inflate the false-positive rate of test statistics. They also allow for explicit modeling of varying degrees of relatedness in samples in which some individuals are related by (possibly unknown) pedigree, whereas others are not. Furthermore, the implementation of the method we describe here is quick enough to be used effectively on genome-wide data. We present an analysis of the data for Genetic Analysis Workshop 15, Problem 3, in which we show that these methods can effectively find signals in this data. Somewhat disappointingly, the false-positive rate does not appear to be reduced, but this is largely because the method used to simulate the data appears not to have encompassed effects, such as population stratification, that might have led to inflation of p-values.
A potentially serious disadvantage of association mapping is the fact that marker-trait associations may arise from confounding population structure as well as from linkage to causative polymorphisms. Using genome-wide marker data, we have previously demonstrated that the problem can be severe in a global sample of 95 Arabidopsis thaliana accessions, and that established methods for controlling for population structure are generally insufficient. Here, we use the same sample together with a number of flowering-related phenotypes and data-perturbation simulations to evaluate a wider range of methods for controlling for population structure. We find that, in terms of reducing the false-positive rate while maintaining statistical power, a recently introduced mixed-model approach that takes genome-wide differences in relatedness into account via estimated pairwise kinship coefficients generally performs best. By combining the association results with results from linkage mapping in F2 crosses, we identify one previously known true positive and several promising new associations, but also demonstrate the existence of both false positives and false negatives. Our results illustrate the potential of genome-wide association scans as a tool for dissecting the genetics of natural variation, while at the same time highlighting the pitfalls. The importance of study design is clear; our study is severely under-powered both in terms of sample size and marker density. Our results also provide a striking demonstration of confounding by population structure. While statistical methods can be used to ameliorate this problem, they cannot always be effective and are certainly not a substitute for independent evidence, such as that obtained via crosses or transgenic experiments. Ultimately, association mapping is a powerful tool for identifying a list of candidates that is short enough to permit further genetic study.
There is currently tremendous interest in using association mapping to find the genes responsible for natural variation, particularly for human disease. In association mapping, researchers seek to identify regions of the genome where individuals who are phenotypically similar (e.g., they all have the same disease) are also unusually closely related. A potentially serious problem is that spurious correlations may arise if the population is structured so that members of a subgroup tend to be much more closely related. We have previously demonstrated that this problem can be severe in Arabidopsis thaliana, and that established statistical methods for controlling for population structure are insufficient. Here, we evaluate a broader range of methods. We find that a recently introduced mixed-model approach generally performs best. By combining the association results with results from linkage mapping in F2 crosses, we identify one previously known true positive and several promising new associations, but also demonstrate the existence of both false positives and false negatives. Our results illustrate the potential of genome-wide association scans as a tool for dissecting the genetics of natural variation, while at the same time highlighting the pitfalls.