Visual refractive errors (REs) are complex genetic traits with a largely unknown etiology. To date, genome-wide association studies (GWASs) of moderate size have identified several novel risk markers for RE, measured here as mean spherical equivalent (MSE). We performed a GWAS using a total of 7280 samples from five cohorts: the Age-Related Eye Disease Study (AREDS); the KORA study (‘Cooperative Health Research in the Region of Augsburg’); the Framingham Eye Study (FES); the Ogliastra Genetic Park-Talana (OGP-Talana) Study and the Multiethnic Study of Atherosclerosis (MESA). Genotyping was performed on Illumina and Affymetrix platforms with additional markers imputed to the HapMap II reference panel. We identified a new genome-wide significant locus on chromosome 16 (rs10500355, P = 3.9 × 10−9) in a combined discovery and replication set (26 953 samples). This single nucleotide polymorphism (SNP) is located within the RBFOX1 gene which is a neuron-specific splicing factor regulating a wide range of alternative splicing events implicated in neuronal development and maturation, including transcription factors, other splicing factors and synaptic proteins.
Coronary artery disease (CAD) is a complex disease driven by myriad interactions of genetics and environmental factors. Traditionally, studies have analyzed only 1 disease factor at a time, providing useful but limited understanding of the underlying etiology. Recent advances in cost-effective and high-throughput technologies, such as single nucleotide polymorphism (SNP) genotyping, exome/genome/RNA sequencing, gene expression microarrays, and metabolomics assays have enabled the collection of millions of data points in many thousands of individuals. In order to make sense of such 'omics' data, effective analytical methods are needed. We review and highlight some of the main results in this area, focusing on integrative approaches that consider multiple modalities simultaneously. Such analyses have the potential to uncover the genetic basis of CAD, produce genomic risk scores (GRS) for disease prediction, disentangle the complex interactions underlying disease, and predict response to treatment.
Coronary artery disease; Coronary heart disease; Genomics; Systems biology; Mendelian randomization; Metabolites; Network analysis; Molecular systems model
Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.
Practical application of genomic-based risk stratification to clinical diagnosis is appealing yet performance varies widely depending on the disease and genomic risk score (GRS) method. Celiac disease (CD), a common immune-mediated illness, is strongly genetically determined and requires specific HLA haplotypes. HLA testing can exclude diagnosis but has low specificity, providing little information suitable for clinical risk stratification. Using six European cohorts, we provide a proof-of-concept that statistical learning approaches which simultaneously model all SNPs can generate robust and highly accurate predictive models of CD based on genome-wide SNP profiles. The high predictive capacity replicated both in cross-validation within each cohort (AUC of 0.87–0.89) and in independent replication across cohorts (AUC of 0.86–0.9), despite differences in ethnicity. The models explained 30–35% of disease variance and up to ∼43% of heritability. The GRS's utility was assessed in different clinically relevant settings. Comparable to HLA typing, the GRS can be used to identify individuals without CD with ≥99.6% negative predictive value however, unlike HLA typing, fine-scale stratification of individuals into categories of higher-risk for CD can identify those that would benefit from more invasive and costly definitive testing. The GRS is flexible and its performance can be adapted to the clinical situation by adjusting the threshold cut-off. Despite explaining a minority of disease heritability, our findings indicate a genomic risk score provides clinically relevant information to improve upon current diagnostic pathways for CD and support further studies evaluating the clinical utility of this approach in CD and other complex diseases.
Celiac disease (CD) is a common immune-mediated illness, affecting approximately 1% of the population in Western countries but the diagnostic process remains sub-optimal. The development of CD is strongly dependent on specific human leukocyte antigen (HLA) genes, and HLA testing to identify CD susceptibility is now commonly undertaken in clinical practice. The clinical utility of HLA typing is to exclude CD when the CD susceptibility HLA types are absent, but notably, most people who possess HLA types imparting susceptibility for CD never develop CD. Therefore, while genetic testing in CD can overcome several limitations of the current diagnostic tools, the utility of HLA typing to identify those individuals at increased-risk of CD is limited. Using large datasets assaying single nucleotide polymorphisms (SNPs), we have developed genomic risk scores (GRS) based on multiple SNPs that can more accurately predict CD risk across several populations in “real world” clinical settings. The GRS can generate predictions that optimize CD risk stratification and diagnosis, potentially reducing the number of unnecessary follow-up investigations. The medical and economic impact of improving CD diagnosis is likely to be significant, and our findings support further studies into the role of personalized GRS's for other strongly heritable human diseases.
Genetic studies might provide new insights into the biological
mechanisms underlying lipid metabolism and risk of CAD. We therefore
conducted a genome-wide association study to identify novel genetic
determinants of LDL-c, HDL-c and triglycerides.
Methods and results
We combined genome-wide association data from eight studies,
comprising up to 17,723 participants with information on circulating lipid
concentrations. We did independent replication studies in up to 37,774
participants from eight populations and also in a population of Indian Asian
descent. We also assessed the association between SNPs at lipid loci and
risk of CAD in up to 9,633 cases and 38,684 controls.
We identified four novel genetic loci that showed reproducible
associations with lipids (P values 1.6 × 10−8 to
3.1 × 10−10). These include a potentially
functional SNP in the SLC39A8 gene for HDL-c, a SNP near
the MYLIP/GMPR and PPP1R3B genes for LDL-c
and at the AFF1 gene for triglycerides. SNPs showing strong
statistical association with one or more lipid traits at the
APOE-C1-C4-C2 cluster, LPL,
ZNF259-APOA5-A4-C3-A1 cluster and
TRIB1 loci were also associated with CAD risk (P values
1.1 × 10−3 to 1.2 ×
We have identified four novel loci associated with circulating
lipids. We also show that in addition to those that are largely associated
with LDL-c, genetic loci mainly associated with circulating triglycerides
and HDL-c are also associated with risk of CAD. These findings potentially
provide new insights into the biological mechanisms underlying lipid
metabolism and CAD risk.
lipids; lipoproteins; genetics; epidemiology
Narrow arterioles in the retina have been shown to predict hypertension as well as other vascular diseases, likely through an increase in the peripheral resistance of the microcirculatory flow. In this study, we performed a genome-wide association study in 18,722 unrelated individuals of European ancestry from the Cohorts for Heart and Aging Research in Genomic Epidemiology consortium and the Blue Mountain Eye Study, to identify genetic determinants associated with variations in retinal arteriolar caliber. Retinal vascular calibers were measured on digitized retinal photographs using a standardized protocol. One variant (rs2194025 on chromosome 5q14 near the myocyte enhancer factor 2C MEF2C gene) was associated with retinal arteriolar caliber in the meta-analysis of the discovery cohorts at genome-wide significance of P-value <5×10−8. This variant was replicated in an additional 3,939 individuals of European ancestry from the Australian Twins Study and Multi-Ethnic Study of Atherosclerosis (rs2194025, P-value = 2.11×10−12 in combined meta-analysis of discovery and replication cohorts). In independent studies of modest sample sizes, no significant association was found between this variant and clinical outcomes including coronary artery disease, stroke, myocardial infarction or hypertension. In conclusion, we found one novel loci which underlie genetic variation in microvasculature which may be relevant to vascular disease. The relevance of these findings to clinical outcomes remains to be determined.
It has been hypothesized that multivariate analysis and systematic detection of epistatic interactions between explanatory genotyping variables may help resolve the problem of "missing heritability" currently observed in genome-wide association studies (GWAS). However, even the simplest bivariate analysis is still held back by significant statistical and computational challenges that are often addressed by reducing the set of analysed markers. Theoretically, it has been shown that combinations of loci may exist that show weak or no effects individually, but show significant (even complete) explanatory power over phenotype when combined. Reducing the set of analysed SNPs before bivariate analysis could easily omit such critical loci.
We have developed an exhaustive bivariate GWAS analysis methodology that yields a manageable subset of candidate marker pairs for subsequent analysis using other, often more computationally expensive techniques. Our model-free filtering approach is based on classification using ROC curve analysis, an alternative to much slower regression-based modelling techniques. Exhaustive analysis of studies containing approximately 450,000 SNPs and 5,000 samples requires only 2 hours using a desktop CPU or 13 minutes using a GPU (Graphics Processing Unit). We validate our methodology with analysis of simulated datasets as well as the seven Wellcome Trust Case-Control Consortium datasets that represent a wide range of real life GWAS challenges. We have identified SNP pairs that have considerably stronger association with disease than their individual component SNPs that often show negligible effect univariately. When compared against previously reported results in the literature, our methods re-detect most significant SNP-pairs and additionally detect many pairs absent from the literature that show strong association with disease. The high overlap suggests that our fast analysis could substitute for some slower alternatives.
We demonstrate that the proposed methodology is robust, fast and capable of exhaustive search for epistatic interactions using a standard desktop computer. First, our implementation is significantly faster than timings for comparable algorithms reported in the literature, especially as our method allows simultaneous use of multiple statistical filters with low computing time overhead. Second, for some diseases, we have identified hundreds of SNP pairs that pass formal multiple test (Bonferroni) correction and could form a rich source of hypotheses for follow-up analysis.
A web-based version of the software used for this analysis is available at http://bioinformatics.research.nicta.com.au/gwis.
Nuclear magnetic resonance assays allow for measurement of a wide range of metabolic phenotypes. We report here the results of a GWAS on 8,330 Finnish individuals genotyped and imputed at 7.7 million SNPs for a range of 216 serum metabolic phenotypes assessed by NMR of serum samples. We identified significant associations (P < 2.31 × 10−10) at 31 loci, including 11 for which there have not been previous reports of associations to a metabolic trait or disorder. Analyses of Finnish twin pairs suggested that the metabolic measures reported here show higher heritability than comparable conventional metabolic phenotypes. In accordance with our expectations, SNPs at the 31 loci associated with individual metabolites account for a greater proportion of the genetic component of trait variance (up to 40%) than is typically observed for conventional serum metabolic phenotypes. The identification of such associations may provide substantial insight into cardiometabolic disorders.
Recent advances in our understanding of the genomics of the human metabolome have shed light on the pathways involved in metabolic and cardiovascular disease. Such studies crucially depend on the interpretation of complex molecular spectra. A recent study by Suhre and colleagues provides a way to identify potentially clinically relevant biomarkers without a priori information, such as reference spectra, thus aiding the discovery of additional spectral features and corresponding genomic loci associated with metabolism and disease.
Genetic factors explain a majority of risk variance for age-related macular degeneration (AMD). While genome-wide association studies (GWAS) for late AMD implicate genes in complement, inflammatory and lipid pathways, the genetic architecture of early AMD has been relatively under studied. We conducted a GWAS meta-analysis of early AMD, including 4,089 individuals with prevalent signs of early AMD (soft drusen and/or retinal pigment epithelial changes) and 20,453 individuals without these signs. For various published late AMD risk loci, we also compared effect sizes between early and late AMD using an additional 484 individuals with prevalent late AMD. GWAS meta-analysis confirmed previously reported association of variants at the complement factor H (CFH) (peak P = 1.5×10−31) and age-related maculopathy susceptibility 2 (ARMS2) (P = 4.3×10−24) loci, and suggested Apolipoprotein E (ApoE) polymorphisms (rs2075650; P = 1.1×10−6) associated with early AMD. Other possible loci that did not reach GWAS significance included variants in the zinc finger protein gene GLI3 (rs2049622; P = 8.9×10−6) and upstream of GLI2 (rs6721654; P = 6.5×10−6), encoding retinal Sonic hedgehog signalling regulators, and in the tyrosinase (TYR) gene (rs621313; P = 3.5×10−6), involved in melanin biosynthesis. For a range of published, late AMD risk loci, estimated effect sizes were significantly lower for early than late AMD. This study confirms the involvement of multiple established AMD risk variants in early AMD, but suggests weaker genetic effects on the risk of early AMD relative to late AMD. Several biological processes were suggested to be potentially specific for early AMD, including pathways regulating RPE cell melanin content and signalling pathways potentially involved in retinal regeneration, generating hypotheses for further investigation.
To identify previously unknown genetic loci associated with fasting glucose concentrations, we examined the leading association signals in ten genome-wide association scans involving a total of 36,610 individuals of European descent. Variants in the gene encoding melatonin receptor 1B (MTNR1B) were consistently associated with fasting glucose across all ten studies. The strongest signal was observed at rs10830963, where each G allele (frequency 0.30 in HapMap CEU) was associated with an increase of 0.07 (95% CI = 0.06-0.08) mmol/l in fasting glucose levels (P = 3.2 = × 10−50) and reduced beta-cell function as measured by homeostasis model assessment (HOMA-B, P = 1.1 × 10−15). The same allele was associated with an increased risk of type 2 diabetes (odds ratio = 1.09 (1.05-1.12), per G allele P = 3.3 × 10−7) in a meta-analysis of 13 case-control studies totaling 18,236 cases and 64,453 controls. Our analyses also confirm previous associations of fasting glucose with variants at the G6PC2 (rs560887, P = 1.1 × 10−57) and GCK (rs4607517, P = 1.0 × 10−25) loci.
Association testing of multiple correlated phenotypes offers better power than univariate analysis of single traits. We analyzed 6,600 individuals from two population-based cohorts with both genome-wide SNP data and serum metabolomic profiles. From the observed correlation structure of 130 metabolites measured by nuclear magnetic resonance, we identified 11 metabolic networks and performed a multivariate genome-wide association analysis. We identified 34 genomic loci at genome-wide significance, of which 7 are novel. In comparison to univariate tests, multivariate association analysis identified nearly twice as many significant associations in total. Multi-tissue gene expression studies identified variants in our top loci, SERPINA1 and AQP9, as eQTLs and showed that SERPINA1 and AQP9 expression in human blood was associated with metabolites from their corresponding metabolic networks. Finally, liver expression of AQP9 was associated with atherosclerotic lesion area in mice, and in human arterial tissue both SERPINA1 and AQP9 were shown to be upregulated (6.3-fold and 4.6-fold, respectively) in atherosclerotic plaques. Our study illustrates the power of multi-phenotype GWAS and highlights candidate genes for atherosclerosis.
In this study, we aim to identify novel genetic variants for metabolism, characterize their effects on nearby genes, and show that the nearby genes are associated with metabolism and atherosclerosis. To discover new genetic variants, we use an alternative approach to traditional genome-wide association studies: we leverage the information in phenotype covariance to increase our statistical power. We identify variants at seven novel loci and then show that our top signals drive expression of nearby genes AQP9 and SERPINA1 in multiple tissues. We demonstrate that AQP9 and SERPINA1 gene expression, in turn, is associated with metabolite levels. Finally, we show that the genes are associated with atherosclerosis using mouse atherosclerotic lesion size (AQP9) as well as tissue from healthy human arteries and atherosclerotic plaques (AQP9 and SERPINA1). This study illustrates that multivariate analysis of correlated metabolites can boost power for gene discovery substantially. Further functional work will need to be performed to elucidate the biological role of SERPINA1 and AQP9 in atherosclerosis.
Multi-locus sequence typing (MLST) has become the gold standard for population analyses of bacterial pathogens. This method focuses on the sequences of a small number of loci (usually seven) to divide the population and is simple, robust and facilitates comparison of results between laboratories and over time. Over the last decade, researchers and population health specialists have invested substantial effort in building up public MLST databases for nearly 100 different bacterial species, and these databases contain a wealth of important information linked to MLST sequence types such as time and place of isolation, host or niche, serotype and even clinical or drug resistance profiles. Recent advances in sequencing technology mean it is increasingly feasible to perform bacterial population analysis at the whole genome level. This offers massive gains in resolving power and genetic profiling compared to MLST, and will eventually replace MLST for bacterial typing and population analysis. However given the wealth of data currently available in MLST databases, it is crucial to maintain backwards compatibility with MLST schemes so that new genome analyses can be understood in their proper historical context.
We present a software tool, SRST, for quick and accurate retrieval of sequence types from short read sets, using inputs easily downloaded from public databases. SRST uses read mapping and an allele assignment score incorporating sequence coverage and variability, to determine the most likely allele at each MLST locus. Analysis of over 3,500 loci in more than 500 publicly accessible Illumina read sets showed SRST to be highly accurate at allele assignment. SRST output is compatible with common analysis tools such as eBURST, Clonal Frame or PhyloViz, allowing easy comparison between novel genome data and MLST data. Alignment, fastq and pileup files can also be generated for novel alleles.
SRST is a novel software tool for accurate assignment of sequence types using short read data. Several uses for the tool are demonstrated, including quality control for high-throughput sequencing projects, plasmid MLST and analysis of genomic data during outbreak investigation. SRST is open-source, requires Python, BWA and SamTools, and is available from http://srst.sourceforge.net.
MLST; Short read; Illumina; Sequence analysis; Plasmid; Chromosome; Microbiology; Bacteria; Population analysis; Outbreak
A central goal of genomics is to predict phenotypic variation from genetic variation. Fitting predictive models to genome-wide and whole genome single nucleotide polymorphism (SNP) profiles allows us to estimate the predictive power of the SNPs and potentially develop diagnostic models for disease. However, many current datasets cannot be analysed with standard tools due to their large size.
We introduce SparSNP, a tool for fitting lasso linear models for massive SNP datasets quickly and with very low memory requirements. In analysis on a large celiac disease case/control dataset, we show that SparSNP runs substantially faster than four other state-of-the-art tools for fitting large scale penalised models. SparSNP was one of only two tools that could successfully fit models to the entire celiac disease dataset, and it did so with superior performance. Compared with the other tools, the models generated by SparSNP had better than or equal to predictive performance in cross-validation.
Genomic datasets are rapidly increasing in size, rendering existing approaches to model fitting impractical due to their prohibitive time or memory requirements. This study shows that SparSNP is an essential addition to the genomic analysis toolkit.
SparSNP is available at http://www.genomics.csse.unimelb.edu.au/SparSNP
Migraine is a common episodic neurological disorder, typically presenting with recurrent attacks of severe headache and autonomic dysfunction. Apart from rare monogenic subtypes, no genetic or molecular markers for migraine have been convincingly established. We identified the minor allele of rs1835740 on chromosome 8q22.1 to be associated with migraine (p=5.12 × 10−9, OR 1.23 [1.150-1.324]) in a genome-wide association study of 2,748 migraineurs from three European headache clinics and 10,747 population-matched controls. The association was replicated in 3,202 cases and 40,062 controls for an overall meta-analysis p-value of 1.60 × 10−11 (OR 1.18 [1.127 – 1.244]). rs1835740 is located between the astrocyte elevated gene 1 (MTDH/AEG-1) and plasma glutamate carboxypeptidase (PGCP). In an expression quantitative trait study in lymphoblastoid cell lines transcript levels of the MTDH/AEG-1 were found to have a significant correlation to rs1835740. Our data establish rs1835740 as the first genetic risk factor for migraine.
The lipid–leukocyte (LL) module is associated with, and reactive to, a wide variety of serum metabolites.The LL module appears to be a link between metabolism, adiposity, and inflammation.Serum metabolite concentrations themselves determine the connectedness of LL module.
Comprehensive characterization of human tissues promises novel insights into the biological architecture of human diseases and traits. We assessed metabonomic, transcriptomic, and genomic variation for a large population-based cohort from the capital region of Finland. Network analyses identified a set of highly correlated genes, the lipid–leukocyte (LL) module, as having a prominent role in over 80 serum metabolites (of 134 measures quantified), including lipoprotein subclasses, lipids, and amino acids. Concurrent association with immune response markers suggested the LL module as a possible link between inflammation, metabolism, and adiposity. Further, genomic variation was used to generate a directed network and infer LL module's largely reactive nature to metabolites. Finally, gene co-expression in circulating leukocytes was shown to be dependent on serum metabolite concentrations, providing evidence for the hypothesis that the coherence of molecular networks themselves is conditional on environmental factors. These findings show the importance and opportunity of systematic molecular investigation of human population samples. To facilitate and encourage this investigation, the metabonomic, transcriptomic, and genomic data used in this study have been made available as a resource for the research community.
bioinformatics; biological networks; integrative genomics; metabonomics; transcriptomics
We report a genome-wide association (GWA) study of severe malaria in The Gambia. The initial GWA scan included 2,500 children genotyped on the Affymetrix 500K GeneChip, and a replication study included 3,400 children. We used this to examine the performance of GWA methods in Africa. We found considerable population stratification, and also that signals of association at known malaria resistance loci were greatly attenuated owing to weak linkage disequilibrium (LD). To investigate possible solutions to the problem of low LD, we focused on the HbS locus, sequencing this region of the genome in 62 Gambian individuals and then using these data to conduct multipoint imputation in the GWA samples. This increased the signal of association, from P = 4 × 10−7 to P = 4 × 10−14, with the peak of the signal located precisely at the HbS causal variant. Our findings provide proof of principle that fine-resolution multipoint imputation, based on population-specific sequencing data, can substantially boost authentic GWA signals and enable fine mapping of causal variants in African populations.
While recent scans for genetic variation associated with human disease have been immensely successful in uncovering large numbers of loci, far fewer studies have focused on the underlying pathways of disease pathogenesis. Many loci which are associated with disease and complex phenotypes map to non-coding, regulatory regions of the genome, indicating that modulation of gene transcription plays a key role. Thus, this study generated genome-wide profiles of both genetic and transcriptional variation from the total blood extracts of over 500 randomly-selected, unrelated individuals. Using measurements of blood lipids, key players in the progression of atherosclerosis, three levels of biological information are integrated in order to investigate the interactions between circulating leukocytes and proximal lipid compounds. Pair-wise correlations between gene expression and lipid concentration indicate a prominent role for basophil granulocytes and mast cells, cell types central to powerful allergic and inflammatory responses. Network analysis of gene co-expression showed that the top associations function as part of a single, previously unknown gene module, the Lipid Leukocyte (LL) module. This module replicated in T cells from an independent cohort while also displaying potential tissue specificity. Further, genetic variation driving LL module expression included the single nucleotide polymorphism (SNP) most strongly associated with serum immunoglobulin E (IgE) levels, a key antibody in allergy. Structural Equation Modeling (SEM) indicated that LL module is at least partially reactive to blood lipid levels. Taken together, this study uncovers a gene network linking blood lipids and circulating cell types and offers insight into the hypothesis that the inflammatory response plays a prominent role in metabolism and the potential control of atherogenesis.
Circulating lipid concentrations are important predictors of coronary artery disease. The main pathology of coronary artery disease is atherosclerosis, a cycle of lipid adherence to the walls of arteries and an inflammatory response resulting in more adhesion. To investigate the link between lipids and immune cells in circulation, we have generated both genomic and whole blood gene expression profiles for a population-based collection of individuals from the capital region of Finland. Key mediators of inflammation and allergy were shown to be correlated with lipid levels. Further, the expressions of these genes operated in such a highly coordinated fashion that they appeared to function as part of a single pathway, which itself was both highly correlated with and reactive to lipid levels. Our findings offer insight into how lipids activate circulating immune cells, potentially contributing to the pathogenesis of coronary artery disease.
We describe a novel approach for evaluating SNP genotypes of a genome-wide association scan to identify “ethnic outlier” subjects whose ethnicity is different or admixed compared to most other subjects in the genotyped sample set. Each ethnic outlier is detected by counting a genomic excess of “rare” heterozygotes and/or homozygotes whose frequencies are low (<1%) within genotypes of the sample set being evaluated. This method also enables simple and striking visualization of non-Caucasian chromosomal DNA segments interspersed within the chromosomes of ethnically admixed individuals. We show that this visualization of the mosaic structure of admixed human chromosomes gives results similar to another visualization method (SABER) but with much less computational time and burden. We also show that other methods for detecting ethnic outliers are enhanced by evaluating only genomic regions of visualized admixture rather than diluting outlier ancestry by evaluating the entire genome considered in aggregate. We have validated our method in the Wellcome Trust Case Control Consortium (WTCCC) study of 17,000 subjects as well as in HapMap subjects and simulated outliers of known ethnicity and admixture. The method's ability to precisely delineate chromosomal segments of non-Caucasian ethnicity has enabled us to demonstrate previously unreported non-Caucasian admixture in two HapMap Caucasian parents and in a number of WTCCC subjects. Its sensitive detection of ethnic outliers and simple visual discrimination of discrete chromosomal segments of different ethnicity implies that this method of rare heterozygotes and homozygotes (RHH) is likely to have diverse and important applications in humans and other species.
To identify novel genetic loci associated with fasting glucose concentrations, we examined the leading association signals in 10 genome-wide association scans involving a total of 36,610 individuals of European descent. Variants in the gene encoding the melatonin receptor 1B (MTNR1B) were consistently associated with fasting glucose across all ten studies. The strongest signal was observed at rs10830963, where each G-allele (frequency 0.30 in HapMap CEU) was associated with an increase of 0.07 (95%CI 0.06–0.08) mmol/L in fasting glucose levels (P=3.2×10−50) and reduced beta-cell function as measured by homeostasis model assessment (HOMA-B, P=1.1×10−15). The same allele was associated with an increased risk of type 2 diabetes (odds ratio = 1.09 (1.05–1.12), per G allele P=3.3×10−7) in a meta-analysis of thirteen case-control studies totalling 18,236 cases and 64,453 controls. Our analyses also confirm previous associations of fasting glucose with variants at the G6PC2 (rs560887, P=1.1×10−57) and GCK (rs4607517, P=1.0×10−25) loci.
Here we present the first diploid genome sequence of an Asian individual. The genome was sequenced to 36-fold average coverage using massively parallel sequencing technology. We aligned the short reads onto the NCBI human reference genome to 99.97% coverage, and guided by the reference genome, we used uniquely mapped reads to assemble a high-quality consensus sequence for 92% of the Asian individual's genome. We identified approximately 3 million single-nucleotide polymorphisms (SNPs) inside this region, of which 13.6% were not in the dbSNP database. Genotyping analysis showed that SNP identification had high accuracy and consistency, indicating the high sequence quality of this assembly. We also carried out heterozygote phasing and haplotype prediction against HapMap CHB and JPT haplotypes (Chinese and Japanese, respectively), sequence comparison with the two available individual genomes (J. D. Watson and J. C. Venter), and structural variation identification. These variations were considered for their potential biological impact. Our sequence data and analyses demonstrate the potential usefulness of next-generation sequencing technologies for personal genomics.
Adult height is a model polygenic trait, but there has been limited success in identifying the genes underlying its normal variation. To identify genetic variants influencing adult human height, we used genome-wide association data from 13,665 individuals and genotyped 39 variants in an additional 16,482 samples. We identified 20 variants associated with adult height (P < 5 × 10−7, with 10 reaching P < 1 × 10−10). Combined, the 20 SNPs explain ~3% of height variation, with a ~5 cm difference between the 6.2% of people with 17 or fewer ‘tall’ alleles compared to the 5.5% with 27 or more ‘tall’ alleles. The loci we identified implicate genes in Hedgehog signaling (IHH, HHIP, PTCH1), extracellular matrix (EFEMP1, ADAMTSL3, ACAN) and cancer (CDK6, HMGA2, DLEU7) pathways, and provide new insights into human growth and developmental processes. Finally, our results provide insights into the genetic architecture of a classic quantitative trait.
To identify common variants influencing body mass index (BMI), we analyzed genome-wide association data from 16,876 individuals of European descent. After previously reported variants in FTO, the strongest association signal (rs17782313, P = 2.9 × 10−6) mapped 188 kb downstream of MC4R (melanocortin-4 receptor), mutations of which are the leading cause of monogenic severe childhood-onset obesity. We confirmed the BMI association in 60,352 adults (per-allele effect = 0.05 Z-score units; P = 2.8 × 10−15) and 5,988 children aged 7–11 (0.13 Z-score units; P = 1.5 × 10−8). In case-control analyses (n = 10,583), the odds for severe childhood obesity reached 1.30 (P = 8.0 × 10−11). Furthermore, we observed overtransmission of the risk allele to obese offspring in 660 families (P (pedigree disequilibrium test average; PDT-avg) = 2.4 × 10−4). The SNP location and patterns of phenotypic associations are consistent with effects mediated through altered MC4R function. Our findings establish that common variants near MC4R influence fat mass, weight and obesity risk at the population level and reinforce the need for large-scale data integration to identify variants influencing continuous biomedical traits.