1.  SRST2: Rapid genomic surveillance for public health and hospital microbiology labs 
Genome Medicine  2014;6(11):90.
Rapid molecular typing of bacterial pathogens is critical for public health epidemiology, surveillance and infection control, yet routine use of whole genome sequencing (WGS) for these purposes poses significant challenges. Here we present SRST2, a read mapping-based tool for fast and accurate detection of genes, alleles and multi-locus sequence types (MLST) from WGS data. Using >900 genomes from common pathogens, we show SRST2 is highly accurate and outperforms assembly-based methods in terms of both gene detection and allele assignment. We include validation of SRST2 within a public health laboratory, and demonstrate its use for microbial genome surveillance in the hospital setting. In the face of rising threats of antimicrobial resistance and emerging virulence among bacterial pathogens, SRST2 represents a powerful tool for rapidly extracting clinically useful information from raw WGS data.
Source code is available from
Electronic supplementary material
The online version of this article (doi:10.1186/s13073-014-0090-6) contains supplementary material, which is available to authorized users.
PMCID: PMC4237778  PMID: 25422674
2.  Distribution and Medical Impact of Loss-of-Function Variants in the Finnish Founder Population 
PLoS Genetics  2014;10(7):e1004494.
Exome sequencing studies in complex diseases are challenged by the allelic heterogeneity, large number and modest effect sizes of associated variants on disease risk and the presence of large numbers of neutral variants, even in phenotypically relevant genes. Isolated populations with recent bottlenecks offer advantages for studying rare variants in complex diseases as they have deleterious variants that are present at higher frequencies as well as a substantial reduction in rare neutral variation. To explore the potential of the Finnish founder population for studying low-frequency (0.5–5%) variants in complex diseases, we compared exome sequence data on 3,000 Finns to the same number of non-Finnish Europeans and discovered that, despite having fewer variable sites overall, the average Finn has more low-frequency loss-of-function variants and complete gene knockouts. We then used several well-characterized Finnish population cohorts to study the phenotypic effects of 83 enriched loss-of-function variants across 60 phenotypes in 36,262 Finns. Using a deep set of quantitative traits collected on these cohorts, we show 5 associations (p<5×10−8) including splice variants in LPA that lowered plasma lipoprotein(a) levels (P = 1.5×10−117). Through accessing the national medical records of these participants, we evaluate the LPA finding via Mendelian randomization and confirm that these splice variants confer protection from cardiovascular disease (OR = 0.84, P = 3×10−4), demonstrating for the first time the correlation between very low levels of LPA in humans with potential therapeutic implications for cardiovascular diseases. More generally, this study articulates substantial advantages for studying the role of rare variation in complex phenotypes in founder populations like the Finns and by combining a unique population genetic history with data from large population cohorts and centralized research access to National Health Registers.
Author Summary
We explored the coding regions of 3,000 Finnish individuals with 3,000 non-Finnish Europeans (NFEs) using whole-exome sequence data, in order to understand how an individual from a bottlenecked population might differ from an individual from an out-bred population. We provide empirical evidence that there are more rare and low-frequency deleterious alleles in Finns compared to NFEs, such that an average Finn has almost twice as many low-frequency complete knockouts of a gene. As such, we hypothesized that some of these low-frequency loss-of-function variants might have important medical consequences in humans and genotyped 83 of these variants in 36,000 Finns. In doing so, we discovered that completely knocking out the TSFM gene might result in inviability or a very severe phenotype in humans and that knocking out the LPA gene might confer protection against coronary heart diseases, suggesting that LPA is likely to be a good potential therapeutic target.
PMCID: PMC4117444  PMID: 25078778
3.  Meta-analysis of genome-wide association studies in five cohorts reveals common variants in RBFOX1, a regulator of tissue-specific splicing, associated with refractive error 
Human Molecular Genetics  2013;22(13):2754-2764.
Visual refractive errors (REs) are complex genetic traits with a largely unknown etiology. To date, genome-wide association studies (GWASs) of moderate size have identified several novel risk markers for RE, measured here as mean spherical equivalent (MSE). We performed a GWAS using a total of 7280 samples from five cohorts: the Age-Related Eye Disease Study (AREDS); the KORA study (‘Cooperative Health Research in the Region of Augsburg’); the Framingham Eye Study (FES); the Ogliastra Genetic Park-Talana (OGP-Talana) Study and the Multiethnic Study of Atherosclerosis (MESA). Genotyping was performed on Illumina and Affymetrix platforms with additional markers imputed to the HapMap II reference panel. We identified a new genome-wide significant locus on chromosome 16 (rs10500355, P = 3.9 × 10−9) in a combined discovery and replication set (26 953 samples). This single nucleotide polymorphism (SNP) is located within the RBFOX1 gene which is a neuron-specific splicing factor regulating a wide range of alternative splicing events implicated in neuronal development and maturation, including transcription factors, other splicing factors and synaptic proteins.
PMCID: PMC3674806  PMID: 23474815
4.  Towards a Molecular Systems Model of Coronary Artery Disease 
Current Cardiology Reports  2014;16(6):488.
Coronary artery disease (CAD) is a complex disease driven by myriad interactions of genetics and environmental factors. Traditionally, studies have analyzed only 1 disease factor at a time, providing useful but limited understanding of the underlying etiology. Recent advances in cost-effective and high-throughput technologies, such as single nucleotide polymorphism (SNP) genotyping, exome/genome/RNA sequencing, gene expression microarrays, and metabolomics assays have enabled the collection of millions of data points in many thousands of individuals. In order to make sense of such 'omics' data, effective analytical methods are needed. We review and highlight some of the main results in this area, focusing on integrative approaches that consider multiple modalities simultaneously. Such analyses have the potential to uncover the genetic basis of CAD, produce genomic risk scores (GRS) for disease prediction, disentangle the complex interactions underlying disease, and predict response to treatment.
PMCID: PMC4050311  PMID: 24743898
Coronary artery disease; Coronary heart disease; Genomics; Systems biology; Mendelian randomization; Metabolites; Network analysis; Molecular systems model
5.  Fast Principal Component Analysis of Large-Scale Genome-Wide Data 
PLoS ONE  2014;9(4):e93766.
Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.
PMCID: PMC3981753  PMID: 24718290
6.  Accurate and Robust Genomic Prediction of Celiac Disease Using Statistical Learning 
PLoS Genetics  2014;10(2):e1004137.
Practical application of genomic-based risk stratification to clinical diagnosis is appealing yet performance varies widely depending on the disease and genomic risk score (GRS) method. Celiac disease (CD), a common immune-mediated illness, is strongly genetically determined and requires specific HLA haplotypes. HLA testing can exclude diagnosis but has low specificity, providing little information suitable for clinical risk stratification. Using six European cohorts, we provide a proof-of-concept that statistical learning approaches which simultaneously model all SNPs can generate robust and highly accurate predictive models of CD based on genome-wide SNP profiles. The high predictive capacity replicated both in cross-validation within each cohort (AUC of 0.87–0.89) and in independent replication across cohorts (AUC of 0.86–0.9), despite differences in ethnicity. The models explained 30–35% of disease variance and up to ∼43% of heritability. The GRS's utility was assessed in different clinically relevant settings. Comparable to HLA typing, the GRS can be used to identify individuals without CD with ≥99.6% negative predictive value however, unlike HLA typing, fine-scale stratification of individuals into categories of higher-risk for CD can identify those that would benefit from more invasive and costly definitive testing. The GRS is flexible and its performance can be adapted to the clinical situation by adjusting the threshold cut-off. Despite explaining a minority of disease heritability, our findings indicate a genomic risk score provides clinically relevant information to improve upon current diagnostic pathways for CD and support further studies evaluating the clinical utility of this approach in CD and other complex diseases.
Author Summary
Celiac disease (CD) is a common immune-mediated illness, affecting approximately 1% of the population in Western countries but the diagnostic process remains sub-optimal. The development of CD is strongly dependent on specific human leukocyte antigen (HLA) genes, and HLA testing to identify CD susceptibility is now commonly undertaken in clinical practice. The clinical utility of HLA typing is to exclude CD when the CD susceptibility HLA types are absent, but notably, most people who possess HLA types imparting susceptibility for CD never develop CD. Therefore, while genetic testing in CD can overcome several limitations of the current diagnostic tools, the utility of HLA typing to identify those individuals at increased-risk of CD is limited. Using large datasets assaying single nucleotide polymorphisms (SNPs), we have developed genomic risk scores (GRS) based on multiple SNPs that can more accurately predict CD risk across several populations in “real world” clinical settings. The GRS can generate predictions that optimize CD risk stratification and diagnosis, potentially reducing the number of unnecessary follow-up investigations. The medical and economic impact of improving CD diagnosis is likely to be significant, and our findings support further studies into the role of personalized GRS's for other strongly heritable human diseases.
PMCID: PMC3923679  PMID: 24550740
7.  Genetic variants influencing circulating lipid levels and risk of coronary artery disease 
Genetic studies might provide new insights into the biological mechanisms underlying lipid metabolism and risk of CAD. We therefore conducted a genome-wide association study to identify novel genetic determinants of LDL-c, HDL-c and triglycerides.
Methods and results
We combined genome-wide association data from eight studies, comprising up to 17,723 participants with information on circulating lipid concentrations. We did independent replication studies in up to 37,774 participants from eight populations and also in a population of Indian Asian descent. We also assessed the association between SNPs at lipid loci and risk of CAD in up to 9,633 cases and 38,684 controls.
We identified four novel genetic loci that showed reproducible associations with lipids (P values 1.6 × 10−8 to 3.1 × 10−10). These include a potentially functional SNP in the SLC39A8 gene for HDL-c, a SNP near the MYLIP/GMPR and PPP1R3B genes for LDL-c and at the AFF1 gene for triglycerides. SNPs showing strong statistical association with one or more lipid traits at the CELSR2, APOB, APOE-C1-C4-C2 cluster, LPL, ZNF259-APOA5-A4-C3-A1 cluster and TRIB1 loci were also associated with CAD risk (P values 1.1 × 10−3 to 1.2 × 10−9).
We have identified four novel loci associated with circulating lipids. We also show that in addition to those that are largely associated with LDL-c, genetic loci mainly associated with circulating triglycerides and HDL-c are also associated with risk of CAD. These findings potentially provide new insights into the biological mechanisms underlying lipid metabolism and CAD risk.
PMCID: PMC3891568  PMID: 20864672
lipids; lipoproteins; genetics; epidemiology
8.  Elucidation of Pathways Driving Asthma Pathogenesis: Development of a Systems-Level Analytic Strategy 
Asthma is a genetically complex, chronic lung disease defined clinically as episodic airflow limitation and breathlessness that is at least partially reversible, either spontaneously or in response to therapy. Whereas asthma was rare in the late 1800s and early 1900s, the marked increase in its incidence and prevalence since the 1960s points to substantial gene × environment interactions occurring over a period of years, but these interactions are very poorly understood (1–6). It is widely believed that the majority of asthma begins during childhood and manifests first as intermittent wheeze. However, wheeze is also very common in infancy and only a subset of wheezy children progress to persistent asthma for reasons that are largely obscure. Here, we review the current literature regarding causal pathways leading to early asthma development and chronicity. Given the complex interactions of many risk factors over time eventually leading to apparently multiple asthma phenotypes, we suggest that deeply phenotyped cohort studies combined with sophisticated network models will be required to derive the next generation of biological and clinical insights in asthma pathogenesis.
PMCID: PMC4172064  PMID: 25295037
allergy; asthma; systems biology; virus infection; birth cohort; childhood; immune function; epidemiology
9.  Genetic Loci for Retinal Arteriolar Microcirculation 
PLoS ONE  2013;8(6):e65804.
Narrow arterioles in the retina have been shown to predict hypertension as well as other vascular diseases, likely through an increase in the peripheral resistance of the microcirculatory flow. In this study, we performed a genome-wide association study in 18,722 unrelated individuals of European ancestry from the Cohorts for Heart and Aging Research in Genomic Epidemiology consortium and the Blue Mountain Eye Study, to identify genetic determinants associated with variations in retinal arteriolar caliber. Retinal vascular calibers were measured on digitized retinal photographs using a standardized protocol. One variant (rs2194025 on chromosome 5q14 near the myocyte enhancer factor 2C MEF2C gene) was associated with retinal arteriolar caliber in the meta-analysis of the discovery cohorts at genome-wide significance of P-value <5×10−8. This variant was replicated in an additional 3,939 individuals of European ancestry from the Australian Twins Study and Multi-Ethnic Study of Atherosclerosis (rs2194025, P-value = 2.11×10−12 in combined meta-analysis of discovery and replication cohorts). In independent studies of modest sample sizes, no significant association was found between this variant and clinical outcomes including coronary artery disease, stroke, myocardial infarction or hypertension. In conclusion, we found one novel loci which underlie genetic variation in microvasculature which may be relevant to vascular disease. The relevance of these findings to clinical outcomes remains to be determined.
PMCID: PMC3680438  PMID: 23776548
10.  GWIS - model-free, fast and exhaustive search for epistatic interactions in case-control GWAS 
BMC Genomics  2013;14(Suppl 3):S10.
It has been hypothesized that multivariate analysis and systematic detection of epistatic interactions between explanatory genotyping variables may help resolve the problem of "missing heritability" currently observed in genome-wide association studies (GWAS). However, even the simplest bivariate analysis is still held back by significant statistical and computational challenges that are often addressed by reducing the set of analysed markers. Theoretically, it has been shown that combinations of loci may exist that show weak or no effects individually, but show significant (even complete) explanatory power over phenotype when combined. Reducing the set of analysed SNPs before bivariate analysis could easily omit such critical loci.
We have developed an exhaustive bivariate GWAS analysis methodology that yields a manageable subset of candidate marker pairs for subsequent analysis using other, often more computationally expensive techniques. Our model-free filtering approach is based on classification using ROC curve analysis, an alternative to much slower regression-based modelling techniques. Exhaustive analysis of studies containing approximately 450,000 SNPs and 5,000 samples requires only 2 hours using a desktop CPU or 13 minutes using a GPU (Graphics Processing Unit). We validate our methodology with analysis of simulated datasets as well as the seven Wellcome Trust Case-Control Consortium datasets that represent a wide range of real life GWAS challenges. We have identified SNP pairs that have considerably stronger association with disease than their individual component SNPs that often show negligible effect univariately. When compared against previously reported results in the literature, our methods re-detect most significant SNP-pairs and additionally detect many pairs absent from the literature that show strong association with disease. The high overlap suggests that our fast analysis could substitute for some slower alternatives.
We demonstrate that the proposed methodology is robust, fast and capable of exhaustive search for epistatic interactions using a standard desktop computer. First, our implementation is significantly faster than timings for comparable algorithms reported in the literature, especially as our method allows simultaneous use of multiple statistical filters with low computing time overhead. Second, for some diseases, we have identified hundreds of SNP pairs that pass formal multiple test (Bonferroni) correction and could form a rich source of hypotheses for follow-up analysis.
A web-based version of the software used for this analysis is available at
PMCID: PMC3665501  PMID: 23819779
11.  Genome-wide association study identifies multiple loci influencing human serum metabolite levels 
Nature genetics  2012;44(3):269-276.
Nuclear magnetic resonance assays allow for measurement of a wide range of metabolic phenotypes. We report here the results of a GWAS on 8,330 Finnish individuals genotyped and imputed at 7.7 million SNPs for a range of 216 serum metabolic phenotypes assessed by NMR of serum samples. We identified significant associations (P < 2.31 × 10−10) at 31 loci, including 11 for which there have not been previous reports of associations to a metabolic trait or disorder. Analyses of Finnish twin pairs suggested that the metabolic measures reported here show higher heritability than comparable conventional metabolic phenotypes. In accordance with our expectations, SNPs at the 31 loci associated with individual metabolites account for a greater proportion of the genetic component of trait variance (up to 40%) than is typically observed for conventional serum metabolic phenotypes. The identification of such associations may provide substantial insight into cardiometabolic disorders.
PMCID: PMC3605033  PMID: 22286219
12.  Look, no hands! Spectral biomarkers from genetic association studies 
Genome Medicine  2013;5(2):14.
Recent advances in our understanding of the genomics of the human metabolome have shed light on the pathways involved in metabolic and cardiovascular disease. Such studies crucially depend on the interpretation of complex molecular spectra. A recent study by Suhre and colleagues provides a way to identify potentially clinically relevant biomarkers without a priori information, such as reference spectra, thus aiding the discovery of additional spectral features and corresponding genomic loci associated with metabolism and disease.
PMCID: PMC3706812  PMID: 23510086
13.  Insights into the Genetic Architecture of Early Stage Age-Related Macular Degeneration: A Genome-Wide Association Study Meta-Analysis 
PLoS ONE  2013;8(1):e53830.
Genetic factors explain a majority of risk variance for age-related macular degeneration (AMD). While genome-wide association studies (GWAS) for late AMD implicate genes in complement, inflammatory and lipid pathways, the genetic architecture of early AMD has been relatively under studied. We conducted a GWAS meta-analysis of early AMD, including 4,089 individuals with prevalent signs of early AMD (soft drusen and/or retinal pigment epithelial changes) and 20,453 individuals without these signs. For various published late AMD risk loci, we also compared effect sizes between early and late AMD using an additional 484 individuals with prevalent late AMD. GWAS meta-analysis confirmed previously reported association of variants at the complement factor H (CFH) (peak P = 1.5×10−31) and age-related maculopathy susceptibility 2 (ARMS2) (P = 4.3×10−24) loci, and suggested Apolipoprotein E (ApoE) polymorphisms (rs2075650; P = 1.1×10−6) associated with early AMD. Other possible loci that did not reach GWAS significance included variants in the zinc finger protein gene GLI3 (rs2049622; P = 8.9×10−6) and upstream of GLI2 (rs6721654; P = 6.5×10−6), encoding retinal Sonic hedgehog signalling regulators, and in the tyrosinase (TYR) gene (rs621313; P = 3.5×10−6), involved in melanin biosynthesis. For a range of published, late AMD risk loci, estimated effect sizes were significantly lower for early than late AMD. This study confirms the involvement of multiple established AMD risk variants in early AMD, but suggests weaker genetic effects on the risk of early AMD relative to late AMD. Several biological processes were suggested to be potentially specific for early AMD, including pathways regulating RPE cell melanin content and signalling pathways potentially involved in retinal regeneration, generating hypotheses for further investigation.
PMCID: PMC3543264  PMID: 23326517
14.  Variants in MTNR1B influence fasting glucose levels 
Prokopenko, Inga | Langenberg, Claudia | Florez, Jose C | Saxena, Richa | Soranzo, Nicole | Thorleifsson, Gudmar | Loos, Ruth J F | Manning, Alisa K | Jackson, Anne U | Aulchenko, Yurii | Potter, Simon C | Erdos, Michael R | Sanna, Serena | Hottenga, Jouke-Jan | Wheeler, Eleanor | Kaakinen, Marika | Lyssenko, Valeriya | Chen, Wei-Min | Ahmadi, Kourosh | Beckmann, Jacques S | Bergman, Richard N | Bochud, Murielle | Bonnycastle, Lori L | Buchanan, Thomas A | Cao, Antonio | Cervino, Alessandra | Coin, Lachlan | Collins, Francis S | Crisponi, Laura | de Geus, Eco J C | Dehghan, Abbas | Deloukas, Panos | Doney, Alex S F | Elliott, Paul | Freimer, Nelson | Gateva, Vesela | Herder, Christian | Hofman, Albert | Hughes, Thomas E | Hunt, Sarah | Illig, Thomas | Inouye, Michael | Isomaa, Bo | Johnson, Toby | Kong, Augustine | Krestyaninova, Maria | Kuusisto, Johanna | Laakso, Markku | Lim, Noha | Lindblad, Ulf | Lindgren, Cecilia M | McCann, Owen T | Mohlke, Karen L | Morris, Andrew D | Naitza, Silvia | Orrù, Marco | Palmer, Colin N A | Pouta, Anneli | Randall, Joshua | Rathmann, Wolfgang | Saramies, Jouko | Scheet, Paul | Scott, Laura J | Scuteri, Angelo | Sharp, Stephen | Sijbrands, Eric | Smit, Jan H | Song, Kijoung | Steinthorsdottir, Valgerdur | Stringham, Heather M | Tuomi, Tiinamaija | Tuomilehto, Jaakko | Uitterlinden, André G | Voight, Benjamin F | Waterworth, Dawn | Wichmann, H-Erich | Willemsen, Gonneke | Witteman, Jacqueline C M | Yuan, Xin | Zhao, Jing Hua | Zeggini, Eleftheria | Schlessinger, David | Sandhu, Manjinder | Boomsma, Dorret I | Uda, Manuela | Spector, Tim D | Penninx, Brenda WJH | Altshuler, David | Vollenweider, Peter | Jarvelin, Marjo Riitta | Lakatta, Edward | Waeber, Gerard | Fox, Caroline S | Peltonen, Leena | Groop, Leif C | Mooser, Vincent | Cupples, L Adrienne | Thorsteinsdottir, Unnur | Boehnke, Michael | Barroso, Inês | Van Duijn, Cornelia | Dupuis, Josée | Watanabe, Richard M | Stefansson, Kari | McCarthy, Mark I | Wareham, Nicholas J | Meigs, James B | Abecasis, Gonçalo R
Nature genetics  2008;41(1):77-81.
To identify previously unknown genetic loci associated with fasting glucose concentrations, we examined the leading association signals in ten genome-wide association scans involving a total of 36,610 individuals of European descent. Variants in the gene encoding melatonin receptor 1B (MTNR1B) were consistently associated with fasting glucose across all ten studies. The strongest signal was observed at rs10830963, where each G allele (frequency 0.30 in HapMap CEU) was associated with an increase of 0.07 (95% CI = 0.06-0.08) mmol/l in fasting glucose levels (P = 3.2 = × 10−50) and reduced beta-cell function as measured by homeostasis model assessment (HOMA-B, P = 1.1 × 10−15). The same allele was associated with an increased risk of type 2 diabetes (odds ratio = 1.09 (1.05-1.12), per G allele P = 3.3 × 10−7) in a meta-analysis of 13 case-control studies totaling 18,236 cases and 64,453 controls. Our analyses also confirm previous associations of fasting glucose with variants at the G6PC2 (rs560887, P = 1.1 × 10−57) and GCK (rs4607517, P = 1.0 × 10−25) loci.
PMCID: PMC2682768  PMID: 19060907
15.  Novel Loci for Metabolic Networks and Multi-Tissue Expression Studies Reveal Genes for Atherosclerosis 
PLoS Genetics  2012;8(8):e1002907.
Association testing of multiple correlated phenotypes offers better power than univariate analysis of single traits. We analyzed 6,600 individuals from two population-based cohorts with both genome-wide SNP data and serum metabolomic profiles. From the observed correlation structure of 130 metabolites measured by nuclear magnetic resonance, we identified 11 metabolic networks and performed a multivariate genome-wide association analysis. We identified 34 genomic loci at genome-wide significance, of which 7 are novel. In comparison to univariate tests, multivariate association analysis identified nearly twice as many significant associations in total. Multi-tissue gene expression studies identified variants in our top loci, SERPINA1 and AQP9, as eQTLs and showed that SERPINA1 and AQP9 expression in human blood was associated with metabolites from their corresponding metabolic networks. Finally, liver expression of AQP9 was associated with atherosclerotic lesion area in mice, and in human arterial tissue both SERPINA1 and AQP9 were shown to be upregulated (6.3-fold and 4.6-fold, respectively) in atherosclerotic plaques. Our study illustrates the power of multi-phenotype GWAS and highlights candidate genes for atherosclerosis.
Author Summary
In this study, we aim to identify novel genetic variants for metabolism, characterize their effects on nearby genes, and show that the nearby genes are associated with metabolism and atherosclerosis. To discover new genetic variants, we use an alternative approach to traditional genome-wide association studies: we leverage the information in phenotype covariance to increase our statistical power. We identify variants at seven novel loci and then show that our top signals drive expression of nearby genes AQP9 and SERPINA1 in multiple tissues. We demonstrate that AQP9 and SERPINA1 gene expression, in turn, is associated with metabolite levels. Finally, we show that the genes are associated with atherosclerosis using mouse atherosclerotic lesion size (AQP9) as well as tissue from healthy human arteries and atherosclerotic plaques (AQP9 and SERPINA1). This study illustrates that multivariate analysis of correlated metabolites can boost power for gene discovery substantially. Further functional work will need to be performed to elucidate the biological role of SERPINA1 and AQP9 in atherosclerosis.
PMCID: PMC3420921  PMID: 22916037
16.  Short read sequence typing (SRST): multi-locus sequence types from short reads 
BMC Genomics  2012;13:338.
Multi-locus sequence typing (MLST) has become the gold standard for population analyses of bacterial pathogens. This method focuses on the sequences of a small number of loci (usually seven) to divide the population and is simple, robust and facilitates comparison of results between laboratories and over time. Over the last decade, researchers and population health specialists have invested substantial effort in building up public MLST databases for nearly 100 different bacterial species, and these databases contain a wealth of important information linked to MLST sequence types such as time and place of isolation, host or niche, serotype and even clinical or drug resistance profiles. Recent advances in sequencing technology mean it is increasingly feasible to perform bacterial population analysis at the whole genome level. This offers massive gains in resolving power and genetic profiling compared to MLST, and will eventually replace MLST for bacterial typing and population analysis. However given the wealth of data currently available in MLST databases, it is crucial to maintain backwards compatibility with MLST schemes so that new genome analyses can be understood in their proper historical context.
We present a software tool, SRST, for quick and accurate retrieval of sequence types from short read sets, using inputs easily downloaded from public databases. SRST uses read mapping and an allele assignment score incorporating sequence coverage and variability, to determine the most likely allele at each MLST locus. Analysis of over 3,500 loci in more than 500 publicly accessible Illumina read sets showed SRST to be highly accurate at allele assignment. SRST output is compatible with common analysis tools such as eBURST, Clonal Frame or PhyloViz, allowing easy comparison between novel genome data and MLST data. Alignment, fastq and pileup files can also be generated for novel alleles.
SRST is a novel software tool for accurate assignment of sequence types using short read data. Several uses for the tool are demonstrated, including quality control for high-throughput sequencing projects, plasmid MLST and analysis of genomic data during outbreak investigation.
PMCID: PMC3460743  PMID: 22827703
MLST; Short read; Illumina; Sequence analysis; Plasmid; Chromosome; Microbiology; Bacteria; Population analysis; Outbreak
17.  SparSNP: Fast and memory-efficient analysis of all SNPs for phenotype prediction 
BMC Bioinformatics  2012;13:88.
A central goal of genomics is to predict phenotypic variation from genetic variation. Fitting predictive models to genome-wide and whole genome single nucleotide polymorphism (SNP) profiles allows us to estimate the predictive power of the SNPs and potentially develop diagnostic models for disease. However, many current datasets cannot be analysed with standard tools due to their large size.
We introduce SparSNP, a tool for fitting lasso linear models for massive SNP datasets quickly and with very low memory requirements. In analysis on a large celiac disease case/control dataset, we show that SparSNP runs substantially faster than four other state-of-the-art tools for fitting large scale penalised models. SparSNP was one of only two tools that could successfully fit models to the entire celiac disease dataset, and it did so with superior performance. Compared with the other tools, the models generated by SparSNP had better than or equal to predictive performance in cross-validation.
Genomic datasets are rapidly increasing in size, rendering existing approaches to model fitting impractical due to their prohibitive time or memory requirements. This study shows that SparSNP is an essential addition to the genomic analysis toolkit.
SparSNP is available at
PMCID: PMC3483007  PMID: 22574887
19.  Genome-wide association study of migraine implicates a common susceptibility variant on 8q22.1 
Nature genetics  2010;42(10):869-873.
Migraine is a common episodic neurological disorder, typically presenting with recurrent attacks of severe headache and autonomic dysfunction. Apart from rare monogenic subtypes, no genetic or molecular markers for migraine have been convincingly established. We identified the minor allele of rs1835740 on chromosome 8q22.1 to be associated with migraine (p=5.12 × 10−9, OR 1.23 [1.150-1.324]) in a genome-wide association study of 2,748 migraineurs from three European headache clinics and 10,747 population-matched controls. The association was replicated in 3,202 cases and 40,062 controls for an overall meta-analysis p-value of 1.60 × 10−11 (OR 1.18 [1.127 – 1.244]). rs1835740 is located between the astrocyte elevated gene 1 (MTDH/AEG-1) and plasma glutamate carboxypeptidase (PGCP). In an expression quantitative trait study in lymphoblastoid cell lines transcript levels of the MTDH/AEG-1 were found to have a significant correlation to rs1835740. Our data establish rs1835740 as the first genetic risk factor for migraine.
PMCID: PMC2948563  PMID: 20802479
20.  Metabonomic, transcriptomic, and genomic variation of a population cohort 
The lipid–leukocyte (LL) module is associated with, and reactive to, a wide variety of serum metabolites.The LL module appears to be a link between metabolism, adiposity, and inflammation.Serum metabolite concentrations themselves determine the connectedness of LL module.
Comprehensive characterization of human tissues promises novel insights into the biological architecture of human diseases and traits. We assessed metabonomic, transcriptomic, and genomic variation for a large population-based cohort from the capital region of Finland. Network analyses identified a set of highly correlated genes, the lipid–leukocyte (LL) module, as having a prominent role in over 80 serum metabolites (of 134 measures quantified), including lipoprotein subclasses, lipids, and amino acids. Concurrent association with immune response markers suggested the LL module as a possible link between inflammation, metabolism, and adiposity. Further, genomic variation was used to generate a directed network and infer LL module's largely reactive nature to metabolites. Finally, gene co-expression in circulating leukocytes was shown to be dependent on serum metabolite concentrations, providing evidence for the hypothesis that the coherence of molecular networks themselves is conditional on environmental factors. These findings show the importance and opportunity of systematic molecular investigation of human population samples. To facilitate and encourage this investigation, the metabonomic, transcriptomic, and genomic data used in this study have been made available as a resource for the research community.
PMCID: PMC3018170  PMID: 21179014
bioinformatics; biological networks; integrative genomics; metabonomics; transcriptomics
21.  Genome-wide and fine-resolution association analysis of malaria in West Africa 
Jallow, Muminatou | Teo, Yik Ying | Small, Kerrin S | Rockett, Kirk A | Deloukas, Panos | Clark, Taane G | Kivinen, Katja | Bojang, Kalifa A | Conway, David J | Pinder, Margaret | Sirugo, Giorgio | Sisay-Joof, Fatou | Usen, Stanley | Auburn, Sarah | Bumpstead, Suzannah J | Campino, Susana | Coffey, Alison | Dunham, Andrew | Fry, Andrew E | Green, Angela | Gwilliam, Rhian | Hunt, Sarah E | Inouye, Michael | Jeffreys, Anna E | Mendy, Alieu | Palotie, Aarno | Potter, Simon | Ragoussis, Jiannis | Rogers, Jane | Rowlands, Kate | Somaskantharajah, Elilan | Whittaker, Pamela | Widden, Claire | Donnelly, Peter | Howie, Bryan | Marchini, Jonathan | Morris, Andrew | SanJoaquin, Miguel | Achidi, Eric Akum | Agbenyega, Tsiri | Allen, Angela | Amodu, Olukemi | Corran, Patrick | Djimde, Abdoulaye | Dolo, Amagana | Doumbo, Ogobara K | Drakeley, Chris | Dunstan, Sarah | Evans, Jennifer | Farrar, Jeremy | Fernando, Deepika | Hien, Tran Tinh | Horstmann, Rolf D | Ibrahim, Muntaser | Karunaweera, Nadira | Kokwaro, Gilbert | Koram, Kwadwo A | Lemnge, Martha | Makani, Julie | Marsh, Kevin | Michon, Pascal | Modiano, David | Molyneux, Malcolm E | Mueller, Ivo | Parker, Michael | Peshu, Norbert | Plowe, Christopher V | Puijalon, Odile | Reeder, John | Reyburn, Hugh | Riley, Eleanor M | Sakuntabhai, Anavaj | Singhasivanon, Pratap | Sirima, Sodiomon | Tall, Adama | Taylor, Terrie E | Thera, Mahamadou | Troye-Blomberg, Marita | Williams, Thomas N | Wilson, Michael | Kwiatkowski, Dominic P
Nature genetics  2009;41(6):657-665.
We report a genome-wide association (GWA) study of severe malaria in The Gambia. The initial GWA scan included 2,500 children genotyped on the Affymetrix 500K GeneChip, and a replication study included 3,400 children. We used this to examine the performance of GWA methods in Africa. We found considerable population stratification, and also that signals of association at known malaria resistance loci were greatly attenuated owing to weak linkage disequilibrium (LD). To investigate possible solutions to the problem of low LD, we focused on the HbS locus, sequencing this region of the genome in 62 Gambian individuals and then using these data to conduct multipoint imputation in the GWA samples. This increased the signal of association, from P = 4 × 10−7 to P = 4 × 10−14, with the peak of the signal located precisely at the HbS causal variant. Our findings provide proof of principle that fine-resolution multipoint imputation, based on population-specific sequencing data, can substantially boost authentic GWA signals and enable fine mapping of causal variants in African populations.
PMCID: PMC2889040  PMID: 19465909
22.  An Immune Response Network Associated with Blood Lipid Levels 
PLoS Genetics  2010;6(9):e1001113.
While recent scans for genetic variation associated with human disease have been immensely successful in uncovering large numbers of loci, far fewer studies have focused on the underlying pathways of disease pathogenesis. Many loci which are associated with disease and complex phenotypes map to non-coding, regulatory regions of the genome, indicating that modulation of gene transcription plays a key role. Thus, this study generated genome-wide profiles of both genetic and transcriptional variation from the total blood extracts of over 500 randomly-selected, unrelated individuals. Using measurements of blood lipids, key players in the progression of atherosclerosis, three levels of biological information are integrated in order to investigate the interactions between circulating leukocytes and proximal lipid compounds. Pair-wise correlations between gene expression and lipid concentration indicate a prominent role for basophil granulocytes and mast cells, cell types central to powerful allergic and inflammatory responses. Network analysis of gene co-expression showed that the top associations function as part of a single, previously unknown gene module, the Lipid Leukocyte (LL) module. This module replicated in T cells from an independent cohort while also displaying potential tissue specificity. Further, genetic variation driving LL module expression included the single nucleotide polymorphism (SNP) most strongly associated with serum immunoglobulin E (IgE) levels, a key antibody in allergy. Structural Equation Modeling (SEM) indicated that LL module is at least partially reactive to blood lipid levels. Taken together, this study uncovers a gene network linking blood lipids and circulating cell types and offers insight into the hypothesis that the inflammatory response plays a prominent role in metabolism and the potential control of atherogenesis.
Author Summary
Circulating lipid concentrations are important predictors of coronary artery disease. The main pathology of coronary artery disease is atherosclerosis, a cycle of lipid adherence to the walls of arteries and an inflammatory response resulting in more adhesion. To investigate the link between lipids and immune cells in circulation, we have generated both genomic and whole blood gene expression profiles for a population-based collection of individuals from the capital region of Finland. Key mediators of inflammation and allergy were shown to be correlated with lipid levels. Further, the expressions of these genes operated in such a highly coordinated fashion that they appeared to function as part of a single pathway, which itself was both highly correlated with and reactive to lipid levels. Our findings offer insight into how lipids activate circulating immune cells, potentially contributing to the pathogenesis of coronary artery disease.
PMCID: PMC2936545  PMID: 20844574
23.  Visualizing Chromosome Mosaicism and Detecting Ethnic Outliers by the Method of “Rare” Heterozygotes and Homozygotes (RHH) 
Human Molecular Genetics  2010;19(13):2539-2553.
We describe a novel approach for evaluating SNP genotypes of a genome-wide association scan to identify “ethnic outlier” subjects whose ethnicity is different or admixed compared to most other subjects in the genotyped sample set. Each ethnic outlier is detected by counting a genomic excess of “rare” heterozygotes and/or homozygotes whose frequencies are low (<1%) within genotypes of the sample set being evaluated. This method also enables simple and striking visualization of non-Caucasian chromosomal DNA segments interspersed within the chromosomes of ethnically admixed individuals. We show that this visualization of the mosaic structure of admixed human chromosomes gives results similar to another visualization method (SABER) but with much less computational time and burden. We also show that other methods for detecting ethnic outliers are enhanced by evaluating only genomic regions of visualized admixture rather than diluting outlier ancestry by evaluating the entire genome considered in aggregate. We have validated our method in the Wellcome Trust Case Control Consortium (WTCCC) study of 17,000 subjects as well as in HapMap subjects and simulated outliers of known ethnicity and admixture. The method's ability to precisely delineate chromosomal segments of non-Caucasian ethnicity has enabled us to demonstrate previously unreported non-Caucasian admixture in two HapMap Caucasian parents and in a number of WTCCC subjects. Its sensitive detection of ethnic outliers and simple visual discrimination of discrete chromosomal segments of different ethnicity implies that this method of rare heterozygotes and homozygotes (RHH) is likely to have diverse and important applications in humans and other species.
PMCID: PMC2883336  PMID: 20211853
24.  Variants in the melatonin receptor 1B gene (MTNR1B) influence fasting glucose levels 
Prokopenko, Inga | Langenberg, Claudia | Florez, Jose C. | Saxena, Richa | Soranzo, Nicole | Thorleifsson, Gudmar | Loos, Ruth J.F. | Manning, Alisa K. | Jackson, Anne U. | Aulchenko, Yurii | Potter, Simon C. | Erdos, Michael R. | Sanna, Serena | Hottenga, Jouke-Jan | Wheeler, Eleanor | Kaakinen, Marika | Lyssenko, Valeriya | Chen, Wei-Min | Ahmadi, Kourosh | Beckmann, Jacques S. | Bergman, Richard N. | Bochud, Murielle | Bonnycastle, Lori L. | Buchanan, Thomas A. | Cao, Antonio | Cervino, Alessandra | Coin, Lachlan | Collins, Francis S. | Crisponi, Laura | de Geus, Eco JC | Dehghan, Abbas | Deloukas, Panos | Doney, Alex S F | Elliott, Paul | Freimer, Nelson | Gateva, Vesela | Herder, Christian | Hofman, Albert | Hughes, Thomas E. | Hunt, Sarah | Illig, Thomas | Inouye, Michael | Isomaa, Bo | Johnson, Toby | Kong, Augustine | Krestyaninova, Maria | Kuusisto, Johanna | Laakso, Markku | Lim, Noha | Lindblad, Ulf | Lindgren, Cecilia M. | McCann, Owen T. | Mohlke, Karen L. | Morris, Andrew D | Naitza, Silvia | Orrù, Marco | Palmer, Colin N A | Pouta, Anneli | Randall, Joshua | Rathmann, Wolfgang | Saramies, Jouko | Scheet, Paul | Scott, Laura J. | Scuteri, Angelo | Sharp, Stephen | Sijbrands, Eric | Smit, Jan H. | Song, Kijoung | Steinthorsdottir, Valgerdur | Stringham, Heather M. | Tuomi, Tiinamaija | Tuomilehto, Jaakko | Uitterlinden, André G. | Voight, Benjamin F. | Waterworth, Dawn | Wichmann, H.-Erich | Willemsen, Gonneke | Witteman, Jacqueline CM | Yuan, Xin | Zhao, Jing Hua | Zeggini, Eleftheria | Schlessinger, David | Sandhu, Manjinder | Boomsma, Dorret I | Uda, Manuela | Spector, Tim D. | Penninx, Brenda WJH | Altshuler, David | Vollenweider, Peter | Jarvelin, Marjo Riitta | Lakatta, Edward | Waeber, Gerard | Fox, Caroline S. | Peltonen, Leena | Groop, Leif C. | Mooser, Vincent | Cupples, L. Adrienne | Thorsteinsdottir, Unnur | Boehnke, Michael | Barroso, Inês | Van Duijn, Cornelia | Dupuis, Josée | Watanabe, Richard M. | Stefansson, Kari | McCarthy, Mark I. | Wareham, Nicholas J. | Meigs, James B. | Abecasis, Goncalo R.
Nature genetics  2008;41(1):77-81.
To identify novel genetic loci associated with fasting glucose concentrations, we examined the leading association signals in 10 genome-wide association scans involving a total of 36,610 individuals of European descent. Variants in the gene encoding the melatonin receptor 1B (MTNR1B) were consistently associated with fasting glucose across all ten studies. The strongest signal was observed at rs10830963, where each G-allele (frequency 0.30 in HapMap CEU) was associated with an increase of 0.07 (95%CI 0.06–0.08) mmol/L in fasting glucose levels (P=3.2×10−50) and reduced beta-cell function as measured by homeostasis model assessment (HOMA-B, P=1.1×10−15). The same allele was associated with an increased risk of type 2 diabetes (odds ratio = 1.09 (1.05–1.12), per G allele P=3.3×10−7) in a meta-analysis of thirteen case-control studies totalling 18,236 cases and 64,453 controls. Our analyses also confirm previous associations of fasting glucose with variants at the G6PC2 (rs560887, P=1.1×10−57) and GCK (rs4607517, P=1.0×10−25) loci.
PMCID: PMC2682768  PMID: 19060907
25.  The diploid genome sequence of an Asian individual 
Nature  2008;456(7218):60-65.
Here we present the first diploid genome sequence of an Asian individual. The genome was sequenced to 36-fold average coverage using massively parallel sequencing technology. We aligned the short reads onto the NCBI human reference genome to 99.97% coverage, and guided by the reference genome, we used uniquely mapped reads to assemble a high-quality consensus sequence for 92% of the Asian individual's genome. We identified approximately 3 million single-nucleotide polymorphisms (SNPs) inside this region, of which 13.6% were not in the dbSNP database. Genotyping analysis showed that SNP identification had high accuracy and consistency, indicating the high sequence quality of this assembly. We also carried out heterozygote phasing and haplotype prediction against HapMap CHB and JPT haplotypes (Chinese and Japanese, respectively), sequence comparison with the two available individual genomes (J. D. Watson and J. C. Venter), and structural variation identification. These variations were considered for their potential biological impact. Our sequence data and analyses demonstrate the potential usefulness of next-generation sequencing technologies for personal genomics.
PMCID: PMC2716080  PMID: 18987735

