Large consortia have revealed hundreds of genetic loci associated with anthropometric traits, one trait at a time. We examined whether genetic variants affect body shape as a composite phenotype that is represented by a combination of anthropometric traits. We developed an approach that calculates averaged PCs (AvPCs) representing body shape derived from six anthropometric traits (body mass index, height, weight, waist and hip circumference, waist-to-hip ratio). The first four AvPCs explain >99% of the variability, are heritable, and associate with cardiometabolic outcomes. We performed genome-wide association analyses for each body shape composite phenotype across 65 studies and meta-analysed summary statistics. We identify six novel loci: LEMD2 and CD47 for AvPC1, RPS6KA5/C14orf159 and GANAB for AvPC3, and ARL15 and ANP32 for AvPC4. Our findings highlight the value of using multiple traits to define complex phenotypes for discovery, which are not captured by single-trait analyses, and may shed light onto new pathways.
Past genome-wide associate studies have identified hundreds of genetic loci that influence body size and shape when examined one trait at a time. Here, Jeff and colleagues develop an aggregate score of various body traits, and use meta-analysis to find new loci linked to body shape.
Serum hepcidin concentration is regulated by iron status, inflammation, erythropoiesis and numerous other factors, but underlying processes are incompletely understood. We studied the association of common and rare single nucleotide variants (SNVs) with serum hepcidin in one Italian study and two large Dutch population-based studies. We genotyped common SNVs with genome-wide association study (GWAS) arrays and subsequently performed imputation using the 1000 Genomes reference panel. Cohort-specific GWAS were performed for log-transformed serum hepcidin, adjusted for age and gender, and results were combined in a fixed-effects meta-analysis (total N 6,096). Six top SNVs (p<5x10-6) were genotyped in 3,821 additional samples, but associations were not replicated. Furthermore, we meta-analyzed cohort-specific exome array association results of rare SNVs with serum hepcidin that were available for two of the three cohorts (total N 3,226), but no exome-wide significant signal (p<1.4x10-6) was identified. Gene-based meta-analyses revealed 19 genes that showed significant association with hepcidin. Our results suggest the absence of common SNVs and rare exonic SNVs explaining a large proportion of phenotypic variation in serum hepcidin. We recommend extension of our study once additional substantial cohorts with hepcidin measurements, GWAS and/or exome array data become available in order to increase power to identify variants that explain a smaller proportion of hepcidin variation. In addition, we encourage follow-up of the potentially interesting genes that resulted from the gene-based analysis of low-frequency and rare variants.
Effective immunity requires a complex network of cellular and humoral components that interact with each other and are influenced by different environmental and host factors. We used a systems biology approach to comprehensively assess the impact of environmental and genetic factors on immune cell populations in peripheral blood, including associations with immunoglobulin concentrations, from ∼500 healthy volunteers from the Human Functional Genomics Project. Genetic heritability estimation showed that variations in T cell numbers are more strongly driven by genetic factors, while B cell counts are more environmentally influenced. Quantitative trait loci (QTL) mapping identified eight independent genomic loci associated with leukocyte count variation, including four associations with T and B cell subtypes. The QTLs identified were enriched among genome-wide association study (GWAS) SNPs reported to increase susceptibility to immune-mediated diseases. Our systems approach provides insights into cellular and humoral immune trait variability in humans.
•Understanding inter-individual variation of immune cells and immunoglobulin levels•Season and gender influence B cell subpopulation abundance•Identification of genetic loci that might regulate B cell levels in blood•Cell count QTLs overlap with risk SNPs for (auto)immune/inflammatory disease
As part of the Human Functional Genomics Project, this study by Aguirre-Gamboa et al. maps the contribution of genetics and non-heritable factors onto immune-cell counts and immunoglobulin levels. They find that season and gender influence the abundance of most of B cell subpopulations.
Structural variation (SV) represents a major source of differences between individual human genomes and has been linked to disease phenotypes. However, the majority of studies provide neither a global view of the full spectrum of these variants nor integrate them into reference panels of genetic variation. Here, we analyse whole genome sequencing data of 769 individuals from 250 Dutch families, and provide a haplotype-resolved map of 1.9 million genome variants across 9 different variant classes, including novel forms of complex indels, and retrotransposition-mediated insertions of mobile elements and processed RNAs. A large proportion are previously under reported variants sized between 21 and 100 bp. We detect 4 megabases of novel sequence, encoding 11 new transcripts. Finally, we show 191 known, trait-associated SNPs to be in strong linkage disequilibrium with SVs and demonstrate that our panel facilitates accurate imputation of SVs in unrelated individuals.
Structural variants (SVs) are prevalent in genomes of the general population. Here, Guryev and The Genome of the Netherlands Consortium describe the reference panel of haplotype-resolved SVs from 769 individuals from 250 Dutch families and show its utility for studying heritable traits.
Epigenetic change is a hallmark of ageing but its link to ageing mechanisms in humans remains poorly understood. While DNA methylation at many CpG sites closely tracks chronological age, DNA methylation changes relevant to biological age are expected to gradually dissociate from chronological age, mirroring the increased heterogeneity in health status at older ages.
Here, we report on the large-scale identification of 6366 age-related variably methylated positions (aVMPs) identified in 3295 whole blood DNA methylation profiles, 2044 of which have a matching RNA-seq gene expression profile. aVMPs are enriched at polycomb repressed regions and, accordingly, methylation at those positions is associated with the expression of genes encoding components of polycomb repressive complex 2 (PRC2) in trans. Further analysis revealed trans-associations for 1816 aVMPs with an additional 854 genes. These trans-associated aVMPs are characterized by either an age-related gain of methylation at CpG islands marked by PRC2 or a loss of methylation at enhancers. This distinct pattern extends to other tissues and multiple cancer types. Finally, genes associated with aVMPs in trans whose expression is variably upregulated with age (733 genes) play a key role in DNA repair and apoptosis, whereas downregulated aVMP-associated genes (121 genes) are mapped to defined pathways in cellular metabolism.
Our results link age-related changes in DNA methylation to fundamental mechanisms that are thought to drive human ageing.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-016-1053-6) contains supplementary material, which is available to authorized users.
DNA methylation; Ageing; 450k; DNA damage; Variability
The range of commercially available array platforms and analysis software packages is expanding and their utility is improving, making reliable detection of copy-number variants (CNVs) relatively straightforward. Reliable interpretation of CNV data, however, is often difficult and requires expertise. With our knowledge of the human genome growing rapidly, applications for array testing continuously broadening, and the resolution of CNV detection increasing, this leads to great complexity in interpreting what can be daunting data. Correct CNV interpretation and optimal use of the genotype information provided by single-nucleotide polymorphism probes on an array depends largely on knowledge present in various resources. In addition to the availability of host laboratories’ own datasets and national registries, there are several public databases and Internet resources with genotype and phenotype information that can be used for array data interpretation. With so many resources now available, it is important to know which are fit-for-purpose in a diagnostic setting. We summarize the characteristics of the most commonly used Internet databases and resources, and propose a general data interpretation strategy that can be used for comparative hybridization, comparative intensity, and genotype-based array data.
array; classification; CNV; database; data interpretation; diagnostic; genome wide
Next‐generation sequencing in clinical diagnostics is providing valuable genomic variant data, which can be used to support healthcare decisions. In silico tools to predict pathogenicity are crucial to assess such variants and we have evaluated a new tool, Combined Annotation Dependent Depletion (CADD), and its classification of gene variants in Lynch syndrome by using a set of 2,210 DNA mismatch repair gene variants. These had already been classified by experts from InSiGHT's Variant Interpretation Committee. Overall, we found CADD scores do predict pathogenicity (Spearman's ρ = 0.595, P < 0.001). However, we discovered 31 major discrepancies between the InSiGHT classification and the CADD scores; these were explained in favor of the expert classification using population allele frequencies, cosegregation analyses, disease association studies, or a second‐tier test. Of 751 variants that could not be clinically classified by InSiGHT, CADD indicated that 47 variants were worth further study to confirm their putative pathogenicity. We demonstrate CADD is valuable in prioritizing variants in clinically relevant genes for further assessment by expert classification teams.
Lynch syndrome; variant classification; pathogenicity prediction; cumulative link model
So far, more than 170 loci have been associated with circulating lipid levels through genome-wide association studies (GWAS). These associations are largely driven by common variants, their function is often not known, and many are likely to be markers for the causal variants. In this study we aimed to identify more new rare and low-frequency functional variants associated with circulating lipid levels.
We used the 1000 Genomes Project as a reference panel for the imputations of GWAS data from ∼60 000 individuals in the discovery stage and ∼90 000 samples in the replication stage.
Our study resulted in the identification of five new associations with circulating lipid levels at four loci. All four loci are within genes that can be linked biologically to lipid metabolism. One of the variants, rs116843064, is a damaging missense variant within the ANGPTL4 gene.
This study illustrates that GWAS with high-scale imputation may still help us unravel the biological mechanism behind circulating lipid levels.
Complex traits; Epidemiology; Genetics; Genome-wide; circulating lipid levels
Biological markers that measure gut health and diagnose functional gastro-intestinal (GI) disorders, such as irritable bowel syndrome (IBS), are lacking. The objective was to identify and validate a biomarker panel associated with the pathophysiology of IBS that discriminates IBS from healthy controls (HC), and correlates with GI symptom severity. In a case-control design, various plasma and fecal markers were measured in a cohort of 196 clinical IBS patients and 160 HC without GI symptoms. A combination of biomarkers, which best discriminates between IBS and HC was identified and validated in an independent internal validation set and by permutation testing. The correlation between the biomarker panel and GI symptom severity was tested in IBS patients and in a general population cohort of 958 subjects. A set of 8 biomarker panel was identified to discriminate IBS from HC with high sensitivity (88.1%) and specificity (86.5%). The results for the IBS subtypes were comparable. Moreover, a moderate correlation was found between the biomarker panel and GI symptom scores in the IBS (r = 0.59, p < 0.001) and the general population cohorts (r = 0.51, p = 0.003). A novel multi-domain biomarker panel has been identified and validated, which correlated moderately to GI symptom severity in IBS and general population subjects.
Motivation: While the size and number of biobanks, patient registries and other data collections are increasing, biomedical researchers still often need to pool data for statistical power, a task that requires time-intensive retrospective integration.
Results: To address this challenge, we developed MOLGENIS/connect, a semi-automatic system to find, match and pool data from different sources. The system shortlists relevant source attributes from thousands of candidates using ontology-based query expansion to overcome variations in terminology. Then it generates algorithms that transform source attributes to a common target DataSchema. These include unit conversion, categorical value matching and complex conversion patterns (e.g. calculation of BMI). In comparison to human-experts, MOLGENIS/connect was able to auto-generate 27% of the algorithms perfectly, with an additional 46% needing only minor editing, representing a reduction in the human effort and expertise needed to pool data.
Availability and Implementation: Source code, binaries and documentation are available as open-source under LGPLv3 from http://github.com/molgenis/molgenis and www.molgenis.org/connect.
Supplementary data are available at Bioinformatics online.
There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.
Research data; Publication characteristics
Clinical and genetic heterogeneity in monogenetic disorders represents a major diagnostic challenge. Although the presence of particular clinical features may aid in identifying a specific cause in some cases, the majority of patients remain undiagnosed.
Here, we investigated the utility of whole-exome sequencing as a diagnostic approach for establishing a molecular diagnosis in a highly heterogeneous group of patients with varied intellectual disability and microcephaly.
Whole-exome sequencing was performed in 38 patients, including three sib-pairs, in addition to or in parallel with genetic analyses that were performed during the diagnostic work-up of the study participants.
In ten out of these 35 families (29 %), we found mutations in genes already known to be related to a disorder in which microcephaly is a main feature. Two unrelated patients had mutations in the ASPM gene. In seven other patients we found mutations in RAB3GAP1, RNASEH2B, KIF11, ERCC8, CASK, DYRK1A and BRCA2. In one of the sib-pairs, mutations were found in the RTTN gene. Mutations were present in seven out of our ten families with an established etiological diagnosis with recessive inheritance.
We demonstrate that whole-exome sequencing is a powerful tool for the diagnostic evaluation of patients with highly heterogeneous neurodevelopmental disorders such as intellectual disability with microcephaly. Our results confirm that autosomal recessive disorders are highly prevalent among patients with microcephaly.
Electronic supplementary material
The online version of this article (doi:10.1186/s12920-016-0167-8) contains supplementary material, which is available to authorized users.
Autosomal recessive inheritance; ASPM; BRCA2; CASK; DYRK1A; ERCC8; KIF11; Microcephaly; RAB3GAP1 RNASEH2B; RTTN; Whole-exome sequencing
Objective Pooling data across biobanks is necessary to increase statistical power, reveal more subtle associations, and synergize the value of data sources. However, searching for desired data elements among the thousands of available elements and harmonizing differences in terminology, data collection, and structure, is arduous and time consuming.
Materials and methods To speed up biobank data pooling we developed BiobankConnect, a system to semi-automatically match desired data elements to available elements by: (1) annotating the desired elements with ontology terms using BioPortal; (2) automatically expanding the query for these elements with synonyms and subclass information using OntoCAT; (3) automatically searching available elements for these expanded terms using Lucene lexical matching; and (4) shortlisting relevant matches sorted by matching score.
Results We evaluated BiobankConnect using human curated matches from EU-BioSHaRE, searching for 32 desired data elements in 7461 available elements from six biobanks. We found 0.75 precision at rank 1 and 0.74 recall at rank 10 compared to a manually curated set of relevant matches. In addition, best matches chosen by BioSHaRE experts ranked first in 63.0% and in the top 10 in 98.4% of cases, indicating that our system has the potential to significantly reduce manual matching work.
Conclusions BiobankConnect provides an easy user interface to significantly speed up the biobank harmonization process. It may also prove useful for other forms of biomedical data integration. All the software can be downloaded as a MOLGENIS open source app from http://www.github.com/molgenis, with a demo available at http://www.biobankconnect.org.
Biobank; Harmonization; Data integration; Search
Mutations create variation in the population, fuel evolution, and cause genetic diseases. Current knowledge about de novo mutations is incomplete and mostly indirect 1–10. Here, we analyze 11,020 de novo mutations from whole-genomes of 250 families. We show that de novo mutations in offspring of older fathers are not only more numerous 11–13 but also occur more frequently in early-replicating, genic regions. Functional regions exhibit higher mutation rates due to CpG dinucleotides and reveal signatures of transcription-coupled repair, while mutation clusters with a unique signature point to a novel mutational mechanism. Mutation and recombination rates independently associate with nucleotide diversity, and regional variation in human-chimpanzee divergence is only partly explained by mutation rate heterogeneity. Finally, we provide a genome-wide mutation rate map for medical and population genetics applications. Our results reveal novel insights and refine long-standing hypotheses about human mutagenesis.
Today researchers can choose from many bioinformatics protocols for all types of life sciences research, computational environments and coding languages. Although the majority of these are open source, few of them possess all virtues to maximize reuse and promote reproducible science. Wikipedia has proven a great tool to disseminate information and enhance collaboration between users with varying expertise and background to author qualitative content via crowdsourcing. However, it remains an open question whether the wiki paradigm can be applied to bioinformatics protocols.
We piloted PyPedia, a wiki where each article is both implementation and documentation of a bioinformatics computational protocol in the python language. Hyperlinks within the wiki can be used to compose complex workflows and induce reuse. A RESTful API enables code execution outside the wiki. Initial content of PyPedia contains articles for population statistics, bioinformatics format conversions and genotype imputation. Use of the easy to learn wiki syntax effectively lowers the barriers to bring expert programmers and less computer savvy researchers on the same page.
PyPedia demonstrates how wiki can provide a collaborative development, sharing and even execution environment for biologists and bioinformaticians that complement existing resources, useful for local and multi-center research teams.
PyPedia is available online at: http://www.pypedia.com. The source code and installation instructions are available at: https://github.com/kantale/PyPedia_server. The PyPedia python library is available at: https://github.com/kantale/pypedia. PyPedia is open-source, available under the BSD 2-Clause License.
Electronic supplementary material
The online version of this article (doi:10.1186/s13029-015-0042-6) contains supplementary material, which is available to authorized users.
Wiki; Web services; Open science; Crowdsourcing; Python
Genome-wide association studies (GWAS) have identified more than 100 genetic variants contributing to BMI, a measure of body size, or waist-to-hip ratio (adjusted for BMI, WHRadjBMI), a measure of body shape. Body size and shape change as people grow older and these changes differ substantially between men and women. To systematically screen for age- and/or sex-specific effects of genetic variants on BMI and WHRadjBMI, we performed meta-analyses of 114 studies (up to 320,485 individuals of European descent) with genome-wide chip and/or Metabochip data by the Genetic Investigation of Anthropometric Traits (GIANT) Consortium. Each study tested the association of up to ~2.8M SNPs with BMI and WHRadjBMI in four strata (men ≤50y, men >50y, women ≤50y, women >50y) and summary statistics were combined in stratum-specific meta-analyses. We then screened for variants that showed age-specific effects (G x AGE), sex-specific effects (G x SEX) or age-specific effects that differed between men and women (G x AGE x SEX). For BMI, we identified 15 loci (11 previously established for main effects, four novel) that showed significant (FDR<5%) age-specific effects, of which 11 had larger effects in younger (<50y) than in older adults (≥50y). No sex-dependent effects were identified for BMI. For WHRadjBMI, we identified 44 loci (27 previously established for main effects, 17 novel) with sex-specific effects, of which 28 showed larger effects in women than in men, five showed larger effects in men than in women, and 11 showed opposite effects between sexes. No age-dependent effects were identified for WHRadjBMI. This is the first genome-wide interaction meta-analysis to report convincing evidence of age-dependent genetic effects on BMI. In addition, we confirm the sex-specificity of genetic effects on WHRadjBMI. These results may provide further insights into the biology that underlies weight change with age or the sexually dimorphism of body shape.
Adult body size and body shape differ substantially between men and women and change over time. More than 100 genetic variants that influence body mass index (measure of body size) or waist-to-hip ratio (measure of body shape) have been identified. While there is evidence that some genetic loci affect body shape differently in men than in women, little is known about whether genetic effects differ in older compared to younger adults, and whether such changes differ between men and women. Therefore, we conducted a systematic genome-wide search, including 114 studies (>320,000 individuals), to specifically identify genetic loci with age- and or sex-dependent effects on body size and shape. We identified 15 loci of which the effect on BMI was different in older compared to younger adults, whereas we found no evidence for loci with different effects in men compared to women. The opposite was seen for body shape as we identified 44 loci of which the effect on waist-to-hip ratio differed between men and women, but no difference between younger and older adults were observed. Our observations may provide new insights into the biology that underlies weight change with age or the sexual dimorphism of body shape.
There is an urgent need to standardize the semantics of biomedical data values, such as phenotypes, to enable comparative and integrative analyses. However, it is unlikely that all studies will use the same data collection protocols. As a result, retrospective standardization is often required, which involves matching of original (unstructured or locally coded) data to widely used coding or ontology systems such as SNOMED CT (clinical terms), ICD-10 (International Classification of Disease) and HPO (Human Phenotype Ontology). This data curation process is usually a time-consuming process performed by a human expert. To help mechanize this process, we have developed SORTA, a computer-aided system for rapidly encoding free text or locally coded values to a formal coding system or ontology. SORTA matches original data values (uploaded in semicolon delimited format) to a target coding system (uploaded in Excel spreadsheet, OWL ontology web language or OBO open biomedical ontologies format). It then semi- automatically shortlists candidate codes for each data value using Lucene and n-gram based matching algorithms, and can also learn from matches chosen by human experts. We evaluated SORTA’s applicability in two use cases. For the LifeLines biobank, we used SORTA to recode 90 000 free text values (including 5211 unique values) about physical exercise to MET (Metabolic Equivalent of Task) codes. For the CINEAS clinical symptom coding system, we used SORTA to map to HPO, enriching HPO when necessary (315 terms matched so far). Out of the shortlists at rank 1, we found a precision/recall of 0.97/0.98 in LifeLines and of 0.58/0.45 in CINEAS. More importantly, users found the tool both a major time saver and a quality improvement because SORTA reduced the chances of human mistakes. Thus, SORTA can dramatically ease data (re)coding tasks and we believe it will prove useful for many more projects.
Database URL: http://molgenis.org/sorta or as an open source download from http://www.molgenis.org/wiki/SORTA
There is a critical need for population-based prospective cohort studies because they follow individuals before the onset of disease, allowing for studies that can identify biomarkers and disease-modifying effects, and thereby contributing to systems epidemiology.
This paper describes the design and baseline characteristics of an intensively examined subpopulation of the LifeLines cohort in the Netherlands. In this unique subcohort, LifeLines DEEP, we included 1539 participants aged 18 years and older.
Findings to date
We collected additional blood (n=1387), exhaled air (n=1425) and faecal samples (n=1248), and elicited responses to gastrointestinal health questionnaires (n=1176) for analysis of the genome, epigenome, transcriptome, microbiome, metabolome and other biological levels. Here, we provide an overview of the different data layers in LifeLines DEEP and present baseline characteristics of the study population including food intake and quality of life. We also describe how the LifeLines DEEP cohort allows for the detailed investigation of genetic, genomic and metabolic variation for a wide range of phenotypic outcomes. Finally, we examine the determinants of gastrointestinal health, an area of particular interest to us that can be addressed by LifeLines DEEP.
We have established a cohort of which multiple data levels allow for the integrative analysis of populations for translation of this information into biomarkers for disease, and which will offer new insights into disease mechanisms and prevention.
EPIDEMIOLOGY; PUBLIC HEALTH; GENETICS
Genotype imputation is an important procedure in current genomic analysis such as genome-wide association studies, meta-analyses and fine mapping. Although high quality tools are available that perform the steps of this process, considerable effort and expertise is required to set up and run a best practice imputation pipeline, particularly for larger genotype datasets, where imputation has to scale out in parallel on computer clusters.
Here we present MOLGENIS-impute, an ‘imputation in a box’ solution that seamlessly and transparently automates the set up and running of all the steps of the imputation process. These steps include genome build liftover (liftovering), genotype phasing with SHAPEIT2, quality control, sample and chromosomal chunking/merging, and imputation with IMPUTE2. MOLGENIS-impute builds on MOLGENIS-compute, a simple pipeline management platform for submission and monitoring of bioinformatics tasks in High Performance Computing (HPC) environments like local/cloud servers, clusters and grids. All the required tools, data and scripts are downloaded and installed in a single step. Researchers with diverse backgrounds and expertise have tested MOLGENIS-impute on different locations and imputed over 30,000 samples so far using the 1,000 Genomes Project and new Genome of the Netherlands data as the imputation reference. The tests have been performed on PBS/SGE clusters, cloud VMs and in a grid HPC environment.
MOLGENIS-impute gives priority to the ease of setting up, configuring and running an imputation. It has minimal dependencies and wraps the pipeline in a simple command line interface, without sacrificing flexibility to adapt or limiting the options of underlying imputation tools. It does not require knowledge of a workflow system or programming, and is targeted at researchers who just want to apply best practices in imputation via simple commands. It is built on the MOLGENIS compute workflow framework to enable customization with additional computational steps or it can be included in other bioinformatics pipelines. It is available as open source from: https://github.com/molgenis/molgenis-imputation.
Electronic supplementary material
The online version of this article (doi:10.1186/s13104-015-1309-3) contains supplementary material, which is available to authorized users.
Imputation; Genotyping; GWAS
Body fat distribution is a heritable trait and a well-established predictor of adverse metabolic outcomes, independent of overall adiposity. To increase our understanding of the genetic basis of body fat distribution and its molecular links to cardiometabolic traits, we conducted genome-wide association meta-analyses of waist and hip circumference-related traits in up to 224,459 individuals. We identified 49 loci (33 new) associated with waist-to-hip ratio adjusted for body mass index (WHRadjBMI) and an additional 19 loci newly associated with related waist and hip circumference measures (P<5×10−8). Twenty of the 49 WHRadjBMI loci showed significant sexual dimorphism, 19 of which displayed a stronger effect in women. The identified loci were enriched for genes expressed in adipose tissue and for putative regulatory elements in adipocytes. Pathway analyses implicated adipogenesis, angiogenesis, transcriptional regulation, and insulin resistance as processes affecting fat distribution, providing insight into potential pathophysiological mechanisms.
Combining genotype data across cohorts increases power to estimate the heritability due to common single nucleotide polymorphisms (SNPs), based on analyzing a Genetic Relationship Matrix (GRM). However, the combination of SNP data across multiple cohorts may lead to stratification, when for example, different genotyping platforms are used. In the current study, we address issues of combining SNP data from different cohorts, the Netherlands Twin Register (NTR) and the Generation R (GENR) study. Both cohorts include children of Northern European Dutch background (N = 3102 + 2826, respectively) who were genotyped on different platforms. We explore imputation and phasing as a tool and compare three GRM-building strategies, when data from two cohorts are (1) just combined, (2) pre-combined and cross-platform imputed and (3) cross-platform imputed and post-combined. We test these three strategies with data on childhood height for unrelated individuals (N = 3124, average age 6.7 years) to explore their effect on SNP-heritability estimates and compare results to those obtained from the independent studies. All combination strategies result in SNP-heritability estimates with a standard error smaller than those of the independent studies. We did not observe significant difference in estimates of SNP-heritability based on various cross-platform imputed GRMs. SNP-heritability of childhood height was on average estimated as 0.50 (SE = 0.10). Introducing cohort as a covariate resulted in ≈2 % drop. Principal components (PCs) adjustment resulted in SNP-heritability estimates of about 0.39 (SE = 0.11). Strikingly, we did not find significant difference between cross-platform imputed and combined GRMs. All estimates were significant regardless the use of PCs adjustment. Based on these analyses we conclude that imputation with a reference set helps to increase power to estimate SNP-heritability by combining cohorts of the same ethnicity genotyped on different platforms. However, important factors should be taken into account such as remaining cohort stratification after imputation and/or phenotypic heterogeneity between and within cohorts. Whether one should use imputation, or just combine the genotype data, depends on the number of overlapping SNPs in relation to the total number of genotyped SNPs for both cohorts, and their ability to tag all the genetic variance related to the specific trait of interest.
Genotyping platform; Heterogeneity; Imputation; GCTA; SNP-heritability; Height
Owners of biobanks are in an unfortunate position: on the one hand, they need to protect the privacy of their participants, whereas on the other, their usefulness relies on the disclosure of the data they hold. Existing methods for Statistical Disclosure Control attempt to find a balance between utility and confidentiality, but come at a cost for the analysts of the data. We outline an alternative perspective to the balance between confidentiality and utility. By combining the generation of synthetic data with the automated execution of data analyses, biobank owners can guarantee the privacy of their participants, yet allow the analysts to work in an unrestricted manner.