To facilitate broad and convenient integrative visualization of and access to GWAS data, we have created the GWAS Central resource (http://www.gwascentral.org). This database seeks to provide a comprehensive collection of summary-level genetic association data, structured both for maximal utility and for safe open access (i.e., non-directional signals to fully preclude research subject identification). The resource emphasizes on advanced tools that allow comparison and discovery of relevant data sets from the perspective of genes, genome regions, phenotypes or traits. Tested markers and relevant genomic features can be visually interrogated across up to 16 multiple association data sets in a single view, starting at a chromosome-wide view and increasing in resolution down to individual bases. In addition, users can privately upload and view their own data as temporary files. Search and display utility is further enhanced by exploiting phenotype ontology annotations to allow genetic variants associated with phenotypes and traits of interest to be precisely identified, across all studies. Data submissions are accepted from individual researchers, groups and consortia, whereas we also actively gather data sets from various public sources. As a result, the resource now provides over 67 million P-values for over 1600 studies, making it the world's largest openly accessible online collection of summary-level GWAS association information.
GWAS; Genotype; Phenotype; SNP; Database
The 11th International Meeting on Human Genome Variation and Complex Genome Analysis (HGV2009: Tallinn, Estonia, 11th–13th September 2009) provided a stimulating workshop environment where diverse academics and industry representatives explored the latest progress, challenges, and opportunities in relating genome variation to evolution, technology, health, and disease. Key themes included Genome-Wide Association Studies (GWAS), progress beyond GWAS, sequencing developments, and bioinformatics approaches to large-scale datasets.
HGV2009; SNP; variation; GWAS; CNV
The 13th International Meeting on Human Genome Variation and Complex Genome Analysis (HGV2012: Shanghai, China, 6th 8th September 2012) was a stimulating workshop where researchers from academia and industry explored the latest progress, challenges, and opportunities in genome variation research. Key themes included advancements in next-generation sequencing (NGS) technology, investigation of common and rare diseases, employing NGS in the clinic, utilizing large datasets that leverage biobanks and population-specific cohorts, and exploration of genomic features.
variation; SNP; GWAS; next generation sequencing; NGS; inherited disease
The Centre for Applied Genomics of the Hospital for Sick Children and the University of Toronto hosted the 10th Human Genome Variation (HGV) Meeting in Toronto, Canada, in October 2008, welcoming about 240 registrants from 34 countries. During the 3 days of plenary workshops, keynote address, and poster sessions, a strong cross-disciplinary trend was evident, integrating expertise from technology and computation, through biology and medicine, to ethics and law. Single nucleotide polymorphisms (SNPs) as well as the larger copy number variants (CNVs) are recognized by ever-improving array and next-generation sequencing technologies, and the data are being incorporated into studies that are increasingly genome-wide as well as global in scope. A greater challenge is to convert data to information, through databases, and to use the information for greater understanding of human variation. In the wake of publications of the first individual genome sequences, an inaugural public forum provided the opportunity to debate whether we are ready for personalized medicine through direct-to-consumer testing. The HGV meetings foster collaboration, and fruits of the interactions from 2008 are anticipated for the 11th annual meeting in September 2009.
SNP; CNV; GWAS; personalized medicine
The 12th International Meeting on Human Genome Variation and Complex Genome Analysis (HGV2011: Berkeley, California, USA, 8th–10th September 2011) was a stimulating workshop where researchers from academia and industry explored the latest progress, challenges, and opportunities in genome variation research. Key themes included progress beyond GWAS, variation in human populations, use of sequence data in medical settings, large-scale sequencing data analysis, and bioinformatics approaches to large datasets.
human variation; GWAS; SNP; medical genomics
In this review, we discuss the latest targeted enrichment methods and aspects of their utilization along with second-generation sequencing for complex genome analysis. In doing so, we provide an overview of issues involved in detecting genetic variation, for which targeted enrichment has become a powerful tool. We explain how targeted enrichment for next-generation sequencing has made great progress in terms of methodology, ease of use and applicability, but emphasize the remaining challenges such as the lack of even coverage across targeted regions. Costs are also considered versus the alternative of whole-genome sequencing which is becoming ever more affordable. We conclude that targeted enrichment is likely to be the most economical option for many years to come in a range of settings.
targeted enrichment; next-generation sequencing; genome partitioning; exome; genetic variation
Motivation: Genomic copy number variation (CNV) can influence susceptibility to common diseases. High-throughput measurement of gene copy number on large numbers of samples is a challenging, yet critical, stage in confirming observations from sequencing or array Comparative Genome Hybridization (CGH). The paralogue ratio test (PRT) is a simple, cost-effective method of accurately determining copy number by quantifying the amplification ratio between a target and reference amplicon. PRT has been successfully applied to several studies analyzing common CNV. However, its use has not been widespread because of difficulties in assay design.
Results: We present PRTPrimer (www.prtprimer.org) software for automated PRT assay design. In addition to stand-alone software, the web site includes a database of pre-designed assays for the human genome at an average spacing of 6 kb and a web interface for custom assay design. Other reference genomes can also be analyzed through local installation of the software. The usefulness of PRTPrimer was tested within known CNV, and showed reproducible quantification. This software and database provide assays that can rapidly genotype CNV, cost-effectively, on a large number of samples and will enable the widespread adoption of PRT.
Availability: PRTPrimer is available in two forms: a Perl script (version 5.14 and higher) that can be run from the command line on Linux systems and as a service on the PRTPrimer web site (www.prtprimer.org).
Supplementary data are available at Bioinformatics online.
The amount of data generated from genome-wide association studies (GWAS) has grown rapidly, but considerations for GWAS phenotype data reuse and interchange have not kept pace. This impacts on the work of GWAS Central – a free and open access resource for the advanced querying and comparison of summary-level genetic association data. The benefits of employing ontologies for standardising and structuring data are widely accepted. The complex spectrum of observed human phenotypes (and traits), and the requirement for cross-species phenotype comparisons, calls for reflection on the most appropriate solution for the organisation of human phenotype data. The Semantic Web provides standards for the possibility of further integration of GWAS data and the ability to contribute to the web of Linked Data.
A pragmatic consideration when applying phenotype ontologies to GWAS data is the ability to retrieve all data, at the most granular level possible, from querying a single ontology graph. We found the Medical Subject Headings (MeSH) terminology suitable for describing all traits (diseases and medical signs and symptoms) at various levels of granularity and the Human Phenotype Ontology (HPO) most suitable for describing phenotypic abnormalities (medical signs and symptoms) at the most granular level. Diseases within MeSH are mapped to HPO to infer the phenotypic abnormalities associated with diseases. Building on the rich semantic phenotype annotation layer, we are able to make cross-species phenotype comparisons and publish a core subset of GWAS data as RDF nanopublications.
We present a methodology for applying phenotype annotations to a comprehensive genome-wide association dataset and for ensuring compatibility with the Semantic Web. The annotations are used to assist with cross-species genotype and phenotype comparisons. However, further processing and deconstructions of terms may be required to facilitate automatic phenotype comparisons. The provision of GWAS nanopublications enables a new dimension for exploring GWAS data, by way of intrinsic links to related data resources within the Linked Data web. The value of such annotation and integration will grow as more biomedical resources adopt the standards of the Semantic Web.
Ontology; Phenotype; GWAS; RDF
Biobanks can have a pivotal role in elucidating disease etiology, translation, and
advancing public health. However, meeting these challenges hinges on a critical shift in
the way science is conducted and requires biobank harmonization. There is growing
recognition that a common strategy is imperative to develop biobanking globally and
effectively. To help guide this strategy, we articulate key principles, goals, and
priorities underpinning a roadmap for global biobanking to accelerate health science,
patient care, and public health. The need to manage and share very large amounts of data
has driven innovations on many fronts. Although technological solutions are allowing
biobanks to reach new levels of integration, increasingly powerful data-collection tools,
analytical techniques, and the results they generate raise new ethical and legal issues
and challenges, necessitating a reconsideration of previous policies, practices, and
ethical norms. These manifold advances and the investments that support them are also
fueling opportunities for biobanks to ultimately become integral parts of health-care
systems in many countries. International harmonization to increase interoperability and
sustainability are two strategic priorities for biobanking. Tackling these issues requires
an environment favorably inclined toward scientific funding and equipped to address
socio-ethical challenges. Cooperation and collaboration must extend beyond systems to
enable the exchange of data and samples to strategic alliances between many organizations,
including governmental bodies, funding agencies, public and private science enterprises,
and other stakeholders, including patients. A common vision is required and we articulate
the essential basis of such a vision herein.
Sharing of data about variation and the associated phenotypes is a critical need, yet variant information can be arbitrarily complex, making a single standard vocabulary elusive and re-formatting difficult. Complex standards have proven too time-consuming to implement.
The GEN2PHEN project addressed these difficulties by developing a comprehensive data model for capturing biomedical observations, Observ-OM, and building the VarioML format around it. VarioML pairs a simplified open specification for describing variants, with a toolkit for adapting the specification into one's own research workflow. Straightforward variant data can be captured, federated, and exchanged with no overhead; more complex data can be described, without loss of compatibility. The open specification enables push-button submission to gene variant databases (LSDBs) e.g., the Leiden Open Variation Database, using the Cafe Variome data publishing service, while VarioML bidirectionally transforms data between XML and web-application code formats, opening up new possibilities for open source web applications building on shared data. A Java implementation toolkit makes VarioML easily integrated into biomedical applications. VarioML is designed primarily for LSDB data submission and transfer scenarios, but can also be used as a standard variation data format for JSON and XML document databases and user interface components.
VarioML is a set of tools and practices improving the availability, quality, and comprehensibility of human variation information. It enables researchers, diagnostic laboratories, and clinics to share that information with ease, clarity, and without ambiguity.
LSDB; Variation database curation; Data collection; Distribution
For many analytical methods the efficiency of DNA amplification varies across the genome and between samples. The most affected genome regions tend to correlate with high C + G content, however this relationship is complex and does not explain why the direction and magnitude of effects varies considerably between samples.
Here, we provide evidence that sequence elements that are particularly high in C + G content can remain annealed even when aggressive melting conditions are applied. In turn, this behavior creates broader ‘Thermodynamically Ultra-Fastened’ (TUF) regions characterized by incomplete denaturation of the two DNA strands, so reducing amplification efficiency throughout these domains.
This model provides a mechanistic explanation for why some genome regions are particularly difficult to amplify and assay in many procedures, and importantly it also explains inter-sample variability of this behavior. That is, DNA samples of varying quality will carry more or fewer nicks and breaks, and hence their intact TUF regions will have different lengths and so be differentially affected by this amplification suppression mechanism – with ‘higher’ quality DNAs being the most vulnerable. A major practical consequence of this is that inter-region and inter-sample variability can be largely overcome by employing routine fragmentation methods (e.g. sonication or restriction enzyme digestion) prior to sample amplification.
DNA amplification; DNA denaturation; C + G; Illumina infinium
We propose an innovative, integrated, cost-effective health system to combat major non-communicable diseases (NCDs), including cardiovascular, chronic respiratory, metabolic, rheumatologic and neurologic disorders and cancers, which together are the predominant health problem of the 21st century. This proposed holistic strategy involves comprehensive patient-centered integrated care and multi-scale, multi-modal and multi-level systems approaches to tackle NCDs as a common group of diseases. Rather than studying each disease individually, it will take into account their intertwined gene-environment, socio-economic interactions and co-morbidities that lead to individual-specific complex phenotypes. It will implement a road map for predictive, preventive, personalized and participatory (P4) medicine based on a robust and extensive knowledge management infrastructure that contains individual patient information. It will be supported by strategic partnerships involving all stakeholders, including general practitioners associated with patient-centered care. This systems medicine strategy, which will take a holistic approach to disease, is designed to allow the results to be used globally, taking into account the needs and specificities of local economies and health systems.
We describe a copy-number variant (CNV) for which deletion alleles confer a protective affect against rheumatoid arthritis (RA). This CNV reflects net unit deletions and expansions to a normal two-unit tandem duplication located on human chr12p13.31, a region with conserved synteny to the rat RA susceptibility quantitative trait loci Oia2. Genotyping, using the paralogue ratio test and SNP intensity data, in Swedish samples (2,403 cases, 1,269 controls) showed that the frequency of deletion variants is significantly lower in cases (P = 0.0012, OR = 0.442 [95%CI 0.258–0.755]). Reduced frequencies of deletion variants were also seen in replication materials comprising 9,201 UK samples (1,846 cases, 7,355 controls) and 2,963 US samples (906 controls, 1,967 cases) (Mantel–Haenszel P = 0.036, OR = 0.559 [95%CI 0.323–0.966]). Combining the three datasets produces a Mantel–Haenszel OR of 0.497 (P < 0.0002). The deletion variant lacks 129-kb of DNA containing SLC2A3, NANOGP1, and SLC2A14. SLC2A3 encodes a high-affinity glucose transporter important in the immune response and chondrocyte metabolism, both key aspects of RA pathogenesis. The large effect size of this association, its potential relevance to other diseases in which SLC2A3 is implicated, and the possibility of targeting drugs to inhibit SLC2A3, argue for further examination of the genetics and the biology of this CNV.
association; rheumatoid arthritis; SLC2A3; GLUT3; CNV
There is a huge demand on bioinformaticians to provide their biologists with user friendly and scalable software infrastructures to capture, exchange, and exploit the unprecedented amounts of new *omics data. We here present MOLGENIS, a generic, open source, software toolkit to quickly produce the bespoke MOLecular GENetics Information Systems needed.
The MOLGENIS toolkit provides bioinformaticians with a simple language to model biological data structures and user interfaces. At the push of a button, MOLGENIS’ generator suite automatically translates these models into a feature-rich, ready-to-use web application including database, user interfaces, exchange formats, and scriptable interfaces. Each generator is a template of SQL, JAVA, R, or HTML code that would require much effort to write by hand. This ‘model-driven’ method ensures reuse of best practices and improves quality because the modeling language and generators are shared between all MOLGENIS applications, so that errors are found quickly and improvements are shared easily by a re-generation. A plug-in mechanism ensures that both the generator suite and generated product can be customized just as much as hand-written software.
In recent years we have successfully evaluated the MOLGENIS toolkit for the rapid prototyping of many types of biomedical applications, including next-generation sequencing, GWAS, QTL, proteomics and biobanking. Writing 500 lines of model XML typically replaces 15,000 lines of hand-written programming code, which allows for quick adaptation if the information system is not yet to the biologist’s satisfaction. Each application generated with MOLGENIS comes with an optimized database back-end, user interfaces for biologists to manage and exploit their data, programming interfaces for bioinformaticians to script analysis tools in R, Java, SOAP, REST/JSON and RDF, a tab-delimited file format to ease upload and exchange of data, and detailed technical documentation. Existing databases can be quickly enhanced with MOLGENIS generated interfaces using the ‘ExtractModel’ procedure.
The MOLGENIS toolkit provides bioinformaticians with a simple model to quickly generate flexible web platforms for all possible genomic, molecular and phenotypic experiments with a richness of interfaces not provided by other tools. All the software and manuals are available free as LGPLv3 open source at http://www.molgenis.org.
The recent explosion of biological data and the concomitant proliferation of distributed databases make it challenging for biologists and bioinformaticians to discover the best data resources for their needs, and the most efficient way to access and use them. Despite a rapid acceleration in uptake of syntactic and semantic standards for interoperability, it is still difficult for users to find which databases support the standards and interfaces that they need. To solve these problems, several groups are developing registries of databases that capture key metadata describing the biological scope, utility, accessibility, ease-of-use and existence of web services allowing interoperability between resources. Here, we describe some of these initiatives including a novel formalism, the Database Description Framework, for describing database operations and functionality and encouraging good database practise. We expect such approaches will result in improved discovery, uptake and utilization of data resources.
Database URL: http://www.casimir.org.uk/casimir_ddf
As our knowledge of the complexity of gene architecture grows, and we increase our understanding of the subtleties of gene expression, the process of accurately describing disease-causing gene variants has become increasingly problematic. In part, this is due to current reference DNA sequence formats that do not fully meet present needs. Here we present the Locus Reference Genomic (LRG) sequence format, which has been designed for the specific purpose of gene variant reporting. The format builds on the successful National Center for Biotechnology Information (NCBI) RefSeqGene project and provides a single-file record containing a uniquely stable reference DNA sequence along with all relevant transcript and protein sequences essential to the description of gene variants. In principle, LRGs can be created for any organism, not just human. In addition, we recognize the need to respect legacy numbering systems for exons and amino acids and the LRG format takes account of these. We hope that widespread adoption of LRGs - which will be created and maintained by the NCBI and the European Bioinformatics Institute (EBI) - along with consistent use of the Human Genome Variation Society (HGVS)-approved variant nomenclature will reduce errors in the reporting of variants in the literature and improve communication about variants affecting human health. Further information can be found on the LRG web site: http://www.lrg-sequence.org.
An ambitious plan to collect, curate, and make accessible information on genetic variations affecting human health is beginning to be realized.
Copy number variants (CNVs) occupy a significant portion of the human genome and may have important roles in meiotic recombination, human genome evolution and gene expression. Many genetic diseases may be underlain by CNVs. However, because of the presence of their multiple copies, variability in copy numbers and the diploidy of the human genome, detailed genetic structure of CNVs cannot be readily studied by available techniques.
Single sperm samples were used as the primary subjects for the study so that CNV haplotypes in the sperm donors could be studied individually. Forty-eight CNVs characterized in a previous study were analyzed using a microarray-based high-throughput genotyping method after multiplex amplification. Seventeen single nucleotide polymorphisms (SNPs) were also included as controls. Two single-base variants, either allelic or paralogous, could be discriminated for all markers. Microarray data were used to resolve SNP alleles and CNV haplotypes, to quantitatively assess the numbers and compositions of the paralogous segments in each CNV haplotype.
This is the first study of the genetic structure of CNVs on a large scale. Resulting information may help understand evolution of the human genome, gain insight into many genetic processes, and discriminate between CNVs and SNPs. The highly sensitive high-throughput experimental system with haploid sperm samples as subjects may be used to facilitate detailed large-scale CNV analysis.
The Human Genome Variation database of Genotype to Phenotype information (HGVbaseG2P) is a new central database for summary-level findings produced by human genetic association studies, both large and small. Such a database is needed so that researchers have an easy way to access all the available association study data relevant to their genes, genome regions or diseases of interest. Such a depository will allow true positive signals to be more readily distinguished from false positives (type I error) that fail to consistently replicate. In this paper we describe how HGVbaseG2P has been constructed, and how its data are gathered and organized. We present a range of user-friendly but powerful website tools for searching, browsing and visualizing G2P study findings. HGVbaseG2P is available at http://www.hgvbaseg2p.org.
Asthma; polymorphisms; glioblastoma multiforme; GBM glioblastoma multiforme; IL interleukin; COX-2 cyclooxygenase 2; OR odds ratio; CI confidence interval; SNP single nucleotide polymorphism; CRP C-reactive protein
We have surveyed, compiled and annotated nucleotide variations in 338 human 7-transmembrane receptors (G-protein coupled receptors). In a sample of 32 chromosomes from a Nordic population, we attempted to determine the allele frequencies of 80 non-synonymous SNPs, and found 20 novel polymorphic markers. GPCR receptors of physiological and clinical importance were prioritized for statistical analysis. Natural variation and rare mutation information were merged and presented online in the Human GPCR-DB database .
The average number of SNPs per 1000 bases of exonic sequence was found to be twice the average number of SNPs per Kilobase of intronic regions (2.2 versus 1.0). Of the 338 genes, 111 were single exon genes, that is, were intronless. The average number of exonic-SNPs per single-exon gene was 3.5 (n = 395) while that for multi-exon genes was 0.8 (n = 1176). The average number of variations within the different protein domain (N-terminus, internal- and external-loops, trans-membrane region, C-terminus) indicates a lower rate of variation in the trans-membrane region of Monoamine GPCRs, as compared to Chemokine- and Peptide-receptor sub-classes of GPCRs.
Single-exon GPCRs on average have approximately three times the number of SNPs as compared to GPCRs with introns. Among various functional classes of GPCRs, Monoamine GPRCs have lower number of natural variations within the trans-membrane domain indicating evolutionary selection against non-synonymous changes within the membrane-localizing domain of this sub-class of GPCRs.
Human genome polymorphism is expected to play a key role in defining the etiologic basis of phenotypic differences between individuals in aspects such as drug responses and common disease predisposition. Relevant functional DNA changes will probably be located in or near to transcribed sequences, and include many single nucleotide polymorphisms. To aid the future analysis of such genome variation, HGBASE (Human Genic Bi-Allelic SEquences) was constructed as a means to gather human gene-linked polymorphisms from all possible public sources, and show these as a non-redundant set of records in a standardized and user-friendly database endowed with text and sequence based search facilities. After 1 year of presence on the WWW, the HGBASE project has compiled data for over 22 000 records, and this number continues to triple every 6–12 months with data harvested or submitted from all major public genome databases and published literature from the previous decade. Extensive annotation enhancement, internal consistency checking and manual review of every record is undertaken to address potential errors and deficiencies sometimes present in the original source data. The fully polished and comprehensive database is made freely available to all at http://hgbase.cgr.ki.se
The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5% of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for non-protein-coding RNA genes. In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology.
An international team has systematically validated and annotated just over 21,000 human genes using full-length cDNA, thereby providing a valuable new resource for the human genetics community