HGVbaseG2P has evolved out of the polymorphism database HGVbase (6
). In addition to extending its predecessor's scope to include phenotypes (in patient and control groups) and genotype–phenotype relationships (in the form of association study findings), HGVbaseG2P combines the best features of a database and a scientific journal, i.e. free access to structured and comprehensive result sets, along with summary-level presentation of high-quality information with full author accreditation. At the very least, the project aims to reduce hurdles to data publication and so minimize the problems of publication bias (7
), and it is designed to bring together many different data sets for joint analysis thereby helping researchers identify genotyping artefacts and elucidate association signals that are population-specific or shared across related traits.
The data model underpinning HGVbaseG2P closely follows that of the new PaGE object model standard (http://www.pageom.org/
). A detailed representation of the HGVbaseG2P model is available at the database website, and the principal elements of this are illustrated in . As is shown, summary-level association data are layered on top of a foundational list of DNA variation markers. This ‘Marker’ layer currently comprises core information on all the markers presently in NCBI's dbSNP database (8
) along with a direct link back to the source for each marker. Additional marker lists will be incorporated from other public databases, in particular copy-number variants which are becoming widely used in disease studies (9
). Markers retain their source database identifiers and are also assigned stable HGVbaseG2P IDs. Changes to marker information in the source database are tracked and updated in HGVbaseG2P via purpose built software that appropriately handles marker and allele content alterations, mergers and deletions in a way that ensures the integrity of any existing connections to frequency or association record items.
Figure 1. Overview data model of HGVbaseG2P. The main subdivisions of data that make up an association report are shown, with further explanation of each component provided in the body text. These components are illustrated sitting on top of a comprehensive ‘Marker’ (more ...)
The main association data layer of HGVbaseG2P comprises four principal components that emulate the main concepts used in standard literature reporting of genetic association studies, namely ‘Study’, ‘Sample’, ‘Phenotype’ and ‘Experiment’ entities.
A ‘Study’ entity sits atop, and thereby integrates, the three other main data entities that make up a single submission (‘Sample’, ‘Phenotype’ and ‘Experiment’). Each ‘Study’ entry could potentially include information items such as ‘Authors’, ‘Title’, ‘Abstract’, ‘Objectives’ and ‘Conclusions’, plus various details relating to the study design, and each will provide links to its original data source so that further information and individual level data can be requested.
HGVbaseG2P uses the ‘Sample Panel’ concept to represent a named collection of individuals that are employed in a ‘Study’—such as disease cases, or matched controls. Typically, individuals in a ‘Sample Panel’ are affected by one or more similar disease phenotypes, or have some other key metric in common (e.g. age, gender, ethnicity). Data generated by performing genotyping experiments are reported in terms of an ‘Assayed Panel’. This is a group of test subjects derived by splitting and/or merging one or more sample panels to create new collections, on the basis of some explicit criteria such as severity or subclass of disease or some environmental criteria.
Phenotypes are stored in a very flexible but straightforward data structure. Whereas other databases typically use unstructured free-text descriptions to hold phenotype information, HGVbaseG2P splits phenotype information into three sub-components: (i) the ‘Phenotype Property’ which represents the character or trait investigated (e.g. nose size); (ii) the ‘Phenotype Method’ which describes how the trait was measured (e.g. nasal septum measured in centimetre to first decimal place with a ruler); and (iii) the ‘Phenotype Value’, which is a particular result produced by measuring the trait using the described method (e.g. size of nose = 1.7 cm). Identical ordinal or nominal values in groups of individuals are thereby easily represented, as are categories of disease affection status. For quantitative traits in patient groups, statistical values that describe a distribution (e.g. median, standard deviation, maximum, minimum) are stored as a series of ‘Phenotype Values’. The same data model allows phenotype thresholds to be specified and used as criteria for ‘Assayed Panel’ selection (e.g. weight greater than ‘Phenotype Value = 120 kg’).
‘Experiments’ are individual sections of a study submission, and each is constrained to a consideration of at most one phenotype examined in a specific set of ‘Assayed Panels’ (e.g. one case and one control ‘Assayed Panel’ used to explore the role of a gene or a region or a haplotype block in causing one phenotype). Subtypes include ‘Genotyping Experiments’ (holding marker allele and genotype frequency data) and ‘Analysis Experiments’ (holding marker to trait association P
-value data). The ‘Genotyping Experiments’ will become increasingly useful as a reference of marker frequencies in a range of populations as this data accumulate. Presently, however, we only permit access to aggregate allele or genotype frequency data one marker at a time, to avoid any risk of individual identification (10
). The ‘Analysis Experiments’ are most central to the purpose of HGVbaseG2P, and the data for each distinct experiment may include more than one package of information based upon different statistical tests, such as an allelic trend test, a genotypic test, etc. This information will usually be initially generated as an output file from software such as PLINK (11