The core of HGBASE is a list of non-redundant polymorphism records, implemented as follows. Four categories of variation are defined: (i) single base differences, (ii) insertion–deletion variants, (iii) simple tandem repeat polymorphisms, and (iv) ‘generic’ (or complex) changes involving alterations not described by the preceding three alternatives. Polymorphisms for inclusion in HGBASE are considered to be equivalent, and are, therefore, combined into a single record if they are of the same category and affect the same base(s) in the identical gene, regardless of the precise allele details. For example, a newly submitted SNP involving a T–C change at base ‘N’ of gene ‘XYZ’ would be merged with an existing record of an SNP involving a T–G change at base ‘N’ of the ‘XYZ’ gene, so producing one HGBASE record with three alleles (T, C and G). Details of the two distinct information sources and any other submitted or newly acquired data would be jointly presented within this record, along with a unique and permanent HGBASE accession number by which the underlying polymorphism can always be referenced. This accession number is structured as a progressively increasing numeric with a three letter prefix (SNP, IND, MIC, GEN) that indicates the category to which the polymorphism belongs (a record property that is also indicated in a separate information field). In this way, a concise and non-redundant catalog is maintained, simplifying tasks of data extraction and subsequent experimental planning by users of the database. The HGBASE accession number is therefore a suitable reference number for use in research communications to identify specific polymorphisms represented in HGBASE. As such, HGBASE accession numbers are immediately allocated to all newly received submissions and this information is passed directly back to the data submitter for possible use in manuscript preparation.
The additional information specified for each record is as follows:
1) DNA sequences comprising 25 bp 5′ of the polymorphism, the allelic bases themselves, and 25 bp 3′ of the polymorphism—all sequences shown in the same orientation as the direction of transcription. Currently this is declared as either genomic or cDNA sequence, but both will be included once the full human genome sequence becomes publicly available.
2) The HUGO nomenclature committee approved name and symbol for the host gene. When the host gene is presently known only as an anonymous Expressed Sequence Tag (EST) or computationally predicted gene, then this is stated and no name or symbol is given.
3) A DDBJ/EMBL/GenBank accession number for at least one nucleotide reference sequence, with specification of the residues therein that are altered by the polymorphism. Where possible, cDNA level and genomic DNA level references are provided, and in many cases an accession number and residue position for a reference protein sequence file (SWISS-PROT) may also be given. Since most gene structures and coding domains are still far from completely defined, no attempt is made to use any formalised or standardised naming system for polymorphisms represented in HGBASE. Instead, unambiguous and doubly foolproof polymorphic base specification is achieved by providing (i) the gene name/symbol plus 25 bp 5′ and 25 bp 3′ flanking sequences (effective since within any one stated gene the given ≥50 base string surrounding each variation is highly likely to be unique), and (ii) the numbered polymorphic base(s) in a reference DNA sequence file (effective since each given DDBJ/EMBL/GenBank access-ion plus version number combination will indicate an unequivocal position in a definitive sequence).
4) Indication of all known sources of the polymorphism, comprising a standardised comment (e.g., database, literature, in silico) as well as a detailed citation/pointer to each information source.
5) The intra-genic location of the polymorphism (e.g., exon, intron, coding sequence, 3′ untranslated region), and details of any predictable or known consequences thereof (e.g., codon and deduced amino-acid changes, splice site modifications, altered transcription factor binding sites).
6) Indication as to whether the variant is experimentally proven or merely suspected to exist, and why—in each case this judgement is required to be made by the data submitter (or copied from the data source) based upon their own criteria, and further details in specific cases must be sought from the original data source. A free text comments box is provided for the submitter to expand on this if they wish to do so.
7) Allele frequency for any number of ‘populations’ (as defined by the researchers who made the frequency determinations), plus for each determination, the number of individuals studied (so that the reliability of the given frequency can be estimated).
8) A free text comments box for any additional information not covered by the above.
For database organisation we have established a two-level system. Local storage and handling of individual records is performed using the MS Access relational database tool. Purpose built scripts are then used to transfer data to either a different relational database platform (Oracle Server 8) or a simple flat-file format, which together provide inputs to the various Web Page interfaces. This arrangement is designed to allow convenient implementation of custom interface programs for advanced tasks like a Java bulk submission program. Final data presentation when viewed by the user is always given as simple text flat-files for easy download.