The IMGT/HLA database contains entries for all HLA alleles, and alleles of some related genes, officially named by the Nomenclature Committee. These entries are derived from expertly annotated copies of the original EMBL-Bank/GenBank/DDBJ entries. This means that the IMGT/HLA database may contain multiple entries for any single allele. These component entries are submitted to the database either by the original author, or by our curators, when sequences of interest have been identified by data-mining but have yet to be submitted to the database. To distinguish each IMGT/HLA entry from the component EMBL entries, each new allele is assigned a unique accession number. The accession numbers follow the format HLA00000, where the ‘00000’ represents a numerical code.
It must be noted that all sequences within the IMGT/HLA database should also be available from the more general nucleotide sequence databases: EMBL-Bank (24
), GenBank (26
) and the DNA Database of Japan (DDBJ) (27
). The main problem when accessing HLA sequences from these databases lies in the definition of the sequence. Despite the work of the members of the WHO Nomenclature Committee for Factors of the HLA System in monitoring HLA allele designations and maintaining the sequences, they have no control of how sequences are defined in these generalist databases. Readers should, therefore, be aware that entries in these generalist databases may be incorrectly named, contain unofficial designations or contain known, but uncorrected, sequencing errors.
Retrieving allele information and displaying polymorphisms
The main access point for the user is the World Wide Web (WWW), which allows users to employ a number of search tools and other facilities to retrieve, manipulate and analyse HLA data. The IMGT/HLA website can be split into three main areas. The first area comprises information and help pages that provide background on the database and provide in-depth help on the tools and data available and documentation of the IMGT/HLA file formats. The second area includes the tools designed specifically for the IMGT/HLA database. These core tools allow the users to perform sequence alignments, allele queries and sequence searches as well as queries more relevant to how the data are used and interpreted in a clinical setting. The third area comprises final pages that provide links to commonly used third-party applications such as the sequence-analysis tools at the EBI, including SRS, BLAST and FASTA.
As the primary users of the database are members of the clinical HLA community involved in transplantation of tissues and organs, the most commonly accessed tools have been written to aid in their common queries. All tools are written in Perl as CGI scripts and access restricted views of the underlying Oracle relational database. The transplant and tissue typing community have two main queries; either to retrieve information on a particular allele or to view how a number of alleles differ in sequence. To answer these questions, the database provides a detailed report on any allele, as well as an interactive alignment tool to view how allelic sequences differ. The Allele Search tool provides a simple-to-use interface for retrieving allele information. The output, see , for each allele includes the official allele designation, previously used designations and the unique IMGT/HLA accession number. Other information provided includes the date that the allele was named, current status (as some allele designations have been deleted) and information on the individual or cell line from which the sequence was derived. Links to all component EMBL-Bank/GenBank/DDBJ entries are also included. Recently, information from the HLA Dictionary (29
) has also been added to some entries. The dictionary presents the serological equivalents of HLA-A, -B, -C, -DRB1, -DRB3, -DRB4, -DRB5
allotypes. The data summarizes equivalents obtained by the WHO Nomenclature Committee for Factors of the HLA System, the International Cell Exchange (UCLA), the National Marrow Donor Program (NMDP), the 13th International Histocompatibility Workshop, recent publications and individual laboratories. Any citations are also included with, wherever possible, a link to the PubMed entry for that citation. The PubMed link provides an online version of the abstract as well as links to other citations by the author and to similar papers. This is also done for any other citations that appear on the website. The final section of the output details the official nucleotide and protein sequence as well as any genomic sequence for the allele that is available.
Figure 1. Section of an HLA allele report. The report provides cross-references to a text flat-file in SRS (HLA0001), the OMIM entry for HLA-A, the source entries in EMBL-Bank (AJ278305-Z93949) and to the seminal citations in PubMed. Other information provided (more ...)
HLA allele sequences can differ from each other by as little a single nucleotide substitution, within a genomic sequence of 3300 bases. Such nucleotide differences between the alleles of prospective transplant donors and recipients can make the difference between a successful transplant, graft failure and death. This means that the database must be able to quickly and easily display this information to the user. The HLA community is interested in seeing the polymorphisms in terms of the changes to the sequence rather than as a list of individual single nucleotide polymorphisms (SNPs). To this end, we have developed the alignment tool, rather than push the users into producing their own alignments for the sequences of interest or simply just reporting the polymorphic positions. These alignments allow a visual interpretation of sequence similarity so that polymorphic positions and motifs, found in multiple alleles, can easily be identified. The representation of HLA sequences in this manner can be useful when designing reagents for HLA typing, such as primers or oligonucleotide probes or comparing mismatches when looking at potential donors. The interface provided lets the user define a number of key variables for the alignments, these include the gene(s) to be aligned, the alleles of interest and the reference sequence they are aligned against, as well as the type of sequence: nucleotide coding region, nucleotide genomic and the amino-acid sequence of the protein, to be aligned. Further, specific regions like individual exons or signal peptides can be selected. The alignment tool uses standard formatting conventions for the display of sequence alignments and alignments adhere to standard conventions for displaying evolutionary events and numbering.
An example of alignments specially tailored to the HLA transplant community is in the presentation of alleles with an alternative splice site. For most alleles, the nucleotide sequence displayed as a coding sequence (CDS) represents the contiguous, correctly spliced exons. For alternatively spliced alleles, the sequence displayed will contain the spliced exons plus any alternatively spliced segment that lies within the traditional exon framework, when compared to a reference sequence. The otherwise missing sequence is also included and highlighted to emphasize the region of interest, rather than omit it, a feature important for the design of reagents that allow for typing of the alternatively spliced allele. illustrates how an alternatively spliced allele (A*0111N) is represented in the sequence alignments.
Figure 2. CDS HLA Allele alignment indicating the alternatively spliced allele, A*0111N. The alternatively spliced A*0111N allele is shown aligned to the reference allele A*01010101. Identity to the A*01010101 allele is shown by hyphens (-) and the exon borders (more ...)
The previous text-only versions of the alignments are still requested and as a result, are available from the ANRI website and in a zipped file in the FTP directory. For users who prefer to use other existing software to produce their own alignments, then the FTP directory contains files in popular formats for them to download and import.
Recent developments and future applications
Recent developments to the website have seen the addition of a search tool for identifying primer and probe sequences. Many HLA typing laboratories who have designed their own reagents for HLA typing have spreadsheets detailing probe-hit patterns for different alleles. These are used when typing samples to identify known alleles based on the reaction patterns seen. Each time a new release of the database was made it was necessary to manually update these ever-expanding lists by cross-referencing the primer sequence with the sequence alignments, which with the rapidly increasing numbers of alleles was becoming a slow and laborious task. The new ‘Probe & Primer Search Tool’ allows users to enter a list of primer sequences and the tool will search the known alleles for the presence of these sequences and report any matches in a file format suitable for cutting-and-pasting into existing spreadsheets. The tool is currently limited to coding sequences but as the number of genomic sequences in the database expands it will be modified to search these regions as well.
The IMGT/HLA database is also involved in developing data format standards for HLA information exchange between the reference database, HLA typing laboratories and commercial typing-kit manufacturers. This work, which is being performed in collaboration with other immuno-informatics groups will provide both an XML output format for the IMGT/HLA database as well as XML reporting format for tissue typing laboratories. The XML output will contain similar information to that described for the flat files and allele output (30
The rise of high-throughput genome typing has seen the expansion of genome browsers like ENSEMBL (31
). These browsers have a different priority in how you view a gene, the alleles and any SNPs. The IMGT/HLA database is working with groups like ENSEMBL, EMBL-Bank and UniProt (32
) to help define HLA references to suit all parties at the different levels through the development of Locus Reference Genomic Sequences (LRGS). A current project is to improve cross-referencing of the HLA data with that from other systems. The aim is to make sure that when users find an entry referring to an HLA allele in a third-party system they can also find a link back to the IMGT/HLA entry for that allele, which should be considered the primary reference for the sequence.