The IMGT/HLA database was established to provide a locus specific database (LSDB) for the allelic sequences of the genes in the HLA system, also known as the human Major Histocompatibility Complex (MHC). This complex of >4 Mb is located within the 6p21.3 region of the short arm of human chromosome 6 and contains in excess of 220 genes (1
). The core genes of interest in the HLA system are 21 highly polymorphic HLA genes, whose protein products mediate the host response to infectious disease and influence the outcome of cell and organ transplants. With a nomenclature spanning over 50 genes and currently over 5000 alleles, there is an obvious need for a LSDB to curate these highly polymorphic variants. The sequencing of HLA alleles began in the late 1970′s predominantly using protein-based techniques to determine the sequences of HLA class I allotypes. The first complete HLA class I allotype sequence, B7.2, now known as B*07:02:01
, was published in 1979 (2
). The first HLA class II allele defined by DNA sequencing, DRA*01:01
, followed in 1982 (3
). The first HLA DNA sequences or alleles were named by the WHO Nomenclature Committee for Factors of the HLA System (4
) in 1987. At that time 12 class I alleles and nine class II alleles were named: in the first 8 months of 2010 the Nomenclature Committee was able to assign names to 1165 alleles.
The dissemination of new allele names and sequences is of paramount importance in the clinical setting. The first public release of the IMGT/HLA database was made on the 16th December 1998 (5
). Since then the database has been updated every 3 months, in a total of 51 releases, to include all the publicly available sequences officially named by the WHO Nomenclature Committee at the time of release.
The database was first available as the HLA Sequence Databank (HLA-DB) (6
), which allowed the periodic publication of HLA class I (7–10
) and class II (11–16
) sequence alignments in a variety of journals. By 1995, the first distribution of the HLA sequence alignments was made online through the web pages of the Tissue Antigen Laboratory at the Imperial Cancer Research Fund (ICRF), London, UK. This work transferred to the Anthony Nolan Research Institute (ANRI) in 1996 where it continues to this day as part of the IMGT/HLA database and the hla.alleles.org web site.
IMGT/HLA data sources
The IMGT/HLA database receives submissions from laboratories across the world (). These submissions are curated and analyzed, and if they meet the strict requirements an official allele designation is assigned. The IMGT/HLA database is the official repository for the WHO Nomenclature Committee for factors of the HLA System, and is the only way of receiving an official allele designation for a sequence. The sequence is then incorporated into the next 3-monthly release of the database. Since its release in December 1998 the database has received nearly 9000 submissions, from around 600 submitters (). These submissions come from a variety of sources; the majority are from routine HLA Typing laboratories or commercial organizations performing contract HLA typing for large haematopoietic stem cell donor registries. Further data has been submitted following large-scale genome sequencing projects. All submissions must meet strict acceptance criteria before the sequence receives an official designation; ~3% of the submissions received fail to meet these criteria and are rejected. In addition, all the submissions received by the IMGT/HLA database are also available from the EMBL-Bank/GenBank/DDBJ collaboration (17–19
). The EMBL-Bank entries also contain database cross-references to the IMGT/HLA entries.
World map showing the source and volume of IMGT/HLA submissions by country.
The past few years have seen a dramatic increase in the numbers of submissions seen and processed, with the number of novel allele sequences identified each year rising rapidly from around 300 in 2008 to over 1000 in 2009. This trend looks set to continue, with over 1200 novel alleles being reported in the first 9 months of 2010 (). This is because of the increased affordability and availability of the sequencing-based typing (SBT) technology as the method of choice for HLA typing, with the consequence of this high-resolution typing being the determination of many novel HLA sequences. A notable increase in volume has been from sequences originating from China. Prior to 2008, the database only had 28 submitters located in China; we now have over 70 submitters. The volume of submissions has also increased. Up to 2008, we averaged only 18 submissions a year from China, we are now averaging nearly 200 a year, a 10-fold increase.
Figure 2. Graph of the number of submissions to the IMGT/HLA database by year. The recent surge in the number of submissions received by the database is clearly shown. The values listed for 2010 are up to the end of September 2010, and do not represent a full year. (more ...)
Another change in the data source has been the type of submission received. In the early days of the database, we received very few full-length or genomic sequences, now with improved sequencing techniques we are getting a much larger number of both full length and genomic sequences covering a range of genes. These submissions cover both new and confirmatory sequences, and the database welcomes both. Confirmatory sequences are important as they verify the existence of the single nucleotide polymorphisms (SNPs) found in many novel alleles. The confirmatory sequences often extend the sequence of an allele beyond that currently held in the database, where many alleles sequences only cover the minimum length required. Over the last 2 years just <40% of the submissions to the database have been confirmatory sequences.
The increase in the number of submissions has also seen a change in the type of new alleles seen. Over 97% of new alleles now being submitted are derived from SNPs. In contrast, in 2000, ~20% of new alleles identified were based on motif shuffling. This is most likely due to the methods used to identify alleles at this time that were largely based on sequence-specific oligonucleotide probes (20
). Nowadays sequencing-based typing methods are used extensively to perform HLA typing and this allows for the easy identification of novel SNPs ().
Figure 3. Heat maps of the polymorphic amino acid positions in HLA-B. The two sets of maps show the increase in the number of polymorphic positions identified between the first release of the database in 1998 (A) and the latest release in 2010 (B). The x-axis is (more ...)
New HLA nomenclature
In April 2010, the official nomenclature used to name HLA alleles was changed (21
). The nomenclature changes were needed, as the existing system could no longer cope with the number of allele variants found in some allele families. The convention of using a four-digit code to distinguish HLA alleles that differed in the proteins they encoded was introduced in the 1987 HLA Nomenclature Report (4
). Since that time additional digits have been added, and prior to the change, an allele name could be composed of four, six or eight digits dependent on its sequence. Each pair of digits was used to describe the allele, the first two digits described the allele family, which often corresponds to the serological antigen carried by the allotype. The third and fourth digits were assigned in the order in which the sequences had been determined. Alleles whose numbers differed in the first four digits differed by one or more nucleotide substitutions that changed the amino-acid sequence of the encoded protein. Alleles that differed only by synonymous nucleotide substitutions within the coding sequence were distinguished by the use of the fifth and sixth digits. Alleles that only differed by sequence polymorphisms in introns or in the 5′- and 3′-untranslated regions that flanked the exons and introns were distinguished by the use of the seventh and eighth digits. To deal with the ever increasing number of HLA alleles described it was decided to introduce colons (:) into the allele names to act as delimiters of the separate fields.
For some users the changes to the nomenclature were minor, to others like HLA Typing Laboratories and Donor Registries, this change in nomenclature had a major impact on their informatics systems. The IMGT/HLA database helped to co-ordinate the move to the new nomenclature by providing conversion lists and tools to help identify alleles in both the new and old nomenclature. The nomenclature officially changed on the 1 April 2010. To aid our users in preparing for this change, the database provided conversion tables for 9 months prior to the release. These tables allowed users to see what the changes would be and how they would impact on their own systems. The database also provided online tools for the conversion of allele names, as well as links to external software designed for the conversion of large data sets from the old to new nomenclature (22
Further information on HLA nomenclature can be found at the IMGT/HLA database’s sister site http://hla.alleles/org
. This site concentrates on HLA nomenclature, whereas the IMGT/HLA database is more focussed on sequence data. There is some overlap between the sites, but with a different prime focus each site can deliver a different set of data and downloadable content that may not be suitable for the other.
Tools available at IMGT/HLA
The IMGT/HLA database provides a large number of tools for the analysis of HLA sequences. These tools are either custom written for the database or are incorporated into existing tools on the EBI web site (23
- Sequence alignments—access to alignment tool, which filters pre-generated alignments to the users’ specification. Provides alignments at the protein, cDNA and gDNA level.
- Allele queries—access to detailed information on any HLA Allele, including information on the ethnic origin of the source, database cross-references and seminal publications. This information is also available through integration with EBI’s SRS search engine (25).
- Sequence search tools—integration into EBI’s suite of search tools including FASTA (26) and BLAST (27).
- Downloads—access to a FTP directory containing all the data from the current and previous releases in a variety of commonly used formats like FASTA, MSF and PIR.
- Cell Queries—a detailed a searchable database of all the source material characterized in the submissions.
- Primer search tools—a simple search tool allowing users to update primer hit pattern tables against each release of the database.
- Ambiguous allele combinations—the use of SBT as a method for defining the HLA type is well documented, most SBT typing strategies currently employed use the exons 2 and 3 sequences for HLA class I analysis and exon 2 alone for HLA class II analysis. Due to the heterozygous nature of the SBT analysis the combinations of many pairs of alleles may give an ambiguous typing result. The document includes a list of all alleles that are identical over exons 2+3 for HLA class I and exon 2 for HLA class II.