We have developed two sample preparation methods and a systematic algorithm for class I HLA typing using next-generation sequencing. These methods leverage the longer read lengths and clonal amplification inherent to the 454 sequencing technology with the Titanium chemistry. Longer reads (>300 bp) capture critical phase information of nearby DNA variants, and enable accurate sequence alignment to the polymorphic HLA region. This information enables differentiation between highly similar alleles and is an important advantage over capillary-based methods.
The ability to capture DNA sequences with high depth of coverage and with phase information enabled us to confidently detect and correct errors in the gold standard database, originally typed using SSO hybridization. The high depth of coverage of 454 sequences minimizes genotyping errors, and the long read lengths and use of locus-specific primer sequences together help prevent alignment errors. Compared to SSO hybridization, 454 holds two main advantages: first, 454 generates many DNA reads for each chromosome while SSO generates only one intensity-based read, allowing 454 to correct for sequencing errors while SSO cannot. Second, reads from 454 sequencing contain phase information that can be used to resolve ambiguous allele combinations while SSO reads cannot. These advantages allowed us to confidently correct errors in the SSO based gold standard, and corrections were verified the laboratory that performed the original SSO experiments.
Related methods for performing HLA typing using 454 sequencing were recently developed by Gabriel et al., Lank et al. and Bentley et al. [16
]. All of these methods employ a PCR primer modified with both barcode and adaptor sequences [16
], resulting in significantly more expensive primers (requiring HPLC cleanup) for DNA library amplification than those in our method. In addition, we multiplex 95-96 barcoded samples into a single lane of an 8-region PicoTiter Plate (PTP), allowing for a maximum of 760 samples per run compared to 24 - 48 samples per run in the method by Bentley et al [16
]. The current estimate for minimum coverage required for accurate HLA typing by our HLA caller is 20 reads, whereas our samples received on average >100 reads per exon. This leaves open the possibility of sequencing additional amplicons per sample, or multiplexing more samples per sequencing run in future if necessary. The longer read lengths provided by the FLX Titanium technology also permit more flexibility in primer selection than the FLX technology used in the Bentley et al study. See Additional File 10
for a comprehensive comparison of these methods to ours.
While sequencing HLA exons 2 and 3 is not sufficient for distinguishing between all known HLA genotypes, it provides an extraordinary amount of identifying information, especially when combined with allele frequency information. Using the HLA database, Lank et al. (2010) predict that exons 2, 3, and 4 theoretically differentiate 85% of all known alleles, however this figure likely underestimates the true typing capacity since they did not considering the effect of allele frequency on the incidence of uniquely identifiable alleles. We were able to achieve 96.4% to 98.6% empirical 4-digit typing accuracy by sequencing only exons 2 and 3, in addition to using frequency information to differentiate between ambiguous HLA genotypes. Without frequency information, it is difficult to identify the true HLA types, but it is possible to generate a short list (average 3 to 6) of candidate HLA allele pairs from a total of 34,584 possibilities at HLA-A, 143,648 at HLA-B, and 48,672 at HLA-C. The true allele pair may be differentiated from the other candidates by having the highest multiplicative frequency in the vast majority of cases. In the rare instance that the HLA genotype remains ambiguous after using frequency information, a list of top candidate alleles may be provided. Additional exons may be sequenced if high-resolution HLA types are desired, for to type individuals in rare populations or for to discover new HLA alleles.
One concern is that the incorporation of population frequency information may bias the HLA typing for a few specific HLA alleles. The definitive solution for preventing report bias is to sequence the entire HLA gene to remove all ambiguities in the genotype background. However this approach is not yet cost effective for most technologies and may hinder multiplexing throughput and efficiency. The interim solution is to report all alleles that match the particular genotype background (nucleotide sequence) and phase, with a pair of alleles that are designated to be the most likely given the expected frequency in the population in question. Another concern may be the accuracy of allele frequencies. All frequency information was collected from http://www.allelefrequencies.net
, a central resource that has aggregated HLA type frequencies from hundreds of sources including blood banks, research facilities, and other studies [19
]. To our knowledge, this is the most reliable source of HLA (and KIR) allele frequencies available. The problem of allele skew is common to all HLA typing methods that do not sequence the entire gene, however report bias may be prevented by reporting all matching alleles in addition to the most likely alleles with frequency information.
Our sequencing methods may be modified to accommodate additional amplicons at the HLA loci (i.e. class I exon 4, or exons in HLA class II). Because the average depth of coverage in the current method (> 100 reads per site) exceeds the estimated coverage required for accurate typing (10-20 reads), the addition of a small number of additional amplicons (i.e. exon 4 of HLA-A, -B, and/or -C) may work in the current multiplexing protocol. Efforts are also underway to develop class II HLA typing capacity. However, the addition of a significant number of amplicons (1 to 2 for each of HLA-DRB1, -DQA, -DQB, -DPA, -DPB) may require decreasing the multiplexing rate by about ½ to maintain a similar depth of coverage. However such changes are well compensated by the high efficiency of sample multiplexing and cost savings from pre-barcoded primers. The main challenge in class II is to design primers that target specific class II HLA genes instead of others in the same homologous family (HLA-DRB1, -DRB2, - DRB3, etc).
We have described two sample preparation methods that differ in the pooling and barcoding strategies employed. Both methods represent facile approaches to amplification and HLA Class I genotyping. If large sample numbers (thousands) are to be processed the PCR-based barcoding method may be preferential as it facilitates pooling of all amplicons directly after PCR, meaning that only a single library construction is required downstream. These changes significantly cut the cost and increase the throughput of HLA typing. Using this process a single technician can amplify, pool and adapt up to 96 samples in two days. Class I HLA typing of samples prepared in this way costs less than $40 per sample, as calculated using our all-in, end-to-end internal accounting model.