The proposed method comprises of three steps.
1. Selection of candidate probes: Select all short subsequences (n-mers) of length 16 to 22 nucleotides (16 ≤ n ≤ 22) containing each SNP in the allelic sequences producing the set of all candidate probes.
2. Identification of ultraspecific probes: Exclude from consideration all candidate probe sequences which are present in or can be converted to a subsequence in the synthetic background given any combination of two or more base change as such sequences are likely to have a high probability of mispairing with non-target DNA present in the sample. The results of this most computationally intensive step is the set of ultraspecific candidate probes or those sequences which are "distant" from the background collection.
3. Selection of optimal set of ultraspecific probes: Determine the set of ultraspecific probe sequences which requires the minimum number of members such that each known allele is expected to hybridize with a specific subset of probes.
For each of the 889 alleles within the HLA-B locus, the set of all 16- through 22-mer sequences were identified as candidate probe sequences. Thus, multiple
n-mers containing the same SNP for the same allele were identified in which both the location of the SNP within the
n-mer and the length of the
n-mer vary. To utilize these probes as markers of the HLA-B specific polymorphisms, each of the candidate probes were compared to the background collection – the remainder of the human genome which includes the other highly similar HLA loci (e.g. HLA-A has 89% sequence similarity and HLA-C has 93% sequence similarity with HLA-B) and known polymorphisms outside of the HLA-B region. With regards to the ability to detect the set of alleles, all of the alleles contain 16- through 22-mers that are at least 1 change away. Table lists the percentage of the alleles which can be detected by probes of a particular size and distance from the background; a complete listing of the number of candidate probes for each of the individual 889 alleles and distance values can be found in Additional File
1. To reduce the likelihood that probes would hybridize to non-target sequence by tolerating single base mispairing(s), we stipulate that ultraspecific
n-mers must be 2 or more changes away from the background.
| Table 1Percentage of the 889 alleles that have at least one probe sequence for the lengths calculated for each of the distances from the background collection sequences. |
While these ultraspecific probe sequences, when incorporated in a microarray assay, are expected to be able to detect all of the 889 alleles considered, it is not guaranteed that a single assay could distinguish between different alleles. Thus, the set of ultraspecific probe sequences was analyzed using a greedy algorithm (see Methods) in order to generate the minimum set of ultraspecific probes required such that all of the alleles can be detected with the highest typing resolution possible. Based upon the results of our computations, a probe set including 115 ultraspecific probes was identified. Ultraspecific probes included in the set were selected based upon their ability to both detect the 889 alleles as well as their ability to provide resolution for distinguishing between alleles. Each allele is expected to hybridize with between two to 19 different probes in this set. While all ultraspecific probes are able to distinguish one to many different targets, not all members of the set provide the ability to indentify a particular allele (distinguishability). This set will produce distinctive hybridization patterns for 634 out of the 889 alleles such that 72.33% of the alleles can not only be detected but also definitively typed; the remaining 255 alleles, while detectable by this probe set, will not hybridize to any unique or unique combination of ultraspecific probe sequences.
As shown in Figure , the first five probes selected in the set are able to detect alleles but do not provide any distinction between the alleles detected; with the inclusion of the sixth probe, however, it is now possible to distinguish between the expected hybridization patterns of the B_1303 and B_560502 alleles from all other alleles considered. While the complete set of 115 ultraspecific probes can recognize 634 alleles, 40 of these ultraspecific probes are actually able to individually distinguish between 509 alleles (57.26%). Additional File
2 lists each of the ultraspecific probe sequences included in the set. The percentage of the 634 distinguishable alleles each of these probes is expected to hybridize with is shown in Figure . As is shown in this figure, nearly half (49.57%) of the probes included in the set are only expected to hybridize with a single allele thus generating a unique hybridization pattern such that the particular allele can be uniquely identified from all of the other alleles considered.
In silico assays for the 115 ultraspecific probe set and each of the alleles were conducted. As one would expect, it is the groups of highly similar alleles that the ultraspecific probe set is unable to distinguish between. In order to visualize the degree of similarity between the alleles based upon their expected hybridization patterns, we calculated the distance (D) between any two patterns as:
where
n1 and
n2 are the numbers of probes present in each of alleles being compared and
n12 is number of probes present in both alleles simultaneously. By computing the distances between each pair of 889 patterns (the distance matrix) with PHYLIP's neighbor program [
22], we were able to group alleles using the publicly available NJPlot software package [
23] based on the distances between patterns observed on the microarray. The tree was generated for the set of alleles which cannot be individually typed. Figure illustrates a subtree in this tree; the NEWICK file for the complete tree can be found in Additional File
3.