|Home | About | Journals | Submit | Contact Us | Français|
Bounds on distances or similarity measures can be useful to help search large databases efficiently. Here we consider the case of large databases of small molecules represented by molecular fingerprint vectors with the Tanimoto similarity measure. We derive a new intersection inequality which provides a bound on the Tanimoto similarity between two fingerprint vectors and show that this bound is considerably sharper than the bound associated with the triangle inequality of the Tanimoto distance. The inequality can be applied to other intersection-based similarity measures. We introduce a new integer representation which relies on partitioning the fingerprint components, for instance by taking components modulo some integer M, and reporting the total number of 1-bits falling in each partition. We show how the intersection inequality can be generalized immediately to these integer representations and used to search large databases of binary fingerprint vectors efficiently.
Bounds on distances or similarity measures can be useful for searching large databases.1–5 In a fairly typical similarity search situation, if item is close to item and item is far from item , then one would like to infer that is also far from , thereby allowing one to prune the search space and ignore In particular, we consider the typical chemoinformatics situation where a large database of molecules represented by binary fingerprint vectors of fixed length N must be searched using the Tanimoto similarity measure.6–9 Given a molecule , we denote its fingerprint vector by . The Tanimoto similarity measure is given by
Here and everywhere else we use A, A ∩ B, A B … to denote the number of 1-bits in the vectors (or sets) ,∩, …. It is well known that D(,) is a distance.10,11 Thus, using the triangle inequality we can bound the distance D(,) by
from which one obtains the following bound on the similarity T(,)
Our goal here is to first derive another general bound on T(,) and then study its relationship to the triangle inequality. We then further generalize this bound and show how it can be applied to efficiently search large databases of chemical fingerprint vectors.
As can be seen in the Venn Diagram of Figure 1, the following relation holds:
By letting A ∩ B = β and A ∩ C = γ
Combining these inequalities, we get
Since T(,) = (B∩C)/(BC)=(B∩C)/(B+C−(B∩C)), we finally have
which we call the intersection bound or intersection inequality.
In this section, we prove that the intersection inequality is sharper than the triangle inequality. More specifically we have the following theorem.
For any fingerprints ,, and , we have the inequality
Furthermore the inequality is strict, except in degenerate cases detailed in the proof where: (1) at least one of the fingerprints is equal to ; or (2) or ; or (3) and are symmetric with respect to (i.e. B = C and β = γ). Furthermore, if two of the fingerprints are identical ( = or = or = ) then the two bounds are equal and exact.
First note that B and C play entirely symmetric roles and this symmetry can be used to reduce the number of cases that need to be examined. Second consider situations where some of the denominators in Equation 9 are equal to 0. If A B = 0, then A = B = 0 and similarly for A C = 0. The denominator on the left hand side is 0 if and only if B = 0 or C = 0. Cases where A = 0, or B = 0, or C = 0 can be examined directly and are summarized in Table 1 and Table 2.
Thus in the rest of the proof we can assume that A ≠ 0, B ≠ 0, and C ≠ 0 and that none of the denominators in Equation 9 is equal to 0. To aggregate the fractions in the right-hand side of Equation 9, it is convenient to parameterize the problem by assuming that A B = k(A C) where k is a strictly positive real number. Without any loss of generality, we can assume that k ≥ 1. By the definition of k, we have A + B − β = k(A + C − γ). This can be rewritten as:
and also as:
Equation 11 shows that under our assumptions B − β ≥ C − γ, with equality possible only if and only if k = 1. By noticing that A + C ≥ 2γ, we also have
Now we can write Equation 9 as:
By assumption here none of the denominators is equal to 0, and we have seen that min (B − β, C − γ) = C − γ. Thus after reducing to common denominators and rearranging the terms, we need to prove the equivalent relation given by:
Three cases can be distinguished depending on the size of β relative to γ and kγ. It is convenient to treat the special symmetric case k = 1 separately, and we begin with that.
k = 1. In this case, Equation 10 yields B − β = C − γ. Two subcases can be considered.
Thus in the rest of the proof, we can assume that k > 1 (in addition to A ≠ 0, B ≠ 0, and C ≠ 0).
kγ ≤ β. In this case, Equation 14 yields:
This inequality is true because B ≤ A B, and β − kγ ≤ B − C by Equation 10. Under the current assumptions, equality in Equation 15 cannot be achieved since equality could occur only in the following two situations:
γ ≤ β ≤ kγ. In this case, Equation 14 yields:
Again this inequality is true because B ≤ AB, and kγ−β ≤ B − C by Equation 10 and Equation 12. Equality between the bounds in Equation 16 can only be achieved when = . In this case, both bounds are exact and equal to C/B. To see this, under the current assumptions, equality could occur only in the following two situations:
β ≤ γ. In this case, Equation 14 yields:
Again this inequality is true because B − β + γ ≤ A B, and kγ− β ≤ B − C + 2γ− 2β by Equation 10 and Equation 12. Equality between the bounds in Equation 16 can only be achieved when = . In this case, both bounds are exact and equal to T(, = T(,) = β/(AB) = β/(AC). To see this, under the current assumptions, equality could occur only in the following two situations:
The following properties can be derived immediately.
The intersection bound does not depend on any bits in that are not in common with or .
The intersection bound can be applied to many other similarity measures S(, for binary fingerprints. Although the Tanimoto measure is by far the most widely used measure, two dozen or so other measures have been described in the literature.12 All these measures involve algebraic expressions that explicitly depend on the intersection A ∩ B. Thus, with the proper adjustments, bounds on the intersection naturally lead to bounds on these other similarity measures.
In general, the intersection inequality is considerably sharper than the triangle inequality. We have seen in the theorem above that the intersection bound is always better than the triangle inequality, except in some degenerate cases. But is it much better and worth using? The answer to this question is clearly yes as illustrated by the following two examples.
Consider the case where the three sets ,, and are disjoint, and A ≠ 0. Then Equation 9 gives
In other words, in this case the triangle inequality bound is useless and equal to 1. The intersection bound is much sharper and can diverge maximally from the triangle inequality bound by being as small as 0 when min(B,C) = 0 and max(B,C) ≠ 0.
Consider now a typical case (e.g. Daylight fingerprint system with OR-folding lossy compression) where fingerprints of length N = 512 are used. Depending on various implementation details, the average number of 1-bits per fingerprint vector could be something like 250. Consider then two fairly typical fingerprints with B = 150 and C = 350. Note that to derive a bound on T(), we are free to choose the vector (and only the vectors of all 0-bits or all 1-bits can be used to compute bounds, without making additional assumptions on and . Consider the choice = . Then Equation 9 yields
Thus in this fairly typical case the triangle inequality gives a useless bound equal to 1, whereas the intersection inequality gives a much sharper bound equal to 3/7. If, for instance, B was the query and we were interested in finding molecules with a Tanimoto similarity of at least 0.5, the intersection inequality would allow us to immediately discard the molecule from the search, without having to compute precisely the value of T(,). This is of course also the basic idea used in5 and generalized below in the Section on Applications to Efficient Database Searches.
Let us use the notation C to denote the complement of . i.e. C is the vector obtained by replacing the 0-bits of with 1-bits, and the 1-bits of with 0-bits. Then, with the usual notation, Equation 7 can also be written as
More generally, if A1,…,AM denotes any family of disjoint subsets of components (Ai ∩ Aj = 0 for any i and j) then we have immediately the generalized triangle inequality
In particular, this is true for any partition A1,…,AM of the entire set of components. Thus one important question is how to choose the partition in order to get good bounds. Below we further investigate the case where the partition is obtained by considering the fingerprint components modulo some integer M.
Let M be any integer between 1 and N and assume that to any molecule and binary fingerprint , we also attach a new integer vector of counts = (a1,…aM) of length M. The count ai represents how many 1-bits in fall on components that are congruent to i modulo M. For instance, if = (1,0,1,1,0,0) and M = 3, then = (2,0,1) because there are two 1-bits in positions congruent to 1 modulo 3 (1st and 4th), zero 1-bits in positions congruent to 2 modulo 3 (2nd and 5th), and one 1-bit in positions congruent to 3 modulo 3 (3rd and 6th).
If we consider two fingerprint vectors and , with their respective count vectors and , then and can have at most min(bi, ci) bits in common at positions that are congruent with i modulo M. Thus in this case the intersection inequality of Equation 21 can be written
resulting in the following bound on the Tanimoto similarity
Suppose that is the query fingerprint and that we are interested in retrieving from the database all the molecules satisfying T(,)>t, i.e. with a Tanimoto similarity of at least t to the query. Then, using the intersection bound, we can discard from the search all the molecules satisfying
Thus, in summary, if we are searching a database of molecules with a query and a Tanimoto similarity threshold t, we can prune the search by removing all the molecules that satisfy Equation 25 and are thereby guaranteed to have a degree of similarity to that is below the threshold t. To show that this corresponds to a very significant amount of pruning, we consider in more details the cases of M = 1 and M = 2.
Using the same example as above with fingerprints of length N = 512, consider a typical query with B = 300 and a similarity threshold of t = 0.8. Then, using the intersection inequality, we can immediately discard from the search all the molecules with C ≤ 240 = 0.8 × 300 or C ≥ 375 = 300/0.8.
More generally, Figure 2 shows the significant amount of database pruning achieved by the intersection inequality with M = 1 in a practical setting, for different Tanimoto similarity thresholds t and different queries. The curves are derived using a random subset of 50,000 molecules from the ChemDB database8,13 using Daylight-style fingerprints of length 512 (with folded-OR lossy compression). For example, for a similarity threshold of t = 0.8 and a query containing B = 200 1-bits, on average 80% of the database can be pruned from the search.
The case M = 2 subsumes the case M = 1 and allows one to prune additional molecules. In this case, we have = (b1,b2) and = (c1,b2). Here b1 represents the number of 1-bits in falling on odd-numbered components, b2 the number of 1-bits in falling on even-numbered components, and similarly for . Obviously, b1+b2 = B and c1 + c2 = C. When M = 2, the intersection bound allows one to prune all the molecules such that
This yields four subcases:
The first two subcases correspond to the pruning derived when M = 1. The last two subcases correspond to new additional pruning.
To see this, consider the example given above in the case of M = 1 with a query with B = 300 and a similarity threshold of t = 0.8. By using the M = 1 level corresponding to the first two subcases, one needs to focus only on candidate molecules satisfying 240 < C < 375. By using M = 2 and the last two subcases, one can prune additional molecules that fall within this interval. For instance, assume that = (200,100). Then even among the molecules containing the same number of 1-bits as the query (i.e. C = B = 300), we can eliminate all of those associated with = (c1,c2) for which c1 ≤ 166 (subcase 4) or c2 ≤ 66 (subcase 3).
In short, it should be clear that the greater the value of M, the greater the amount of pruning associated with the intersection bound and Equation 25. While the amount of pruning is an essential consideration, other equally important factors must be taken into account in an actual database implementation. For instance, larger values of M may require additional storage space and computing time. Furthermore, the data structure required to implement the pruning algorithm is also an important consideration. While a complete analysis of these issues is beyond the scope of this paper, it is worth noting that the cases of M = 1 and M = 2 suggest a natural and efficient data structure. First, organize the fingerprints in the database into bins containing all the fingerprints that have the same number C of 1-bits. Organize the bins by increasing values of C. Second, within any such bin, organize the fingerprints by increasing values of, for instance, c1 corresponding to the number of bits falling on odd components. For a given query and threshold t, one can first use the intersection inequality with M = 1 to discard all the bins associated with values of C that are too large or too small and focus on bins with C in a limited range. Then, for these remaining bins, one can use the intersection inequality with M = 2 to discard additional molecules and focus on sub-bins of molecules with c1 in a limited range. This procedure can be further repeated using, for instance, M = 4 raising again interesting questions of optimality and tradeoffs.
Formulae for estimating the amount of pruning for different values of M and different thresholds t have been derived, but will be given elsewhere, together with an analysis of the corresponding tradeoffs. Yet, even without those analyses, it is clear that the intersection bound is a simple but powerful tool for organizing and searching large databases of fingerprint vectors.
Work supported by NIH Biomedical Informatics Training grant (LM-07443-01), NSF MRI grant (EIA- 0321390), and NSF grant 0513376 to PB.