PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
J Chem Inf Model. Author manuscript; available in PMC 2010 August 1.
Published in final edited form as:
PMCID: PMC2758932
NIHMSID: NIHMS132611

An Intersection Inequality Sharper than the Tanimoto Triangle Inequality for Efficiently Searching Large Databases

Abstract

Bounds on distances or similarity measures can be useful to help search large databases efficiently. Here we consider the case of large databases of small molecules represented by molecular fingerprint vectors with the Tanimoto similarity measure. We derive a new intersection inequality which provides a bound on the Tanimoto similarity between two fingerprint vectors and show that this bound is considerably sharper than the bound associated with the triangle inequality of the Tanimoto distance. The inequality can be applied to other intersection-based similarity measures. We introduce a new integer representation which relies on partitioning the fingerprint components, for instance by taking components modulo some integer M, and reporting the total number of 1-bits falling in each partition. We show how the intersection inequality can be generalized immediately to these integer representations and used to search large databases of binary fingerprint vectors efficiently.

Introduction

Bounds on distances or similarity measures can be useful for searching large databases.15 In a fairly typical similarity search situation, if item [mathematical script A] is close to item B and item C is far from item [mathematical script A], then one would like to infer that C is also far from B, thereby allowing one to prune the search space and ignore B In particular, we consider the typical chemoinformatics situation where a large database of molecules represented by binary fingerprint vectors of fixed length N must be searched using the Tanimoto similarity measure.69 Given a molecule [mathematical script A], we denote its fingerprint vector by A. The Tanimoto similarity measure is given by

T(𝒜,)=T(A,B)=ABAB
(1)

Here and everywhere else we use A, AB, A [union or logical sum] B … to denote the number of 1-bits in the vectors (or sets) A,AB, A[union or logical sum]B. It is well known that D(A,B) is a distance.10,11 Thus, using the triangle inequality we can bound the distance D(B,C) by

|D(A,B)D(A,C)|D(B,C)D(A,C)+D(A,B)
(2)

from which one obtains the following bound on the similarity T(B,C)

T(B,C)1|T(A,B)T(A,C)|
(3)

Our goal here is to first derive another general bound on T(B,C) and then study its relationship to the triangle inequality. We then further generalize this bound and show how it can be applied to efficiently search large databases of chemical fingerprint vectors.

The Intersection Bound

As can be seen in the Venn Diagram of Figure 1, the following relation holds:

BC=(BCA)[(BA)(CA)]
(4)

By letting A ∩ B = β and A ∩ C = γ

|BCA|min(AB,AC)=min(β,γ)
(5)

and

|(BA)(CA)|min[B(AB),C(AC)]=min(Bβ,Cγ)
(6)

Combining these inequalities, we get

BCmin(AB,AC)+min[C(AC),B(AB)]=min(β,γ)+min(Bβ,Cγ)
(7)

Since T(B,C) = (B∩C)/(B[union or logical sum]C)=(B∩C)/(B+C−(B∩C)), we finally have

T(B,C)min(β,γ)+min(Bβ,Cγ)B+Cmin(β,γ)min(Bβ,Cγ)
(8)

which we call the intersection bound or intersection inequality.

Figure 1
Venn Diagram. The three sets represent the 1-bit components of three fingerprint vectors associated with three molecules [mathematical script A], B, and C. β = A ∩ B and γ = AC. We are interested in estimating or bounding ...

The Intersection Inequality is Sharper than the Triangle Inequality

In this section, we prove that the intersection inequality is sharper than the triangle inequality. More specifically we have the following theorem.

Theorem

For any fingerprints A,B, and C, we have the inequality

T(B,C)min(β,γ)+min(Bβ,Cγ)B+Cmin(β,γ)min(Bβ,Cγ)1|βABγAC|
(9)

Furthermore the inequality is strict, except in degenerate cases detailed in the proof where: (1) at least one of the fingerprints is equal to [0 with right arrow above]; or (2) A [subset or is implied by] B or A [subset or is implied by] C; or (3) B and C are symmetric with respect to A (i.e. B = C and β = γ). Furthermore, if two of the fingerprints are identical (A = B or B = C or A = C) then the two bounds are equal and exact.

Proof

First note that B and C play entirely symmetric roles and this symmetry can be used to reduce the number of cases that need to be examined. Second consider situations where some of the denominators in Equation 9 are equal to 0. If A [union or logical sum] B = 0, then A = B = 0 and similarly for A [union or logical sum] C = 0. The denominator on the left hand side is 0 if and only if B = 0 or C = 0. Cases where A = 0, or B = 0, or C = 0 can be examined directly and are summarized in Table 1 and Table 2.

Table 1
Table of possibilities with A = 0. In each case, the left entry is the value of the bound given by the left-hand side of Equation 9 and the right-hand side is the bound resulting from the triangle inequality (right-hand side of Equation 9). Note that ...
Table 2
Table of possibilities with A ≠ 0. In each case, the left entry is the value of the bound given by the left-hand side of Equation 9 and the right-hand side is the bound resulting from the triangle inequality (right-hand side of Equation 9). Note ...

Thus in the rest of the proof we can assume that A ≠ 0, B ≠ 0, and C ≠ 0 and that none of the denominators in Equation 9 is equal to 0. To aggregate the fractions in the right-hand side of Equation 9, it is convenient to parameterize the problem by assuming that A [union or logical sum] B = k(A [union or logical sum] C) where k is a strictly positive real number. Without any loss of generality, we can assume that k ≥ 1. By the definition of k, we have A + B − β = k(A + C − γ). This can be rewritten as:

BC=(k1)(A+C)+βkγ
(10)

and also as:

Bβ=(k1)(AC)+Cγ
(11)

Equation 11 shows that under our assumptions B − β ≥ C − γ, with equality possible only if and only if k = 1. By noticing that A + C ≥ 2γ, we also have

BCkγ2γ+β
(12)

Now we can write Equation 9 as:

min(β,γ)+min(Bβ,Cγ)B+Cmin(β,γ)min(Bβ,Cγ)1|βkγ|AB=1max(β,kγ)min(β,kγ)AB
(13)

By assumption here none of the denominators is equal to 0, and we have seen that min (B − β, C − γ) = C − γ. Thus after reducing to common denominators and rearranging the terms, we need to prove the equivalent relation given by:

[Bmin(β,γ)+γ][max(β,kγ)min(β,kγ)][AB][BC+2γ2min(β,γ)]
(14)

Three cases can be distinguished depending on the size of β relative to γ and kγ. It is convenient to treat the special symmetric case k = 1 separately, and we begin with that.

Case 1

k = 1. In this case, Equation 10 yields B − β = C − γ. Two subcases can be considered.

  • In the most symmetric subcase, γ = β which implies that B = C, and both bounds are equal to 1.
  • Alternatively, without any loss of generality, assume that γ < β. Then Equation 14 becomes B(β − γ) ≤ (A [union or logical sum] B)(B − C). Since here BC = β − γ, the inequality simplifies to B ≤ A [union or logical sum] B which is always true. Furthermore, equality is achieved only when A[subset or is implied by]B in which case both bounds are equal to C/B and are not necessarily exact.

Thus in the rest of the proof, we can assume that k > 1 (in addition to A ≠ 0, B ≠ 0, and C ≠ 0).

Case 2

kγ ≤ β. In this case, Equation 14 yields:

B[βkγ][AB][BC]
(15)

This inequality is true because B ≤ A [union or logical sum] B, and β − kγ ≤ BC by Equation 10. Under the current assumptions, equality in Equation 15 cannot be achieved since equality could occur only in the following two situations:

  • β = kγ and B = C. But then Equation 10 would imply k = 1, contradicting the assumption k > 1.
  • B = A [union or logical sum] B and β − kγ = B − C, which also implies k = 1.

Case 3

γ ≤ β ≤ kγ. In this case, Equation 14 yields:

B[kγβ][AB][BC]
(16)

Again this inequality is true because BA[union or logical sum]B, and kγ−β ≤ B − C by Equation 10 and Equation 12. Equality between the bounds in Equation 16 can only be achieved when A = C [subset or is implied by] B. In this case, both bounds are exact and equal to C/B. To see this, under the current assumptions, equality could occur only in the following two situations:

  • β = kγ and B = C. But then Equation 10 would imply k = 1, contradicting the assumption k > 1.
  • B = A [union or logical sum] B and kγ− β = BC. But then A [subset or is implied by] B and Equation 12 would imply γ ≥ β, hence β = γ = A. In turn, this would yield B = A [union or logical sum] B = k(A [union or logical sum] C) = kC which combined with kγ− β = BC forces C = γ and therefore C = A [subset or is implied by] B. In this case, we have T(B,C) = C/B and this is the value assumed by both the intersection inequality and the triangle inequality and they are both exact.

Case 4

β ≤ γ. In this case, Equation 14 yields:

[Bβ+γ][kγβ][AB][BC+2γ2β]
(17)

Again this inequality is true because B − β + γ ≤ A [union or logical sum] B, and kγ− β ≤ B − C + 2γ− 2β by Equation 10 and Equation 12. Equality between the bounds in Equation 16 can only be achieved when A = C. In this case, both bounds are exact and equal to T(B,C = T(A,B) = β/(A[union or logical sum]B) = β/(A[union or logical sum]C). To see this, under the current assumptions, equality could occur only in the following two situations:

  • B − C = 2β − 2γ. Using Equation 12, this would imply β = kγ which is possible here only if β = γ = 0 and B = C. But this implies A [union or logical sum] B = A [union or logical sum] C which contradicts k > 1
  • A [union or logical sum] B = B − β + γ and kγ− β = BC + 2γ− 2β. The first equality yields A = γ, hence A[subset or is implied by]C. Combining this with A [union or logical sum] B = k(A [union or logical sum] C) gives C = γ, and so A = C. When A = C, the intersection and triangle inequality bounds are identical and exact and equal to β/(A [union or logical sum]B).

Additional Properties and Generalization of the Intersection Inequality

The following properties can be derived immediately.

Property 1

The intersection bound does not depend on any bits in A that are not in common with B or C.

Property 2

The intersection bound can be applied to many other similarity measures S(A,B for binary fingerprints. Although the Tanimoto measure is by far the most widely used measure, two dozen or so other measures have been described in the literature.12 All these measures involve algebraic expressions that explicitly depend on the intersection AB. Thus, with the proper adjustments, bounds on the intersection naturally lead to bounds on these other similarity measures.

Property 3

In general, the intersection inequality is considerably sharper than the triangle inequality. We have seen in the theorem above that the intersection bound is always better than the triangle inequality, except in some degenerate cases. But is it much better and worth using? The answer to this question is clearly yes as illustrated by the following two examples.

Example 1

Consider the case where the three sets A,B, and C are disjoint, and A ≠ 0. Then Equation 9 gives

T(B,Cmin(B,C)max(B,C)1
(18)

In other words, in this case the triangle inequality bound is useless and equal to 1. The intersection bound is much sharper and can diverge maximally from the triangle inequality bound by being as small as 0 when min(B,C) = 0 and max(B,C) ≠ 0.

Example 2

Consider now a typical case (e.g. Daylight fingerprint system with OR-folding lossy compression) where fingerprints of length N = 512 are used. Depending on various implementation details, the average number of 1-bits per fingerprint vector could be something like 250. Consider then two fairly typical fingerprints with B = 150 and C = 350. Note that to derive a bound on T(BC), we are free to choose the vector A (and only the vectors of all 0-bits or all 1-bits can be used to compute bounds, without making additional assumptions on B and C. Consider the choice A = [0 with right arrow above]. Then Equation 9 yields

T(B,C)1503501
(19)

Thus in this fairly typical case the triangle inequality gives a useless bound equal to 1, whereas the intersection inequality gives a much sharper bound equal to 3/7. If, for instance, B was the query and we were interested in finding molecules with a Tanimoto similarity of at least 0.5, the intersection inequality would allow us to immediately discard the molecule C from the search, without having to compute precisely the value of T(B,C). This is of course also the basic idea used in5 and generalized below in the Section on Applications to Efficient Database Searches.

Generalized Triangle Inequality

Let us use the notation AC to denote the complement of A. i.e. AC is the vector obtained by replacing the 0-bits of A with 1-bits, and the 1-bits of A with 0-bits. Then, with the usual notation, Equation 7 can also be written as

BCmin(AB,AC)+min(ACB,ACC)
(20)

More generally, if A1,…,AM denotes any family of disjoint subsets of components (Ai ∩ Aj = 0 for any i and j) then we have immediately the generalized triangle inequality

BCi=1Mmin(AiB,AiC)
(21)

In particular, this is true for any partition A1,…,AM of the entire set of components. Thus one important question is how to choose the partition in order to get good bounds. Below we further investigate the case where the partition is obtained by considering the fingerprint components modulo some integer M.

Applications to Efficient Database Searches: the Modulo Hashing Approach

Let M be any integer between 1 and N and assume that to any molecule [mathematical script A] and binary fingerprint A, we also attach a new integer vector of counts a = (a1,…aM) of length M. The count ai represents how many 1-bits in A fall on components that are congruent to i modulo M. For instance, if A = (1,0,1,1,0,0) and M = 3, then a = (2,0,1) because there are two 1-bits in positions congruent to 1 modulo 3 (1st and 4th), zero 1-bits in positions congruent to 2 modulo 3 (2nd and 5th), and one 1-bit in positions congruent to 3 modulo 3 (3rd and 6th).

If we consider two fingerprint vectors B and C, with their respective count vectors b and c, then B and C can have at most min(bi, ci) bits in common at positions that are congruent with i modulo M. Thus in this case the intersection inequality of Equation 21 can be written

BCi=1Mmin(bi,ci)
(22)

resulting in the following bound on the Tanimoto similarity

T(B,C)i=1Mmin(bi,ci)B+Ci=1Mmin(bi,ci)
(23)

Suppose that B is the query fingerprint and that we are interested in retrieving from the database all the molecules C satisfying T(B,C)>t, i.e. with a Tanimoto similarity of at least t to the query. Then, using the intersection bound, we can discard from the search all the molecules C satisfying

i=1Mmin(bi,ci)B+Ci=1Mmin(bi,ci)t
(24)

or equivalently

i=1Mmin(bi,ci)t1+t(B+C)
(25)

Thus, in summary, if we are searching a database of molecules with a query B and a Tanimoto similarity threshold t, we can prune the search by removing all the molecules that satisfy Equation 25 and are thereby guaranteed to have a degree of similarity to B that is below the threshold t. To show that this corresponds to a very significant amount of pruning, we consider in more details the cases of M = 1 and M = 2.

Case M=1

This is exactly the case described and used in.5 In this case, Equation 25 implies that we can discard all the molecules C with

CtBorCB/t
(26)

Using the same example as above with fingerprints of length N = 512, consider a typical query B with B = 300 and a similarity threshold of t = 0.8. Then, using the intersection inequality, we can immediately discard from the search all the molecules C with C ≤ 240 = 0.8 × 300 or C ≥ 375 = 300/0.8.

More generally, Figure 2 shows the significant amount of database pruning achieved by the intersection inequality with M = 1 in a practical setting, for different Tanimoto similarity thresholds t and different queries. The curves are derived using a random subset of 50,000 molecules from the ChemDB database8,13 using Daylight-style fingerprints of length 512 (with folded-OR lossy compression). For example, for a similarity threshold of t = 0.8 and a query containing B = 200 1-bits, on average 80% of the database can be pruned from the search.

Figure 2
Average database pruning for the case of M = 1. The y-axis represents the fraction of the database being pruned. The x-axis represents the number B of bits in the query molecule. Curves correspond to three different Tanimoto similarity threshold. Blue ...

Case M=2

The case M = 2 subsumes the case M = 1 and allows one to prune additional molecules. In this case, we have b = (b1,b2) and c = (c1,b2). Here b1 represents the number of 1-bits in B falling on odd-numbered components, b2 the number of 1-bits in B falling on even-numbered components, and similarly for c. Obviously, b1+b2 = B and c1 + c2 = C. When M = 2, the intersection bound allows one to prune all the molecules C such that

min(b1,c1)+min(b2,c2)t1+t(B+C)
(27)

This yields four subcases:

  1. b1 + b2 = Bt(B +C)/(1 + t) which yields CB/t (as in the case of M = 1).
  2. c1 + c2 = Ct(B + C)/(1 + t) which yields CtB (as in the case of M = 1).
  3. b1 +c2t(B + C)/(1 + t) which yields c2 ≤ −b1 + t(B + C)/(1 + t).
  4. c1 +b2 < t(B + C)/(1 + t) which yields c1 ≤ −b2 + t(B + C)/(1 + t).

The first two subcases correspond to the pruning derived when M = 1. The last two subcases correspond to new additional pruning.

To see this, consider the example given above in the case of M = 1 with a query B with B = 300 and a similarity threshold of t = 0.8. By using the M = 1 level corresponding to the first two subcases, one needs to focus only on candidate molecules satisfying 240 < C < 375. By using M = 2 and the last two subcases, one can prune additional molecules that fall within this interval. For instance, assume that b = (200,100). Then even among the molecules C containing the same number of 1-bits as the query (i.e. C = B = 300), we can eliminate all of those associated with c = (c1,c2) for which c1 ≤ 166 (subcase 4) or c2 ≤ 66 (subcase 3).

In short, it should be clear that the greater the value of M, the greater the amount of pruning associated with the intersection bound and Equation 25. While the amount of pruning is an essential consideration, other equally important factors must be taken into account in an actual database implementation. For instance, larger values of M may require additional storage space and computing time. Furthermore, the data structure required to implement the pruning algorithm is also an important consideration. While a complete analysis of these issues is beyond the scope of this paper, it is worth noting that the cases of M = 1 and M = 2 suggest a natural and efficient data structure. First, organize the fingerprints C in the database into bins containing all the fingerprints that have the same number C of 1-bits. Organize the bins by increasing values of C. Second, within any such bin, organize the fingerprints by increasing values of, for instance, c1 corresponding to the number of bits falling on odd components. For a given query B and threshold t, one can first use the intersection inequality with M = 1 to discard all the bins associated with values of C that are too large or too small and focus on bins with C in a limited range. Then, for these remaining bins, one can use the intersection inequality with M = 2 to discard additional molecules and focus on sub-bins of molecules with c1 in a limited range. This procedure can be further repeated using, for instance, M = 4 raising again interesting questions of optimality and tradeoffs.

Formulae for estimating the amount of pruning for different values of M and different thresholds t have been derived, but will be given elsewhere, together with an analysis of the corresponding tradeoffs. Yet, even without those analyses, it is clear that the intersection bound is a simple but powerful tool for organizing and searching large databases of fingerprint vectors.

Acknowledgement

Work supported by NIH Biomedical Informatics Training grant (LM-07443-01), NSF MRI grant (EIA- 0321390), and NSF grant 0513376 to PB.

References

1. Burkhard W, Keller R. Some Approaches to Best-Match File Searching. Communications of the ACM. 1973;16:230–236.
2. Shapiro M. The Choice of Reference Points in Best-Match File Searching. Communications of the ACM. 1977;20:339–343.
3. Shasha D, Wang T-L. New Techniques for Best-Match Retrieval. ACM Trans. Inf. Syst. 1990;8:140–158.
4. Downs GM, Willett P. Similarity Searching and Clustering of Chemical-Structure Databases Using Molecular Property Data. J. Chem. Info. Comp. Sci. 1994;34:1094–1102.
5. Swamidass S, Baldi P. Bounds and Algorithms for Exact Searches of Chemical Fingerprints in Linear and Sub-Linear Time. J. Chem. Info. Model. 2007;47:302–317. [PMC free article] [PubMed]
6. Willett P, Barnard J, Downs G. Chemical Similarity Searching. J. Chem. Info. Comp. Sci. 1998;38:983–996.
7. Leach AR, Gillet VJ. An Introduction to Chemoinformatics. Springer: Dordrecht, The Netherlands; 2005.
8. Chen J, Swamidass SJ, Dou Y, Bruand J, Baldi P. ChemDB: a Public Database of Small Molecules and Related Chemoinformatics Resources. Bioinformatics. 2005;21:4133–4139. [PubMed]
9. Baldi P, Hirschberg DS, Nasr RJ. Speeding Up Chemical Database Searches Using a Proximity Filter Based on the Logical Exclusive-OR. J. Chem. Info. Comp. Model. 2008;48:1367–1378. [PubMed]
10. Spath H. Cluster Analysis Algorithms. Ellis Horwood; 1980.
11. Lipkus A. A Proof of the Triangle Inequality for the Tanimoto Distance. J. Math. Chem. 1999;26:263–265.
12. Holliday JD, Hu CY, Willett P. Grouping of Coefficients for the Calculation of Inter-Molecular Similarity and Dissimilarity Using 2D Fragment Bit-Strings. Comb. Chem. High. Throughput Screen. 2002;5:155–166. [PubMed]
13. Chen J, Linstead E, Swamidass SJ, Wang D, Baldi P. ChemDB Update-Full Text Search and Virtual Chemical Space. Bioinformatics. 2007;23:2348–2351. [PubMed]