Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2758932

Formats

Article sections

- Abstract
- Introduction
- The Intersection Bound
- The Intersection Inequality is Sharper than the Triangle Inequality
- Additional Properties and Generalization of the Intersection Inequality
- Generalized Triangle Inequality
- Applications to Efficient Database Searches: the Modulo Hashing Approach
- References

Authors

Related links

J Chem Inf Model. Author manuscript; available in PMC 2010 August 1.

Published in final edited form as:

PMCID: PMC2758932

NIHMSID: NIHMS132611

School of Information and Computer Sciences, Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, CA 92697-3435, USA

E-mail: ude.icu.sci@idlabfp

The publisher's final edited version of this article is available at J Chem Inf Model

See other articles in PMC that cite the published article.

Bounds on distances or similarity measures can be useful to help search large databases efficiently. Here we consider the case of large databases of small molecules represented by molecular fingerprint vectors with the Tanimoto similarity measure. We derive a new intersection inequality which provides a bound on the Tanimoto similarity between two fingerprint vectors and show that this bound is considerably sharper than the bound associated with the triangle inequality of the Tanimoto distance. The inequality can be applied to other intersection-based similarity measures. We introduce a new integer representation which relies on partitioning the fingerprint components, for instance by taking components modulo some integer *M*, and reporting the total number of 1-bits falling in each partition. We show how the intersection inequality can be generalized immediately to these integer representations and used to search large databases of binary fingerprint vectors efficiently.

Bounds on distances or similarity measures can be useful for searching large databases.^{1}^{–}^{5} In a fairly typical similarity search situation, if item is close to item and item is far from item , then one would like to infer that is also far from , thereby allowing one to prune the search space and ignore In particular, we consider the typical chemoinformatics situation where a large database of molecules represented by binary fingerprint vectors of fixed length *N* must be searched using the Tanimoto similarity measure.^{6}^{–}^{9} Given a molecule , we denote its fingerprint vector by *$\stackrel{\u20d7}{A}$*. The Tanimoto similarity measure is given by

$$T(\mathcal{A},\mathcal{B})=T(\overrightarrow{A},\overrightarrow{B})=\frac{A\cap B}{A\cup B}$$

(1)

Here and everywhere else we use *A, A* ∩ *B, A* *B* … to denote the number of 1-bits in the vectors (or sets) *$\stackrel{\u20d7}{A}$,$\stackrel{\u20d7}{A}$∩$\stackrel{\u20d7}{B}$, $\stackrel{\u20d7}{A}$$\stackrel{\u20d7}{B}$ …*. It is well known that *D*(*$\stackrel{\u20d7}{A}$,$\stackrel{\u20d7}{B}$*) is a distance.^{10}^{,}^{11} Thus, using the triangle inequality we can bound the distance *D*(*$\stackrel{\u20d7}{B}$,$\stackrel{\u20d7}{C}$*) by

$$|D(\overrightarrow{A},\overrightarrow{B})-D(\overrightarrow{A},\overrightarrow{C})|\le D(\overrightarrow{B},\overrightarrow{C})\le D(\overrightarrow{A},\overrightarrow{C})+D(\overrightarrow{A},\overrightarrow{B})$$

(2)

from which one obtains the following bound on the similarity *T*(*$\stackrel{\u20d7}{B}$,$\stackrel{\u20d7}{C}$*)

$$T(\overrightarrow{B},\overrightarrow{C})\le 1-|T(\overrightarrow{A},\overrightarrow{B})-T(\overrightarrow{A},\overrightarrow{C})|$$

(3)

Our goal here is to first derive another general bound on *T*(*$\stackrel{\u20d7}{B}$,$\stackrel{\u20d7}{C}$*) and then study its relationship to the triangle inequality. We then further generalize this bound and show how it can be applied to efficiently search large databases of chemical fingerprint vectors.

As can be seen in the Venn Diagram of Figure 1, the following relation holds:

$$\overrightarrow{B}\cap \overrightarrow{C}=(\overrightarrow{B}\cap \overrightarrow{C}\cap \overrightarrow{A})\cup [(\overrightarrow{B}-\overrightarrow{A})\cap (\overrightarrow{C}-\overrightarrow{A})]$$

(4)

By letting *A ∩ B* = β and *A ∩ C* = γ

$$|\overrightarrow{B}\cap \overrightarrow{C}\cap \overrightarrow{A}|\le \text{min}(A\cap B,A\cap C)=\text{min}(\beta ,\gamma )$$

(5)

and

$$|(\overrightarrow{B}-\overrightarrow{A})\cap (\overrightarrow{C}-\overrightarrow{A})|\le \text{min}[B-(A\cap B),C-(A\cap C)]=\text{min}(B-\beta ,C-\gamma )$$

(6)

Combining these inequalities, we get

$$B\cap C\le \text{min}(A\cap B,A\cap C)+\text{min}[C-(A\cap C),B-(A\cap B)]=\text{min}(\beta ,\gamma )+\text{min}(B-\beta ,C-\gamma )$$

(7)

Since *T*(*$\stackrel{\u20d7}{B}$,$\stackrel{\u20d7}{C}$*) = (*B∩C*)/(*BC*)=(*B∩C*)/(*B*+*C*−(*B∩C*)), we finally have

$$T(\overrightarrow{B},\overrightarrow{C})\le \frac{\text{min}(\beta ,\gamma )+\text{min}(B-\beta ,C-\gamma )}{B+C-\text{min}(\beta ,\gamma )-\text{min}(B-\beta ,C-\gamma )}$$

(8)

which we call the intersection bound or intersection inequality.

In this section, we prove that the intersection inequality is sharper than the triangle inequality. More specifically we have the following theorem.

For any fingerprints *$\stackrel{\u20d7}{A}$,$\stackrel{\u20d7}{B}$*, and *$\stackrel{\u20d7}{C}$*, we have the inequality

$$T(\overrightarrow{B},\overrightarrow{C})\le \frac{\text{min}(\beta ,\gamma )+\text{min}(B-\beta ,C-\gamma )}{B+C-\text{min}(\beta ,\gamma )-\text{min}(B-\beta ,C-\gamma )}\le 1-|\frac{\beta}{A\cup B}-\frac{\gamma}{A\cup C}|$$

(9)

Furthermore the inequality is strict, except in degenerate cases detailed in the proof where: (^{1}) at least one of the fingerprints is equal to ; or (2) *$\stackrel{\u20d7}{A}$* *$\stackrel{\u20d7}{B}$* or *$\stackrel{\u20d7}{A}$* *$\stackrel{\u20d7}{C}$*; or (3) *$\stackrel{\u20d7}{B}$* and *$\stackrel{\u20d7}{C}$* are symmetric with respect to *$\stackrel{\u20d7}{A}$* (i.e. *B = C* and β = γ). Furthermore, if two of the fingerprints are identical (*$\stackrel{\u20d7}{A}$* = *$\stackrel{\u20d7}{B}$* or *$\stackrel{\u20d7}{B}$* = *$\stackrel{\u20d7}{C}$* or *$\stackrel{\u20d7}{A}$* = *$\stackrel{\u20d7}{C}$*) then the two bounds are equal and exact.

First note that *B* and *C* play entirely symmetric roles and this symmetry can be used to reduce the number of cases that need to be examined. Second consider situations where some of the denominators in Equation 9 are equal to 0. If *A* *B* = 0, then *A* = *B* = 0 and similarly for *A* *C* = 0. The denominator on the left hand side is 0 if and only if *B* = 0 or *C* = 0. Cases where *A* = 0, or *B* = 0, or *C* = 0 can be examined directly and are summarized in Table 1 and Table 2.

Table of possibilities with *A* = 0. In each case, the left entry is the value of the bound given by the left-hand side of Equation 9 and the right-hand side is the bound resulting from the triangle inequality (right-hand side of Equation 9). Note that **...**

Table of possibilities with *A* ≠ 0. In each case, the left entry is the value of the bound given by the left-hand side of Equation 9 and the right-hand side is the bound resulting from the triangle inequality (right-hand side of Equation 9). Note **...**

Thus in the rest of the proof we can assume that *A* ≠ 0, *B* ≠ 0, and *C* ≠ 0 and that none of the denominators in Equation 9 is equal to 0. To aggregate the fractions in the right-hand side of Equation 9, it is convenient to parameterize the problem by assuming that *A* *B* = *k*(*A* *C*) where *k* is a strictly positive real number. Without any loss of generality, we can assume that *k* ≥ 1. By the definition of *k*, we have *A* + *B* − β = *k*(*A* + *C* − γ). This can be rewritten as:

$$B-C=(k-1)(A+C)+\beta -k\gamma $$

(10)

and also as:

$$B-\beta =(k-1)(A\cup C)+C-\gamma $$

(11)

Equation 11 shows that under our assumptions *B* − β ≥ *C* − γ, with equality possible only if and only if *k* = 1. By noticing that *A* + *C ≥* 2γ, we also have

$$B-C\ge k\gamma -2\gamma +\beta $$

(12)

Now we can write Equation 9 as:

$$\frac{\text{min}(\beta ,\gamma )+\text{min}(B-\beta ,C-\gamma )}{B+C-\text{min}(\beta ,\gamma )-\text{min}(B-\beta ,C-\gamma )}\le 1-\frac{|\beta -k\gamma |}{A\cup B}=1-\frac{\text{max}(\beta ,k\gamma )-\text{min}(\beta ,k\gamma )}{A\cup B}$$

(13)

By assumption here none of the denominators is equal to 0, and we have seen that min (*B* − β, *C* − γ) = *C* − γ. Thus after reducing to common denominators and rearranging the terms, we need to prove the equivalent relation given by:

$$[B-\text{min}(\beta ,\gamma )+\gamma ][\text{max}(\beta ,k\gamma )-\text{min}(\beta ,k\gamma )]\le [A\cup B][B-C+2\gamma -2\text{min}(\beta ,\gamma )]$$

(14)

Three cases can be distinguished depending on the size of β relative to γ and *k*γ. It is convenient to treat the special symmetric case *k* = 1 separately, and we begin with that.

*k* = 1. In this case, Equation 10 yields *B* − β = *C* − γ. Two subcases can be considered.

- In the most symmetric subcase, γ = β which implies that
*B = C*, and both bounds are equal to 1. - Alternatively, without any loss of generality, assume that γ < β. Then Equation 14 becomes
*B*(β − γ) ≤ (*A**B*)(*B − C*). Since here*B*−*C*= β − γ, the inequality simplifies to*B ≤ A**B*which is always true. Furthermore, equality is achieved only when*$\stackrel{\u20d7}{A}$**$\stackrel{\u20d7}{B}$*in which case both bounds are equal to*C/B*and are not necessarily exact.

Thus in the rest of the proof, we can assume that *k* > 1 (in addition to *A ≠* 0, *B ≠* 0, and *C ≠* 0).

*k*γ ≤ β. In this case, Equation 14 yields:

$$B[\beta -k\gamma ]\le [A\cup B][B-C]$$

(15)

This inequality is true because *B ≤ A* *B*, and β − *k*γ ≤ *B* − *C* by Equation 10. Under the current assumptions, equality in Equation 15 cannot be achieved since equality could occur only in the following two situations:

*B = A**B*and β −*k*γ =*B − C*, which also implies*k*= 1.

γ ≤ β ≤ *k*γ. In this case, Equation 14 yields:

$$B[k\gamma -\beta ]\le [A\cup B][B-C]$$

(16)

Again this inequality is true because *B* ≤ *A**B*, and *k*γ−β ≤ *B − C* by Equation 10 and Equation 12. Equality between the bounds in Equation 16 can only be achieved when *$\stackrel{\u20d7}{A}$* = *$\stackrel{\u20d7}{C}$* *$\stackrel{\u20d7}{B}$*. In this case, both bounds are exact and equal to *C/B*. To see this, under the current assumptions, equality could occur only in the following two situations:

*B = A**B*and*k*γ− β =*B*−*C*. But then*$\stackrel{\u20d7}{A}$**$\stackrel{\u20d7}{B}$*and Equation 12 would imply γ ≥ β, hence β = γ =*A*. In turn, this would yield*B = A**B = k*(*A**C*) =*kC*which combined with*k*γ− β =*B*−*C*forces*C*= γ and therefore*$\stackrel{\u20d7}{C}$*=*$\stackrel{\u20d7}{A}$**$\stackrel{\u20d7}{B}$*. In this case, we have*T*(*$\stackrel{\u20d7}{B}$*,*$\stackrel{\u20d7}{C}$*) =*C/B*and this is the value assumed by both the intersection inequality and the triangle inequality and they are both exact.

β ≤ γ. In this case, Equation 14 yields:

$$[B-\beta +\gamma ][k\gamma -\beta ]\le [A\cup B][B-C+2\gamma -2\beta ]$$

(17)

Again this inequality is true because *B* − β + γ ≤ *A* *B*, and *k*γ− β ≤ *B − C + 2γ−* 2β by Equation 10 and Equation 12. Equality between the bounds in Equation 16 can only be achieved when *$\stackrel{\u20d7}{A}$* = *$\stackrel{\u20d7}{C}$*. In this case, both bounds are exact and equal to *T*(*$\stackrel{\u20d7}{B}$,$\stackrel{\u20d7}{C}$* = *T*(*$\stackrel{\u20d7}{A}$,$\stackrel{\u20d7}{B}$*) = β/(*A**B*) = β/(*A**C*). To see this, under the current assumptions, equality could occur only in the following two situations:

*B − C*= 2β − 2γ. Using Equation 12, this would imply β =*k*γ which is possible here only if β = γ = 0 and*B = C*. But this implies*A**B = A**C*which contradicts*k*> 1*A**B*=*B*− β + γ and*k*γ− β =*B*−*C*+ 2γ− 2β. The first equality yields*A*= γ, hence*$\stackrel{\u20d7}{A}$**$\stackrel{\u20d7}{C}$*. Combining this with*A**B = k*(*A**C*) gives*C*= γ, and so*$\stackrel{\u20d7}{A}$*=*$\stackrel{\u20d7}{C}$*. When*$\stackrel{\u20d7}{A}$*=*$\stackrel{\u20d7}{C}$*, the intersection and triangle inequality bounds are identical and exact and equal to β/(*A**B*).

The following properties can be derived immediately.

The intersection bound does not depend on any bits in *$\stackrel{\u20d7}{A}$* that are not in common with *$\stackrel{\u20d7}{B}$* or *$\stackrel{\u20d7}{C}$*.

The intersection bound can be applied to many other similarity measures *S*(*$\stackrel{\u20d7}{A}$,$\stackrel{\u20d7}{B}$* for binary fingerprints. Although the Tanimoto measure is by far the most widely used measure, two dozen or so other measures have been described in the literature.^{12} All these measures involve algebraic expressions that explicitly depend on the intersection *A* ∩ *B*. Thus, with the proper adjustments, bounds on the intersection naturally lead to bounds on these other similarity measures.

In general, the intersection inequality is considerably sharper than the triangle inequality. We have seen in the theorem above that the intersection bound is always better than the triangle inequality, except in some degenerate cases. But is it much better and worth using? The answer to this question is clearly yes as illustrated by the following two examples.

Consider the case where the three sets *$\stackrel{\u20d7}{A}$*,*$\stackrel{\u20d7}{B}$*, and *$\stackrel{\u20d7}{C}$* are disjoint, and *A* ≠ 0. Then Equation 9 gives

$$T(\overrightarrow{B},\overrightarrow{C}\le \frac{\text{min}(B,C)}{\text{max}(B,C)}\le 1$$

(18)

In other words, in this case the triangle inequality bound is useless and equal to 1. The intersection bound is much sharper and can diverge maximally from the triangle inequality bound by being as small as 0 when min(*B,C*) = 0 and max(*B,C*) ≠ 0.

Consider now a typical case (e.g. Daylight fingerprint system with OR-folding lossy compression) where fingerprints of length *N* = 512 are used. Depending on various implementation details, the average number of 1-bits per fingerprint vector could be something like 250. Consider then two fairly typical fingerprints with *B* = 150 and *C* = 350. Note that to derive a bound on *T*(*$\stackrel{\u20d7}{B}$$\stackrel{\u20d7}{C}$*), we are free to choose the vector *$\stackrel{\u20d7}{A}$* (and only the vectors of all 0-bits or all 1-bits can be used to compute bounds, without making additional assumptions on *$\stackrel{\u20d7}{B}$* and *$\stackrel{\u20d7}{C}$*. Consider the choice *$\stackrel{\u20d7}{A}$* = . Then Equation 9 yields

$$T(\overrightarrow{B},\overrightarrow{C})\le \frac{150}{350}\le 1$$

(19)

Thus in this fairly typical case the triangle inequality gives a useless bound equal to 1, whereas the intersection inequality gives a much sharper bound equal to 3/7. If, for instance, *B* was the query and we were interested in finding molecules with a Tanimoto similarity of at least 0.5, the intersection inequality would allow us to immediately discard the molecule *$\stackrel{\u20d7}{C}$* from the search, without having to compute precisely the value of *T*(*$\stackrel{\u20d7}{B}$,$\stackrel{\u20d7}{C}$*). This is of course also the basic idea used in^{5} and generalized below in the Section on Applications to Efficient Database Searches.

Let us use the notation *$\stackrel{\u20d7}{A}$ ^{C}* to denote the complement of

$$B\cap C\le \text{min}(A\cap B,A\cap C)+\text{min}({A}^{C}\cap B,{A}^{C}\cap C)$$

(20)

More generally, if *A*_{1},…,*A _{M}* denotes any family of disjoint subsets of components (

$$B\cap C\le {\displaystyle \sum _{i=1}^{M}\text{min}({A}_{i}\cap B,{A}_{i}\cap C)}$$

(21)

In particular, this is true for any partition *A*_{1},…,*A _{M}* of the entire set of components. Thus one important question is how to choose the partition in order to get good bounds. Below we further investigate the case where the partition is obtained by considering the fingerprint components modulo some integer

Let *M* be any integer between 1 and *N* and assume that to any molecule and binary fingerprint *$\stackrel{\u20d7}{A}$*, we also attach a new integer vector of counts *$\stackrel{\u20d7}{a}$* = (*a*_{1},…*a _{M}*) of length

If we consider two fingerprint vectors *$\stackrel{\u20d7}{B}$* and *$\stackrel{\u20d7}{C}$*, with their respective count vectors *$\stackrel{\u20d7}{b}$* and *$\stackrel{\u20d7}{c}$*, then *$\stackrel{\u20d7}{B}$* and *$\stackrel{\u20d7}{C}$* can have at most min(*b _{i}*,

$$B\cap C\le {\displaystyle \sum _{i=1}^{M}\text{min}({b}_{i},{c}_{i})}$$

(22)

resulting in the following bound on the Tanimoto similarity

$$T(\overrightarrow{B},\overrightarrow{C})\le \frac{{\sum}_{i=1}^{M}\text{min}({b}_{i},{c}_{i})}{B+C-{\sum}_{i=1}^{M}\text{min}({b}_{i},{c}_{i})}$$

(23)

Suppose that *$\stackrel{\u20d7}{B}$* is the query fingerprint and that we are interested in retrieving from the database all the molecules *$\stackrel{\u20d7}{C}$* satisfying *T*(*$\stackrel{\u20d7}{B}$*,*$\stackrel{\u20d7}{C}$*)>*t*, i.e. with a Tanimoto similarity of at least *t* to the query. Then, using the intersection bound, we can discard from the search all the molecules *$\stackrel{\u20d7}{C}$* satisfying

$$\frac{{\sum}_{i=1}^{M}\text{min}({b}_{i},{c}_{i})}{B+C-{\sum}_{i=1}^{M}\text{min}({b}_{i},{c}_{i})}\le t$$

(24)

or equivalently

$$\sum _{i=1}^{M}\text{min}({b}_{i},{c}_{i})\le \frac{t}{1+t}(B+C)$$

(25)

Thus, in summary, if we are searching a database of molecules with a query *$\stackrel{\u20d7}{B}$* and a Tanimoto similarity threshold *t*, we can prune the search by removing all the molecules that satisfy Equation 25 and are thereby guaranteed to have a degree of similarity to *$\stackrel{\u20d7}{B}$* that is below the threshold *t*. To show that this corresponds to a very significant amount of pruning, we consider in more details the cases of *M* = 1 and *M* = 2.

This is exactly the case described and used in.^{5} In this case, Equation 25 implies that we can discard all the molecules *$\stackrel{\u20d7}{C}$* with

$$C\le tB\phantom{\rule{thinmathspace}{0ex}}\text{or}\phantom{\rule{thinmathspace}{0ex}}C\ge B/t$$

(26)

Using the same example as above with fingerprints of length *N* = 512, consider a typical query *$\stackrel{\u20d7}{B}$* with *B* = 300 and a similarity threshold of *t* = 0.8. Then, using the intersection inequality, we can immediately discard from the search all the molecules *$\stackrel{\u20d7}{C}$* with *C* ≤ 240 = 0.8 × 300 or *C* ≥ 375 = 300/0.8.

More generally, Figure 2 shows the significant amount of database pruning achieved by the intersection inequality with *M* = 1 in a practical setting, for different Tanimoto similarity thresholds *t* and different queries. The curves are derived using a random subset of 50,000 molecules from the ChemDB database^{8}^{,}^{13} using Daylight-style fingerprints of length 512 (with folded-OR lossy compression). For example, for a similarity threshold of *t* = 0.8 and a query containing *B* = 200 1-bits, on average 80% of the database can be pruned from the search.

The case *M* = 2 subsumes the case *M* = 1 and allows one to prune additional molecules. In this case, we have *$\stackrel{\u20d7}{b}$* = (*b*_{1},*b*_{2}) and *$\stackrel{\u20d7}{c}$* = (*c*_{1},*b*_{2}). Here *b*_{1} represents the number of 1-bits in *$\stackrel{\u20d7}{B}$* falling on odd-numbered components, *b*_{2} the number of 1-bits in *$\stackrel{\u20d7}{B}$* falling on even-numbered components, and similarly for *$\stackrel{\u20d7}{c}$*. Obviously, *b*_{1}+*b*_{2} = *B* and *c*_{1} + *c*_{2} = *C*. When *M* = 2, the intersection bound allows one to prune all the molecules *$\stackrel{\u20d7}{C}$* such that

$$\text{min}({b}_{1},{c}_{1})+\text{min}({b}_{2},{c}_{2})\le \frac{t}{1+t}(B+C)$$

(27)

This yields four subcases:

*b*_{1}+*b*_{2}=*B*≤*t*(*B*+*C*)/(1 +*t*) which yields*C*≥*B/t*(as in the case of*M*= 1).*c*_{1}+*c*_{2}=*C*≤*t*(*B*+*C*)/(1 +*t*) which yields*C*≤*tB*(as in the case of*M*= 1).*b*_{1}+*c*_{2}≤*t*(*B*+*C*)/(1 +*t*) which yields*c*_{2}≤ −*b*_{1}+*t*(*B*+*C*)/(1 +*t*).*c*_{1}+*b*_{2}<*t*(*B*+*C*)/(1 +*t*) which yields*c*_{1}≤ −*b*_{2}+*t*(*B*+*C*)/(1 +*t*).

The first two subcases correspond to the pruning derived when *M* = 1. The last two subcases correspond to new additional pruning.

To see this, consider the example given above in the case of *M* = 1 with a query *$\stackrel{\u20d7}{B}$* with *B* = 300 and a similarity threshold of *t* = 0.8. By using the *M* = 1 level corresponding to the first two subcases, one needs to focus only on candidate molecules satisfying 240 < *C* < 375. By using *M* = 2 and the last two subcases, one can prune additional molecules that fall within this interval. For instance, assume that *$\stackrel{\u20d7}{b}$* = (200,100). Then even among the molecules *$\stackrel{\u20d7}{C}$* containing the same number of 1-bits as the query (i.e. *C* = *B* = 300), we can eliminate all of those associated with *$\stackrel{\u20d7}{c}$* = (*c*_{1},*c*_{2}) for which *c*_{1} ≤ 166 (subcase 4) or *c*_{2} ≤ 66 (subcase 3).

In short, it should be clear that the greater the value of *M*, the greater the amount of pruning associated with the intersection bound and Equation 25. While the amount of pruning is an essential consideration, other equally important factors must be taken into account in an actual database implementation. For instance, larger values of *M* may require additional storage space and computing time. Furthermore, the data structure required to implement the pruning algorithm is also an important consideration. While a complete analysis of these issues is beyond the scope of this paper, it is worth noting that the cases of *M* = 1 and *M* = ^{2} suggest a natural and efficient data structure. First, organize the fingerprints *$\stackrel{\u20d7}{C}$* in the database into bins containing all the fingerprints that have the same number *C* of 1-bits. Organize the bins by increasing values of *C*. Second, within any such bin, organize the fingerprints by increasing values of, for instance, *c*_{1} corresponding to the number of bits falling on odd components. For a given query *$\stackrel{\u20d7}{B}$* and threshold *t*, one can first use the intersection inequality with *M* = 1 to discard all the bins associated with values of *C* that are too large or too small and focus on bins with *C* in a limited range. Then, for these remaining bins, one can use the intersection inequality with *M* = 2 to discard additional molecules and focus on sub-bins of molecules with *c*_{1} in a limited range. This procedure can be further repeated using, for instance, *M* = 4 raising again interesting questions of optimality and tradeoffs.

Formulae for estimating the amount of pruning for different values of *M* and different thresholds *t* have been derived, but will be given elsewhere, together with an analysis of the corresponding tradeoffs. Yet, even without those analyses, it is clear that the intersection bound is a simple but powerful tool for organizing and searching large databases of fingerprint vectors.

Work supported by NIH Biomedical Informatics Training grant (LM-07443-01), NSF MRI grant (EIA- 0321390), and NSF grant 0513376 to PB.

1. Burkhard W, Keller R. Some Approaches to Best-Match File Searching. Communications of the ACM. 1973;16:230–236.

2. Shapiro M. The Choice of Reference Points in Best-Match File Searching. Communications of the ACM. 1977;20:339–343.

3. Shasha D, Wang T-L. New Techniques for Best-Match Retrieval. ACM Trans. Inf. Syst. 1990;8:140–158.

4. Downs GM, Willett P. Similarity Searching and Clustering of Chemical-Structure Databases Using Molecular Property Data. J. Chem. Info. Comp. Sci. 1994;34:1094–1102.

5. Swamidass S, Baldi P. Bounds and Algorithms for Exact Searches of Chemical Fingerprints in Linear and Sub-Linear Time. J. Chem. Info. Model. 2007;47:302–317. [PMC free article] [PubMed]

6. Willett P, Barnard J, Downs G. Chemical Similarity Searching. J. Chem. Info. Comp. Sci. 1998;38:983–996.

7. Leach AR, Gillet VJ. An Introduction to Chemoinformatics. Springer: Dordrecht, The Netherlands; 2005.

8. Chen J, Swamidass SJ, Dou Y, Bruand J, Baldi P. ChemDB: a Public Database of Small Molecules and Related Chemoinformatics Resources. Bioinformatics. 2005;21:4133–4139. [PubMed]

9. Baldi P, Hirschberg DS, Nasr RJ. Speeding Up Chemical Database Searches Using a Proximity Filter Based on the Logical Exclusive-OR. J. Chem. Info. Comp. Model. 2008;48:1367–1378. [PubMed]

10. Spath H. Cluster Analysis Algorithms. Ellis Horwood; 1980.

11. Lipkus A. A Proof of the Triangle Inequality for the Tanimoto Distance. J. Math. Chem. 1999;26:263–265.

12. Holliday JD, Hu CY, Willett P. Grouping of Coefficients for the Calculation of Inter-Molecular Similarity and Dissimilarity Using 2D Fragment Bit-Strings. Comb. Chem. High. Throughput Screen. 2002;5:155–166. [PubMed]

13. Chen J, Linstead E, Swamidass SJ, Wang D, Baldi P. ChemDB Update-Full Text Search and Virtual Chemical Space. Bioinformatics. 2007;23:2348–2351. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |