Computer-aided drug design has achieved wide use, judging by the penetrance of molecular modeling software within the pharmaceutical industry. A common observation, however, is that the performance of methods in practice, on real projects involving design and testing of new molecules, falls far short of the expectations generated by initial reports of new methods and corresponding validation data. Each such report involves a hypothesis of the form that some method is good for some application, and data are supplied to bolster the assertion.
There are two logical fallacies of hypothesis formation and testing, which, while not exclusive
to molecular modeling, form especially frequent traps within this field. The first is known in Latin as cum hoc ergo propter hoc,
meaning “with this, therefore because of this.” This will be referred to in what follows as the correlation fallacy
because the error lies in conflation of mere correlation with causation. The second encompasses a large variety of human behaviors and is called confirmation bias
]. This logical fallacy takes the form of seeking confirmatory evidence of a hypothesis in favor of or to the exclusion of evidence that may tend against the hypothesis.
The correlation fallacy was notably highlighted with respect to QSAR by Stephen Johnson [2
]. Johnson illustrated the issue with the high correlation between the quantity of fresh lemons imported into the USA over time and the reduction in the US highway fatality rate. There are, of course, many possible reasons why the fatality rate dropped, but selection of a reason purely by virtue of its correlation with the outcome and independent of any physical reality
is not enough evidence to support the selection. But, this is how much of QSAR has been practiced over many years: selection of a particular model from among many based on which has the best correlation or fitness score of some type (irrespective of relationship to underlying physical reality). Such models are then typically tested against some set of data that has been withheld, often by random partitioning. Random partitioning of a set of strongly related molecules will typically yield a test set where, for each molecule, there exists a highly similar training molecule. In general, the models will exhibit adequate performance on such tests, but will tend to perform poorly in the presence of many “activity cliffs” (relatively infrequent cases where small changes in structure produce large changes in activity). This is an example of a confirmation bias masked by what might seem to be a reasonable validation procedure.
There is a famous scene in “Monty Python and the Holy Grail” beginning with an accusation of witchcraft. Through serial application of the correlation fallacy, the crowd reasons that because witches are to be burned, the accused is a witch if she weighs the same as a duck. A suitable balancing scale and duck were produced, showing that the accused did, in fact, weigh the same as a duck. Therefore went the reasoning, she was a witch! Confirmation bias was ubiquitous in historical witch hunts [1
]. In 17th
century France, a legal authority was quoted as follows:
“He who is accused of sorcery should never be acquitted, unless the malice of the prosecutor be clearer than the sun; for it is so difficult to bring full proof of this secret crime, that out of a million witches not one would be convicted if the usual course were followed!”
Thus, confirmation of witchcraft was essentially guaranteed, making use of special rules of evidence in such trials in order to achieve the desired outcome.
For modeling, this amounts to rejecting a model or method by impugning the motives of the proponent, but not through the use of data and analysis to test the method itself. Far too large a fraction of research in computer-aided drug-design is the logical equivalent of witch hunting. Modeling methods are built upon faulty reasoning rife with correlation fallacies and then are “tried” with validation procedures that embed confirmation bias either by design or through ignorance. A model or method is thus produced with magical predictive powers. As a field, we should strive to do better.
But doing better is not without challenges. One reason that molecular modeling is particularly badly afflicted by these two fallacies is that the data themselves
suffer from the afflictions, since small molecules are made by people that, in their own reasoning, exhibit both the correlation fallacy and
confirmation bias. shows nine molecules to illustrate this point, all of whose modulatory effects on cardiac potassium channel current were published by 1992 (later understood to be primarily governed by hERG [3
]). The six molecules covering the upper left are all phenyl-methane-sulfonamides with a large para
substituent containing a tertiary amine. The preponderance of this exact substructure shows an adherence to a correlative assumption: that the particular right-hand-side of these molecules is related to favorable activity. It also shows a tendency toward confirmation bias: many more molecules were probably made with the common core of dofetilide/ibutilide than with something else.
Structures of nine molecules known to modulate cardiac potassium channel current (one major responsible gene product was later found to be HERG) as of 1992.
We have previously quantified this effect in molecular design [4
], describing it as an artifact of a human inductive bias toward 2D topological similarity in reasoning about and predicting molecular activity. We showed that the 2D similarity among molecule pairs designed intentionally to hit a particular target was much higher than between molecule pairs where one hit the target of the other as a side-effect. Recently, we showed that such a design bias limits the potential therapeutic novelty of new small molecules [5
]. The important point here is that the logical fallacies of correlation and confirmation drive the production of medicinal molecules themselves.
Given that the very production of molecules on which to make predictions of biological activity is linked with these two logical fallacies, it is easy to see how methodological development and validation can be lead astray. For example, in developing a theory of the movement of celestial bodies, one might choose to make observations that will tend to confirm the theory. However, such a choice does not actually prevent contrary observations from existing. Others are free to make the other observations, and the theory can be invalidated. In contrast, in molecular modeling, molecules exhibiting a contrary hypothesis about biological activity often are never made. So, the space of observables in molecular modeling is shaped by the correlation fallacy and confirmation bias, and even a benign selection of data will tend to be favorably confirmatory of methods whose underpinnings parallel the biases of medicinal chemical production.
Why does this matter? First, as evident from the molecules on the lower right of , excessive bias in design would miss the fact that neither the methyl-sulfonamide, nor the phenyl, nor the tertiary amine are necessary for activity against the hERG potassium channel. Second, it is not the nominal cost of a successful drug discovery and development project that dominates pharmaceutical innovation. It is the amortized cost of the much more frequent
failures that is the chief problem, stemming from failure rates from Phase 1 onward of about nine in ten clinical candidates [6
]. A particularly expensive form of failure can arise from post-marketing withdrawal due to unwanted side effects. shows the structures of five molecules, all withdrawn for hERG-related toxicity after
the molecules in and their activity in corresponding potassium channel assays were well-known. Terfenadine and astemizole were developed as antihistamines, mibefradil as a calcium channel blocker for hypertension, cisapride as an agonist of 5HT4 for heartburn, and thioridazine as an antipsychotic which derived its therapeutic effect primarily from dopamine receptor antagonism. None of these molecules exhibits such obvious structural similarity to those molecules in that a reasonable medicinal chemist or modeler would made a confident guess that they would have HERG activity sufficient to cause therapeutically disastrous side-effects.
Structures of five drugs withdrawn from the US market due to inappropriate modulation of HERG (dates shown are FDA approval and withdrawal from market).
In order to help address the most serious challenges of the pharmaceutical industry, the question should not be the moral equivalent of “Does your model weigh the same as a duck?” The ability to identify data that provide favorable evidence about the performance of a method is not sufficient, just as finding a suitable scale and duck ought not to have been sufficient in the case of the witch in the “Holy Grail.” Those involved in methodological development and validation must take special care to avoid reasoning involving correlation fallacy and confirmation bias. The data we have available make these traps natural and ubiquitous, but difficult problems such as those highlighted by and are unlikely to be solved using methods that do not get at the underlying physical phenomena that drive biological activity of small medicinal molecules. In what follows, we will touch on these issues as they relate to off-target predictive modeling, QSAR, molecular similarity computation, and docking.