An ideal lead candidate for an anticancer drug is one that is non-toxic to the host, is well absorbed and so can be administered orally, and is effective at inhibiting cancer cell growth. Data on safety, pharmacokinetics, and cytotoxicity are expensive to generate in the laboratory, however, and there is need for developing reliable in-silico predictive models.
One aspect of developing reliable models is to make efficient use of all available training data. For example, if training data are available for an additional task that is related to the primary task of interest, such data could be useful in constructing a more reliable model. This paper explores the use of a multitask model for oral clearance, where bioavailability is the second task. It is among the first papers to report results for a human oral clearance model. Multitask models are most useful when data is limited, as is the case with the oral clearance model. In some cases, however, multitask models can also be useful with larger data sets. For the cytotoxicity models constructed here, use of a multitask model did not affect accuracy, but did reduce computation time (it reduced the computation time of the record selection algorithm).
Another aspect of developing reliable models is to base the models on random samples from well-defined populations. Moreover, the training populations should be very similar or identical to the population of compounds that one wishes to make predictions for. This is often difficult to achieve in practice, and new solutions are not proposed in this paper. This important topic is addressed more fully in the discussion section, but the reader should be aware that while the accuracy results presented here are valid for the training and testing sets, they may or may not be valid for predictions on other sets. Nevertheless, predictions are made here on other compounds to demonstrate the approach of using multiple QSAR models to screen a large compound library. Predicted results for any particular compound passing the screen would need to be verified in the laboratory. Such an approach has been used by Boik et al. [1
] in a small study.
Predictive models of safety, pharmacokinetics, and cytotoxicity could be designed and used for a variety of purposes. Keeping in mind the model's limitations, the intended purpose in this paper was to screen a large library of natural products for those that might be suitable for preclinical study as components of anticancer drug mixtures. The criteria for suitability was that a compound be predicted to:
• inhibit multiple cancer cell lines in vitro at modest to low concentrations (IC50 of 50 μM or below),
• be of low systemic toxicity (rat LD50 > 1920 mg/kg/day), and
• exhibit a low to modest oral clearance (<83 L/hr in humans).
Note that if the goal were to identify promising compounds for study as individual drugs, as opposed to components of mixtures, different criteria would likely be used. For example, more potent cytotoxic agents might be desired. In addition, only novel compounds might be of interest.
Three QSAR classification models were constructed. QSAR models identify statistical relationships between a response (also called a task or target) and molecular features of a compound, such as molecular weight, logP, and functional group counts. As noted above, the three proposed models are based on data for human oral clearance, rat LD50, and in-vitro cytotoxicity. Oral clearance is a measure of the rate of drug removal from the body after oral administration, and LD50 refers to the expected dose needed to kill 50 percent of an animal population. The three models were applied to a set of over 115,000 natural compounds and hundreds were predicted to be cytotoxic, of low systemic toxicity, and of low to modest oral clearance.
The data modeled here were challenging. Correlations between single features and responses were very weak and many of the features were highly correlated with one another. In addition, a large number of features (>1600) were employed, which in the case of the oral clearance model was greater than the number of records. (The term record
is used to refer to the combination of observed responses and calculated features for a single drug.) Lastly, the processes modeled were biochemically complex and observed responses were noisy. For example, measurements of oral clearance commonly exhibit a within-study coefficient of variation of 25 to 100 percent [2
]. In a 1979 report, LD50 values were observed to vary by as much as 3- to 11-fold between different laboratories [7
cytotoxicity data were also noisy, typical of high-throughput screening experiments. With regards to classification, noisy measurements are particularly problematic when they occur at the thresholds used to demarcate active from inactive drugs. In the cytotoxicity data modeled here, compounds were concentrated near these thresholds.
Models were constructed using Kernel Multitask Latent Analysis (KMLA), an algorithm developed by Xiang and Bennett [8
] based on earlier work by Momma and Bennett [9
] and used here with minor changes. KMLA is closely related to partial least squares (PLS), an algorithm that is commonly used in QSAR and microarray studies when features are highly correlated and the number of records is small compared to the number of features [10
]. PLS algorithms were originated by Wold in the 1970s and were later refined by a number of researchers [17
]. Briefly, PLS algorithms use a series of linear projections to create a small set of orthogonal "latent" data columns from the original data so that the covariance between the latent features and response is maximized. In this way, the dimensions of the original data are greatly reduced, problems with correlated features are eliminated, and maximal information related to the response is retained. Whereas PLS produces linear models, KMLA can produce nonlinear models, with the degree of nonlinearity determined by the choice of kernel function and kernel parameters.
KMLA is designed for (nonlinear) multitask learning. Multitask learning can be useful when the number of records is small relative to the number of features [21
], as is the case for the oral clearance data. Multitask learning models have been proposed by several authors [22
], although their use in QSAR is still rare. In KMLA, collective learning is ensured by forcing all problems to use a shared set of latent features. This type of collective learning has been referred to as common feature mapping
. In the common latent feature space each task is independently treated as a single-task learning problem [26
]. Because each task is independently modeled, tasks need not share common records and multiple types of models can be used (classification for one task and regression for another, for example). The common set of latent features is obtained by minimizing loss functions across modeled tasks.