A decision theoretic approach to model evaluation in computational drug discovery
Watson O., Cortes-Ciriano I., Taylor A., Watson JA.
Artificial intelligence, trained via machine learning or computational statistics algorithms, holds much promise for the improvement of small molecule drug discovery. However, structure-activity data are high dimensional with low signal-to-noise ratios and proper validation of predictive methods is difficult. It is poorly understood which, if any, of the currently available machine learning algorithms will best predict new candidate drugs. 25 publicly available molecular datasets were extracted from ChEMBL. Neural nets, random forests, support vector machines (regression) and ridge regression were then fitted to the structure-activity data. A new validation method, based on quantile splits on the activity distribution function, is proposed for the construction of training and testing sets. Model validation based on random partitioning of available data favours models which overfit and `memorize' the training set, namely random forests and deep neural nets. Partitioning based on quantiles of the activity distribution correctly penalizes models which can extrapolate onto structurally different molecules outside of the training data. This approach favours more constrained models, namely ridge regression and support vector regression. In addition, our new rank-based loss functions give considerably different results from mean squared error highlighting the necessity to define model optimality with respect to the decision task at hand. Model performance should be evaluated from a decision theoretic perspective with subjective loss functions. Data-splitting based on the separation of high and low activity data provides a robust methodology for determining the best extrapolating model. Simpler, traditional statistical methods such as ridge regression outperform state-of-the-art machine learning methods in this setting.