Supervised learning models, also known as quantitative structure-activity regression (QSAR) models, are increasingly used in assisting the process of preclinical, small molecule drug discovery. The models are trained on data consisting of a finite dimensional representation of molecular structures and their corresponding target specific activities. These models can then be used to predict the activity of previously unmeasured novel compounds. In this work we address two problems related to this approach. The first is to estimate the extent to which the quality of the model predictions degrades for compounds very different from the compounds in the training data. The second is to adjust for the screening dependent selection bias inherent in many training data sets. In the most extreme cases, only compounds which pass an activity-dependent screening are reported. By using a semi-supervised learning framework, we show that it is possible to make predictions which take into account the similarity of the testing compounds to those in the training data and adjust for the reporting selection bias. We illustrate this approach using publicly available structure-activity data on a large set of compounds reported by GlaxoSmithKline (the Tres Cantos AntiMalarial Set) to inhibit in vitro P. falciparum growth.
stat.AP, stat.AP, stat.ML