Abstract:
In recent years, there has been significant interest in using Machine Learning and Deep Learning to predict protein-ligand binding affinity. This is due to the rapid growth of the computational approaches that have evolved in drug discovery. The binding affinity prediction is useful in the virtual screening and drug screening optimization step of drug discovery.. The ML and DL-based approaches have shown notable improvements compared to the conventional approaches. The conventional approaches are time-consuming, complex, and challenging. However, the introduction of computational approaches has expedited the drug discovery timeline. In this study, we aim to develop Machine Learning models and benchmark some of the Deep Learning models to predict the protein-ligand binding affinity. We have used the refined set of the PDBbind database(version 2020) to fetch the protein-ligand structural data and binding affinity data. We have used the dataset mentioned above for the machine learning models and featurized the protein-ligand complexes using tools such as RDkit/Mordred and Pfeature, followed by feature selection. Models such as SVM, Random Forest, Multiple Linear Regression, etc, have been used to predict the binding affinity of PL complexes. From all the ML models we tested, it was observed that Random Forest performed better with an R-squared value of 0.6. Further, we benchmarked the CNN-based Deep learning models such as Pafnucy and OnionNet-2 using the refined set of PDBbind as the benchmarking test dataset. It was observed that the OnionNet-2 model showed better predictive performance at an R-squared value of 0.85 than that of the Pafnucy model at an R-squared value of 0.46. We have discussed this relative performance in our study. Hence, it was observed that out of all the approaches we used, the PDBbind refined dataset showed the maximum R-squared value when it was benchmarked using the OnionNet-2 model. We have also discussed the reasons for the variation and the future scope of the study.