Abstract:
Several computational tools have been developed to predict the anticancer nature of peptides, including AntiCP and AntiCP2. While these methods have been widely adopted by the scientific community, they are not suitable for predicting anticancer proteins, as they differ significantly in composition and sequence characteristics. In this study, we introduce AntiCP3, the first dedicated platform for the accurate prediction of anticancer proteins. Our approach begins with an in-depth compositional analysis, which revealed clear differences between anticancer peptides and proteins, reinforcing the need for a distinct predictive framework. To build this, we first implemented similarity- based methods, which provided only moderate performance. We then developed a range of machine learning and deep learning models using conventional protein features such as amino acid composition (AAC), dipeptide composition (DPC), and physicochemical properties (PCP). The Extra Trees classifier achieved the best performance among traditional models, with a maximum AU-ROC of 0.72. To enhance performance, we integrated evolutionary features by extracting Position-Specific Scoring Matrix (PSSM) profiles, which improved the AUROC to 0.79. We further fine-tuned the pre-trained ESM2-t33 protein language model on our curated dataset, using its ability to capture both structural and contextual information. This led to a significant increase in the performance, achieving an AUROC of 0.90. Finally, we developed a hybrid model that combines BLAST-based sequence similarity scores with the fine-tuned ESM2 model, resulting in the highest performance with an AUROC of 0.91. All models were rigorously trained using manual five-fold cross-validation, and the performance was further validated using an independent test set. To facilitate widespread usage, AntiCP3 has been implemented as both a user- friendly web server and a standalone package. Additionally, the best-performing model has been deployed on Hugging Face as open access enabling direct integration into computational pipelines and promoting reproducible research.