AntiCP3: prediction of anticancer proteins using evolutionary information from protein language models

Gupta, Amisha; Raghava, Gajendra Pal Singh (Advisor)

Please use this identifier to cite or link to this item: http://repository.iiitd.edu.in/xmlui/handle/123456789/1895

Title:	AntiCP3: prediction of anticancer proteins using evolutionary information from protein language models
Authors:	Gupta, Amisha Raghava, Gajendra Pal Singh (Advisor)
Keywords:	AntiCP3 Language Models
Issue Date:	May-2025
Publisher:	IIIT-Delhi
Abstract:	Several computational tools have been developed to predict the anticancer nature of peptides, including AntiCP and AntiCP2. While these methods have been widely adopted by the scientific community, they are not suitable for predicting anticancer proteins, as they differ significantly in composition and sequence characteristics. In this study, we introduce AntiCP3, the first dedicated platform for the accurate prediction of anticancer proteins. Our approach begins with an in-depth compositional analysis, which revealed clear differences between anticancer peptides and proteins, reinforcing the need for a distinct predictive framework. To build this, we first implemented similarity- based methods, which provided only moderate performance. We then developed a range of machine learning and deep learning models using conventional protein features such as amino acid composition (AAC), dipeptide composition (DPC), and physicochemical properties (PCP). The Extra Trees classifier achieved the best performance among traditional models, with a maximum AU-ROC of 0.72. To enhance performance, we integrated evolutionary features by extracting Position-Specific Scoring Matrix (PSSM) profiles, which improved the AUROC to 0.79. We further fine-tuned the pre-trained ESM2-t33 protein language model on our curated dataset, using its ability to capture both structural and contextual information. This led to a significant increase in the performance, achieving an AUROC of 0.90. Finally, we developed a hybrid model that combines BLAST-based sequence similarity scores with the fine-tuned ESM2 model, resulting in the highest performance with an AUROC of 0.91. All models were rigorously trained using manual five-fold cross-validation, and the performance was further validated using an independent test set. To facilitate widespread usage, AntiCP3 has been implemented as both a user- friendly web server and a standalone package. Additionally, the best-performing model has been deployed on Hugging Face as open access enabling direct integration into computational pipelines and promoting reproducible research.
URI:	http://repository.iiitd.edu.in/xmlui/handle/123456789/1895
Appears in Collections:	Year-2025

Files in This Item:

File	Description	Size	Format
MT23225_Amisha Gupta.pdf		8.56 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets