Abstract:
The thermal stability of a protein is a fundamental biophysical property that governs its structural integrity, functional efficacy, and practical utility, particularly its shelf-life in therapeutic and industrial applications. Consequently, the ability to accurately predict a protein's melting temperature (Tm) from its primary sequence is a central challenge in protein engineering, synthetic biology, and drug development. While experimental determination of Tm is precise, it is often resource-intensive and low-throughput, creating a bottleneck in large-scale protein design projects. To address this, various computational methods have been developed. The recent success of deep learning, especially large language models in understanding biological sequences, presents new and powerful opportunities to decipher the intricate relationship between a protein's sequence, its structure, and its overall thermal tolerance. This thesis presents a comprehensive investigation into deep learning methodologies for predicting protein thermal stability, performing a systematic comparison of both sequence- and structure-based computational paradigms. Our entire analysis is benchmarked on the extensive Novozymes dataset, comprising 31,384 diverse protein sequences with their experimentally validated melting temperatures. Our investigation commenced with sequence-based models. We first established a performance baseline using traditional machine learning approaches built on classical biochemical features. An Artificial Neural Network (ANN) trained on physicochemical properties (PCP) extracted using the pFeature library, while a LightGBM model. To move beyond handcrafted features and capture richer contextual information, we leveraged the power of pre-trained protein language models. An ANN utilizing static embeddings extracted from the ProtT5-BERT model showed a marked improvement. The pinnacle of our sequence-based approach was achieved by directly fine-tuning the entire ProtT5-BERT model on our specific thermal stability dataset. This transfer learning strategy resulted in the highest performance, demonstrating that allowing the model to adapt its internal representations to the specific task of stability prediction is superior to using static, pre-computed embeddings. In parallel, we explored a structure-based approach, predicated on the principle that a protein's three-dimensional fold is a key determinant of its stability. As experimental structures were unavailable for the vast majority of the dataset, we first predicted the 3D structures for all protein sequences using the state-of-the-art ESMFold model. We then developed a Graph Neural Network (GNN), a class of models inherently suited to learning from graph-based data like protein structures, to predict thermal stability. This structure-based model. While valuable, this result was notably lower than our top sequence-based models, potentially reflecting the compounding errors from the initial structure prediction step or the inherent difficulty in learning stability from static structural snapshots. Our comparative analysis conclusively demonstrates that while both sequence and structural modalities contain stability-related information, the fine-tuned sequence-based language model significantly outperforms all other methods. This work underscores the immense and still-unfolding potential of protein language models to accurately predict complex functional properties like thermal stability directly from sequence data. This provides the scientific community with a powerful and scalable tool for high-throughput in-silico screening, accelerating the design and optimization of novel, hyper-stable proteins for a wide range of applications.