Abstract:
This project aims to compare traditional baseline models, specifically the Naive Bayes model and the Long Short-Term Memory (LSTM) model, and advanced transformer models, including Strainformer and Vaxformer. The primary objective is to assess the efficacy of these models in generating new sequences and subsequently evaluate the generated sequences based on their antigenicity score(NetMHCpan), stability(DDGun), and Root Mean Square Deviation (RMSD) from a reference spike protein (AlphaFold). The study design involves training each model on relevant biological sequence datasets, emphasizing the diverse nature of antigenic proteins. Following training, the models will generate novel sequences, and their antigenic properties will be quantified using state-of-the-art scoring systems. The antigenicity score stability will be assessed to determine the consistency of the generated sequences in maintaining desirable antigenic features. Additionally, the generated sequences will be compared to a reference spike protein, and the RMSD metric will be employed to quantify the structural differences between the generated and reference sequences. This analysis aims to provide insights into the structural fidelity of the generated sequences and their potential practical utility in the context of vaccine development for COVID-19 or other diseases.