Abstract:
Long non-coding RNAs (lncRNAs) are key regulators of gene expression, and their stability, commonly quantified as half-life, plays a critical role in cellular function. Recent computational efforts have attempted to predict RNA half-life from sequence, with limited success. For instance, Shi et al. applied deep learning models and initially reported spearman correlations of 0.7–0.8, but performance dropped to 0.06–0.09 after five-fold validation. In this study, we developed machine learning and deep learning models using sequence-derived features to predict lncRNA half-life. Among the approaches tested, Random Forest based on nucleotide composition features performed best, achieving a spearman correlation of 0.9862 on the training dataset but only 0.0592 on the validation dataset. Furthermore, clustering analysis revealed that different transcript groups exhibited nearly identical mean half-life distributions, indicating that sequence-derived features alone do not meaningfully stratify lncRNAs by stability. These results, consistent with prior studies, demonstrate the persistent difficulty of predicting RNA half-life in silico. Further, inclusion of features such as RNA–binding protein motifs, structure–based minimum free energy and sub–cellular localization did not improve the model performances. This suggests that RNA stability is regulated by features beyond those included. Therefore, in this paper, we outline the approaches studied and the challenges to predict the RNA stability, further highlighting the need to integrate multi-omic strategy or design an algorithm to predict RNA half-life.