Abstract:
Blood cancer has emerged as a growing concern over the past decade, necessitating early detection for timely and effective treatment. Traditional methods of diagnosing blood cancers involve a series of pathological tests and consultations with medical experts, a process that is not only time-consuming but also financially burdensome. The advent of genomic data analysis offers a promising avenue for understanding the pathogenesis of blood cancers, providing valuable insights into crucial biomarkers that could serve as potential therapeutic targets, ultimately impeding the progression of the disease. In the scope of this study, we have delved into the genomic intricacies of two prominent blood cancer types: Chronic Lymphocytic Leukemia (CLL) and Multiple Myeloma (MM). The treatment decisions for CLL and MM rely heavily on patient symptoms and are underpinned by the genetic anomalies in the patient’s genome. Here, we have undertaken a comprehensive omics data analysis, employing novel pipelines and methodologies developed in-house. Our objective has been to unearth the genetic aberrations that underlie these diseases’ development and identify pivotal biomarkers that hold promise as therapeutic targets for each category of haematological malignancy. Our first objective was to identify clinically relevant small non-coding RNAs (sncRNAs) in CLL through a comprehensive genome-wide study of RNASeq data. This analysis revealed a distinct pattern of dysregulated miRNAs in the CLL cohort. Among these, three miRNAs were up-regulated (hsa-mir-1295a, hsa-mir-155, and hsa-mir-4524a), while five miRNAs were down-regulated (hsa-mir-30a, hsa-mir-423, hsa-mir-486*, hsa-let-7e, and hsa-mir-744). Moreover, our investigation identified seven novel miRNA sequences with elevated expression in CLL, including tRNAs, piRNAs (piRNA-30799, piRNA-36225), and snoRNAs (SNORD43). Notably, we observed a significant correlation between the increased expression of hsa-mir-4524a and a shorter time to first treatment (TTFT) (HR: 1.916, 95% CI: 1.080–3.4, p-value: 0.026) and higher expression of hsa-mir-744 with a longer TTFT (HR: 0.415, 95% CI: 0.224–0.769, p-value: 0.005) in CLL patients. These findings suggest that further research may establish the potential integration of these differentially expressed miRNA (DEM) markers into risk stratification models and prognostic approaches for CLL. We proceeded by developing an integrated and reproducible workflow for RNA-Seq data analysis, known as miRPipe. This pipeline was designed to identify dysregulated sncRNAs, including miRNAs and piRNAs, and functionally similar miRNAs, often called miRNA paralogues. To evaluate the performance and benchmark miRPipe, we introduced an in-house synthetic sequence simulator called miRSim. miRSim utilizes seed and xseed information from sncRNA sequences to generate synthetic sequences. Additionally, it provides ground-truth data in a user-friendly comma-separated file format, offering comprehensive information on known miRNAs, piRNAs, novel miRNAs, their sequences, chromosome locations, expression counts, and CIGAR strings for all sequences. We rigorously benchmarked miRPipe against seven existing state-of-the-art pipelines using synthetic and publicly available real RNA-Seq expression datasets (lung cancer, breast cancer, and CLL). In synthetic datasets, miRPipe demonstrated superior performance to existing pipelines, achieving an accuracy of 95.23% and an F1-score of 94.17%. Furthermore, our analysis of all three cancer datasets indicated that miRPipe excelled in extracting a more significant number of known dysregulated miRNAs and piRNAs than existing pipelines. Then, we designed an innovative AI-driven bio-inspired deep learning architecture to identify altered signaling pathways (BDL-SP) and determine the pivotal genomic biomarkers that can distinguish MM and its precursor stage, named Monoclonal gammopathy of undetermined significance (MGUS). The proposed BDL-SP model comprehends gene-gene interactions using the protein-protein interaction (PPI) network and analyzes genomic features using deep learning (DL) architecture to identify significantly altered genes and signaling pathways in MM and MGUS. The exome sequencing data of 1174 MM and 61 MGUS patients were analyzed for this. In the quantitative benchmarking with the other popular machine learning models, BDL-SP performed almost similarly to the best-performing predictive machine learning (ML) models of Random Forest and CatBoost. However, an extensive post-hoc explainability analysis, capturing the application-specific nuances, clearly established the significance of the BDL-SP model. This analysis revealed that BDL-SP identified a maximum number of previously reported oncogenes (OG), tumour-suppressor genes (TSG), both oncogene and driver gene (ODGs) and actionable genes (AGs) of high relevance in MM as the top significantly altered genes. Further, the post-hoc analysis revealed a significant contribution of single nucleotide variants (SNVs) and genomic features associated with synonymous SNVs in disease stage classification. Finally, the pathway enrichment analysis of the top significantly altered genes showed that many cancer pathways are selectively and significantly dysregulated in MM compared to its precursor stage of MGUS. At the same time, a few that lost their significance with disease progression from MGUS to MM were related to the other disease types. These observations may pave the way for appropriate therapeutic interventions to halt the progression to overt MM in the future. Lastly, we designed a curated, comprehensive, targeted sequencing panel focusing on 282 MM-relevant genes and employing clinically oriented NGS-targeted sequencing approaches. To identify these 282 MM-relevant genes, we designed an innovative AI- based Biological Network for Directed Gene-Gene Interaction Learning (BIO-DGI) model for detecting biomarkers and gene interactions that can potentially differentiate MM from MGUS. The BIO-DGI model leverages gene interactions from nine PPI networks and analyzes the genomic features from 1154 MM and 61 MGUS samples. The proposed model outperformed baseline ML and DL models, demonstrating quantitative and qualitative superiority by identifying the largest number of MM-relevant genes in the post-hoc analysis. The pathway analysis underscored the importance of top-ranked genes by highlighting the MM-relevant pathways as the top-significantly altered pathways. The 282-gene panel encompasses 9272 coding regions and has a length of 2.577 Mb. Additionally, the 282-gene panel showcased superior performance compared to previously published panels, excelling in detecting genomic and transformative events. Notably, the proposed gene panel also highlighted highly influential genes and their interactions within gene communities in MM. The clinical relevance is confirmed through a two-fold univariate survival analysis. The study’s findings shed light on essential gene biomarkers and their interactions, providing valuable insights into disease progression.