Year-2024

Year-2024 http://repository.iiitd.edu.in/xmlui/handle/123456789/1503 2026-07-24T00:04:27Z Comprehending the synergistic effects of gene ensembles in the context of disease biology, prognosis, and therapy http://repository.iiitd.edu.in/xmlui/handle/123456789/1719 Comprehending the synergistic effects of gene ensembles in the context of disease biology, prognosis, and therapy Sharma, Madhu; Kumar, Vibhor (Advisor) It is rare for individual genes to exert influence on biological processes in isolation. Instead, they are controlled by intricate networks of genes that collaborate in a well-organized manner. Complex biological processes and their dysregulation in disease states are governed by the collaborative action of gene ensembles via epigenetic, genetic, and proteomic mechanisms. Through the analysis of their synergistic actions, researchers have the potential to understand the complex interplay of biological systems that are involved in the development, progression, and treatment of diseases. However, despite the abundance of readily accessible high-throughput technology, unraveling disease-related molecular pathways remains difficult. Possible factors contributing to this issue are background noise, batch issues, environmental conditions, individual heterogeneity, and technology limitations. To overcome these limitations, it is necessary to create more advanced, integrated, and personalized diagnostic methods that may provide a thorough understanding of disease biology, leading to enhanced diagnosis and treatment options. In the field of disease diagnosis and treatment, genes often work together in the form of pathways that provide valuable insights into the fundamental processes of many disorders. In our study, we have examined the challenge of determining the direct relationships between pathways and diseases. We present sci-PDC, a method that leverages single-cell expression data to infer relationships between disease, cell type, and pathways. The use of this approach offers valuable perspectives on the causal connections between these variables and has the potential to make improvement in current precision medicine methods. Another similar set of gene ensembles known as cancer hallmarks additionally serves a vital role in cancer identification by providing insights into the underlying features of cancer cells. In order to acquire a deeper understanding of the fundamental processes, we analyzed the hallmark properties of cancer in relation to canonical pathways. As we go, our objective is to investigate the drug's mechanism of action in connection to its specificity towards different cell types. Therefore, in this study, we used our technique to investigate the connections between the drug-targeted pathways and the distinctive characteristics of different single-cell cancer transcriptomes. In genomic conformations, gene ensembles often collaborate via spatial organization, therefore exerting an influence on cellular activities and phenotype. These higher order chromatin topologies facilitate the integration of genes, and their regulatory elements, thereby enabling synchronised gene expression and regulation. This research used a unique methodology that included analyzing Topologically associating domains (TAD) activity in order to investigate the diversity of cancer and patients' responsiveness to drugs. Our study's results unequivocally show that TAD activity may function as a biomarker for estimating survival in the midst of tumor heterogeneity and predicting drug responsiveness. Regulation of transcriptome and genomic conformations are often profoundly affected by epigenetic markers through mechanisms such as DNA methylation. The functional integrity of gene ensembles may be compromised by dysregulation of such epigenetic mechanisms, which can also lead to a number of diseases. Our work offers an elucidation of the computational difficulties associated with DNA methylation analysis, which arise from the inherent bias present in various approaches of profiling. Moreover, this study assesses the efficacy of deconvolution and machine learning methodologies in the examination of cell-free DNA (cfDNA) methylation, hence indicating their potential use in the early identification of cancer. Overall, our suggested methodologies have the potential to leverage the synergistic effects of gene ensembles via diverse genomic and epigenomic patterns in order to provide a holistic comprehension of disease biology, hence enhancing diagnostic and therapeutic approaches. 2024-09-01T00:00:00Z Explainable machine learning with epigenomic features for insights into regulatory and functional genomics http://repository.iiitd.edu.in/xmlui/handle/123456789/1710 Explainable machine learning with epigenomic features for insights into regulatory and functional genomics R Chandra, Omkar; Kumar, Vibhor (Advisor) There are thousands of genes with incomplete functional annotations, particularly non-coding genes. Understanding the functional roles of genes is crucial for dissecting the complex genomic regulatory mechanisms underlying biological processes, which in turn provides control over cellular processes such as the immune response and cell cycle for potential clinical interventions. Over the years, numerous computational methods have emerged to link genes with biological processes and molecular functions. However, these methods often fail to account for non-coding genes and rarely provide interpretations of their predictions. To address this problem, a computational framework has been developed that incorporates features of non-coding genes at the promoter level using epigenome profiles, open-chromatin profiles, and transcription factor (TF) binding profiles of gene promoters. This approach allows for reliable predictions of gene functions, which are independently validated using available CRISPR screens and PubMed abstract mining. The explainable machine learning algorithms used for the prediction of gene function allowed for post hoc analysis using the top predictors of the learned models, yielding latent clusters of functions that collectively contribute to larger cellular processes. Additionally, downstream analysis using only transcription factors as top predictors provided insights into their synergy and pleiotropy in regulating various biological functions. The entire computational framework is built into an R package, "GFPredict," which can be used to predict biologically similar genes to user-defined query genes. Further analysis utilizing TF binding and epigenome profiles as features identified novel disease-gene associations. The predicted associations of coding and non-coding genes with diseases were validated using GWAS data and PubMed abstract mining. The genomic regulation analysis using top predictors of individual disease gene-sets revealed associations of divergent cell types in diseases. These association insights were validated with evidence from the literature, providing a basis for generating putative hypotheses for developing strategies for diagnosis, prognosis, and potential therapeutics. 2024-07-01T00:00:00Z Language models for temporal decisions in health datasets http://repository.iiitd.edu.in/xmlui/handle/123456789/1649 Language models for temporal decisions in health datasets Pal, Ridam; Sethi, Tavpritesh (Advisor) Healthcare has been undergoing a data-driven transformation, further accelerated by the COVID-19 pandemic. A significant amount of healthcare data is unstructured and underutilized. The success of Large Language Models (LLMs) in achieving human-like conversations has unlocked their potential in healthcare. For example, language models can help improve patient outcomes through temporal decision support, early warning systems, and clinical risk assessment. Through our work, we have explained how language models can assist in pandemic preparedness and support decision-making processes in critical care. Integrated frameworks incorporating machine learning, deep learning, and language models have been developed to effectively track and analyze temporal changes in unstructured healthcare data, to make informed decisions, and to enhance patient outcomes in a dynamic healthcare landscape. In this thesis, my first contribution was a deep learning based language model for modeling the spike region of COVID-19 genome sequences. This led to novel knowledge discovery and real-world implementation for predicting pandemic progression, StrainFlow, which successfully captured COVID-19 caseloads two months ahead of their occurrence. The integrative framework for language models, statistical features and machine learning to capture the temporal changes in the semantics of the genomic sequence was deployed as a publicly available web-application. In my second contribution, I constructed language models on COVID-19 scientific literature to track and predict emerging scientific evidence. The findings of this contribution illustrated that temporal changes in unsupervised word embeddings of scientific literature effectively captured and tracked new knowledge. Additionally, my work leveraged machine learning techniques and predicted emerging themes based on evolving word associations. This was also implemented as an openly available web application called EvidenceFlow. In my third contribution, I developed language models on unstructured clinical notes data from intensive care units (ICU) for prognosticating critical outcomes. Shock Index (SI) is a commonly employed prognostic indicator used in intensive care units (ICU) and emergency settings to assess patient outcomes. We developed a comprehensive multimodal early warning system (EWS) utilizing an integrated framework combining machine learning, deep learning, and language models. The framework leverages routinely available vital signs and clinical notes data to detect abnormal shock index and provide timely alerts for potential deteriorations in patient health. This model is planned to be evaluated prospectively for real-world clinical decision making, which is outside the scope of my thesis. In our final contribution, I contributed to the development and deployment of an end-to-end language model pipeline and android application, WashKaro, for raising WASH awareness during the COVID-19 pandemic. This was one of the first AI-based information dissemination applications built during COVID-19, which provided both Hindi and English bite sized text and audio based upon text summarization, word embedding similarities and text-to-speech technologies using advanced NLP methods. The application and research publication also demonstrated the user-feedback based improvement of our AI model, providing pointers for designing public health intervention systems for pandemic preparedness. Overall, my thesis contributed to the development, evaluation, and deployment of language model based technologies in ICU and pandemic preparedness settings, specifically in the setting of future predictions and early warning systems using temporal data. The findings contribute to advancing knowledge and methodologies while assisting medical practitioners and policymakers in effectively responding to disease outbreaks and formulating data and AI-augmented policy for healthcare settings. 2024-07-01T00:00:00Z Design and development of AI-based computational tools for identifying predictive biomarkers and signaling pathways for blood cancer http://repository.iiitd.edu.in/xmlui/handle/123456789/1504 Design and development of AI-based computational tools for identifying predictive biomarkers and signaling pathways for blood cancer Ruhela, Vivek; Gupta, Anubha (Advisor); Gupta, Ritu (Advisor) Blood cancer has emerged as a growing concern over the past decade, necessitating early detection for timely and effective treatment. Traditional methods of diagnosing blood cancers involve a series of pathological tests and consultations with medical experts, a process that is not only time-consuming but also financially burdensome. The advent of genomic data analysis offers a promising avenue for understanding the pathogenesis of blood cancers, providing valuable insights into crucial biomarkers that could serve as potential therapeutic targets, ultimately impeding the progression of the disease. In the scope of this study, we have delved into the genomic intricacies of two prominent blood cancer types: Chronic Lymphocytic Leukemia (CLL) and Multiple Myeloma (MM). The treatment decisions for CLL and MM rely heavily on patient symptoms and are underpinned by the genetic anomalies in the patient’s genome. Here, we have undertaken a comprehensive omics data analysis, employing novel pipelines and methodologies developed in-house. Our objective has been to unearth the genetic aberrations that underlie these diseases’ development and identify pivotal biomarkers that hold promise as therapeutic targets for each category of haematological malignancy. Our first objective was to identify clinically relevant small non-coding RNAs (sncRNAs) in CLL through a comprehensive genome-wide study of RNASeq data. This analysis revealed a distinct pattern of dysregulated miRNAs in the CLL cohort. Among these, three miRNAs were up-regulated (hsa-mir-1295a, hsa-mir-155, and hsa-mir-4524a), while five miRNAs were down-regulated (hsa-mir-30a, hsa-mir-423, hsa-mir-486*, hsa-let-7e, and hsa-mir-744). Moreover, our investigation identified seven novel miRNA sequences with elevated expression in CLL, including tRNAs, piRNAs (piRNA-30799, piRNA-36225), and snoRNAs (SNORD43). Notably, we observed a significant correlation between the increased expression of hsa-mir-4524a and a shorter time to first treatment (TTFT) (HR: 1.916, 95% CI: 1.080–3.4, p-value: 0.026) and higher expression of hsa-mir-744 with a longer TTFT (HR: 0.415, 95% CI: 0.224–0.769, p-value: 0.005) in CLL patients. These findings suggest that further research may establish the potential integration of these differentially expressed miRNA (DEM) markers into risk stratification models and prognostic approaches for CLL. We proceeded by developing an integrated and reproducible workflow for RNA-Seq data analysis, known as miRPipe. This pipeline was designed to identify dysregulated sncRNAs, including miRNAs and piRNAs, and functionally similar miRNAs, often called miRNA paralogues. To evaluate the performance and benchmark miRPipe, we introduced an in-house synthetic sequence simulator called miRSim. miRSim utilizes seed and xseed information from sncRNA sequences to generate synthetic sequences. Additionally, it provides ground-truth data in a user-friendly comma-separated file format, offering comprehensive information on known miRNAs, piRNAs, novel miRNAs, their sequences, chromosome locations, expression counts, and CIGAR strings for all sequences. We rigorously benchmarked miRPipe against seven existing state-of-the-art pipelines using synthetic and publicly available real RNA-Seq expression datasets (lung cancer, breast cancer, and CLL). In synthetic datasets, miRPipe demonstrated superior performance to existing pipelines, achieving an accuracy of 95.23% and an F1-score of 94.17%. Furthermore, our analysis of all three cancer datasets indicated that miRPipe excelled in extracting a more significant number of known dysregulated miRNAs and piRNAs than existing pipelines. Then, we designed an innovative AI-driven bio-inspired deep learning architecture to identify altered signaling pathways (BDL-SP) and determine the pivotal genomic biomarkers that can distinguish MM and its precursor stage, named Monoclonal gammopathy of undetermined significance (MGUS). The proposed BDL-SP model comprehends gene-gene interactions using the protein-protein interaction (PPI) network and analyzes genomic features using deep learning (DL) architecture to identify significantly altered genes and signaling pathways in MM and MGUS. The exome sequencing data of 1174 MM and 61 MGUS patients were analyzed for this. In the quantitative benchmarking with the other popular machine learning models, BDL-SP performed almost similarly to the best-performing predictive machine learning (ML) models of Random Forest and CatBoost. However, an extensive post-hoc explainability analysis, capturing the application-specific nuances, clearly established the significance of the BDL-SP model. This analysis revealed that BDL-SP identified a maximum number of previously reported oncogenes (OG), tumour-suppressor genes (TSG), both oncogene and driver gene (ODGs) and actionable genes (AGs) of high relevance in MM as the top significantly altered genes. Further, the post-hoc analysis revealed a significant contribution of single nucleotide variants (SNVs) and genomic features associated with synonymous SNVs in disease stage classification. Finally, the pathway enrichment analysis of the top significantly altered genes showed that many cancer pathways are selectively and significantly dysregulated in MM compared to its precursor stage of MGUS. At the same time, a few that lost their significance with disease progression from MGUS to MM were related to the other disease types. These observations may pave the way for appropriate therapeutic interventions to halt the progression to overt MM in the future. Lastly, we designed a curated, comprehensive, targeted sequencing panel focusing on 282 MM-relevant genes and employing clinically oriented NGS-targeted sequencing approaches. To identify these 282 MM-relevant genes, we designed an innovative AI- based Biological Network for Directed Gene-Gene Interaction Learning (BIO-DGI) model for detecting biomarkers and gene interactions that can potentially differentiate MM from MGUS. The BIO-DGI model leverages gene interactions from nine PPI networks and analyzes the genomic features from 1154 MM and 61 MGUS samples. The proposed model outperformed baseline ML and DL models, demonstrating quantitative and qualitative superiority by identifying the largest number of MM-relevant genes in the post-hoc analysis. The pathway analysis underscored the importance of top-ranked genes by highlighting the MM-relevant pathways as the top-significantly altered pathways. The 282-gene panel encompasses 9272 coding regions and has a length of 2.577 Mb. Additionally, the 282-gene panel showcased superior performance compared to previously published panels, excelling in detecting genomic and transformative events. Notably, the proposed gene panel also highlighted highly influential genes and their interactions within gene communities in MM. The clinical relevance is confirmed through a two-fold univariate survival analysis. The study’s findings shed light on essential gene biomarkers and their interactions, providing valuable insights into disease progression. 2024-05-01T00:00:00Z