Abstract:
There are thousands of genes with incomplete functional annotations, particularly non-coding genes. Understanding the functional roles of genes is crucial for dissecting the complex genomic regulatory mechanisms underlying biological processes, which in turn provides control over cellular processes such as the immune response and cell cycle for potential clinical interventions. Over the years, numerous computational methods have emerged to link genes with biological processes and molecular functions. However, these methods often fail to account for non-coding genes and rarely provide interpretations of their predictions. To address this problem, a computational framework has been developed that incorporates features of non-coding genes at the promoter level using epigenome profiles, open-chromatin profiles, and transcription factor (TF) binding profiles of gene promoters. This approach allows for reliable predictions of gene functions, which are independently validated using available CRISPR screens and PubMed abstract mining. The explainable machine learning algorithms used for the prediction of gene function allowed for post hoc analysis using the top predictors of the learned models, yielding latent clusters of functions that collectively contribute to larger cellular processes. Additionally, downstream analysis using only transcription factors as top predictors provided insights into their synergy and pleiotropy in regulating various biological functions. The entire computational framework is built into an R package, "GFPredict," which can be used to predict biologically similar genes to user-defined query genes. Further analysis utilizing TF binding and epigenome profiles as features identified novel disease-gene associations. The predicted associations of coding and non-coding genes with diseases were validated using GWAS data and PubMed abstract mining. The genomic regulation analysis using top predictors of individual disease gene-sets revealed associations of divergent cell types in diseases. These association insights were validated with evidence from the literature, providing a basis for generating putative hypotheses for developing strategies for diagnosis, prognosis, and potential therapeutics.