Abstract:
Understanding and predicting complex biological processes, from the interactions of molecules to the aging of cells, relies heavily on the ability to extract useful information from different types of biological data. Biological features are the measurable characteristics extracted from this data and can exist in many forms. They are the molecule shapes displayed as graphs, the levels of gene expression in single cells, and the visible features seen in cell images: shape, texture, and motion. Using these different forms of features successfully is highly important to generating good models and learning more about biology. However, getting these various features in front of a computer and extracting the most relevant ones is very challenging and often requires special techniques specific to the type of data and biological questions. This thesis presents three computational tools to address these challenges by focusing on robust feature engineering across multiple biological scales and data types. First, deepGraphh utilizes graph-based structural traits to predict molecular activity. It avoids the need for conventional, pre-computed descriptors by using a suite of Graph Neural Networks (GNNs), such as Graph Convolutional Networks (GCN), Graph Attention Networks (GAT), Directed Acyclic Graph(DAG) networks, and Attentive FP, to learn representations straight from molecular graphs. This web service streamlines model development, parameter tuning, and validation, offering performance comparable to traditional descriptor-based approaches. Quantitative Structure-Activity Relationship(QSAR) modeling is made easier with deepGraphh, an open-source web service that is accessible and performs similarly to descriptor-based techniques. deepGraphh was used to predict the permeability of human and microbiome-generated metabolites across the blood-brain barrier. Second, EcTracker analyzes single-cell RNA sequencing data to identify ectopically expressed genes and characterize cell types. This R/Shiny-based web server compares gene expression to physiological norms, enabling the identification of cell identities and regulatory networks through regulon analysis. By reanalyzing a CRISPRi dataset, EcTracker revealed previously ambiguous identities in SMAD2 knockout cells, highlighting its ability to uncover critical regulatory insights. Finally, scCamAge leverages single-cell microscopy images and associated bioactivity measurements to predict cellular states, particularly aging. This multimodal deep learning engine, packaged in Docker, analyzes single-cell microscopy images (trained initially on approximately one million yeast cells) to predict cellular age and functional bioactivities by capturing complex spatiotemporal and morphological features. The conservation of visual aging indicators and their promise for high-throughput screening are highlighted by its validation utilizing genetic and chemical perturbations, as well as their exceptional capacity to predict senescence in human fibroblasts based on yeast training data, highlighting the conserved morphometric features related to senescence in human cells. Collectively, deepGraphh, EcTracker, and scCamAge provide a comprehensive suite of tools for robust feature engineering and predictive modeling across diverse biological data, facilitating a deeper understanding of complex biological processes.