Abstract:
The accurate classification of Central Nervous System (CNS) tumors into their respective sub- types and grades is vital for prognosis, therapeutic decision-making, and patient management. Traditional diagnostic methods, primarily reliant on radiological imaging and histopathology, are time-intensive and prone to inter-observer variability. In this work, we propose a multimodal deep learning framework for the automated detection and characterization of CNS tumors us- ing the AIIMS brain tumor dataset. Our approach leverages a modified CLIP (Contrastive Language–Image Pretraining) architecture tailored for medical imaging, combining a Vision Transformer (ViT) as the image encoder with BioBERT as the textual encoder. This enables robust cross-modal learning between medical images and corresponding textual metadata, such as clinical notes, radiology findings, and histopathological labels.The model is trained using con- trastive learning to align image and text embeddings in a shared latent space, facilitating both image-to-text and text-to-image retrieval.