Learning cross-modal embedding mappings for Image-to-music generation

Deepika, N; Abrol, Vinayak (Advisor)

dc.contributor.author	Deepika, N
dc.contributor.author	Abrol, Vinayak (Advisor)
dc.date.accessioned	2026-04-15T07:18:32Z
dc.date.available	2026-04-15T07:18:32Z
dc.date.issued	2025-05
dc.identifier.uri	http://repository.iiitd.edu.in/xmlui/handle/123456789/1882
dc.description.abstract	This thesis investigates two sequential studies toward real-time music synthesis directly from images via learned cross-modal embedding mappings and presents a unified deep-learning framework. In Study I, we explored a one-step projection from CLIP’s 512-dimensional image embeddings to MusicGen’s audio embeddings using a ViT-based network trained with a combination of latent-space alignment, mel-spectrogram, adversarial, and feature-matching losses. Although this confirmed that visual features carry musical intent, the generated outputs lacked coherent structure and emotional depth. To address these limitations, Study II—the proposed framework—constructs a supervised dataset by converting images into rich musical descriptions: BLIP generates semantic captions that Llama 3.1-8B refines into concise musical themes, which MusicGen’s text encoder then transforms into robust 1,024-dimensional embeddings. A lightweight projection network is trained to align CLIP’s visual vectors with these text-derived music embeddings using the same multi-loss objective. At inference, the network directly con- verts image embeddings into MusicGen-compatible vectors, eliminating any runtime text processing—and conditions the MusicGen decoder to synthesize coherent, emotionally resonant compositions. By removing textual intermediaries at inference and leveraging efficient token interleaving, our approach markedly reduces latency and computational overhead, enabling practical applications in automated soundtrack creation, interactive art installations, and immersive multimedia storytelling. This work establishes a streamlined, end-to-end pathway from visual perception to auditory experience, effectively preserving semantic and emotional nuances in generated music.	en_US
dc.language.iso	en_US	en_US
dc.publisher	IIIT-Delhi	en_US
dc.subject	Cross-Modal Embedding	en_US
dc.subject	Image Embedding	en_US
dc.subject	Text Embedding	en_US
dc.subject	Audio Embedding	en_US
dc.subject	MusicGen	en_US
dc.subject	mel-spectrogram	en_US
dc.title	Learning cross-modal embedding mappings for Image-to-music generation	en_US
dc.type	Thesis	en_US