IIIT-Delhi Institutional Repository

Learning cross-modal embedding mappings for Image-to-music generation

Show simple item record

dc.contributor.author Deepika, N
dc.contributor.author Abrol, Vinayak (Advisor)
dc.date.accessioned 2026-04-15T07:18:32Z
dc.date.available 2026-04-15T07:18:32Z
dc.date.issued 2025-05
dc.identifier.uri http://repository.iiitd.edu.in/xmlui/handle/123456789/1882
dc.description.abstract This thesis investigates two sequential studies toward real-time music synthesis directly from images via learned cross-modal embedding mappings and presents a unified deep-learning framework. In Study I, we explored a one-step projection from CLIP’s 512-dimensional image embeddings to MusicGen’s audio embeddings using a ViT-based network trained with a combination of latent-space alignment, mel-spectrogram, adversarial, and feature-matching losses. Although this confirmed that visual features carry musical intent, the generated outputs lacked coherent structure and emotional depth. To address these limitations, Study II—the proposed framework—constructs a supervised dataset by converting images into rich musical descriptions: BLIP generates semantic captions that Llama 3.1-8B refines into concise musical themes, which MusicGen’s text encoder then transforms into robust 1,024-dimensional embeddings. A lightweight projection network is trained to align CLIP’s visual vectors with these text-derived music embeddings using the same multi-loss objective. At inference, the network directly con- verts image embeddings into MusicGen-compatible vectors, eliminating any runtime text processing—and conditions the MusicGen decoder to synthesize coherent, emotionally resonant compositions. By removing textual intermediaries at inference and leveraging efficient token interleaving, our approach markedly reduces latency and computational overhead, enabling practical applications in automated soundtrack creation, interactive art installations, and immersive multimedia storytelling. This work establishes a streamlined, end-to-end pathway from visual perception to auditory experience, effectively preserving semantic and emotional nuances in generated music. en_US
dc.language.iso en_US en_US
dc.publisher IIIT-Delhi en_US
dc.subject Cross-Modal Embedding en_US
dc.subject Image Embedding en_US
dc.subject Text Embedding en_US
dc.subject Audio Embedding en_US
dc.subject MusicGen en_US
dc.subject mel-spectrogram en_US
dc.title Learning cross-modal embedding mappings for Image-to-music generation en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Repository


Advanced Search

Browse

My Account