Abstract:
This thesis investigates two sequential studies toward real-time music synthesis directly from images via learned cross-modal embedding mappings and presents a unified deep-learning framework. In Study I, we explored a one-step projection from CLIP’s 512-dimensional image embeddings to MusicGen’s audio embeddings using a ViT-based network trained with a combination of latent-space alignment, mel-spectrogram, adversarial, and feature-matching losses. Although this confirmed that visual features carry musical intent, the generated outputs lacked coherent structure and emotional depth. To address these limitations, Study II—the proposed framework—constructs a supervised dataset by converting images into rich musical descriptions: BLIP generates semantic captions that Llama 3.1-8B refines into concise musical themes, which MusicGen’s text encoder then transforms into robust 1,024-dimensional embeddings. A lightweight projection network is trained to align CLIP’s visual vectors with these text-derived music embeddings using the same multi-loss objective. At inference, the network directly con- verts image embeddings into MusicGen-compatible vectors, eliminating any runtime text processing—and conditions the MusicGen decoder to synthesize coherent, emotionally resonant compositions. By removing textual intermediaries at inference and leveraging efficient token interleaving, our approach markedly reduces latency and computational overhead, enabling practical applications in automated soundtrack creation, interactive art installations, and immersive multimedia storytelling. This work establishes a streamlined, end-to-end pathway from visual perception to auditory experience, effectively preserving semantic and emotional nuances in generated music.