DSpace Repository

Visual voice activity detection using multimodal foundation models

Show simple item record

dc.contributor.author Shubham
dc.contributor.author Buduru, Arun Balaji (Advisor)
dc.date.accessioned 2026-06-17T11:49:56Z
dc.date.available 2026-06-17T11:49:56Z
dc.date.issued 2025-07
dc.identifier.uri http://repository.iiitd.edu.in/xmlui/handle/123456789/1990
dc.description.abstract This project explores the task of Visual Voice Activity Detection (VVAD) using only facial video data without access to audio. We evaluate the effectiveness of pretrained models including VideoMAE, ViViT, TimeSformer, ResNet50, as well as multimodal models like ImageBind, LanguageBind, and Video-LLaVA. Our goal is to classify whether a person is speaking in a given video segment using only visual cues. The models are tested on the VVAD-LRS3 dataset, and the results show strong promise for multimodal models even in vision-only setups. We hypothesize that large vision-language models can be adapted for explainable VVAD using prompt-based querying. en_US
dc.language.iso en_US en_US
dc.publisher IIIT-Delhi en_US
dc.subject Visual Voice Activity Detection en_US
dc.subject Multimodal Learning en_US
dc.subject ImageBind en_US
dc.subject Vision Transformers en_US
dc.title Visual voice activity detection using multimodal foundation models en_US
dc.type Other en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account