Visual voice activity detection using multimodal foundation models

Shubham; Buduru, Arun Balaji (Advisor)

Visual voice activity detection using multimodal foundation models

Shubham; Buduru, Arun Balaji (Advisor)

URI: http://repository.iiitd.edu.in/xmlui/handle/123456789/1990

Date: 2025-07

Abstract:

This project explores the task of Visual Voice Activity Detection (VVAD) using only facial video data without access to audio. We evaluate the effectiveness of pretrained models including VideoMAE, ViViT, TimeSformer, ResNet50, as well as multimodal models like ImageBind, LanguageBind, and Video-LLaVA. Our goal is to classify whether a person is speaking in a given video segment using only visual cues. The models are tested on the VVAD-LRS3 dataset, and the results show strong promise for multimodal models even in vision-only setups. We hypothesize that large vision-language models can be adapted for explainable VVAD using prompt-based querying.

Show full item record