Visual voice activity detection using multimodal foundation models

Please use this identifier to cite or link to this item: http://repository.iiitd.edu.in/xmlui/handle/123456789/1990

Full metadata record

DC Field	Value	Language
dc.contributor.author	Shubham	-
dc.contributor.author	Buduru, Arun Balaji (Advisor)	-
dc.date.accessioned	2026-06-17T11:49:56Z	-
dc.date.available	2026-06-17T11:49:56Z	-
dc.date.issued	2025-07	-
dc.identifier.uri	http://repository.iiitd.edu.in/xmlui/handle/123456789/1990	-
dc.description.abstract	This project explores the task of Visual Voice Activity Detection (VVAD) using only facial video data without access to audio. We evaluate the effectiveness of pretrained models including VideoMAE, ViViT, TimeSformer, ResNet50, as well as multimodal models like ImageBind, LanguageBind, and Video-LLaVA. Our goal is to classify whether a person is speaking in a given video segment using only visual cues. The models are tested on the VVAD-LRS3 dataset, and the results show strong promise for multimodal models even in vision-only setups. We hypothesize that large vision-language models can be adapted for explainable VVAD using prompt-based querying.	en_US
dc.language.iso	en_US	en_US
dc.publisher	IIIT-Delhi	en_US
dc.subject	Visual Voice Activity Detection	en_US
dc.subject	Multimodal Learning	en_US
dc.subject	ImageBind	en_US
dc.subject	Vision Transformers	en_US
dc.title	Visual voice activity detection using multimodal foundation models	en_US
dc.type	Other	en_US
Appears in Collections:	Year-2025

Files in This Item:

File	Description	Size	Format
btp_report (2) - Shubham IIITD.pdf Restricted Access		101.93 kB	Adobe PDF	View/Open Request a copy

DSpace JSPUI