Please use this identifier to cite or link to this item: http://repository.iiitd.edu.in/xmlui/handle/123456789/1990
Full metadata record
DC FieldValueLanguage
dc.contributor.authorShubham-
dc.contributor.authorBuduru, Arun Balaji (Advisor)-
dc.date.accessioned2026-06-17T11:49:56Z-
dc.date.available2026-06-17T11:49:56Z-
dc.date.issued2025-07-
dc.identifier.urihttp://repository.iiitd.edu.in/xmlui/handle/123456789/1990-
dc.description.abstractThis project explores the task of Visual Voice Activity Detection (VVAD) using only facial video data without access to audio. We evaluate the effectiveness of pretrained models including VideoMAE, ViViT, TimeSformer, ResNet50, as well as multimodal models like ImageBind, LanguageBind, and Video-LLaVA. Our goal is to classify whether a person is speaking in a given video segment using only visual cues. The models are tested on the VVAD-LRS3 dataset, and the results show strong promise for multimodal models even in vision-only setups. We hypothesize that large vision-language models can be adapted for explainable VVAD using prompt-based querying.en_US
dc.language.isoen_USen_US
dc.publisherIIIT-Delhien_US
dc.subjectVisual Voice Activity Detectionen_US
dc.subjectMultimodal Learningen_US
dc.subjectImageBinden_US
dc.subjectVision Transformersen_US
dc.titleVisual voice activity detection using multimodal foundation modelsen_US
dc.typeOtheren_US
Appears in Collections:Year-2025

Files in This Item:
File Description SizeFormat 
btp_report (2) - Shubham IIITD.pdf
  Restricted Access
101.93 kBAdobe PDFView/Open Request a copy


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.