DSpace Collection: Year-2025

DSpace Collection: Year-2025 http://repository.iiitd.edu.in/xmlui/handle/123456789/1806 Year-2025 Sat, 25 Jul 2026 22:10:24 GMT 2026-07-25T22:10:24Z Visual voice activity detection using multimodal foundation models http://repository.iiitd.edu.in/xmlui/handle/123456789/1990 Title: Visual voice activity detection using multimodal foundation models Authors: Shubham; Buduru, Arun Balaji (Advisor) Abstract: This project explores the task of Visual Voice Activity Detection (VVAD) using only facial video data without access to audio. We evaluate the effectiveness of pretrained models including VideoMAE, ViViT, TimeSformer, ResNet50, as well as multimodal models like ImageBind, LanguageBind, and Video-LLaVA. Our goal is to classify whether a person is speaking in a given video segment using only visual cues. The models are tested on the VVAD-LRS3 dataset, and the results show strong promise for multimodal models even in vision-only setups. We hypothesize that large vision-language models can be adapted for explainable VVAD using prompt-based querying. Tue, 01 Jul 2025 00:00:00 GMT http://repository.iiitd.edu.in/xmlui/handle/123456789/1990 2025-07-01T00:00:00Z Interactive task learning framework for human-robot collaboration using generative AI http://repository.iiitd.edu.in/xmlui/handle/123456789/1969 Title: Interactive task learning framework for human-robot collaboration using generative AI Authors: Garg, Himang Chandra; Jain, Aditya Raj; Shukla, Jainendra (Advisor); Kundu, Tanmoy (Advisor) Abstract: Navigating a robot in an new environment without predefined graphs presents significant chal- lenges in perception, planning, and adaptability. However, traditional approaches rely on struc- tured maps, which limits their flex- ibility in dynamic and unexplored environments. Therefore, we propose a foundational framework that enables robots to navigate and execute tasks through human interaction in natural language. By leveraging Generative AI and multimodal learning, our system allows robots to dynamically adapt to new environments without requiring prede- fined graphs Fri, 18 Jul 2025 00:00:00 GMT http://repository.iiitd.edu.in/xmlui/handle/123456789/1969 2025-07-18T00:00:00Z SLAM technology for VR application http://repository.iiitd.edu.in/xmlui/handle/123456789/1968 Title: SLAM technology for VR application Authors: Chauhan, Abhijeet; Dwivedi, Ashutosh; Sankit; Shankhwar, Kalpana (Advisor) Abstract: This project presents the development of a mobile telepresence system that enables real-time 3D mapping and remote environment visualization using a VR headset. At its core, the system in- tegrates an Intel RealSense D455 RGB-D camera mounted on a rover, which streams depth and color data to a ROS 2-based onboard computer. Utilizing the RTAB-Map SLAM framework, the system incrementally reconstructs a dense 3D map of the environment as the rover explores. This spatial data is visualized using RViz2 and is intended for further integration with Unity for immersive VR rendering via HTC Vive. The current implementation successfully demonstrates the capture and visualization of live camera feeds and the foundational setup for 3D mapping. This progress lays the groundwork for future enhancements such as full VR telepresence, remote control, and multi-sensor fusion. Fri, 18 Jul 2025 00:00:00 GMT http://repository.iiitd.edu.in/xmlui/handle/123456789/1968 2025-07-18T00:00:00Z Interaction prediction between protein and ligand using 1D and 3D features http://repository.iiitd.edu.in/xmlui/handle/123456789/1967 Title: Interaction prediction between protein and ligand using 1D and 3D features Authors: Hasan, Ayaan; Dhanjal, Jaspreet Kaur (Advisor) Abstract: Determining whether a drug molecule inhibits a target protein is a critical step in the drug discovery process. While the pIC50 value is commonly used to quantify the inhibitory effect of a drug, experimentally determining these values is often expensive, slow, and not feasible for large-scale screening. To address this, we propose a deep learning-based approach to classify protein-ligand interactions as inhibitory or non-inhibitory, using 1D sequence data. Our method uses protein amino acid sequences and ligand representations in the form of SMILES strings as inputs. The corresponding interaction label is derived from experimentally known pIC50 values, binarized into inhibitory and non-inhibitory classes based on a defined threshold. We utilize pretrained transformer models from the Hugging Face library to encode both protein and ligand sequences into contextual embeddings, which are then combined and passed through a neural classification head. This structure-free, sequence-only approach eliminates the need for 3D structural data, making it computationally efficient and scalable for high-throughput applications. The model is trained and evaluated on datasets involving cancer-related targets, and it demon- strates promising performance across standard binary classification metrics. Our results validate the use of transformer-based sequence models for predicting drug–target interaction classes, en- abling faster virtual screening pipelines in early-stage drug discovery. Mon, 21 Jul 2025 00:00:00 GMT http://repository.iiitd.edu.in/xmlui/handle/123456789/1967 2025-07-21T00:00:00Z