IIIT-Delhi Institutional Repository

Analysing long egocentric videos

Show simple item record

dc.contributor.author Nagar, Pravin
dc.contributor.author Arora, Chetan (Advisor)
dc.date.accessioned 2023-03-21T11:56:08Z
dc.date.available 2023-03-21T11:56:08Z
dc.date.issued 2023-02
dc.identifier.uri http://repository.iiitd.edu.in/xmlui/handle/123456789/1054
dc.description.abstract Egocentric videos are recorded in a hands-free, always-on, under enhanced privacy-sensitive scenario and are often collected from day to weeks. For efficient consumption, such videos require robust video analysis techniques that can deal with extremely long sequences in an unsupervised setting. This dissertation explores a novel research area by developing video analysis tasks for extremely long and sequential data (ranging from a day to weeks long) in a self supervised /unsupervised setting. In this dissertation, we address the three key video analysis problems, namely temporal segmentation, summa-rization, and recovering activity patterns, specifically designed to deal with the issues of scalability, privacy, and unlabeled data.There are a plethora of works in the literature for third person video analysis. How-ever, third person videos are often recorded from point-and-shoot cameras, thus gener-ating small video samples (up to a few minutes). In this dissertation, we work on Disney (up to 8 hrs video sequence), UT Egocentric (UTE) (up to 5 hrs), and EgoRoutine (up to 20 days of photo-stream lifelogs) datasets that are recorded in a real-life setting. There-fore, third person video analysis techniques do not typically scale for long sequences. For example, the simplest task of temporal segmentation becomes challenging for extremely long sequence data as the length of events ranges from a few seconds to hours long. Sim ilarly, for video summarization, we usually consider the whole video sequence to select the appropriate frames/sub-shots for generating a compact yet comprehensive summary. In activity pattern recovery, we need to model the underlying distribution of activity patterns for the whole data (weeks long lifelog), and the task becomes cumbersome when the distributions are highly skewed. In all these instances, the complexity of the task increases multifold and requires a different level of comprehension for modeling the extremely long video sequences. We further demonstrate that state-of-the-art (SOTA) approaches based on Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), Graph Convolutional Networks (GCNs), or Transformers networks fail miserably to handle massively long sequences. This dissertation proposes scalable solutions to analyze extremely long egocentric videos, typically ranging from a day to weeks.The long and unconstrained nature of egocentric videos makes it imperative to use temporal segmentation as an important pre-processing step for many higher-level video analysis tasks. In the first work, we present a novel unsupervised temporal segmen-tation technique especially suited for extremely long egocentric videos. We formulate the problem as detecting concept drift in a time-varying, non i.i.d. sequence of frames. Statistically bounded thresholds are calculated to detect concept drift between two tem-porally adjacent multivariate data segments with different underlying distributions while establishing guarantees on false positives. The egocentric videos are extremely long and highly redundant in nature, and these videos are difficult to watch from beginning to end. Hence, require summarization tools for their efficient consumption. The second work presents a novel unsupervised deep reinforcement learning framework to generate video summaries from day long egocentric videos. We also incorporate user choices using interactive feedback for including or excluding a particular type of content in the generated summaries. Lifelogging applications for egocentric videos require analyzing a huge volume of data often captured over weeks to months for a particular subject and contain long-term dependencies. High-level video analysis tasks over lifelogs include recognizing daily living activity (ADL), routine discovery, event detection, anomaly detection, etc. We observe that the Transformer-based SOTA architectures still fail for extremely long video sequences. Our analysis reveals that the key ingredient missing is the inability of the architecture to exploit strong spatio-temporal visual cues inherent in video data. To cap-ture such cues within a transformer architecture, we propose a novel architecture named Semantic Attention TransFormer (SATFormer), which factorizes the self-attention ma-trix into a semantically meaningful subspace. We use SATFormer within a novel self-supervised training pipeline developed specifically for the task of recovering activity patterns in extremely long (weeks-long) egocentric lifelogs. In the proposed pipeline, we alternatively learn feature embedding from the proposed SATFormer using the pseudo-label assigned to each frame and learn the pseudo-labels from the clustering done using feature embedding from SATFormer. Overall, this dissertation is a significant feat addressing the broader issues of scala-bility, privacy, and unlabeled data and establishing SOTA performance for the respective tasks. The proposed works are pioneers in handling massively long (up to 60k time steps) sequence video data in an unsupervised setting. en_US
dc.language.iso en_US en_US
dc.publisher IIIT-Delhi en_US
dc.subject Weeks Long Lifelogs en_US
dc.subject Hoeffding’s Bound en_US
dc.subject Drift Detection en_US
dc.subject Short Hand-held Videos en_US
dc.subject Photo-stream Data en_US
dc.subject Recorded Video en_US
dc.title Analysing long egocentric videos en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Repository


Advanced Search

Browse

My Account