Abstract:
First person videos captured from wearable cameras are growing in popularity.Standard algo-
rithms developed for third person videos often do not work for such egocentric videos because
of drastic change in camera perspective as well as unavailability of common cues such as actor's pose. In the last few years, researchers have developed various deep neural network models for variety of first person tasks such as action detection, object classification, hand detection and pose classification etc. The models are often constrained by the limited amount of annotated training data as well as inherent wide variations in egocentric tasks and contexts. In this paper we propose a multi task learning framework which allows the model to learn various egocentric cues automatically by explicitly training for multiple egocentric tasks together. The joint training allows the cues from multiple tasks to fuse with each other as well as exploits the training samples available for each of the tasks. We show that our approach simultaneously improves the accuracy of the state of the art on all the trained tasks. We also show that the proposed model can extend easily to newer tasks with scarce data.