Abstract:
Video summarization is an important step in analysing video feeds in various scenarios such as surveillance, events and other applications. However, it is a challenging task due to inherent dynamic nature of video due to which understanding and extracting keyframes for summary becomes a difficult task. In this paper, we propose an end-to-end unsupervised video summarization algorithm. Our algorithm incorporates Convolutional Neural Networks (CNN), discriminative visual attention network, Semi Dense Long Short Term Memory (SD-LSTM) and Adversarial SD-LSTM to extract keyframes. CNN followed by visual attention network helps in weighing only relevant regions of the frames, whereas, SD-LSTMs score each frame based on its relevance for generating summary. Inspired by densely connected convolutional networks, we propose SD-LSTM, where each cell takes present as well as immediate past input. We perform various experiments on publicly available datasets to analyse the proposed network. Our experimental results showthat the proposed method is highly efficient in generating a sound summary of the given video. This paper also looks into audio stream of the video, for highlight generation of tennis videos. We use crowd-cheer detection to find key shots, which are to be included in the final summary.