Abstract:
Understanding a video from concise summaries is of great importance for various applications such as browsing, retrieval and assistive technologies. In this work, we present unsupervised summarization of videos. Video summarization is extremely challenging as it is difficult to find concise and semantic frame representations. In order to address this problem, our contributions are twofold. First, we study different convolutional and transformer based architectures which can obtain efficient spatio-temporal representations. Second, we propose an optimal transport method to obtain representative clusters of a video. Experimental results on benchmark datasets such as TVSum and SumMe demonstrate that our approach achieves competitive performance.