Abstract:
The rise of surveillance cameras has led to a significant focus on large scale deployment of intelligent surveillance systems. Person re-identification (Re-ID) is one of the quintessential surveillance problems. Person Re-ID is the task of matching people from the non-overlapping multi-camera network. It is a non-trivial problem because of the presence of several visual recognition challenges such as pose change, occlusion, illumination variation, low resolution, and additional challenges due to temporal dimension in videos for homogeneous Re-ID. In case of heterogeneous Re-ID, the large modality difference is a big challenge. There are various applications such as long-term multi-camera tracking, crime prevention, forensic search, threat detection, instance search, activity analysis, photo-tagging, and many more intelligent applications. In this dissertation, we propose homogeneous and heterogeneous Re-ID models. In homogeneous case, we contribute towards video-video and image-to-image matching. In heterogeneous Re-ID, we propose novel models for RGB-IR and text-image matching. Our first work is based on video-video Re-ID. In this work, we propose a novel shall ow end-to-end model. It incorporates two stream CNNs, discriminative visual attention and recur-rent neural network with triplet and soft max loss to learn the spatiotemporal fusion feature sand improve the generalization ability. In addition, we contribute a large novel dataset of air borne videos for person Re-ID, named DJI01. It includes various challenging conditions like occlusion, illumination-changes, people with similar clothes and same people at different days. The elaborate qualitative and quantitative experiments on PRID-2011 [1], iLIDS-VID [3], MARS [2] and our drone dataset DJI01 demonstrate the robustness of the extracted discriminative features and efficacy of the proposed model. The second work aims at developing the image-based Re-ID model. In order to obtain strong generalizable as well as discriminative features, we propose a novel deep reconstruction re-identification network (HDRNet). HDRNet comprises of an encoder and a multi-resolution decoder, which can learn embed dings invariant to pose, occlusion, illumination, and low-resolution. We further propose a hybrid sampling strategy to boost the effectiveness of the training loss function. In addition, we propose test set augmentation using reconstructed images to explicitly transform single query to multi query setting. In our multi-tasking approach, the feature robustness is enhanced by the multi-resolution decoder and the overall performance is further improved by sampling strategy and test data augmentation. The rigorous analysis on publicly available datasets CUHK03 [7], Market-1501 [8] and Duke MTMC-reID [9] demon-strate the state-of-the-art accuracy. Homogeneous Re-ID assumes single modality which restricts its utility under scenario where samples are captured under different spectrum. We address the limitations of homogeneous Re-ID models in our second contribution and propose an image-based heterogeneous Re-ID model. Visible-infrared (RGB-IR) Re-ID is one of the important heterogeneous Re-ID tasks for surveillance applications under poor illumination conditions. In addition to conventional Re-ID challenges, the spectrum discrepancy scales up the difficulty of the problem. In order to address this, we propose to disentangle the spectrum information while learning the identity discriminative features. To extract these features, we propose a novel network with disentanglement loss which can distill identity features and dispel spectrum features. Our network has two branches, spectrum dispelling and spectrum distilling branch. On spectrum dispelling branch, we apply identification loss to learn the identity related and spectrum disentangled features. On spectrum distilling branch, we apply an identity-dispeller loss to fool the identity classifier so that it primarily learns spectrum related information. The entire network is trained in an end-to-end manner, which minimizes spectrum information and maximizes invariant identity relevant information at spectrum dispelling branch. Extensive experiments on existing datasetsSYSU-MM01 [5] and RegDB [6] demonstrate the superior performance of our approach. The previous problem settings deal with images or videos only. However, in many cases, there might be only a verbal description of the persons’ appearance available. These descriptions can also be used as the cues for finding a specific person from visual surveillance data. Such a scenario motivates us to address new challenges beyond the single and multi-modal visual Re-ID, named as text-image Re-ID. To this end, we first analyze the major challenges of text-image Re-ID which include text complexity arising due to different words with the same meanings and alignment uncertainty occurring during matching due to poor correspondence of text-image pairs. To solve these challenges, we propose an end-to-end Hierarchical Attention Alignment Network. Our model comprises of: i) a new strategy of Term Frequency-Inverse document Frequency thresholding to extract the salient tokens to alleviate the challenge of text complexity; ii) a hierarchical attention alignment network to determine the potential relation-ships of image content and textual information at different levels, namely, word-patch level, phrase-patch level, and sentence-image level for addressing alignment uncertainty. Since hierarchical attention exploits salient regions has an additional advantage for performing retrieval in fine-grained text-image matching. The network is optimized via joint weighted hierarchical attention loss and cross-modal loss in an end-to-end manner. Extensive quantitative and qualitative analysis on the challenging datasets CUHK-PEDES [10] and Flickr-30K [11] demonstrate the superiority of the proposed approach.