Abstract:
Image-Text Retrieval (ITR) is the task of retrieving an image from a corresponding textual description and/or a textual description from the corresponding image. Person Re-Identification (Person Re-ID) is a downstream task of ITR where the images and texts are descriptions of persons. Our paper focuses only on Text-based Person Re-ID, retrieving images from their textual descriptions. The key challenge in Person Re-ID is that the textual modality is feature coarse, whereas the image modality is feature dense. The granularity gap between both these modalities is large. Adding to this, the inherent modalities of images and texts are also different, leading to a large modality gap. Therefore, feature learning becomes difficult. Another problem currently faced in this domain is the shortage of datasets, primarily due to privacy concerns that pedestrians face getting their images clicked. A possible solution is to learn with a combination of datasets. Incorporating meta-learning to learn across datasets while retaining model robustness is a possible solution to this problem. In this paper, we aim to develop a model that can learn effectively despite the image-text granularity gap while incorporating multiple datasets for its training using meta-learning. CUHK-PEDES, RSTPReid and ICFG-PEDES are the three available benchmarks to evaluate T2I ReID methods. RSTPReid and ICFG-PEDES comprise of identities from MSMT17 but due to limited number of unique persons, the diversity is limited. On the other hand, CUHK-PEDES comprises of 13,003 identities but has relatively shorter text description on average. Further, these datasets are captured in a restricted environment with limited number of cameras. In order to further diversify the identities and provide dense captions, we propose a novel dataset called IIITD-20K. IIITD20K comprises of 20,000 unique identities captured in the wild and provides a rich dataset for text-to-image ReID. We further synthetically generate images and fine-grained captions using Stable-diffusion and BLIP models trained on our dataset. We perform elaborate experiments using state-of-art text-to-image ReID models and vision-language pretrained models and present a comprehensive analysis of the dataset. Our experiments also reveal that synthetically generated data leads to a substantial performance improvement in both same dataset as well as cross dataset settings