Show simple item record

dc.contributor.author Sundararajan, Niranjan
dc.contributor.author Dubey, Vibhu
dc.contributor.author Subramanyam, A V (Advisor)
dc.date.accessioned 2024-05-16T12:03:55Z
dc.date.available 2024-05-16T12:03:55Z
dc.date.issued 2023-12-11
dc.identifier.uri http://repository.iiitd.edu.in/xmlui/handle/123456789/1492
dc.description.abstract Image-Text Retrieval (ITR) is the task of retrieving an image from a corresponding textual description and/or a textual description from the corresponding image. Person Re-Identification (Person Re-ID) is a downstream task of ITR where the images and texts are descriptions of persons. Our paper focuses only on Text-based Person Re-ID, retrieving images from their textual descriptions. The key challenge in Person Re-ID is that the textual modality is feature coarse, whereas the image modality is feature dense. The granularity gap between both these modalities is large. Adding to this, the inherent modalities of images and texts are also different, leading to a large modality gap. Therefore, feature learning becomes difficult. Another problem currently faced in this domain is the shortage of datasets, primarily due to privacy concerns that pedestrians face getting their images clicked. A possible solution is to learn with a combination of datasets. Incorporating meta-learning to learn across datasets while retaining model robustness is a possible solution to this problem. In this paper, we aim to develop a model that can learn effectively despite the image-text granularity gap while incorporating multiple datasets for its training using meta-learning. CUHK-PEDES, RSTPReid and ICFG-PEDES are the three available benchmarks to evaluate T2I ReID methods. RSTPReid and ICFG-PEDES comprise of identities from MSMT17 but due to limited number of unique persons, the diversity is limited. On the other hand, CUHK-PEDES comprises of 13,003 identities but has relatively shorter text description on average. Further, these datasets are captured in a restricted environment with limited number of cameras. In order to further diversify the identities and provide dense captions, we propose a novel dataset called IIITD-20K. IIITD20K comprises of 20,000 unique identities captured in the wild and provides a rich dataset for text-to-image ReID. We further synthetically generate images and fine-grained captions using Stable-diffusion and BLIP models trained on our dataset. We perform elaborate experiments using state-of-art text-to-image ReID models and vision-language pretrained models and present a comprehensive analysis of the dataset. Our experiments also reveal that synthetically generated data leads to a substantial performance improvement in both same dataset as well as cross dataset settings en_US
dc.language.iso en_US en_US
dc.publisher IIIT-Delhi en_US
dc.subject Text-based Person Re-Identification en_US
dc.subject Image-Text Retrieval en_US
dc.subject Meta-learning en_US
dc.subject Information Retrieval en_US
dc.subject Machine Learning en_US
dc.title Person Re-identification en_US
dc.type Other en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Repository


Advanced Search

Browse

My Account