Abstract:
With a massive amount of online multimedia data (e.g., images, videos, text articles,
etc.) and increasing needs of the people, venue discovery using multimedia data has
become an underlined research topic. We are referring to business and travel locations
as venues in this study and aim to improve e ciency of venue discovery by
hashing. Previously, a lot of work has been done in the eld of cross-modal retrieval
for reducing the heterogeneous gap between multiple modalities, so that samples
from those modalities can be compared directly. Such techniques have also shown
their application in venue discovery. However, improved technology has increased
the size of the multimedia data and thus has made the retrieval more di cult and
slower. Therefore, hashing techniques are being developed to project features from
di erent modalities into a common hamming space. Hash features take very less
storage space, and they can be compared faster than the real-valued features, using
hamming distance. In this thesis, we propose an adversarial learning-based approach
of generating hash code for venue-related heterogeneous multimedia data to ease the
task of venue discovery without any location information.
Previous works have shown the great ability of Generative Adversarial Networks
(GANs) to model the distribution of the data and learn discriminative representations.
We show how GANs can be used to learn to generate hash codes with category
and pairwise information that occur naturally in the data. Most existing supervised
cross-modal hashing methods map data in di erent modalities to Hamming space,
where the semantic information is exploited to supervise data in di erent modalities
during the training stage. However, previous works neglect pairwise similarity between
data in di erent modalities, which lead to degraded performance of the model
for nding exact matches for the queries. To address this issue, we propose a supervised
Generative Adversarial Cross-modal Hashing method by Transferring Pairwise
Similarities (SGACH-TPS). This work has three signi cant contributions: i) we propose
a model for making e cient venue discovery on a new dataset, WikiVenue, of
real-world images produced by the people, ii) the supervised generative adversarial
network to construct a hash function that can map multimodal data of image-text
pairs to a common hamming space, and, iii) a simple transfer training strategy for
the adversarial network is suggested to supervise di erent modalities of samples
where we transfer the pairwise similarity to the ne-tuning stage of training. To
generalize our work in the eld of cross-modal retrieval, we showed experiments
with the benchmark datasets, Wiki, and NUS-WIDE.