dc.description.abstract |
In past few years, cross-modal information retrieval has drawn much attention due to significant growth in the multimodal data. It takes one type of data as the query to retrieve relevant data of multiple modalities. For example, a user can use a text to retrieve relevant pictures or videos. Since the query and its retrieved results can be of different modalities, how to measure the content similarity between different modalities of data remains a challenge. The existing solutions try to project data from different modalities into a common latent space and then learn a independent mapping from one modality to another. In this paper, we propose a novel fully-coupled deep learning architecture that can effectively exploit the inter-modal and intra-modal associations from heterogeneous data. The proposed learning objective can capture the correlations between the cross-modal data while preserving the intra-modal relationships. We also propose a training method that uses expectation maximization for learning the mapping function from one modality to other. The proposed training method is memory efficient and large training datasets can be split into mini-batches for parameter updations. |
en_US |