dc.description.abstract |
Sarcasm is a pervading linguistic phenomenon and highly challenging to explain due to
its subjectivity, lack of context and deeply-felt opinion. In the multimodal setup, sarcasm
is conveyed through the incongruity between the text and visual entities. Although recent
approaches consider it as a classification problem, it is unclear why an online post is identified as sarcastic. Without proper explanation, end users may not be able to perceive the
underlying use of irony. In this paper, we propose a novel problem – Multimodal Sarcasm
Explanation (MSE) – given a multimodal sarcastic post containing an image and a caption,
we aim to generate a natural language explanation to reveal the intended sarcasm. To this
end, we develop a novel dataset, MORE, with explanation for 3510 sarcastic multimodal
posts. Each explanation is a natural language (English) sentence that describes the hidden
irony. We then propose EXMORE, a multimodal transformer-based architecture to address
MSE. It incorporates a cross-modal attention in transformer’s encoder which attends the distinguishing features between two modalities. Subsequently, a BART-based auto-regressive
decoder is used as the generator. Empirical results demonstrate the efficacy of EXMORE
over six baselines (adopted for MSE) and shows > 10% improvement compared to the best
baseline across five evaluation metrics |
en_US |