| dc.description.abstract |
In today’s world, digital-born media, especially advertisements, have a substantial influence on our daily lives, from persuading us to buy particular brands to creating awareness about a social or environmental cause. This work proposes LearnAd, a learning method for the challenging task of understanding advertisements. Marketing graphics such as advertisements are digitally borne, multi modal (contain both text and visual content) and employ rhetorical devices such as emotions, symbolism, and slogans to convey meaning. On the other hand, most of the work in visual content understanding today is about camera shot images which does not translate well to marketing graphics To address this gap, we propose using human content interaction patterns in the form of eye movements to finetune the understanding of Vision Transformer (ViT). This helps LearnAd – a multimodal transformer-based cross-attention model, achieve state of the art results on three advertisement understanding tasks – generation of the action that an ad persuades a user to take and the reason it provides for the action (what-why of the ad), and prediction of the sentiment and topic of the advertisement image. Despite the lack of availability of real customer gaze patterns over marketing images, LearnAd achieves state of the art performance on three advertisement understanding tasks with the help of generated human saliency patterns. |
en_US |