Multi-modal fusion transformer for understanding digital advertisements

Khurana, Varun; Shah, Rajiv Ratn (Advisor)

dc.contributor.author	Khurana, Varun
dc.contributor.author	Shah, Rajiv Ratn (Advisor)
dc.date.accessioned	2025-06-19T12:25:21Z
dc.date.available	2025-06-19T12:25:21Z
dc.date.issued	2023-05-10
dc.identifier.uri	http://repository.iiitd.edu.in/xmlui/handle/123456789/1750
dc.description.abstract	In today’s world, digital-born media, especially advertisements, have a substantial influence on our daily lives, from persuading us to buy particular brands to creating awareness about a social or environmental cause. This work proposes LearnAd, a learning method for the challenging task of understanding advertisements. Marketing graphics such as advertisements are digitally borne, multi modal (contain both text and visual content) and employ rhetorical devices such as emotions, symbolism, and slogans to convey meaning. On the other hand, most of the work in visual content understanding today is about camera shot images which does not translate well to marketing graphics To address this gap, we propose using human content interaction patterns in the form of eye movements to finetune the understanding of Vision Transformer (ViT). This helps LearnAd – a multimodal transformer-based cross-attention model, achieve state of the art results on three advertisement understanding tasks – generation of the action that an ad persuades a user to take and the reason it provides for the action (what-why of the ad), and prediction of the sentiment and topic of the advertisement image. Despite the lack of availability of real customer gaze patterns over marketing images, LearnAd achieves state of the art performance on three advertisement understanding tasks with the help of generated human saliency patterns.	en_US
dc.language.iso	en_US	en_US
dc.publisher	III-Delhi	en_US
dc.subject	Digital advertisements	en_US
dc.subject	Digital marketing	en_US
dc.subject	Multi-modal content understanding	en_US
dc.subject	Advertisement understanding	en_US
dc.subject	Transformer	en_US
dc.subject	Cross attention	en_US
dc.title	Multi-modal fusion transformer for understanding digital advertisements	en_US
dc.type	Other	en_US