Abstract:
The spread of fake news poses a serious problem in today’s world where the masses consume and produce news using online platforms. One main reason why fake news detection is hard is the lack of ground truth database for training classification models. In this paper, we present a benchmark dataset for fake news detection. The size of this dataset is an order of magnitude larger as compared to existing datasets for fake news detection. Moreover, we collect our training and testing datasets from different news sources to understand how well deep detection architectures generalize to unseen data. We also present an augmented training dataset generated using a custom data augmentation algorithm. The proposed dataset comprises of two modalities, image, and text; therefore, both unimodal and multimodal (deep learning) models can be trained. We also present the baseline results of single modality and multimodal approaches. We observe that the multimodal approaches yield better results compared to unimodal approaches. We assert that the availability of such large database can instigate research in this arduous research problem.