Abstract:
Machine Learning models output high confidence scores when provided with input samples that belong to the input data distribution, referred to as in-distribution (ID) samples. However, when presented with an Out of Distribution (OoD) sample during inference, the same model often outputs uncertain confidence scores raising the question of the interpretability and reliability of these models. Typically, ML models are incapable of detecting near-OOD samples, which are perceptually similar, but semantically dissimilar samples closely resembling the input distribution with fine-grained variations. We address the issue of fine-grained variations using vision transformers (ViT) and capture the patch-based correlations through the self-attention mechanism. The Vision Transformer is a part of our VAE architecture along with a neural network which helps in patch-level disentanglement. Such a patch-level disentanglement using a ViT encoder results in disentangling the common latent factors for the entire image. To the best of our knowledge, this is the first work that uses a ViT with encoder-decoder architecture for OOD detection. Using experiments on fine-grained datasets such as Oxford Flowers102 and CUB200 Birds Dataset, we demonstrate that the proposed method outperforms OOD-aware baselines in terms of several OOD metrics. Our framework outperforms the density based techniques and classification based methods as compared on OOD Detection metrics such as AUC, AUPR and FPR@95. We also demonstrate that training the entire network with SRGAN decoder with a combination of Mean Square Loss and Perceptual Loss can learn better representations for density-based methods.