Abstract:
The commercial use of Natural Language Processing (NLP) has gained significant popularity in recent years. Many companies train and deploy language models to perform tasks like Sentiment classification, Machine Translation etc. These models are published as black box APIs that charge the user per query. However, these models are vulnerable to Model Stealing attacks. In these attacks, the attacker repeatedly queries the API and uses the generated dataset to train a thief model. The thief model can closely replicate the input-output behaviour of the original model. This attack thus poses a serious intellectual property risk and compromises the accuracy and reliability of the original model. Previous work in this domain has focused primarily on Image Classification models. In this study, we show that it is possible to steal Text classification models using the same techniques. Our primary focus was conducting experiments to understand the impact of domain mismatch, model architecture variation, and query budget on extraction accuracy.