Abstract:
Object detection is a fundamental task in computer vision, involving the localization and classi- fication of objects within images or video frames. However, training these models requires large amounts of labeled data, prompting the exploration of active learning strategies to optimize annotation efficiency, model performance and training time. This thesis explores the application of active learning to object detection in practical settings, focusing on reducing localization and classification errors within the model. We investigate the applicability of recent advances in foundation models for this purpose, designing a system that complements such large vision-language models with humans-in-the-loop in a semi-supervised learning system. We study recent advances in Neural Collapse – a phenomenon that is observed at the end of the terminal phase of training (TPT). This phenomenon is characterized by the emergence of certain interesting geometrical structures in the model weights and learned feature representations. We study neural collapse under various settings to develop an active sampling algorithm grounded in established theoretical research. Through this research, we contribute to developing an overall system that facilitates efficient training of object detectors, maximizing performance while minimizing expenditure of human, monetary and computational resources.