Abstract:
A scene can be interpreted from two perspectives: geometric and semantic. Geometric scene understanding requires inferring the3D layout of the scene from an image or video, while semantic scene understanding requires identification of the type of objects and their relationships. It is well known that geometric estimation is sensitive to random noise and outliers in the data and typically leverage robust estimation methods. Thanks to deep learning (DL) approaches, semantic scene understanding has seen substantial progress through advances in image classification, object detection, and semantic segmentation type tasks even in complex, cluttered scenes. However, DL techniques are known to be susceptible to carefully crafted noisy samples, popularly known as adversarial examples. Perceptually both clean and noisy/adversarial samples look very similar; even a human find it difficult to differentiate between them semantically. Essentially, the goal of an adversary is to add sufficient noise to the clean sample so that it doesn’t follow the under-lying classifier model learned in the form of neural network weights. This vulnerability has inspired the investigation of robust methods for deep neural networks as well.
The ultimate goal of an intelligent visual perception system is to mimic the human level scene understanding and reasoning from the images and videos. Collectively both semantic and geometric understanding plays a vital role in bridging the gap between human and machine vision’s capability. The universal existence of noisy and corrupted input data/observations and the sensitivity of both geometric and semantic tasks towards them poses an important question on the reliability of such tasks in security critical applications. For example, an autonomous car navigating in a city incorrectly classifies the red light into green or inaccurately estimates the distance from an obstacle, etc. Such an event leads to a catastrophic situation. What makes human perception unique is its robustness. Therefore, to mimic the human-level understanding, we believe that all the methods intended for scene understanding tasks must consider robustness as one of their prime and the necessary component, and evaluate their effectiveness on the same.
In this dissertation, we investigate the robustness of two geometric tasks and one semantic task towards robust scene understanding. Geometric: Multiple Geometric Model Fitting, Simultaneous Localization and Mapping (SLAM) using a monocular camera. Semantic: Robust Image Classification. We propose robust solutions to these three important scene understanding tasks.