Detection to interpretation: advancing tabular data processing with multimodal AI

Bhuyan, Pijush; Shah, Rajiv Ratn (Advisor)

dc.contributor.author	Bhuyan, Pijush
dc.contributor.author	Shah, Rajiv Ratn (Advisor)
dc.date.accessioned	2025-12-20T06:58:33Z
dc.date.available	2025-12-20T06:58:33Z
dc.date.issued	2024-12-21
dc.identifier.uri	http://repository.iiitd.edu.in/xmlui/handle/123456789/1788
dc.description.abstract	Tables are the most common form of structured data found in documents. Proper interpretation of such raw tabular data by computer systems remains an open challenge. We take a deep dive into document intelligence - which includes table detection, table reconstruction and table structure interpretation by AI models. Firstly, we handle domain adaptation in table detection. Pre-trained table detection models have displayed poor results when the target domain varied from the source. We resolve this by building a domain invariant table detection dataset where we inject additional noisy synthetic detection data. Empirical tests show that training a detection model on synthetic data displays a significantly lower drop in performance when tested on out-of-distribution datasets. Following this, we build a fast,yet efficient, end-to-end pipeline for Table-OCR. It reconstructs the table structure and content from raw detection crops and converts them into computer-storable text format. Finally, we design a comprehensive benchmark suite of tests to test the table structure understanding capabilities and limitations of existing Large Language models (LLMs) and Vision Language Models (VLMs) using both text and image modalities. The vision component of VLMs is found to be a bottleneck in multi-modal table interpretability. We work with a light-weight, yet efficient model-agnostic adapter module which injects positional information into the image modality through positional embeddings during model training. We also design a novel pre-training task for image-text alignment for open-source VLMs and study the change in model performance while interpreting visual tabular data. We also study the feasibility and future scope for true multimodal table understanding - interpreting tabular data from both image and text modalities for reasoning.	en_US
dc.language.iso	en_US	en_US
dc.publisher	IIIT-Delhi	en_US
dc.subject	Table Detection	en_US
dc.subject	Unsupervised Domain Adaptation	en_US
dc.subject	Table Structure and Content Recognition (Table OCR)	en_US
dc.subject	Table Question Answering	en_US
dc.subject	Large Language Models	en_US
dc.title	Detection to interpretation: advancing tabular data processing with multimodal AI	en_US
dc.type	Thesis	en_US