IIIT-Delhi Institutional Repository

Detection to interpretation: advancing tabular data processing with multimodal AI

Show simple item record

dc.contributor.author Bhuyan, Pijush
dc.contributor.author Shah, Rajiv Ratn (Advisor)
dc.date.accessioned 2025-12-20T06:58:33Z
dc.date.available 2025-12-20T06:58:33Z
dc.date.issued 2024-12-21
dc.identifier.uri http://repository.iiitd.edu.in/xmlui/handle/123456789/1788
dc.description.abstract Tables are the most common form of structured data found in documents. Proper interpretation of such raw tabular data by computer systems remains an open challenge. We take a deep dive into document intelligence - which includes table detection, table reconstruction and table structure interpretation by AI models. Firstly, we handle domain adaptation in table detection. Pre-trained table detection models have displayed poor results when the target domain varied from the source. We resolve this by building a domain invariant table detection dataset where we inject additional noisy synthetic detection data. Empirical tests show that training a detection model on synthetic data displays a significantly lower drop in performance when tested on out-of-distribution datasets. Following this, we build a fast,yet efficient, end-to-end pipeline for Table-OCR. It reconstructs the table structure and content from raw detection crops and converts them into computer-storable text format. Finally, we design a comprehensive benchmark suite of tests to test the table structure understanding capabilities and limitations of existing Large Language models (LLMs) and Vision Language Models (VLMs) using both text and image modalities. The vision component of VLMs is found to be a bottleneck in multi-modal table interpretability. We work with a light-weight, yet efficient model-agnostic adapter module which injects positional information into the image modality through positional embeddings during model training. We also design a novel pre-training task for image-text alignment for open-source VLMs and study the change in model performance while interpreting visual tabular data. We also study the feasibility and future scope for true multimodal table understanding - interpreting tabular data from both image and text modalities for reasoning. en_US
dc.language.iso en_US en_US
dc.publisher IIIT-Delhi en_US
dc.subject Table Detection en_US
dc.subject Unsupervised Domain Adaptation en_US
dc.subject Table Structure and Content Recognition (Table OCR) en_US
dc.subject Table Question Answering en_US
dc.subject Large Language Models en_US
dc.title Detection to interpretation: advancing tabular data processing with multimodal AI en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Repository


Advanced Search

Browse

My Account