Indian legal case judgment document mining : semantic segmentation and information extraction

Das, Antara; Goyal, Vikram (Advisor)

dc.contributor.author	Das, Antara
dc.contributor.author	Goyal, Vikram (Advisor)
dc.date.accessioned	2024-09-21T10:31:37Z
dc.date.available	2024-09-21T10:31:37Z
dc.date.issued	2024-05-01
dc.identifier.uri	http://repository.iiitd.edu.in/xmlui/handle/123456789/1678
dc.description.abstract	Developing NLP-based techniques to automate tasks in the Indian legal domain is highly demanding due to the enormously increasing volume of legal text documents, intricate legal terminologies, and the need for efficient information retrieval and document analysis for legal professionals. These techniques streamline extensive processing, extraction, and understanding of legal information, aiding more productivity within the judicial framework. In this work, we have experimented with two tasks: Task 1 deals with extracting eight legal domain-specific named entities from the Indian court judgment texts, and Task 2 is on semantic segmentation of Indian case judgment documents into different functional or rhetorical components such as Facts, Arguments, Judgment statement etc. We have introduced two new large corpora for each task, which enabled us to experiment with different transformer-based models. For Task 1, we propose a hybrid approach combining a BERT-CRF model for token classification and uniquely designed rule-based information extraction. The semantic segmentation task can be modelled in two ways: a high-level approach automatically segregates a given text document into multiple functional chunks using a subtask called Label Shift Prediction, and another detailed approach classifies the rhetorical roles of those text chunks. We have extensively experimented on Task 2 to improve prior research by introducing different ways to incorporate the Label Shift Prediction task to enhance the hierarchical BERT-based approach of the rhetorical role identification task. Also, in Task 2, we worked on a dataset with more fine-grained RR labels and huge label imbalances and significantly improved the performance of rare labels using a dynamically weighted loss. Further we have experimented with cross domain performance of RR and LSP prediction models and shown that finetuning a model with a small corpus of a target domain is can efficiently provide solution for cases from unseen domain.	en_US
dc.language.iso	en_US	en_US
dc.publisher	IIIT-Delhi	en_US
dc.subject	NER	en_US
dc.subject	Pretraining Models	en_US
dc.subject	Baselines: Legal NER	en_US
dc.title	Indian legal case judgment document mining : semantic segmentation and information extraction	en_US
dc.type	Thesis	en_US