Abstract:
Developing NLP-based techniques to automate tasks in the Indian legal domain is highly demanding due to the enormously increasing volume of legal text documents, intricate legal terminologies, and the need for efficient information retrieval and document analysis for legal professionals. These techniques streamline extensive processing, extraction, and understanding of legal information, aiding more productivity within the judicial framework. In this work, we have experimented with two tasks: Task 1 deals with extracting eight legal domain-specific named entities from the Indian court judgment texts, and Task 2 is on semantic segmentation of Indian case judgment documents into different functional or rhetorical components such as Facts, Arguments, Judgment statement etc. We have introduced two new large corpora for each task, which enabled us to experiment with different transformer-based models. For Task 1, we propose a hybrid approach combining a BERT-CRF model for token classification and uniquely designed rule-based information extraction. The semantic segmentation task can be modelled in two ways: a high-level approach automatically segregates a given text document into multiple functional chunks using a subtask called Label Shift Prediction, and another detailed approach classifies the rhetorical roles of those text chunks. We have extensively experimented on Task 2 to improve prior research by introducing different ways to incorporate the Label Shift Prediction task to enhance the hierarchical BERT-based approach of the rhetorical role identification task. Also, in Task 2, we worked on a dataset with more fine-grained RR labels and huge label imbalances and significantly improved the performance of rare labels using a dynamically weighted loss. Further we have experimented with cross domain performance of RR and LSP prediction models and shown that finetuning a model with a small corpus of a target domain is can efficiently provide solution for cases from unseen domain.