Please use this identifier to cite or link to this item: http://repository.iiitd.edu.in/xmlui/handle/123456789/166
Title: Finding influential people from a historical news repository
Authors: Gupta, Aayushee
Dutta, Haimonti (Advisor)
Keywords: Gazetteer
Text Mining
Information Retrieval
OCR
Spelling Correction
Historical data
In uential people detection
Issue Date: 5-Sep-2014
Publisher: IIIT Delhi
Abstract: Historical newspaper archives provide a wealth of information. They are of particular interest to genealogists, historians and scholars for People Search. In this thesis, we design a People Gazetteer from the noisy OCR text of historical newspapers and identify \in uential" people from it. A People Gazetteer is a dictionary of personal names; each entry of the gazetteer is a tuple containing a person name and a list of articles in which his name occurs along with the corresponding topic associated with each article. To build the People Gazetteer, we rst spell correct the noisy text using an edit distance based algorithm. A novel N-gram based evaluation algorithm is designed for measuring the perfor- mance of the spell corrector. Next, a Named Entity Recognizer is run on the text of each article to identify person entities and an LDA-based topic detector to assign categories to articles. To identify in uential people across each category of People Gazetteer, we de ne the notion of an In uential Person Index (IPI) and rank based on it. Our corpus is a sample of 14020 OCR newspaper articles (roughly two months' data) obtained from \The Sun" newspaper in the Chronicling America project. We present results on the top-K in uential people obtained from our algorithm by varying its parameters and verify results using Wikipedia.
URI: https://repository.iiitd.edu.in/jspui/handle/123456789/166
Appears in Collections:Year-2014

Files in This Item:
File Description SizeFormat 
MT12030.pdf1.56 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.