Abstract:
A tremendous amount of impact is generated through the images on social media as they account for more than 60% of the content available online. Understanding the textual content of the image is therefore significant for making constructive inferences. Significant number of optical character recognition (OCR) tools exist - tesseract, Google vision API, Microsoft Cognitive services, ocropy for conducting research and extracting text from images. However, some of these tools are expensive and paid while others give less accurate results on memes and user generated OSM content. This report focuses on the methodology adopted for developing an OCR tool just for this purpose. This report will discuss two mainstream methods adopted for text recognition – tweaking the tesseract pipeline for improving the existing results and using a single shot multibox detector for segmenting the text regions and training it on the synthetically generated annotated data. The results have been compared using multiple string matching metrics including jaccard similarity, jaro winkler etc.