Abstract

Optical Character Recognition and Analysis of Tamil Characters


Abstract


Several tools are available for the analysis of the English language. However, not many tools are available for the analysis of Tamil. Even though most of the applications and programming languages across the internet have incorporated the use of foreign languages in them, the ease of usage is not up to the mark. Popular programming languages like Python and Java have the capability of processing Tamil, however, they do not follow the rules of the language and their respective language classes are not up to the mark. For e.g., கீ is considered as 2 separate characters ‘க’ + ‘ஈ’. This creates an unnecessary overhead in the processing of the language. We also find the absence of an open-source library that can process Tamil and perform its analysis similarly to the NLTK library. The aim of the project is to build an open-source and user-friendly Python library for the analysis of the Tamil corpus in the correct form. A separate class is created for the language to overcome the issues of the inbuilt Python class that handles Tamil alphabets. Several functions must be provided to the users for processing any given Tamil text. We perform the optical character recognition of printed Tamil characters and process them further to analyze the characters detected.




Keywords


Character Analysis; Corpus Analysis; Library; OCR; Package; Python; Tamil Characters.