Comprehensive Evaluation of Tamil OCR Systems: A Survey, Dataset Creation, Benchmarking, and Error Analysis

Main Article Content

Sivashanth Suthakar
Sarveswaran Kengatharaiyer
Andrew Charles Eugene Yugarajah

Abstract

This paper presents a comprehensive evaluation of Optical Character Recognition (OCR) systems for Tamil, a low-resource language that continues to trail behind high-resource counterparts in recognition accuracy despite substantial research efforts. Tamil OCR is inherently challenging due to the script’s intricate character shapes, numerous ligatures, and its large set of 247 characters, including complex vowel-consonant combinations. These features complicate segmentation and recognition far more than Latin scripts. To contextualize the evaluation, a literature survey was conducted, covering key aspects such as pre-processing, recognition, and post-processing techniques. A major barrier in Tamil OCR research is the absence of a standardized diachronic benchmark dataset. To address this, we curated a dataset of 164 scanned images from printed documents spanning from 1850 to the present, taken at 10-year intervals. The collection captures eleven types of page layouts and eight document characteristics, considering noise levels, monolingual, multilingual content, and printing technologies, etc. Expert-verified ground-truth data, including full-page transcriptions, line segmentations, and bounding boxes, enable detailed system evaluation. Using this dataset, we evaluated both commercial and open-source OCR systems through three strategies: full-page recognition, line-level segmentation, and bounding-box-based processing. Results show that while commercial systems perform better in terms of character and word accuracy, they struggle with complex layouts, degraded text, and historical typefaces. Although the focus is on Tamil, the evaluation approach and findings offer broader relevance for OCR research in other complex-script languages. The dataset and evaluation results are publicly available on GitHub to support future work in this domain.

Article Details

Select the Journal Issue
Articles
Author Biographies

Sivashanth Suthakar, Department of Computer Science, University of Jaffna

Suthakar Sivashanth received the B.Sc. (Hons.) degree in Computer Science from the University of Jaffna, Sri Lanka, in 2024, and is currently an MPhil candidate and Research Assistant with the Department of Computer Science at the same institution. His research interests include machine learning, deep learning, natural language processing (NLP), and artificial intelligence (AI).

Sarveswaran Kengatharaiyer, Department of Computer Science, University of Jaffna

Kengatharaiyer Sarveswaran received his B.Sc. (Hons.) degree in Computer Science from the University of Peradeniya, Sri Lanka, in 2006, and his M.Sc. (2011) and Ph.D. (2022) degrees in Computer Science from the University of Moratuwa, Sri Lanka. He is currently a Senior Lecturer (Grade I) in the Department of Computer Science at the University of Jaffna. His research interests include natural language processing and computational linguistics, with a particular focus on Tamil and other low-resource languages.

Andrew Charles Eugene Yugarajah, Department of Computer Science, University of Jaffna

Eugene Yugarajah Andrew Charles received the B.Sc. (First Class) degree in Computer Science from the University of Jaffna, Sri Lanka, in 1999, and a Ph.D. degree in Computer Science from Cardiff University, UK, in 2007. He is currently a Senior Lecturer Grade I with the Department of Computer Science, University of Jaffna. His research interests include blockchain transaction analysis, natural language processing (NLP) for Tamil, digital library resource description, and applied machine learning.