Comprehensive Evaluation of Tamil OCR Systems: A Survey, Dataset Creation, Benchmarking, and Error Analysis

Sivashanth Suthakar; Sarveswaran Kengatharaiyer; Andrew Charles Eugene Yugarajah

PDF

Published Mar 6, 2026

Sivashanth Suthakar

Department of Computer Science, University of Jaffna

https://orcid.org/0009-0002-2538-3485

Sarveswaran Kengatharaiyer

Department of Computer Science, University of Jaffna

https://orcid.org/0000-0003-1579-0597

Andrew Charles Eugene Yugarajah

Department of Computer Science, University of Jaffna

https://orcid.org/0000-0002-0678-3486

Abstract

This paper presents a comprehensive evaluation of Optical Character Recognition (OCR) systems for Tamil, a low-resource language that continues to trail behind high-resource counterparts in recognition accuracy despite substantial research efforts. Tamil OCR is inherently challenging due to the script’s intricate character shapes, numerous ligatures, and its large set of 247 characters, including complex vowel-consonant combinations. These features complicate segmentation and recognition far more than Latin scripts. To contextualize the evaluation, a literature survey was conducted, covering key aspects such as pre-processing, recognition, and post-processing techniques. A major barrier in Tamil OCR research is the absence of a standardized diachronic benchmark dataset. To address this, we curated a dataset of 164 scanned images from printed documents spanning from 1850 to the present, taken at 10-year intervals. The collection captures eleven types of page layouts and eight document characteristics, considering noise levels, monolingual, multilingual content, and printing technologies, etc. Expert-verified ground-truth data, including full-page transcriptions, line segmentations, and bounding boxes, enable detailed system evaluation. Using this dataset, we evaluated both commercial and open-source OCR systems through three strategies: full-page recognition, line-level segmentation, and bounding-box-based processing. Results show that while commercial systems perform better in terms of character and word accuracy, they struggle with complex layouts, degraded text, and historical typefaces. Although the focus is on Tamil, the evaluation approach and findings offer broader relevance for OCR research in other complex-script languages. The dataset and evaluation results are publicly available on GitHub to support future work in this domain.

Issue

Vol 18 No 3 (2025): 2025 December Issue

Select the Journal Issue

Articles

Author Biographies

Sivashanth Suthakar, Department of Computer Science, University of Jaffna

Suthakar Sivashanth received the B.Sc. (Hons.) degree in Computer Science from the University of Jaffna, Sri Lanka, in 2024, and is currently an MPhil candidate and Research Assistant with the Department of Computer Science at the same institution. His research interests include machine learning, deep learning, natural language processing (NLP), and artificial intelligence (AI).

Sarveswaran Kengatharaiyer, Department of Computer Science, University of Jaffna

Kengatharaiyer Sarveswaran received his B.Sc. (Hons.) degree in Computer Science from the University of Peradeniya, Sri Lanka, in 2006, and his M.Sc. (2011) and Ph.D. (2022) degrees in Computer Science from the University of Moratuwa, Sri Lanka. He is currently a Senior Lecturer (Grade I) in the Department of Computer Science at the University of Jaffna. His research interests include natural language processing and computational linguistics, with a particular focus on Tamil and other low-resource languages.

Andrew Charles Eugene Yugarajah, Department of Computer Science, University of Jaffna

Eugene Yugarajah Andrew Charles received the B.Sc. (First Class) degree in Computer Science from the University of Jaffna, Sri Lanka, in 1999, and a Ph.D. degree in Computer Science from Cardiff University, UK, in 2007. He is currently a Senior Lecturer Grade I with the Department of Computer Science, University of Jaffna. His research interests include blockchain transaction analysis, natural language processing (NLP) for Tamil, digital library resource description, and applied machine learning.

Article Sidebar

Main Article Content

Abstract

Article Details

Sivashanth Suthakar, Department of Computer Science, University of Jaffna

Sarveswaran Kengatharaiyer, Department of Computer Science, University of Jaffna

Andrew Charles Eugene Yugarajah, Department of Computer Science, University of Jaffna