Optimizing n-gram order of an N-gram based language identification algorithm for 63 written languages

Main Article Content

Yew Choong Chew
Yoshiki Mikami
Chandrajith Ashuboda Marasinghe
S. T. Nandasara

Abstract

Language identification (LI) technology is widely used in the domains of machine learning and text mining. Many past researchers have achieved great results on selected few popular languages. However, majority of less popular world languages remain untested. This research presents an extensive empirical work on our N-gram based LI algorithm test against 63 languages, including many from Africa and Asia regions. The algorithm is designed to automatically detect the language, script and encoding system (LSE) of a language. In addition, we measure factors such as n-gram order (bigram vs. trigram, etc) and mixed n-gram order affect the performance and accuracy of identification. The experimental results show that our algorithm achieves very high accuracy in identifying different languages on our test corpus. Besides, the results also provided useful information to select the best n-gram order for tested languages.

Article Details

Select the Journal Issue
Articles