Automatic Syllable Segmentation of Myanmar Texts using Finite State Transducer

Main Article Content

Tin Htay Hlaing
Yoshiki MIKAMI

Abstract

Abstract — Automatic syllabification lies at the heart of script processing especially for the South East Asian scripts like Myanmar. Myanmar syllabification algorithms implemented so far are either rule-based or data-driven approach. This paper proposes a new method for Myanmar syllabification which deploys formal grammar and un-weighted finite state transducers (FST) as Myanmar syllabification relies heavily on formal model of syllable structure. Our proposed method focuses on orthographic way of syllabification for the input texts encoded in Unicode. We tackle syllabification of Myanmar words with standard syllable structure as well as words with irregular structures such as kinzi, consonant stacking which have not been resolved by previous methods. Our FST based syllabifier was tested on 11,732 distinct words extracted from Myanmar Orthography Corpus. The 11,732 words yielded 32,238 syllables and are compared with correctly hand syllabified words. Our FST based syllabification method performs with 99.93% accuracy and we use Stuttgart FST tools for our experiments.

Article Details

Select the Journal Issue
Articles