Automatic Syllable Segmentation of Myanmar Texts using Finite State Transducer
Abstract — Automatic syllabification lies at the heart of script processing especially for the South East Asian scripts like Myanmar. Myanmar syllabification algorithms implemented so far are either rule-based or data-driven approach. This paper proposes a new method for Myanmar syllabification which deploys formal grammar and un-weighted finite state transducers (FST) as Myanmar syllabification relies heavily on formal model of syllable structure. Our proposed method focuses on orthographic way of syllabification for the input texts encoded in Unicode. We tackle syllabification of Myanmar words with standard syllable structure as well as words with irregular structures such as kinzi, consonant stacking which have not been resolved by previous methods. Our FST based syllabifier was tested on 11,732 distinct words extracted from Myanmar Orthography Corpus. The 11,732 words yielded 32,238 syllables and are compared with correctly hand syllabified words. Our FST based syllabification method performs with 99.93% accuracy and we use Stuttgart FST tools for our experiments.
Myanmar Syllabificaiton, Finite State Syllabificaiton
|University of Colombo
School of Computing
Managed & Published
This journal is published under a Creative Commons Attribution 4.0 International License.