Neural Machine Translation for Sinhala-English Code-Mixed Text

Archchana Kugathasan; Sagara Sumathipala

PDF

Published Feb 7, 2023

Archchana Kugathasan

University of Moratuwa

Sagara Sumathipala

University of Moratuwa

Abstract

Multilingual societies use a mix of two or more languages when communicating. It has become a famous way of communication in social media in South Asian communities. Sinhala-English Code-Mixed Texts (SCMT) are known as the most popular text representation used in Sri Lanka in the informal context such as social media chats, comments, small talks etc. The challenges in utilizing the SCMT sentences are addressed in this paper. The main focus of this study is translating code-mixed sentences written in Sinhala-English to the standard Sinhala language. Since Sinhala is a low-resource language, we were able to collect only a limited number of SCMTSinhala parallel sentences. Creating the parallel corpus of SCMT-Sinhala was a time-consuming and costly task. The proposed architecture of Neural Machine Translation(NMT) to translate SCMT text to Sinhala, is built with a combination of normalization pipeline, Long Short Term Memory(LSTM) units, Sequence to Sequence(Seq2Seq) and Teachers Forcing mechanism. The proposed model is evaluated against the current state-of-the-art models using the same experimental setup, which proves the Teacher Forcing Algorithm combined with Seq2Seq and Normalization improves the quality of the translation. The predicted outputs from the model are compared using the BLEU (Bilingual Evaluation Understudy) metric and our proposed model achieved a better BLEU score of 33.89 in the evaluation

Issue

Vol 15 No 3 (2022): 2022 December Issue

Select the Journal Issue

Articles


CodeGen Industry Sponsors	University of Colombo School of Computing Managed & Published

Article Sidebar

Main Article Content

Abstract

Article Details