Multi-modal Deep Learning Approach to Improve Sentence level Sinhala Sign Language Recognition

Main Article Content

H.H.S.N. Haputhanthri
H.M.N. Tennakoon
M.A.S.M. Wijesekara
B.H.R. Pushpananda
H.N.D. Thilini

Abstract

Sign language is used across the world for communication purposes within hearing-impaired communities. Hearing people are not well versed in sign language and most hearing-impaired are not good in general text, creating a communication barrier. Research on Sign Language Recognition (SLR) systems have shown admirable solutions for this issue. In Sri Lanka, machine learning along with neural networks has been the prominent domain of research in Sinhala SLR. All previous research is mainly focused on word-level SLR using hand gestures for translation. While this works for a certain vocabulary, there are many signs interpreted through other spatial cues like lip movements and facial expressions. Therefore, translation is limited and sometimes the interpretations can be misleading. In this research, we propose a multi-modal Deep Learning approach that can effectively recognize sentence-level sign gestures using hand and lip movements and translate to Sinhala text. The model consists of modules for visual feature extraction (ResNet), contextual relationship modeling (transformer encoder with multi-head attention), alignment (CTC) and decoding (Prefix beam search). A dataset consisting 22 of sentences used for evaluations was collectedunder controlled conditions for a specific day-to-day scenario (a conversation between a vendor and a customer in a shop). The proposed model achieves a best Word Error Rate (WER) of 12.70 on the testing split, improving over the single-stream model which shows a best WER of 17.41, suggesting a multimodal approach improves overall SLR.

Article Details

Select the Journal Issue
Articles