Exploring Recent NLP Advances for Tamil: Word Vectors and Hybrid Deep Learning Architectures

Main Article Content

Archchitha Aravinthan
Charles Eugene

Abstract

The advancements of deep learning methods and the availability of large corpora and data sets have led to an exponential increase in the performance of Natural Language Processing (NLP) methods resulting in successful NLP applications for various day-to-day tasks such as Language translation, Voice to text, Grammar checking, Sentiment analysis, etc. These advancements enabled the well-resourced languages to adapt themselves to the digital era while the gap for the low-resource languages widened. This research work explores the suitability of the recent advancements in NLP for Tamil, a low-resource language spoken mainly in South India, Sri Lanka, and Malaysia. In particular, this research work analyses the applicability of deep learning approaches namely word embedding, Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN) for Tamil text classification tasks. The pre-trained word vectors based on word2Vec and FastText were built for Tamil and their effectiveness was evaluated. In this study, four simple hybrid CNN and Bi-GRU models were proposed for Tamil text classification and their performances were evaluated. The study found that hybrid CNN and Bi-GRU models perform better compared to the classical machine learning models, individual CNN and RNN models, and the Multilingual BERT model. Moreover, the pre-trained 300-dimensional FastText word vectors showed better performance than other pre-trained word vectors. These results confirm that the jointly learned embeddings with different deep learning architectures like CNN and RNN achieve remarkable results for Tamil text classification, thus ensuring that the deep learning approaches can be successful for NLP on Tamil text.

Article Details

Select the Journal Issue
Articles