Identification of Abusive Sinhala Comments in Social Media using Text Mining and Machine Learning Techniques

Supun Tharaka Sandaruwan; Susil Aruna Shantha Lorensuhewa; Kalyani Munasinghe

PDF

Published Apr 9, 2020

Supun Tharaka Sandaruwan

University Of Ruhuna

Susil Aruna Shantha Lorensuhewa

University of Ruhuna

Kalyani Munasinghe

University of Ruhuna

Abstract

With the technology revolution, most of the natural languages that are used all over the world have invaded the digital world. Therefore, people use modern technologies such as Social Media and Internet as their own languages. As a result, people who are with self-ego on their tradition, race, cast, religion and with other social factors, tend to make abusiveness on others who do not belong to same social group. Since the Social Media platforms do not have centralized control, it has become a good stage to advertise their backward ideas without being governed. Sinhala language has also been added in most famous Social Media platforms. Though Sinhala has more than 2500 years history, it does not have enough resources in natural language processing. Therefore, it has been a very difficult task to automatically detect Sinhala abusive comments which are being published, and shared among Social Media platforms. Here, we have used evenly distributed 2000 comment corpus among offensive and neutral classes to train three different models: Multinomial Naïve Bayes (MNB), Support Vector Machine (SVM) and, Random Forest Decision Tree (RFDT) and the features were extracted from Bag of Word model, word n-gram model, character n-gram model and word skip gram model. After the training process, each model was tested with 200 evenly distributed comment corpus and MNB showed the highest accuracy of 96.5% with 96% average recall for both character tri-gram and character four-gram models. Further, two lexicon based approaches called cross lingual lexicon approach and corpus based lexicon approach were considered to detect Sinhala abusive comments. From these two approaches, corpus based lexicon gave the highest accuracy of 90.5% with average recall of 90.5%.

Issue

Vol 13 No 1 (2020): 2020 Special Issue

Select the Journal Issue

Articles

Author Biographies

Susil Aruna Shantha Lorensuhewa, University of Ruhuna

Senior Lecturer, Department of Computer Science

Kalyani Munasinghe, University of Ruhuna

Lecturer, Department of Computer Science

Article Sidebar

Main Article Content

Abstract

Article Details

Susil Aruna Shantha Lorensuhewa, University of Ruhuna

Kalyani Munasinghe, University of Ruhuna