Linguistic Analysis of Sinhala YouTube Comments on Sinhala Music Videos: A Dataset Study
Main Article Content
Abstract
This research investigates the area of Music Information Retrieval (MIR) and Music Emotion Recognition (MER) in relation to Sinhala songs, an underexplored field i n m usic s tudies. T he p urpose of this study is to analyze the behavior of Sinhala comments on YouTube Sinhala song videos using social media comments as the primary data sources. These included comments from 27 YouTube videos containing 20 different Sinhala songs, which were carefully selected so that strict linguistic reliability would be maintained and relevancy ensured. This process led to a total of 93,116 comments being gathered upon which the dataset was refined f urther u sing a dvanced filtering methods and transliteration mechanisms resulting into 63,471 Sinhala comments. Additionally, 964 stop-words specific t he S inhala l anguage w ere a lgorithmically derived out of which 182 matched exactly with English stop-words from the NLTK corpus once translated. Also comparisons were made between general domain corpora in Sinhala against the YouTube Comment Corpus in Sinhala confirming latter as good representation of general domain. The meticulously curated dataset as well as the derived stop-words form important resources for future research in the fields o f M IR and MER, since they demonstrate the potential of that there are possibilities with computational techniques to address complex musical experiences across varied cultural traditions