Enhancing Social Media Content Analysis with Advanced Topic Modeling Techniques: A Comparative Study

Main Article Content

Amila Chethana Nanayakkara
Danuka Mahesh Thennakoon

Abstract

Topic modeling, a pivotal unsupervised machine learning approach, serves as a valuable tool for uncovering latent themes within vast document repositories. It aids in the organization, comprehension, and simplification of extensive textual data while revealing distinctive underlying themes across a corpus of documents. The intrinsic characteristics of social media content, marked by brevity, text-heavy nature, and a lack of structure, often pose methodological challenges in data collection and analysis. In an effort to bridge the realms of computer science and empirical social sciences, this research aims to assess the effectiveness of three distinct topic modeling methodologies: Bidirectional Encoder Representations from Transformers (BERTopic), Non-negative Matrix Factorization (NMF), and Latent Dirichlet Allocation (LDA). While NMF relies on a matrix factorization paradigm and LDA employs a probabilistic framework, BERT-based techniques, which utilize sentence embeddings for topic generation, represent a contemporary innovation. In this study, BERTopic is evaluated with multiple pre-trained sentence embeddings, and the outcomes are rigorously compared with those derived from LDA and NMF methodologies. The study leverages C_V and U_MASS, two vital coherence measures, to evaluate the efficacy of these topic modeling strategies. The research delves into the analysis of various algorithms, elucidating their strengths and limitations within the context of social sciences, using YouTube comments as a benchmark dataset. Notably, this investigation sheds light on the utility of BERTopic and NMF for evaluating YouTube video content disclosure based on specific attributes, thereby enhancing the analysis process and addressing performance concerns.

Article Details

Select the Journal Issue
Articles