Development of clustering models for extended opinion holders based on aggregated stylometric and sentiment features of chat messages

Heorhii Chyzhmak; Valeriy Sydorenko

doi:10.15587/2706-5448.2025.344630

Authors

Heorhii Chyzhmak Kremenchuk Mykhailo Ostrohradskyi National University, Ukraine https://orcid.org/0000-0001-9284-4195
Valeriy Sydorenko Kremenchuk Mykhailo Ostrohradskyi National University, Ukraine https://orcid.org/0000-0002-4449-073X

DOI:

https://doi.org/10.15587/2706-5448.2025.344630

Keywords:

clustering models, natural language processing, semantic and sentiment analysis, explainable artificial intelligence

Abstract

The subject of research is the methods and technologies for monitoring holder opinion groups in social media based on stylometric and sentiment features. One of the most important problems is the increasing complexity of text content, which makes user behavior analysis more difficult because of anonymity, informal language, slang, emojis, and non-standard writing styles. Stable, long-term behavioral patterns are not captured by methods based on single-message evaluation.

This research proposes a holder-level clustering method based on aggregated stylometric and sentiment features taken from several messages per user. The methodology includes agglomerative hierarchical clustering, which is enhanced by decision tree analysis for feature selection and cluster interpretability, quantile normalization, dimensionality reduction via PCA (LiveJournal provided six components explaining 81.7% of the variance, while Instagram provided four components explaining 83.5% of the variance), and data preprocessing (VarianceThreshold, removal of highly correlated features). Ultimately, the majority of users were covered by two clusters for LiveJournal and three clusters for Instagram. The result is a set of clustering models that efficiently group holders into logical, understandable clusters based on their overall communication style and emotional expression. The primary advantages of the proposed approach are as follows: holder-level aggregation ensures stability and consistency in profiling; two-stage clustering with intermediate feature selection enhances explainability; the method demonstrates cross-platform applicability, validated on both LiveJournal and Instagram. As a result, over time, more accurate and dynamic user profiles can be developed, enabling improved sentiment analysis, automated moderation, and customized user interaction. This approach offers significant benefits over conventional single-message analysis methods in terms of results transparency, behavioral insight depth, and profile stability. Customized social media recommendations, automated moderation, and social sentiment analysis can all benefit from the study's findings.

Author Biographies

Heorhii Chyzhmak, Kremenchuk Mykhailo Ostrohradskyi National University

PhD Student, Assistant

Department of Computer Engineering and Electronics

Valeriy Sydorenko, Kremenchuk Mykhailo Ostrohradskyi National University

PhD, Associate Professor

Department of Computer Engineering and Electronics

References

Sydorenko, V., Kravchenko, S., Rychok, Y., Zeman, K. (2020). Method of Classification of Tonal Estimations Time Series in Problems of Intellectual Analysis of Text Content. Transportation Research Procedia, 44, 102–109. https://doi.org/10.1016/j.trpro.2020.02.015
Sydorenko, V., Rychok, Y., Oladko, M. (2022). Method for Evaluation the Pattern of Internet Service Customers Based on Stylometric Analysis Oof their Text Content. 2022 IEEE 4th International Conference on Modern Electrical and Energy System (MEES), 1–6. https://doi.org/10.1109/mees58014.2022.10005654
F. Mosteller and D.L. Wallace Inference and Disputed Authorship; The Federalist. Addison-Wesley Series in Behavioral Science; Quantitative Methods. Reading, Mass., Palo Alto, London, Addison-Wesley Publishing Company, Inc., 1964, XV p. 287 p., $ 12.50. (1965). Recherches Économiques de Louvain, 31 (8), 721–721. https://doi.org/10.1017/s0770451800020777
Stamatatos, E. (2008). A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60 (3), 538–556. https://doi.org/10.1002/asi.21001
Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G. (2013). Overview of the Author Profiling Task at PAN 2013. Working Notes of CLEF 2013 Conference. Valencia: CEUR, 1179. https://ceur-ws.org/Vol-1179/CLEF2013wn-PAN-RangelEt2013.pdf
Giorgi, S., Preoţiuc-Pietro, D., Buffone, A., Rieman, D., Ungar, L., Schwartz, H. A. (2018). The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels: Association for Computational Linguistics, 1167–1172. https://doi.org/10.18653/v1/d18-1148
Chyzhmak, H., Sydorenko, V. (2023). Classification models of direct opinion holders in the space of stylometric and sentiment features of chat messages. 2023 IEEE 5th International Conference on Modern Electrical and Energy System (MEES), 1–6. https://doi.org/10.1109/mees61502.2023.10402395
Rychok, Yu. S., Sydorenko, V. M. (2021). Model otsinky sentyment-komponent u zadachakh sentyment-analizu skladnoho tekstovoho kontenta. Fizychni protsesy ta polia tekhnichnykh i biolohichnykh obiektiv. Kremenchuk, 83–86.
LiveJournal. Available at: https://www.livejournal.com/
Instagram. Available at: https://www.instagram.com
Ali, S., Abuhmed, T., El-Sappagh, S., Muhammad, K., Alonso-Moral, J. M., Confalonieri, R. et al. (2023). Explainable Artificial Intelligence (XAI): What we know and what is left to attain Trustworthy Artificial Intelligence. Information Fusion, 99, 101805. https://doi.org/10.1016/j.inffus.2023.101805
GitHub – agentcooper/node-livejournal: LiveJournal API. Available at: https://github.com/agentcooper/node-livejournal
7 000 000 Russian comments from Instagram (2025). Available at: https://t.me/danokhlopkov/395
VarianceThreshold. Scikit-learn. Available at: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html
Kuhn, M., Johnson, K. (2013). Applied Predictive Modeling. New York: Springer. https://doi.org/10.1007/978-1-4614-6849-3
Amaratunga, D., Cabrera, J. (2001). Analysis of Data From Viral DNA Microchips. Journal of the American Statistical Association, 96 (456), 1161–1170. https://doi.org/10.1198/016214501753381814
Aitchison, J., Brown, J. A. C. (1958). The Lognormal Distribution. The Incorporated Statistician, 8 (3), 145. https://doi.org/10.2307/2986416
Box, G. E. P., Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society, Series B, 26 (2), 211–252. Available at: http://www.econ.illinois.edu/~econ508/Papers/boxcox64.pdf
Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2 (11), 559–572. https://doi.org/10.1080/14786440109462720
Nielsen, F. (2016). Hierarchical Clustering. Introduction to HPC with MPI for Data Science. Cham: Springer, 195–211. https://doi.org/10.1007/978-3-319-21903-5_8
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1 (5), 206–215. https://doi.org/10.1038/s42256-019-0048-x
Phillips, P. J., Hahn, C. A., Fontana, P. C., Yates, A. N., Greene, K., Broniatowski, D. A., Przybocki, M. A. (2021). Four principles of explainable artificial intelligence. National Institute of Standards and Technology. https://doi.org/10.6028/nist.ir.8312

Development of clustering models for extended opinion holders based on aggregated stylometric and sentiment features of chat messages

Authors

DOI:

Keywords:

Abstract

Author Biographies

Heorhii Chyzhmak, Kremenchuk Mykhailo Ostrohradskyi National University

Valeriy Sydorenko, Kremenchuk Mykhailo Ostrohradskyi National University

References

Downloads

Published

How to Cite

Issue

Section

License

Information site

Language

Information

Developed By

Current Issue