Development of clustering models for extended opinion holders based on aggregated stylometric and sentiment features of chat messages

Authors

DOI:

https://doi.org/10.15587/2706-5448.2025.344630

Keywords:

clustering models, natural language processing, semantic and sentiment analysis, explainable artificial intelligence

Abstract

The subject of research is the methods and technologies for monitoring holder opinion groups in social media based on stylometric and sentiment features. One of the most important problems is the increasing complexity of text content, which makes user behavior analysis more difficult because of anonymity, informal language, slang, emojis, and non-standard writing styles. Stable, long-term behavioral patterns are not captured by methods based on single-message evaluation.

This research proposes a holder-level clustering method based on aggregated stylometric and sentiment features taken from several messages per user. The methodology includes agglomerative hierarchical clustering, which is enhanced by decision tree analysis for feature selection and cluster interpretability, quantile normalization, dimensionality reduction via PCA (LiveJournal provided six components explaining 81.7% of the variance, while Instagram provided four components explaining 83.5% of the variance), and data preprocessing (VarianceThreshold, removal of highly correlated features). Ultimately, the majority of users were covered by two clusters for LiveJournal and three clusters for Instagram. The result is a set of clustering models that efficiently group holders into logical, understandable clusters based on their overall communication style and emotional expression. The primary advantages of the proposed approach are as follows: holder-level aggregation ensures stability and consistency in profiling; two-stage clustering with intermediate feature selection enhances explainability; the method demonstrates cross-platform applicability, validated on both LiveJournal and Instagram. As a result, over time, more accurate and dynamic user profiles can be developed, enabling improved sentiment analysis, automated moderation, and customized user interaction. This approach offers significant benefits over conventional single-message analysis methods in terms of results transparency, behavioral insight depth, and profile stability. Customized social media recommendations, automated moderation, and social sentiment analysis can all benefit from the study's findings.

Author Biographies

Heorhii Chyzhmak, Kremenchuk Mykhailo Ostrohradskyi National University

PhD Student, Assistant

Department of Computer Engineering and Electronics

Valeriy Sydorenko, Kremenchuk Mykhailo Ostrohradskyi National University

PhD, Associate Professor

Department of Computer Engineering and Electronics

References

  1. Sydorenko, V., Kravchenko, S., Rychok, Y., Zeman, K. (2020). Method of Classification of Tonal Estimations Time Series in Problems of Intellectual Analysis of Text Content. Transportation Research Procedia, 44, 102–109. https://doi.org/10.1016/j.trpro.2020.02.015
  2. Sydorenko, V., Rychok, Y., Oladko, M. (2022). Method for Evaluation the Pattern of Internet Service Customers Based on Stylometric Analysis Oof their Text Content. 2022 IEEE 4th International Conference on Modern Electrical and Energy System (MEES), 1–6. https://doi.org/10.1109/mees58014.2022.10005654
  3. F. Mosteller and D.L. Wallace Inference and Disputed Authorship; The Federalist. Addison-Wesley Series in Behavioral Science; Quantitative Methods. Reading, Mass., Palo Alto, London, Addison-Wesley Publishing Company, Inc., 1964, XV p. 287 p., $ 12.50. (1965). Recherches Économiques de Louvain, 31 (8), 721–721. https://doi.org/10.1017/s0770451800020777
  4. Stamatatos, E. (2008). A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60 (3), 538–556. https://doi.org/10.1002/asi.21001
  5. Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G. (2013). Overview of the Author Profiling Task at PAN 2013. Working Notes of CLEF 2013 Conference. Valencia: CEUR, 1179. https://ceur-ws.org/Vol-1179/CLEF2013wn-PAN-RangelEt2013.pdf
  6. Giorgi, S., Preoţiuc-Pietro, D., Buffone, A., Rieman, D., Ungar, L., Schwartz, H. A. (2018). The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels: Association for Computational Linguistics, 1167–1172. https://doi.org/10.18653/v1/d18-1148
  7. Chyzhmak, H., Sydorenko, V. (2023). Classification models of direct opinion holders in the space of stylometric and sentiment features of chat messages. 2023 IEEE 5th International Conference on Modern Electrical and Energy System (MEES), 1–6. https://doi.org/10.1109/mees61502.2023.10402395
  8. Rychok, Yu. S., Sydorenko, V. M. (2021). Model otsinky sentyment-komponent u zadachakh sentyment-analizu skladnoho tekstovoho kontenta. Fizychni protsesy ta polia tekhnichnykh i biolohichnykh obiektiv. Kremenchuk, 83–86.
  9. LiveJournal. Available at: https://www.livejournal.com/
  10. Instagram. Available at: https://www.instagram.com
  11. Ali, S., Abuhmed, T., El-Sappagh, S., Muhammad, K., Alonso-Moral, J. M., Confalonieri, R. et al. (2023). Explainable Artificial Intelligence (XAI): What we know and what is left to attain Trustworthy Artificial Intelligence. Information Fusion, 99, 101805. https://doi.org/10.1016/j.inffus.2023.101805
  12. GitHub – agentcooper/node-livejournal: LiveJournal API. Available at: https://github.com/agentcooper/node-livejournal
  13. 7 000 000 Russian comments from Instagram (2025). Available at: https://t.me/danokhlopkov/395
  14. VarianceThreshold. Scikit-learn. Available at: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html
  15. Kuhn, M., Johnson, K. (2013). Applied Predictive Modeling. New York: Springer. https://doi.org/10.1007/978-1-4614-6849-3
  16. Amaratunga, D., Cabrera, J. (2001). Analysis of Data From Viral DNA Microchips. Journal of the American Statistical Association, 96 (456), 1161–1170. https://doi.org/10.1198/016214501753381814
  17. Aitchison, J., Brown, J. A. C. (1958). The Lognormal Distribution. The Incorporated Statistician, 8 (3), 145. https://doi.org/10.2307/2986416
  18. Box, G. E. P., Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society, Series B, 26 (2), 211–252. Available at: http://www.econ.illinois.edu/~econ508/Papers/boxcox64.pdf
  19. Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2 (11), 559–572. https://doi.org/10.1080/14786440109462720
  20. Nielsen, F. (2016). Hierarchical Clustering. Introduction to HPC with MPI for Data Science. Cham: Springer, 195–211. https://doi.org/10.1007/978-3-319-21903-5_8
  21. Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1 (5), 206–215. https://doi.org/10.1038/s42256-019-0048-x
  22. Phillips, P. J., Hahn, C. A., Fontana, P. C., Yates, A. N., Greene, K., Broniatowski, D. A., Przybocki, M. A. (2021). Four principles of explainable artificial intelligence. National Institute of Standards and Technology. https://doi.org/10.6028/nist.ir.8312
Development of clustering models for extended opinion holders based on aggregated stylometric and sentiment features of chat messages

Downloads

Published

2025-12-29

How to Cite

Chyzhmak, H., & Sydorenko, V. (2025). Development of clustering models for extended opinion holders based on aggregated stylometric and sentiment features of chat messages. Technology Audit and Production Reserves, 6(2(86), 23–30. https://doi.org/10.15587/2706-5448.2025.344630

Issue

Section

Information Technologies