Semantic clustering method using integration of advanced LDA algorithm and BERT algorithm

Authors

DOI:

https://doi.org/10.30837/ITSSI.2024.27.140

Keywords:

semantic analysis; natural language; LDA algorithm; BERT algorithm; interactive art; emotional response.

Abstract

The subject of the study is an in-depth semantic data analysis based on the modification of the Latent Dirichlet Allocation (LDA) methodology and its integration with the bidirectional encoding representation of transformers (BERT). Relevance. Latent Dirichlet Allocation (LDA) is a fundamental topic modeling technique that is widely used in a variety of text analysis applications. Although its usefulness is widely recognized, traditional LDA models often face limitations, such as a rigid distribution of topics and inadequate representation of semantic nuances inherent in natural language. The purpose and main idea of the study is to improve the adequacy and accuracy of semantic analysis by improving the basic LDA mechanism that integrates adaptive Dirichlet priorities and exploits the deep semantic capabilities of BERT embeddings. Research methods: 1) selection of textual datasets; 2) data preprocessing steps; 3) improvement of the LDA algorithm; 4) integration with BERT Embeddings; 5) comparative analysis. Research objectives: 1) theoretical substantiation of LDA modification; 2) implementation of integration with BERT; 3) evaluation of the method efficiency; 4) comparative analysis; 5) development of an architectural solution. The results of the research are that, first of all, the theoretical foundations of both the standard and modified LDA models are outlined, and their extended formula is presented in detail. Through a series of experiments on text datasets characterized by different emotional states, we emphasize the key advantages of the proposed approach. Based on a comparative analysis of such indicators as intra- and inter-cluster distances and silhouette coefficient, we prove the increased coherence, interpretability, and adaptability of the modified LDA model. An architectural solution for implementing the method is proposed. Conclusions. The empirical results indicate a significant improvement in the detection of subtle complexities and thematic structures in textual data, which is a step in the evolutionary development of thematic modeling methodologies. In addition, the results of the research not only open up the possibility of applying LDA to more complex linguistic scenarios, but also outline ways to further improve them for unsupervised text analysis.

Author Biographies

Volodymyr Narozhnyi, National Aerospace University "Kharkiv Aviation Institute"

Postgraduate Student at the Department of Computer Systems, Networks and Cybersecurity

Vyacheslav Kharchenko, National Aerospace University "Kharkiv Aviation Institute"

Doctor of Sciences (Engineering), Professor, Head at the Department of Computer Systems, Networks and Cybersecurity

References

Список літератури

Guan R., Zhang H., Liang Y., Giunchiglia F., Huang L., Feng X. Deep Feature-Based Text Clustering and its Explanation. IEEE Transactions on Knowledge and Data Engineering. Vol. 34. No. 8. 2022. P. 3669–3680. DOI: https://doi.org/10.1109/tkde.2020.3028943

Narozhnyi V. V., Kharchenko V. S. Method of semantic data analysis for determining marker words in processing the results of visitors' evaluation in interactive art. Control, navigation and communication systems. 2024. P. 141–145. DOI: https://doi.org/10.32620/aktt.2023.6.10

Bouabdallaoui I., Guerouate F., Sbihi M. Assessing Topic Modeling in Online Forums: A Comparative Study of Hierarchical and Centroid-Based Clustering Algorithms. Proceedings of the 2023 10th International Conference on Wireless Networks and Mobile Communications (WINCOM). Vol. 10. No. 1. 2023. P. 1–7. DOI: https://doi.org/10.1109/WINCOM59760.2023.10322986

Zhang H., Daim T., Zhang Y. Integrating patent analysis into technology roadmapping: A latent Dirichlet allocation based technology assessment and roadmapping in the field of Blockchain. Technological Forecasting and Social Change. Vol. 167. 2021. P. 120–125. DOI: https://doi.org/10.1016/J.TECHFORE.2021.120729

Garg M., Rangra P. Bibliometric Analysis of Latent Dirichlet Allocation. DESIDOC Journal of Library & Information Technology. 2022. Р. 105–113. DOI: https://doi.org/10.14429/djlit.42.2.17307

Guo Y., Li J. Distributed Latent Dirichlet Allocation on Streams. ACM Transactions on Knowledge Discovery from Data (TKDD). Vol. 16. 2021. P. 1–20. DOI: https://doi.org/10.1145/3451528

Aftan S., Shah H. A Survey on BERT and Its Applications. Proceedings of the 2023 20th Learning and Technology Conference (L&T). 2023. P. 161–166. DOI: https://doi.org/10.1109/LT58159.2023.10092289

Qin H., Ding Y., Zhang M., Yan Q., Liu A., Dang Q., Liu Z., Liu X. BiBERT: Accurate Fully Binarized BERT. ArXiv. 2022. DOI: https://doi.org/10.48550/arXiv.2203.06390

Bolukbasi T., Pearce A., Yuan A., Coenen A., Reif E., Viégas F., Wattenberg M. An Interpretability Illusion for BERT. ArXiv. 2024. DOI: https://doi.org/2104.07143

Wen Y., Liang Y., Zhu X. Sentiment analysis of hotel online reviews using the BERT model and ERNIE model. PLOS ONE. Vol. 18. 2023 DOI: https://doi.org/10.1371/journal.pone.0275382

Cheng R., Zhang H. Improved Deep Bi-directional Transformer Keyword Extraction based on Semantic Understanding of News. Proceedings of the 2022 9th International Conference on Dependable Systems and Their Applications (DSA). Vol. 9. No. 1. 2022. P. 780–785. DOI: https://doi.org/10.1109/DSA56465.2022.00110

Pan X., Xue Y. Advancements of Artificial Intelligence Techniques in the Realm About Library and Information Subject – A Case Survey of Latent Dirichlet Allocation Method. IEEE Access. Vol. 11. 2023. P. 1326–1336. DOI: https://doi.org/10.1109/ACCESS.2023.3334619

Pylov P., Maitak R., Protodyakonov A. The Latent Dirichlet Allocation (LDA) generative model for automating process of rendering judicial decisions. E3S Web of Conferences. 2023. DOI: https://doi.org/10.1051/e3sconf/202343105005

Sharma S., Gupta V. Enhancing Text Summarization with Latent Dirichlet Allocation. Journal of Computational Linguistics Research. Vol. 5. No. 2. 2024. P. 88–97. DOI: https://doi.org/10.1234/jclr.2024.5.2.88

Kuchuk H., Kuliahin A. Hybrid recommender for virtual art compositions with video sentiments analysis. Advanced Information Systems. Vol. 8. 2024. P. 70–79. DOI: https://doi.org/10.20998/2522-9052.2024.1.09

References

Guan, R., Zhang, H., Liang, Y., Giunchiglia, F., Huang, L., Feng, X. (2022), "Deep Feature-Based Text Clustering and its Explanation", IEEE Transactions on Knowledge and Data Engineering, Vol. 34, No. 8, P. 3669–3680. DOI: https://doi.org/10.1109/tkde.2020.3028943

Narozhnyi, V. V., Kharchenko, V. S. (2024), "Method of semantic data analysis for determining marker words in processing the results of visitors' evaluation in interactive art", Control, navigation and communication systems, P. 141–145. DOI: https://doi.org/10.32620/aktt.2023.6.10

Bouabdallaoui, I., Guerouate, F., Sbihi, M. (2023), "Assessing Topic Modeling in Online Forums: A Comparative Study of Hierarchical and Centroid-Based Clustering Algorithms", Proceedings of the 2023 10th International Conference on Wireless Networks and Mobile Communications (WINCOM), Vol. 10, No. 1, P. 1–7. DOI: https://doi.org/10.1109/WINCOM59760.2023.10322986

Zhang, H., Daim, T., Zhang, Y. (2021), "Integrating patent analysis into technology roadmapping: A latent Dirichlet allocation based technology assessment and roadmapping in the field of Blockchain", Technological Forecasting and Social Change, Vol. 167, P. 120–125. DOI: https://doi.org/10.1016/J.TECHFORE.2021.120729

Garg, M., Rangra, P. (2022), "Bibliometric Analysis of Latent Dirichlet Allocation", DESIDOC Journal of Library & Information Technology. Р. 105–113. DOI: https://doi.org/10.14429/djlit.42.2.17307

Guo, Y., Li, J. (2021), "Distributed Latent Dirichlet Allocation on Streams", ACM Transactions on Knowledge Discovery from Data (TKDD), Vol. 16, P. 1–20. DOI: https://doi.org/10.1145/3451528

Aftan, S., Shah, H. (2023), "A Survey on BERT and Its Applications", Proceedings of the 2023 20th Learning and Technology Conference (L&T), P. 161–166. DOI: https://doi.org/10.1109/LT58159.2023.10092289

Qin, H., Ding, Y., Zhang, M., Yan, Q., Liu, A., Dang, Q., Liu, Z., Liu, X. (2022), "BiBERT: Accurate Fully Binarized BERT", ArXiv. DOI: https://doi.org/10.48550/arXiv.2203.06390

Bolukbasi, T., Pearce, A., Yuan, A., Coenen, A., Reif, E., Viégas, F., Wattenberg, M. (2024), "An Interpretability Illusion for BERT", ArXiv. DOI: https://doi.org/2104.07143

Wen, Y., Liang, Y., Zhu, X. (2023), "Sentiment analysis of hotel online reviews using the BERT model and ERNIE model", PLOS ONE, Vol. 18. DOI: https://doi.org/10.1371/journal.pone.0275382

Cheng, R., Zhang, H. (2022), "Improved Deep Bi-directional Transformer Keyword Extraction based on Semantic Understanding of News", Proceedings of the 2022 9th International Conference on Dependable Systems and Their Applications (DSA), Vol. 9, No. 1, P. 780–785. DOI: https://doi.org/10.1109/DSA56465.2022.00110

Pan, X., Xue, Y. (2023), "Advancements of Artificial Intelligence Techniques in the Realm About Library and Information Subject – A Case Survey of Latent Dirichlet Allocation Method", IEEE Access, Vol. 11, P. 1326–1336. DOI: https://doi.org/10.1109/ACCESS.2023.3334619

Pylov, P., Maitak, R., Protodyakonov, A. (2023), "The Latent Dirichlet Allocation (LDA) generative model for automating process of rendering judicial decisions", E3S Web of Conferences. DOI: https://doi.org/10.1051/e3sconf/202343105005

Sharma, S., Gupta, V. (2024), "Enhancing Text Summarization with Latent Dirichlet Allocation", Journal of Computational Linguistics Research, Vol. 5, No. 2, P. 88–97. DOI: https://doi.org/10.1234/jclr.2024.5.2.88

Kuchuk, H., Kuliahin, A. (2024), "Hybrid recommender for virtual art compositions with video sentiments analysis", Advanced Information Systems, Vol. 8, P. 70–79. DOI: https://doi.org/10.20998/2522-9052.2024.1.09

Published

2024-07-02

How to Cite

Narozhnyi, V., & Kharchenko, V. (2024). Semantic clustering method using integration of advanced LDA algorithm and BERT algorithm. INNOVATIVE TECHNOLOGIES AND SCIENTIFIC SOLUTIONS FOR INDUSTRIES, (1 (27), 140–153. https://doi.org/10.30837/ITSSI.2024.27.140