Semantic clustering method using integration of advanced LDA algorithm and BERT algorithm
DOI:
https://doi.org/10.30837/ITSSI.2024.27.140Keywords:
semantic analysis; natural language; LDA algorithm; BERT algorithm; interactive art; emotional response.Abstract
The subject of the study is an in-depth semantic data analysis based on the modification of the Latent Dirichlet Allocation (LDA) methodology and its integration with the bidirectional encoding representation of transformers (BERT). Relevance. Latent Dirichlet Allocation (LDA) is a fundamental topic modeling technique that is widely used in a variety of text analysis applications. Although its usefulness is widely recognized, traditional LDA models often face limitations, such as a rigid distribution of topics and inadequate representation of semantic nuances inherent in natural language. The purpose and main idea of the study is to improve the adequacy and accuracy of semantic analysis by improving the basic LDA mechanism that integrates adaptive Dirichlet priorities and exploits the deep semantic capabilities of BERT embeddings. Research methods: 1) selection of textual datasets; 2) data preprocessing steps; 3) improvement of the LDA algorithm; 4) integration with BERT Embeddings; 5) comparative analysis. Research objectives: 1) theoretical substantiation of LDA modification; 2) implementation of integration with BERT; 3) evaluation of the method efficiency; 4) comparative analysis; 5) development of an architectural solution. The results of the research are that, first of all, the theoretical foundations of both the standard and modified LDA models are outlined, and their extended formula is presented in detail. Through a series of experiments on text datasets characterized by different emotional states, we emphasize the key advantages of the proposed approach. Based on a comparative analysis of such indicators as intra- and inter-cluster distances and silhouette coefficient, we prove the increased coherence, interpretability, and adaptability of the modified LDA model. An architectural solution for implementing the method is proposed. Conclusions. The empirical results indicate a significant improvement in the detection of subtle complexities and thematic structures in textual data, which is a step in the evolutionary development of thematic modeling methodologies. In addition, the results of the research not only open up the possibility of applying LDA to more complex linguistic scenarios, but also outline ways to further improve them for unsupervised text analysis.
References
Список літератури
Guan R., Zhang H., Liang Y., Giunchiglia F., Huang L., Feng X. Deep Feature-Based Text Clustering and its Explanation. IEEE Transactions on Knowledge and Data Engineering. Vol. 34. No. 8. 2022. P. 3669–3680. DOI: https://doi.org/10.1109/tkde.2020.3028943
Narozhnyi V. V., Kharchenko V. S. Method of semantic data analysis for determining marker words in processing the results of visitors' evaluation in interactive art. Control, navigation and communication systems. 2024. P. 141–145. DOI: https://doi.org/10.32620/aktt.2023.6.10
Bouabdallaoui I., Guerouate F., Sbihi M. Assessing Topic Modeling in Online Forums: A Comparative Study of Hierarchical and Centroid-Based Clustering Algorithms. Proceedings of the 2023 10th International Conference on Wireless Networks and Mobile Communications (WINCOM). Vol. 10. No. 1. 2023. P. 1–7. DOI: https://doi.org/10.1109/WINCOM59760.2023.10322986
Zhang H., Daim T., Zhang Y. Integrating patent analysis into technology roadmapping: A latent Dirichlet allocation based technology assessment and roadmapping in the field of Blockchain. Technological Forecasting and Social Change. Vol. 167. 2021. P. 120–125. DOI: https://doi.org/10.1016/J.TECHFORE.2021.120729
Garg M., Rangra P. Bibliometric Analysis of Latent Dirichlet Allocation. DESIDOC Journal of Library & Information Technology. 2022. Р. 105–113. DOI: https://doi.org/10.14429/djlit.42.2.17307
Guo Y., Li J. Distributed Latent Dirichlet Allocation on Streams. ACM Transactions on Knowledge Discovery from Data (TKDD). Vol. 16. 2021. P. 1–20. DOI: https://doi.org/10.1145/3451528
Aftan S., Shah H. A Survey on BERT and Its Applications. Proceedings of the 2023 20th Learning and Technology Conference (L&T). 2023. P. 161–166. DOI: https://doi.org/10.1109/LT58159.2023.10092289
Qin H., Ding Y., Zhang M., Yan Q., Liu A., Dang Q., Liu Z., Liu X. BiBERT: Accurate Fully Binarized BERT. ArXiv. 2022. DOI: https://doi.org/10.48550/arXiv.2203.06390
Bolukbasi T., Pearce A., Yuan A., Coenen A., Reif E., Viégas F., Wattenberg M. An Interpretability Illusion for BERT. ArXiv. 2024. DOI: https://doi.org/2104.07143
Wen Y., Liang Y., Zhu X. Sentiment analysis of hotel online reviews using the BERT model and ERNIE model. PLOS ONE. Vol. 18. 2023 DOI: https://doi.org/10.1371/journal.pone.0275382
Cheng R., Zhang H. Improved Deep Bi-directional Transformer Keyword Extraction based on Semantic Understanding of News. Proceedings of the 2022 9th International Conference on Dependable Systems and Their Applications (DSA). Vol. 9. No. 1. 2022. P. 780–785. DOI: https://doi.org/10.1109/DSA56465.2022.00110
Pan X., Xue Y. Advancements of Artificial Intelligence Techniques in the Realm About Library and Information Subject – A Case Survey of Latent Dirichlet Allocation Method. IEEE Access. Vol. 11. 2023. P. 1326–1336. DOI: https://doi.org/10.1109/ACCESS.2023.3334619
Pylov P., Maitak R., Protodyakonov A. The Latent Dirichlet Allocation (LDA) generative model for automating process of rendering judicial decisions. E3S Web of Conferences. 2023. DOI: https://doi.org/10.1051/e3sconf/202343105005
Sharma S., Gupta V. Enhancing Text Summarization with Latent Dirichlet Allocation. Journal of Computational Linguistics Research. Vol. 5. No. 2. 2024. P. 88–97. DOI: https://doi.org/10.1234/jclr.2024.5.2.88
Kuchuk H., Kuliahin A. Hybrid recommender for virtual art compositions with video sentiments analysis. Advanced Information Systems. Vol. 8. 2024. P. 70–79. DOI: https://doi.org/10.20998/2522-9052.2024.1.09
References
Guan, R., Zhang, H., Liang, Y., Giunchiglia, F., Huang, L., Feng, X. (2022), "Deep Feature-Based Text Clustering and its Explanation", IEEE Transactions on Knowledge and Data Engineering, Vol. 34, No. 8, P. 3669–3680. DOI: https://doi.org/10.1109/tkde.2020.3028943
Narozhnyi, V. V., Kharchenko, V. S. (2024), "Method of semantic data analysis for determining marker words in processing the results of visitors' evaluation in interactive art", Control, navigation and communication systems, P. 141–145. DOI: https://doi.org/10.32620/aktt.2023.6.10
Bouabdallaoui, I., Guerouate, F., Sbihi, M. (2023), "Assessing Topic Modeling in Online Forums: A Comparative Study of Hierarchical and Centroid-Based Clustering Algorithms", Proceedings of the 2023 10th International Conference on Wireless Networks and Mobile Communications (WINCOM), Vol. 10, No. 1, P. 1–7. DOI: https://doi.org/10.1109/WINCOM59760.2023.10322986
Zhang, H., Daim, T., Zhang, Y. (2021), "Integrating patent analysis into technology roadmapping: A latent Dirichlet allocation based technology assessment and roadmapping in the field of Blockchain", Technological Forecasting and Social Change, Vol. 167, P. 120–125. DOI: https://doi.org/10.1016/J.TECHFORE.2021.120729
Garg, M., Rangra, P. (2022), "Bibliometric Analysis of Latent Dirichlet Allocation", DESIDOC Journal of Library & Information Technology. Р. 105–113. DOI: https://doi.org/10.14429/djlit.42.2.17307
Guo, Y., Li, J. (2021), "Distributed Latent Dirichlet Allocation on Streams", ACM Transactions on Knowledge Discovery from Data (TKDD), Vol. 16, P. 1–20. DOI: https://doi.org/10.1145/3451528
Aftan, S., Shah, H. (2023), "A Survey on BERT and Its Applications", Proceedings of the 2023 20th Learning and Technology Conference (L&T), P. 161–166. DOI: https://doi.org/10.1109/LT58159.2023.10092289
Qin, H., Ding, Y., Zhang, M., Yan, Q., Liu, A., Dang, Q., Liu, Z., Liu, X. (2022), "BiBERT: Accurate Fully Binarized BERT", ArXiv. DOI: https://doi.org/10.48550/arXiv.2203.06390
Bolukbasi, T., Pearce, A., Yuan, A., Coenen, A., Reif, E., Viégas, F., Wattenberg, M. (2024), "An Interpretability Illusion for BERT", ArXiv. DOI: https://doi.org/2104.07143
Wen, Y., Liang, Y., Zhu, X. (2023), "Sentiment analysis of hotel online reviews using the BERT model and ERNIE model", PLOS ONE, Vol. 18. DOI: https://doi.org/10.1371/journal.pone.0275382
Cheng, R., Zhang, H. (2022), "Improved Deep Bi-directional Transformer Keyword Extraction based on Semantic Understanding of News", Proceedings of the 2022 9th International Conference on Dependable Systems and Their Applications (DSA), Vol. 9, No. 1, P. 780–785. DOI: https://doi.org/10.1109/DSA56465.2022.00110
Pan, X., Xue, Y. (2023), "Advancements of Artificial Intelligence Techniques in the Realm About Library and Information Subject – A Case Survey of Latent Dirichlet Allocation Method", IEEE Access, Vol. 11, P. 1326–1336. DOI: https://doi.org/10.1109/ACCESS.2023.3334619
Pylov, P., Maitak, R., Protodyakonov, A. (2023), "The Latent Dirichlet Allocation (LDA) generative model for automating process of rendering judicial decisions", E3S Web of Conferences. DOI: https://doi.org/10.1051/e3sconf/202343105005
Sharma, S., Gupta, V. (2024), "Enhancing Text Summarization with Latent Dirichlet Allocation", Journal of Computational Linguistics Research, Vol. 5, No. 2, P. 88–97. DOI: https://doi.org/10.1234/jclr.2024.5.2.88
Kuchuk, H., Kuliahin, A. (2024), "Hybrid recommender for virtual art compositions with video sentiments analysis", Advanced Information Systems, Vol. 8, P. 70–79. DOI: https://doi.org/10.20998/2522-9052.2024.1.09
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Our journal abides by the Creative Commons copyright rights and permissions for open access journals.
Authors who publish with this journal agree to the following terms:
Authors hold the copyright without restrictions and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-commercial and non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their published work online (e.g., in institutional repositories or on their website) as it can lead to productive exchanges, as well as earlier and greater citation of published work.