Організація програмних і нейромережевих алгоритмів машинного аналізу текстових повідомлень, поданих природною мовою

Вячеслав Шкурко; Андрій Поляков

doi:10.30837/2522-9818.2025.2.151

Автор(и)

Вячеслав Шкурко Харківський національний університет радіоелектроніки, Україна https://orcid.org/0009-0003-3274-3941
Андрій Поляков Харківський національний економічний університет імені Семена Кузнеця, Україна https://orcid.org/0000-0003-1805-9011

DOI:

https://doi.org/10.30837/2522-9818.2025.2.151

Ключові слова:

машинний аналіз; текстові дані; природна мова; програмні алгоритми; нейромережева архітектура.

Анотація

У статті розглянуто питання організації програмних і нейромережевих алгоритмів машинного аналізу текстових даних, поданих природною мовою. Обґрунтовано актуальність завдання оброблення як стислих текстових повідомлень і відгуків, що потребують швидкого оброблення з мінімальними ресурсними витратами, так і складних структурованих документів, які вимагають збереження структурних характеристик і глибокого контекстного аналізу. Проведено комплексний аналіз сучасних методів машинного оброблення текстової інформації, зокрема токенізації, кластеризації, семантико-релевантного пошуку й застосування нейромережевих архітектур. Особливу увагу приділено підходам, що дають змогу оптимізувати обчислювальні витрати без суттєвого зниження якості результатів аналізу, що є критично важливим для роботи в умовах обмежених ресурсів. На основі аналізу розроблено багаторівневу методику організації машинного аналізу текстових даних. Методика передбачає попередню класифікацію текстових масивів за типами документів, групування текстів методом кластеризації для підвищення релевантності оброблення та застосування нейромережевих моделей глибокого навчання. Для глибокого аналізу текстової інформації реалізовано архітектуру на основі двонаправленої рекурентної нейронної мережі (Bidirectional LSTM) із використанням регуляризації Dropout та механізмів раннього припинення навчання. З метою практичної перевірки запропонованої методики розроблено застосунок для автоматизованого аналізу стислих текстових повідомлень природною мовою. Подано результати навчання моделі, побудовано графіки динаміки зміни функції втрат на тренувальних і валідаційних вибірках, розроблено матриці помилок та візуалізацію результатів прогнозування. Продемонстровано стабільне зниження функції втрат без суттєвого збільшення обчислювальних витрат системи. Запропонована методика може бути застосована в інформаційних системах різного призначення для автоматизованого оброблення текстових повідомлень у режимах з обмеженими ресурсами, а також має перспективи подальшого розвитку в напрямі аналізу мультимодальних даних і впровадження в реальні інформаційно-аналітичні комплекси.

Біографії авторів

Вячеслав Шкурко, Харківський національний університет радіоелектроніки

аспірант кафедри прикладної математики

Андрій Поляков, Харківський національний економічний університет імені Семена Кузнеця

кандидат технічних наук, доцент, Харківський національний економічний університет імені Семена Кузнеця, доцент кафедри інформаційних систем; Харківський національний університет радіоелектроніки, доцент кафедри прикладної математики

Посилання

Список літератури

Petrov V.V., Zichun L., Kryuchyn A.A., Shanoylo S.M., Mingle F., Beliak I.V., Manko D.Y., Lapchuk A.S., Morozov E.M. Long-term storage of digital information. Akademperiodyka, Kyiv. 148 р. 2018. DOI: https://doi.org/10.15407/akademperiodyka. 360.148

Giordano V., Spada I., Chiarello F., Fantoni G. The impact of ChatGPT on human skills: A quantitative study on Twitter data. Technological Forecasting and Social Change. 2024. No. 203. 124 р. DOI: https://doi.org/10.1016/j.techfore.2024.123389

Dahri N. A., Extended Tam based acceptance of ai-powered ChatGPT for supporting metacognitive self-regulated learning in education: A mixed-methods study / Dahri N. A., Yahaya N., Al-Rahmi W. M., Aldraiweesh A., Alturki U., Almutairy S., Shutaleva A., Soomro R. B. Heliyon. 2024. No. 10(8). DOI: https://doi.org/10.1016/j.heliyon.2024.e29317

Malhotra A., Bajaj K. A hybrid pattern based text mining approach for malware detection using DBScan. CSI Transactions on ICT, 4 (2-4), 2016. Р. 141-149. DOI: https://doi.org/10. 1007/s40012-016-0095-y

The Trustees of Princeton University. What is WordNet? Princeton University. Retrieved January 12, 2022, URL: https://wordnet.princeton.edu.

Marcos T. Efficient Methods for Natural Language Processing: A Survey. / Marcos Treviso, Ji-Ung Lee, Tianchu Ji. et al. Transactions of the Association for Computational Linguistics, 11. 2023. Р. 826–860. DOI: 10.1162/tacl_a_00577

Zhao W. X., A Survey of Large Language Models. / Zhao W. X., Zhou, K., Li, J. et al. Computation and Language. 144 p. 2023. DOI: https://doi.org/10.48550/arXiv.2303.18223

Lialin V., Deshpande V., Rumshisky A. Scaling down to scale up: A guide to parameter-efficient fine-tuning. Computation and Language. 2023. DOI: https://doi.org/10.48550/arXiv.2303.15647

Yang J. A Survey of Knowledge Enhanced Pre-trained Models. / Yang J., Xiao G., Shen Y., Jiang W., Hu X., Zhang Y., Peng, J. et al. Computation and Language. 2021. 32 p. https://doi.org/10.48550/arXiv.2110.00269

Fournie Q., Caron G. M., Aloise D. A Practical Survey on Faster and Lighter Transformers. ACM Computing Surveys, 55(14s), Р. 1-40. 2023. DOI: https://doi.org/10.1145/3586074

Zhang X. Edge intelligence optimization for large language model inference with batching and quantization. / Zhang X., Liu J., Xiong Z., Huang Y., Xie G., Zhang R. et al. IEEE Wireless Communications and Networking Conference (WCNC). 2024. DOI: https://doi.org/10. 1109/wcnc57260.2024.10571127

Wei X. Knowledge Enhanced Pretrained Language Models: A Compreshensive Survey. / Wei X., Wang S., Zhang D., Bhatia P., Arnold A.O. et al. Computation and Language. 2021. DOI: https://doi.org/10.48550/arXiv.2110.08455

Nagamatsu N., Hara-Azumi Y. Dynamic split computing-aware mixed-precision quantization for efficient deep edge intelligence. IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). 2023. DOI: https://doi.org/10.1109/trustcom60117.2023.00355

Bao Y., Xu Y., Xiong H. Feature map alignment: Towards efficient design of mixed-precision quantization scheme. IEEE Visual Communications and Image Processing (VCIP). 2019. DOI: https://doi.org/10.1109/vcip47243.2019.8965724

Shamaeva I., Galley D. Simple and advanced Google Search. Custom Search – Discover More: Р. 7–27. 2021. DOI: https://doi.org/10.1201/9781003 100133-2

Vo N.P., Popescu O. A multi-layer system for semantic textual similarity. Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management. Р. 56-67. 2016. DOI: https://doi.org/10.5220/0006045800560067

Yıldız E., Findik Y. Question similarity detection in Turkish using semantic textual similarity methods. 2019 27th Signal Processing and Communications Applications Conference (SIU). 2019. DOI: https://doi.org/10.1109/siu. 2019.8806308

Nel W., de Wet L., Schall R. Randomised controlled trial of the usability of major search engines (Google, Yahoo! and Bing) when using ambiguous search queries. Proceedings of the 4th International Conference on Computer-Human Interaction Research and Applications. Р. 152-161. 2020. DOI: https://doi.org/10. 5220/0010133601520161

Oladipo F. O., Ohiani A. B.A. Text summarization system: An extractive approach using hierarchical text clustering. International Journal of Computer Applications, 174 (23), Р. 15–19. 2021. DOI: https://doi.org/10.5120/ijca202192 1015

Bindal A. Pathak A. A survey on K-means clustering and web-text mining. International Journal of Science and Research (IJSR), 5(4), Р. 1049–1052. 2016. DOI: https://doi.org/10.21275/v5i4.nov162776

Shaposhnikov A. I. Feature-vector for the meanshift. Proceedings of Tomsk State University of Control Systems and Radioelectronics, 24(2), Р. 34–38. 2021. DOI: https://doi.org/10.21293/ 1818-0442-2021-24-2-34-38

Tingting S. Application and research of DBSCAN optimization algorithm in big data analysis of experimental text. Computer Science and Application, 10 (05), Р. 906-913. 2020. DOI: https://doi.org/10.12677/csa.2020.105093

Otani N. Pre tokenization of multi-word expressions in cross-lingual word embeddings / Otani N., Ozaki S., Zhao X., Li Y., St Johns M., Levin L. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Р. 4451–4464. 2020. DOI: https://doi.org/10.18653/v1/2020.emnlp-main.360

Ribeiro E., Ribeiro R., de Matos D. A multilingual and Multidomain Study on Dialog Act recognition using character-level tokenization. Information, 10(3), 94 р. 2019. DOI: https://doi.org/10.3390/info10030094

Slimane F., Margner V. A new text-independent GMM writer identification system applied to Arabic handwriting. International Conference on Frontiers in Handwriting Recognition. 13 р. 2014. DOI: https://doi.org/10.1109/ icfhr.2014.124

Belogorskaya D. V. Summarizing news texts using quantitative methods (TF-IDF). Proceedings of the VII (XXI) International Scientific and Practical Conference of Young Scientists. 2020. DOI: https://doi.org/10.17223/978-5-94621-901-3-2020-30

Choi E.A., Han Y.E., Lee S., Oh M. A comparison of TF and TF-IDF analysis for trends of Blockchain in health and Welfare. Journal of the Korean Data And Information Science Society, 30(5), Р. 1025–1036. 2019. DOI: https://doi.org/10.7465/jkdi.2019.30.5.1025

Ghawi R., Pfeffer J. Efficient hyperparameter tuning with grid search for text categorization using KNN approach with BM25 similarity. Open Computer Science, 9(1), Р. 160–180. 2019. DOI: https://doi.org/10.1515/comp-2019-0011

Tinega G. A., Mwangi P. W., Rimiru D. R. Text mining in digital libraries using okapi BM25 model. International Journal of Computer Applications Technology and Research, 7(10), Р. 398–406. 2018. DOI: https://doi.org/10.7753/ijcatr0710.1003

Ma X., Hovy E. End-to-end sequence labeling via bi-directional LSTM-cnns-CRF. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). P. 1064–1074. 2016. DOI: https://doi.org/10.18653/v1/p16-1101

Khalil F., Pipa P. D. Transforming the generative pretrained transformer into augmented business text writer. CC BY 4.0. 2021. DOI: https://doi.org/10.21203/rs.3.rs-1170589/v1

Soyalp G., Alar A., Ozkanli K., Yildiz B. Improving text classification with Transformer. 2021 6th International Conference on Computer Science and Engineering (UBMK). 12 р. 2021. DOI: https://doi.org/10.1109/ ubmk52708.2021.9558906

Shen Y., Liu J. Comparison of text sentiment analysis based on Bert and word2vec. 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC). 17 р. 2021. DOI: https://doi.org/10.1109/icftic54370.2021.9647258

Pylkkönen J., Ukkonen A., Kilpikoski J., Tamminen S., Heikinheimo H. Fast text-only domain adaptation of RNN-Transducer Prediction Network. Interspeech. 2021. DOI: https://doi.org/10.21437/interspeech.2021-1191

Nismi Mol E. A., Santosh Kumar M. B. Study on impact of RNN, CNN and Han in text classification. 2020 Advanced Computing and Communication Technologies for High Performance Applications (ACCTHPA). 2020. DOI: https://doi.org/10.1109/accthpa49271.2020. 9213231

Zouzou A., Azami I. E. Text sentiment analysis with CNN & GRU model using glove. 2021 Fifth International Conference On Intelligent Computing in Data Sciences (ICDS). 2021. DOI: https://doi.org/10.1109/icds53782. 2021.9626715

Park P.W. Text-CNN based intent classification method for automatic input of intent sentences in chatbot. The Journal of Korean Institute of Information Technology, 18 (1), Р. 19–25. 2020. DOI: https://doi.org/10.14801/jkiit.2020. 18.1.19

Sun S., Gao Z., Huang C., Yu H. Glove-FRCNN: Comprehensive network algorithm for Vespa Mandarinia image-text extraction and classification. 2021 International Conference on Communications, Information System and Computer Engineering (CISCE). 2021. DOI: https://doi.org/10.1109/ cisce52179.2021.9445902

Pan N., Yao W., Li X. Friends recommendation based on KBERT-CNN Text Classification Model. International Joint Conference on Neural Networks (IJCNN). 2021. DOI: https://doi.org/10.1109/ijcnn52387.2021. 9533618

References

Petrov, V. V., Zichun, L., Kryuchyn, A. A., Shanoylo, S. M., Mingle, F., Beliak, I. V., Manko, D. Y., Lapchuk, A. S., Morozov, E. M. (2018), Long-term storage of digital information. Akademperiodyka, Kyiv. 148 р. DOI: https://doi.org/10.15407/akademperiodyka. 360.148

Giordano, V., Spada, I., Chiarello, F., Fantoni, G. (2024), "The impact of ChatGPT on human skills: A quantitative study on Twitter data". Technological Forecasting and Social Change. No. 203. 124 р. DOI: https://doi.org/10.1016/j.techfore.2024.123389

Dahri, N. A., (2024), "Extended Tam based acceptance of ai-powered ChatGPT for supporting metacognitive self-regulated learning in education: A mixed-methods study" / Dahri, N. A., Yahaya, N., Al-Rahmi, W. M., Aldraiweesh, A., Alturki, U., Almutairy, S., Shutaleva, A., Soomro, R. B. Heliyon. 2024. No. 10(8). DOI: https://doi.org/10.1016/j.heliyon.2024.e29317

Malhotra, A., Bajaj, K. (2016), "A hybrid pattern based text mining approach for malware detection using DBScan". CSI Transactions on ICT, 4 (2-4), Р. 141-149. DOI: https://doi.org/10. 1007/s40012-016-0095-y

"The Trustees of Princeton University. What is WordNet? Princeton University". Retrieved January 12, 2022, available at: https://wordnet.princeton.edu.

Marcos, T. (2023), "Efficient Methods for Natural Language Processing: A Survey". / Marcos Treviso, Ji-Ung Lee, Tianchu Ji. et al. Transactions of the Association for Computational Linguistics, 11. Р. 826–860. DOI: 10.1162/tacl_a_00577

Zhao, W. X.,(2023), "A Survey of Large Language Models". / Zhao W. X., Zhou, K., Li, J. et al. Computation and Language. 144 p. DOI: https://doi.org/10.48550/arXiv.2303.18223

Lialin, V., Deshpande, V., Rumshisky, A. (2023), "Scaling down to scale up: A guide to parameter-efficient fine-tuning. Computation and Language. DOI: https://doi.org/10.48550/arXiv.2303.15647

Yang, J. (2021), "A Survey of Knowledge Enhanced Pre-trained Models". / Yang J., Xiao G., Shen Y., Jiang W., Hu X., Zhang Y., Peng, J. et al. Computation and Language. 32 p. https://doi.org/10.48550/arXiv.2110.00269

Fournie, Q., Caron, G. M., Aloise, D. (2023), "A Practical Survey on Faster and Lighter Transformers. ACM Computing Surveys, 55(14s), Р. 1-40. DOI: https://doi.org/10.1145/3586074

Zhang, X. (2024), "Edge intelligence optimization for large language model inference with batching and quantization". / Zhang X., Liu J., Xiong Z., Huang Y., Xie G., Zhang R. et al. IEEE Wireless Communications and Networking Conference (WCNC). DOI: https://doi.org/10. 1109/wcnc57260.2024.10571127

Wei, X. (2021), "Knowledge Enhanced Pretrained Language Models: A Compreshensive Survey". / Wei X., Wang S., Zhang D., Bhatia P., Arnold A.O. et al. Computation and Language. DOI: https://doi.org/10.48550/arXiv.2110.08455

Nagamatsu, N., Hara-Azumi, Y. (2023), "Dynamic split computing-aware mixed-precision quantization for efficient deep edge intelligence". IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). DOI: https://doi.org/10.1109/trustcom60117.2023.00355

Bao, Y., Xu, Y., Xiong, H. (2019), "Feature map alignment: Towards efficient design of mixed-precision quantization scheme". IEEE Visual Communications and Image Processing (VCIP). DOI: https://doi.org/10.1109/vcip47243.2019.8965724

Shamaeva, I., Galley, D. (2021), "Simple and advanced Google Search". Custom Search – Discover More: Р. 7–27. DOI: https://doi.org/10.1201/9781003 100133-2

Vo, N.P., Popescu, O. (2016), "A multi-layer system for semantic textual similarity". Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management. Р. 56-67. DOI: https://doi.org/10.5220/0006045800560067

Yıldız, E., Findik, Y. (2019), "Question similarity detection in Turkish using semantic textual similarity methods". 2019 27th Signal Processing and Communications Applications Conference (SIU). DOI: https://doi.org/10.1109/siu. 2019.8806308

Nel, W., de Wet, L., Schall, R. (2020), "Randomised controlled trial of the usability of major search engines (Google, Yahoo! and Bing) when using ambiguous search queries". Proceedings of the 4th International Conference on Computer-Human Interaction Research and Applications. Р. 152-161. DOI: https://doi.org/10. 5220/0010133601520161

Oladipo, F. O., Ohiani, A. B.A. (2021), "Text summarization system: An extractive approach using hierarchical text clustering". International Journal of Computer Applications, 174 (23), Р. 15–19. DOI: https://doi.org/10.5120/ijca202192 1015

Bindal, A. Pathak, A. (2016), "A survey on K-means clustering and web-text mining". International Journal of Science and Research (IJSR), 5(4), Р. 1049–1052. DOI: https://doi.org/10.21275/v5i4.nov162776

Shaposhnikov, A. I. (2021), "Feature-vector for the meanshift". Proceedings of Tomsk State University of Control Systems and Radioelectronics, 24(2), Р. 34–38. DOI: https://doi.org/10.21293/ 1818-0442-2021-24-2-34-38

Tingting, S. (2020), "Application and research of DBSCAN optimization algorithm in big data analysis of experimental text". Computer Science and Application, 10 (05), Р. 906-913. DOI: https://doi.org/10.12677/csa.2020.105093

Otani, N. (2020), "Pre tokenization of multi-word expressions in cross-lingual word embeddings" / Otani N., Ozaki S., Zhao X., Li Y., St Johns M., Levin L. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Р. 4451–4464. DOI: https://doi.org/10.18653/v1/2020.emnlp-main.360

Ribeiro, E., Ribeiro, R., de Matos, D. (2019), "A multilingual and Multidomain Study on Dialog Act recognition using character-level tokenization". Information, 10(3), 94 р. DOI: https://doi.org/10.3390/info10030094

Slimane, F., Margner, V.(2014), "A new text-independent GMM writer identification system applied to Arabic handwriting". International Conference on Frontiers in Handwriting Recognition. 13 р. DOI: https://doi.org/10.1109/ icfhr.2014.124

Belogorskaya, D. V. (2020), "Summarizing news texts using quantitative methods (TF-IDF). Proceedings of the VII (XXI) International Scientific and Practical Conference of Young Scientists. DOI: https://doi.org/10.17223/978-5-94621-901-3-2020-30

Choi, E.A., Han, Y.E., Lee, S., Oh, M. (2019), "A comparison of TF and TF-IDF analysis for trends of Blockchain in health and Welfare". Journal of the Korean Data And Information Science Society, 30(5), Р. 1025–1036. DOI: https://doi.org/10.7465/jkdi.2019.30.5.1025

Ghawi, R., Pfeffer, J. (2019), "Efficient hyperparameter tuning with grid search for text categorization using KNN approach with BM25 similarity". Open Computer Science, 9(1), Р. 160–180. DOI: https://doi.org/10.1515/comp-2019-0011

Tinega, G. A., Mwangi, P. W., Rimiru, D. R. (2018), "Text mining in digital libraries using okapi BM25 model". International Journal of Computer Applications Technology and Research, 7(10), Р. 398–406. DOI: https://doi.org/10.7753/ijcatr0710.1003

Ma, X., Hovy, E. (2016), "End-to-end sequence labeling via bi-directional LSTM-cnns-CRF". Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). P. 1064–1074. DOI: https://doi.org/10.18653/v1/p16-1101

Khalil, F., Pipa, P. D. (2021), "Transforming the generative pretrained transformer into augmented business text writer".

CC BY 4.0. DOI: https://doi.org/10.21203/rs.3.rs-1170589/v1

Soyalp, G., Alar, A., Ozkanli, K., Yildiz, B. (2021), "Improving text classification with Transformer". 2021 6th International Conference on Computer Science and Engineering (UBMK). 12 р. DOI: https://doi.org/10.1109/ ubmk52708.2021.9558906

Shen, Y., Liu J. (2021), "Comparison of text sentiment analysis based on Bert and word2vec". 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC). 17 р. DOI: https://doi.org/10.1109/icftic54370.2021.9647258

Pylkkönen, J., Ukkonen, A., Kilpikoski, J., Tamminen, S., Heikinheimo, H. (2021), "Fast text-only domain adaptation of RNN-Transducer Prediction Network". Interspeech. DOI: https://doi.org/10.21437/interspeech.2021-1191

Nismi, Mol, E. A., Santosh, Kumar M. B. (2020), "Study on impact of RNN, CNN and Han in text classification". 2020 Advanced Computing and Communication Technologies for High Performance Applications (ACCTHPA). DOI: https://doi.org/10.1109/accthpa49271.2020. 9213231

Zouzou, A., Azami, I. E. (2021), "Text sentiment analysis with CNN & GRU model using glove". 2021 Fifth International Conference On Intelligent Computing in Data Sciences (ICDS). DOI: https://doi.org/10.1109/icds53782. 2021.9626715

Park, P.W. (2020), "Text-CNN based intent classification method for automatic input of intent sentences in chatbot". The Journal of Korean Institute of Information Technology, 18 (1), Р. 19–25. DOI: https://doi.org/10.14801/jkiit.2020. 18.1.19

Sun, S., Gao, Z., Huang, C., Yu, H. (2021), "Glove-FRCNN: Comprehensive network algorithm for Vespa Mandarinia image-text extraction and classification". 2021 International Conference on Communications, Information System and Computer Engineering (CISCE). DOI: https://doi.org/10.1109/ cisce52179.2021.9445902

Pan, N., Yao, W., Li, X. (2021), "Friends recommendation based on KBERT-CNN Text Classification Model". International Joint Conference on Neural Networks (IJCNN). 2021. DOI: https://doi.org/10.1109/ijcnn52387.2021. 9533618

Організація програмних і нейромережевих алгоритмів машинного аналізу текстових повідомлень, поданих природною мовою

Автор(и)

DOI:

Ключові слова:

Анотація

Біографії авторів

Вячеслав Шкурко, Харківський національний університет радіоелектроніки

Андрій Поляков, Харківський національний економічний університет імені Семена Кузнеця

Посилання

##submission.downloads##

Опубліковано

Як цитувати

Номер

Розділ

Ліцензія

Мова

Подати статтю