A comparison of Kazakh language processing models for improving semantic search results

Authors

DOI:

https://doi.org/10.15587/1729-4061.2025.315954

Keywords:

semantic search, natural language processing, Kazakh language, information retrieval systems

Abstract

The object of the study is the text classification and semantic search tailored to the unique linguistic features of the Kazakh language. The research addresses the challenge of improving the accuracy, relevance, and efficiency of semantic search.

This study focuses on improving semantic search for the Kazakh language by analyzing computational models tailored to its unique linguistic features, such as agglutinative morphology and rich inflectional systems. The research compares traditional rule-based approaches and advanced transformer architectures, including fine-tuned models like RoBERTa, for their ability to handle semantic nuances, contextual relationships, and user intent. The results reveal that fine-tuned transformer models achieved significant advancements, with the RoBERTa model attaining a Precision@10 of 89.4 %, a Mean Reciprocal Rank (MRR) of 85.6 %, and an F1-Score of 88.0 %. Additionally, the semantic search system developed in this study demonstrated a precision of 88.4 %, recall of 87.6 %, and an F1-score of 88.0 % on a domain-specific Kazakh dataset.

Key to these improvements were innovations in preprocessing pipelines, including custom tokenization and lemmatization tailored to Kazakh's agglutinative morphology, and the integration of contextual embeddings to resolve issues such as synonymy and homonymy. Computational efficiency was enhanced through resource optimization techniques, enabling the deployment of these advanced models in constrained environments. These findings underscore the potential of tailored transformer models to bridge the gap in semantic search capabilities for underrepresented languages like Kazakh, advancing the inclusivity of natural language processing technologies

Author Biographies

Aigerim Aitim, International IT University

Master of Technical Sciences, Assistant-Professor

Department of Information Systems

Ryskhan Satybaldiyeva, Satbayev University

Candidate of Technical Sciences, Associate Professor, Head of Department

Department of Cybersecurity, Information Processing and Storage

References

  1. Aitim, A. K., Satybaldiyeva, R. Zh., Wojcik, W. (2020). The construction of the Kazakh language thesauri in automatic word processing system. Proceedings of the 6th International Conference on Engineering & MIS 2020, 1–4. https://doi.org/10.1145/3410352.3410789
  2. Satybaldiyeva, R., Uskenbayeva, R., Moldagulova, A., Kalpeyeva, Z., Aitim, A. (2019). Features of Administrative and Management Processes Modeling. Optimization of Complex Systems: Theory, Models, Algorithms and Applications, 842–849. https://doi.org/10.1007/978-3-030-21803-4_84
  3. Bora, J., Dehingia, S., Boruah, A., Chetia, A. A., Gogoi, D. (2023). Real-time Assamese Sign Language Recognition using MediaPipe and Deep Learning. Procedia Computer Science, 218, 1384–1393. https://doi.org/10.1016/j.procs.2023.01.117
  4. Bogdanchikov, A., Ayazbayev, D., Varlamis, I. (2022). Classification of Scientific Documents in the Kazakh Language Using Deep Neural Networks and a Fusion of Images and Text. Big Data and Cognitive Computing, 6 (4), 123. https://doi.org/10.3390/bdcc6040123
  5. Turganbayeva, A., Tukeyev, U. (2020). The solution of the problem of unknown words under neural machine translation of the Kazakh language. Journal of Information and Telecommunication, 1–12. https://doi.org/10.1080/24751839.2020.1838713
  6. Patayon, U. B., Crisostomo, R. V. (2021). Automatic Identification of Abaca Bunchy Top Disease using Deep Learning Models. Procedia Computer Science, 179, 321–329. https://doi.org/10.1016/j.procs.2021.01.012
  7. Nugamanov, E., Panov, A. I. (2020). Hierarchical Temporal Memory with Reinforcement Learning. Procedia Computer Science, 169, 123–131. https://doi.org/10.1016/j.procs.2020.02.123
  8. Wang, J., Chen, J., Guo, H. (2022). Research on Design Innovation Method Based on Extenics Compound-Element. Procedia Computer Science, 199, 977–983. https://doi.org/10.1016/j.procs.2022.01.123
  9. Wang, C., Ye, Y., Ma, L., Li, D., Zhuang, L. (2023). Dual disentanglement of user–item interaction for recommendation with causal embedding. Information Processing & Management, 60 (5), 103456. https://doi.org/10.1016/j.ipm.2023.103456
  10. Haisa, G., Altenbek, G. (2022). Multi-Task Learning Model for Kazakh Query Understanding. Sensors, 22 (24), 9810. https://doi.org/10.3390/s22249810
  11. Bandara, E., Liang, X., Foytik, P., Shetty, S., Hall, C., Bowden, D. et al. (2021). A blockchain empowered and privacy preserving digital contact tracing platform. Information Processing & Management, 58 (4), 102572. https://doi.org/10.1016/j.ipm.2021.102572
  12. Omarova, G. S., Starovoitov, V. V., Aitkozha, Zh. Zh., Bekbolatov, S., Ostayeva, A. B., Nuridinov, O. (2022). Application of the Clahe Method Contrast Enhancement of X-Ray Images. International Journal of Advanced Computer Science and Applications, 13(5). https://doi.org/10.14569/ijacsa.2022.0130549
  13. Yadav, H., Husain, S., Futrell, R. (2022). Assessing Corpus Evidence for Formal and Psycholinguistic Constraints on Nonprojectivity. Computational Linguistics, 48 (2), 375–401. https://doi.org/10.1162/coli_a_00437
  14. Sembina, G., Aitim, A., Shaizat, M. (2022). Machine Learning Algorithms for Predicting and Preventive Diagnosis of Cardiovascular Disease. 2022 International Conference on Smart Information Systems and Technologies (SIST), 1–5. https://doi.org/10.1109/sist54437.2022.9945708
  15. Kolesnikova, K., Mezentseva, O., Savielieva, O. (2019). Modeling of Decision Making Strategies In Management of Steelmaking Processes. 2019 IEEE International Conference on Advanced Trends in Information Theory (ATIT), 455–460. https://doi.org/10.1109/atit49449.2019.9030524
  16. Nuralin, M., Daineko, Y., Aljawarneh, S., Tsoy, D., Ipalakova, M. (2024). The real-time hand and object recognition for virtual interaction. PeerJ Computer Science, 10, e2110. https://doi.org/10.7717/peerj-cs.2110
  17. Kadyrbek, N., Mansurova, M., Shomanov, A., Makharova, G. (2023). The Development of a Kazakh Speech Recognition Model Using a Convolutional Neural Network with Fixed Character Level Filters. Big Data and Cognitive Computing, 7 (3), 132. https://doi.org/10.3390/bdcc7030132
  18. Aitim, A. (2024). Developing methods for automatic processing systems of Kazakh language. KazATC Bulletin, 133 (4), 254–265. https://doi.org/10.52167/1609-1817-2024-133-4-254-265
  19. Yu, L., Wang, Y., Zhou, L., Wu, J., Wang, Z. (2023). Residual neural network‐assisted one‐class classification algorithm for melanoma recognition with imbalanced data. Computational Intelligence, 39 (6), 1004–1021. https://doi.org/10.1111/coin.12578
  20. Haisa, G., Altenbek, G. (2022). Deep Learning with Word Embedding Improves Kazakh Named-Entity Recognition. Information, 13 (4), 180. https://doi.org/10.3390/info13040180
  21. Aitim, A. K., Satybaldiyeva, R. Zh. (2024). A systematic review of existing tools to automated processing systems for Kazakh language. Bulletin Series of Physics & Mathematical Sciences, 87 (3). https://doi.org/10.51889/2959-5894.2024.87.3.009
  22. Kozhirbayev, Z., Islamgozhayev, T. (2023). Cascade Speech Translation for the Kazakh Language. Applied Sciences, 13 (15), 8900. https://doi.org/10.3390/app13158900
  23. Makhambetov, O., Makazhanov, A., Yessenbayev, Z., Matkarimov, B., Sabyrgaliyev, I., Sharafudinov, A. (2012). Assembling the Kazakh Language Corpus. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC'12). https://doi.org/10.13140/2.1.5127.4882
  24. Zhenisbekovna, M. S., Aslanbekkyzy, B. M., Bolatkyzy, B. G. (2024). Investigating long short‐term memory approach for extremist messages detection in Kazakh language. Expert Systems, 42 (1). https://doi.org/10.1111/exsy.13595
  25. Kartbayev, A. (2015). Learning Word Alignment Models for Kazakh-English Machine Translation. Integrated Uncertainty in Knowledge Modelling and Decision Making, 326–335. https://doi.org/10.1007/978-3-319-25135-6_31
  26. Aitim, A., Abdulla, M. (2024). Data processing and analysing techniques in UX research. Procedia Computer Science, 251, 591–596. https://doi.org/10.1016/j.procs.2024.11.154
  27. Mohyuddin, H., Moosavi, S. K. R., Zafar, M. H., Sanfilippo, F. (2023). A comprehensive framework for hand gesture recognition using hybrid-metaheuristic algorithms and deep learning models. Array, 19, 100317. https://doi.org/10.1016/j.array.2023.100317
A comparison of Kazakh language processing models for improving semantic search results

Downloads

Published

2025-02-27

How to Cite

Aitim, A., & Satybaldiyeva, R. (2025). A comparison of Kazakh language processing models for improving semantic search results. Eastern-European Journal of Enterprise Technologies, 1(2 (133), 66–75. https://doi.org/10.15587/1729-4061.2025.315954