A comparison of Kazakh language processing models for improving semantic search results

Aigerim Aitim; Ryskhan Satybaldiyeva

doi:10.15587/1729-4061.2025.315954

Authors

Aigerim Aitim International IT University, Kazakhstan https://orcid.org/0000-0003-2982-214X
Ryskhan Satybaldiyeva Satbayev University, Kazakhstan https://orcid.org/0000-0002-0678-7583

DOI:

https://doi.org/10.15587/1729-4061.2025.315954

Keywords:

semantic search, natural language processing, Kazakh language, information retrieval systems

Abstract

The object of the study is the text classification and semantic search tailored to the unique linguistic features of the Kazakh language. The research addresses the challenge of improving the accuracy, relevance, and efficiency of semantic search.

This study focuses on improving semantic search for the Kazakh language by analyzing computational models tailored to its unique linguistic features, such as agglutinative morphology and rich inflectional systems. The research compares traditional rule-based approaches and advanced transformer architectures, including fine-tuned models like RoBERTa, for their ability to handle semantic nuances, contextual relationships, and user intent. The results reveal that fine-tuned transformer models achieved significant advancements, with the RoBERTa model attaining a Precision@10 of 89.4 %, a Mean Reciprocal Rank (MRR) of 85.6 %, and an F1-Score of 88.0 %. Additionally, the semantic search system developed in this study demonstrated a precision of 88.4 %, recall of 87.6 %, and an F1-score of 88.0 % on a domain-specific Kazakh dataset.

Key to these improvements were innovations in preprocessing pipelines, including custom tokenization and lemmatization tailored to Kazakh's agglutinative morphology, and the integration of contextual embeddings to resolve issues such as synonymy and homonymy. Computational efficiency was enhanced through resource optimization techniques, enabling the deployment of these advanced models in constrained environments. These findings underscore the potential of tailored transformer models to bridge the gap in semantic search capabilities for underrepresented languages like Kazakh, advancing the inclusivity of natural language processing technologies

Author Biographies

Aigerim Aitim, International IT University

Master of Technical Sciences, Assistant-Professor

Department of Information Systems

Ryskhan Satybaldiyeva, Satbayev University

Candidate of Technical Sciences, Associate Professor, Head of Department

Department of Cybersecurity, Information Processing and Storage

References

Aitim, A. K., Satybaldiyeva, R. Zh., Wojcik, W. (2020). The construction of the Kazakh language thesauri in automatic word processing system. Proceedings of the 6th International Conference on Engineering & MIS 2020, 1–4. https://doi.org/10.1145/3410352.3410789
Satybaldiyeva, R., Uskenbayeva, R., Moldagulova, A., Kalpeyeva, Z., Aitim, A. (2019). Features of Administrative and Management Processes Modeling. Optimization of Complex Systems: Theory, Models, Algorithms and Applications, 842–849. https://doi.org/10.1007/978-3-030-21803-4_84
Bora, J., Dehingia, S., Boruah, A., Chetia, A. A., Gogoi, D. (2023). Real-time Assamese Sign Language Recognition using MediaPipe and Deep Learning. Procedia Computer Science, 218, 1384–1393. https://doi.org/10.1016/j.procs.2023.01.117
Bogdanchikov, A., Ayazbayev, D., Varlamis, I. (2022). Classification of Scientific Documents in the Kazakh Language Using Deep Neural Networks and a Fusion of Images and Text. Big Data and Cognitive Computing, 6 (4), 123. https://doi.org/10.3390/bdcc6040123
Turganbayeva, A., Tukeyev, U. (2020). The solution of the problem of unknown words under neural machine translation of the Kazakh language. Journal of Information and Telecommunication, 1–12. https://doi.org/10.1080/24751839.2020.1838713
Patayon, U. B., Crisostomo, R. V. (2021). Automatic Identification of Abaca Bunchy Top Disease using Deep Learning Models. Procedia Computer Science, 179, 321–329. https://doi.org/10.1016/j.procs.2021.01.012
Nugamanov, E., Panov, A. I. (2020). Hierarchical Temporal Memory with Reinforcement Learning. Procedia Computer Science, 169, 123–131. https://doi.org/10.1016/j.procs.2020.02.123
Wang, J., Chen, J., Guo, H. (2022). Research on Design Innovation Method Based on Extenics Compound-Element. Procedia Computer Science, 199, 977–983. https://doi.org/10.1016/j.procs.2022.01.123
Wang, C., Ye, Y., Ma, L., Li, D., Zhuang, L. (2023). Dual disentanglement of user–item interaction for recommendation with causal embedding. Information Processing & Management, 60 (5), 103456. https://doi.org/10.1016/j.ipm.2023.103456
Haisa, G., Altenbek, G. (2022). Multi-Task Learning Model for Kazakh Query Understanding. Sensors, 22 (24), 9810. https://doi.org/10.3390/s22249810
Bandara, E., Liang, X., Foytik, P., Shetty, S., Hall, C., Bowden, D. et al. (2021). A blockchain empowered and privacy preserving digital contact tracing platform. Information Processing & Management, 58 (4), 102572. https://doi.org/10.1016/j.ipm.2021.102572
Omarova, G. S., Starovoitov, V. V., Aitkozha, Zh. Zh., Bekbolatov, S., Ostayeva, A. B., Nuridinov, O. (2022). Application of the Clahe Method Contrast Enhancement of X-Ray Images. International Journal of Advanced Computer Science and Applications, 13(5). https://doi.org/10.14569/ijacsa.2022.0130549
Yadav, H., Husain, S., Futrell, R. (2022). Assessing Corpus Evidence for Formal and Psycholinguistic Constraints on Nonprojectivity. Computational Linguistics, 48 (2), 375–401. https://doi.org/10.1162/coli_a_00437
Sembina, G., Aitim, A., Shaizat, M. (2022). Machine Learning Algorithms for Predicting and Preventive Diagnosis of Cardiovascular Disease. 2022 International Conference on Smart Information Systems and Technologies (SIST), 1–5. https://doi.org/10.1109/sist54437.2022.9945708
Kolesnikova, K., Mezentseva, O., Savielieva, O. (2019). Modeling of Decision Making Strategies In Management of Steelmaking Processes. 2019 IEEE International Conference on Advanced Trends in Information Theory (ATIT), 455–460. https://doi.org/10.1109/atit49449.2019.9030524
Nuralin, M., Daineko, Y., Aljawarneh, S., Tsoy, D., Ipalakova, M. (2024). The real-time hand and object recognition for virtual interaction. PeerJ Computer Science, 10, e2110. https://doi.org/10.7717/peerj-cs.2110
Kadyrbek, N., Mansurova, M., Shomanov, A., Makharova, G. (2023). The Development of a Kazakh Speech Recognition Model Using a Convolutional Neural Network with Fixed Character Level Filters. Big Data and Cognitive Computing, 7 (3), 132. https://doi.org/10.3390/bdcc7030132
Aitim, A. (2024). Developing methods for automatic processing systems of Kazakh language. KazATC Bulletin, 133 (4), 254–265. https://doi.org/10.52167/1609-1817-2024-133-4-254-265
Yu, L., Wang, Y., Zhou, L., Wu, J., Wang, Z. (2023). Residual neural network‐assisted one‐class classification algorithm for melanoma recognition with imbalanced data. Computational Intelligence, 39 (6), 1004–1021. https://doi.org/10.1111/coin.12578
Haisa, G., Altenbek, G. (2022). Deep Learning with Word Embedding Improves Kazakh Named-Entity Recognition. Information, 13 (4), 180. https://doi.org/10.3390/info13040180
Aitim, A. K., Satybaldiyeva, R. Zh. (2024). A systematic review of existing tools to automated processing systems for Kazakh language. Bulletin Series of Physics & Mathematical Sciences, 87 (3). https://doi.org/10.51889/2959-5894.2024.87.3.009
Kozhirbayev, Z., Islamgozhayev, T. (2023). Cascade Speech Translation for the Kazakh Language. Applied Sciences, 13 (15), 8900. https://doi.org/10.3390/app13158900
Makhambetov, O., Makazhanov, A., Yessenbayev, Z., Matkarimov, B., Sabyrgaliyev, I., Sharafudinov, A. (2012). Assembling the Kazakh Language Corpus. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC'12). https://doi.org/10.13140/2.1.5127.4882
Zhenisbekovna, M. S., Aslanbekkyzy, B. M., Bolatkyzy, B. G. (2024). Investigating long short‐term memory approach for extremist messages detection in Kazakh language. Expert Systems, 42 (1). https://doi.org/10.1111/exsy.13595
Kartbayev, A. (2015). Learning Word Alignment Models for Kazakh-English Machine Translation. Integrated Uncertainty in Knowledge Modelling and Decision Making, 326–335. https://doi.org/10.1007/978-3-319-25135-6_31
Aitim, A., Abdulla, M. (2024). Data processing and analysing techniques in UX research. Procedia Computer Science, 251, 591–596. https://doi.org/10.1016/j.procs.2024.11.154
Mohyuddin, H., Moosavi, S. K. R., Zafar, M. H., Sanfilippo, F. (2023). A comprehensive framework for hand gesture recognition using hybrid-metaheuristic algorithms and deep learning models. Array, 19, 100317. https://doi.org/10.1016/j.array.2023.100317

A comparison of Kazakh language processing models for improving semantic search results

Authors

DOI:

Keywords:

Abstract

Author Biographies

Aigerim Aitim, International IT University

Ryskhan Satybaldiyeva, Satbayev University

References

Downloads

Published

How to Cite

Issue

Section

License

Language

Information

Make a Submission

Developed By

Current Issue