A comparison of Kazakh language processing models for improving semantic search results
DOI:
https://doi.org/10.15587/1729-4061.2025.315954Keywords:
semantic search, natural language processing, Kazakh language, information retrieval systemsAbstract
The object of the study is the text classification and semantic search tailored to the unique linguistic features of the Kazakh language. The research addresses the challenge of improving the accuracy, relevance, and efficiency of semantic search.
This study focuses on improving semantic search for the Kazakh language by analyzing computational models tailored to its unique linguistic features, such as agglutinative morphology and rich inflectional systems. The research compares traditional rule-based approaches and advanced transformer architectures, including fine-tuned models like RoBERTa, for their ability to handle semantic nuances, contextual relationships, and user intent. The results reveal that fine-tuned transformer models achieved significant advancements, with the RoBERTa model attaining a Precision@10 of 89.4 %, a Mean Reciprocal Rank (MRR) of 85.6 %, and an F1-Score of 88.0 %. Additionally, the semantic search system developed in this study demonstrated a precision of 88.4 %, recall of 87.6 %, and an F1-score of 88.0 % on a domain-specific Kazakh dataset.
Key to these improvements were innovations in preprocessing pipelines, including custom tokenization and lemmatization tailored to Kazakh's agglutinative morphology, and the integration of contextual embeddings to resolve issues such as synonymy and homonymy. Computational efficiency was enhanced through resource optimization techniques, enabling the deployment of these advanced models in constrained environments. These findings underscore the potential of tailored transformer models to bridge the gap in semantic search capabilities for underrepresented languages like Kazakh, advancing the inclusivity of natural language processing technologies
References
- Aitim, A. K., Satybaldiyeva, R. Zh., Wojcik, W. (2020). The construction of the Kazakh language thesauri in automatic word processing system. Proceedings of the 6th International Conference on Engineering & MIS 2020, 1–4. https://doi.org/10.1145/3410352.3410789
- Satybaldiyeva, R., Uskenbayeva, R., Moldagulova, A., Kalpeyeva, Z., Aitim, A. (2019). Features of Administrative and Management Processes Modeling. Optimization of Complex Systems: Theory, Models, Algorithms and Applications, 842–849. https://doi.org/10.1007/978-3-030-21803-4_84
- Bora, J., Dehingia, S., Boruah, A., Chetia, A. A., Gogoi, D. (2023). Real-time Assamese Sign Language Recognition using MediaPipe and Deep Learning. Procedia Computer Science, 218, 1384–1393. https://doi.org/10.1016/j.procs.2023.01.117
- Bogdanchikov, A., Ayazbayev, D., Varlamis, I. (2022). Classification of Scientific Documents in the Kazakh Language Using Deep Neural Networks and a Fusion of Images and Text. Big Data and Cognitive Computing, 6 (4), 123. https://doi.org/10.3390/bdcc6040123
- Turganbayeva, A., Tukeyev, U. (2020). The solution of the problem of unknown words under neural machine translation of the Kazakh language. Journal of Information and Telecommunication, 1–12. https://doi.org/10.1080/24751839.2020.1838713
- Patayon, U. B., Crisostomo, R. V. (2021). Automatic Identification of Abaca Bunchy Top Disease using Deep Learning Models. Procedia Computer Science, 179, 321–329. https://doi.org/10.1016/j.procs.2021.01.012
- Nugamanov, E., Panov, A. I. (2020). Hierarchical Temporal Memory with Reinforcement Learning. Procedia Computer Science, 169, 123–131. https://doi.org/10.1016/j.procs.2020.02.123
- Wang, J., Chen, J., Guo, H. (2022). Research on Design Innovation Method Based on Extenics Compound-Element. Procedia Computer Science, 199, 977–983. https://doi.org/10.1016/j.procs.2022.01.123
- Wang, C., Ye, Y., Ma, L., Li, D., Zhuang, L. (2023). Dual disentanglement of user–item interaction for recommendation with causal embedding. Information Processing & Management, 60 (5), 103456. https://doi.org/10.1016/j.ipm.2023.103456
- Haisa, G., Altenbek, G. (2022). Multi-Task Learning Model for Kazakh Query Understanding. Sensors, 22 (24), 9810. https://doi.org/10.3390/s22249810
- Bandara, E., Liang, X., Foytik, P., Shetty, S., Hall, C., Bowden, D. et al. (2021). A blockchain empowered and privacy preserving digital contact tracing platform. Information Processing & Management, 58 (4), 102572. https://doi.org/10.1016/j.ipm.2021.102572
- Omarova, G. S., Starovoitov, V. V., Aitkozha, Zh. Zh., Bekbolatov, S., Ostayeva, A. B., Nuridinov, O. (2022). Application of the Clahe Method Contrast Enhancement of X-Ray Images. International Journal of Advanced Computer Science and Applications, 13(5). https://doi.org/10.14569/ijacsa.2022.0130549
- Yadav, H., Husain, S., Futrell, R. (2022). Assessing Corpus Evidence for Formal and Psycholinguistic Constraints on Nonprojectivity. Computational Linguistics, 48 (2), 375–401. https://doi.org/10.1162/coli_a_00437
- Sembina, G., Aitim, A., Shaizat, M. (2022). Machine Learning Algorithms for Predicting and Preventive Diagnosis of Cardiovascular Disease. 2022 International Conference on Smart Information Systems and Technologies (SIST), 1–5. https://doi.org/10.1109/sist54437.2022.9945708
- Kolesnikova, K., Mezentseva, O., Savielieva, O. (2019). Modeling of Decision Making Strategies In Management of Steelmaking Processes. 2019 IEEE International Conference on Advanced Trends in Information Theory (ATIT), 455–460. https://doi.org/10.1109/atit49449.2019.9030524
- Nuralin, M., Daineko, Y., Aljawarneh, S., Tsoy, D., Ipalakova, M. (2024). The real-time hand and object recognition for virtual interaction. PeerJ Computer Science, 10, e2110. https://doi.org/10.7717/peerj-cs.2110
- Kadyrbek, N., Mansurova, M., Shomanov, A., Makharova, G. (2023). The Development of a Kazakh Speech Recognition Model Using a Convolutional Neural Network with Fixed Character Level Filters. Big Data and Cognitive Computing, 7 (3), 132. https://doi.org/10.3390/bdcc7030132
- Aitim, A. (2024). Developing methods for automatic processing systems of Kazakh language. KazATC Bulletin, 133 (4), 254–265. https://doi.org/10.52167/1609-1817-2024-133-4-254-265
- Yu, L., Wang, Y., Zhou, L., Wu, J., Wang, Z. (2023). Residual neural network‐assisted one‐class classification algorithm for melanoma recognition with imbalanced data. Computational Intelligence, 39 (6), 1004–1021. https://doi.org/10.1111/coin.12578
- Haisa, G., Altenbek, G. (2022). Deep Learning with Word Embedding Improves Kazakh Named-Entity Recognition. Information, 13 (4), 180. https://doi.org/10.3390/info13040180
- Aitim, A. K., Satybaldiyeva, R. Zh. (2024). A systematic review of existing tools to automated processing systems for Kazakh language. Bulletin Series of Physics & Mathematical Sciences, 87 (3). https://doi.org/10.51889/2959-5894.2024.87.3.009
- Kozhirbayev, Z., Islamgozhayev, T. (2023). Cascade Speech Translation for the Kazakh Language. Applied Sciences, 13 (15), 8900. https://doi.org/10.3390/app13158900
- Makhambetov, O., Makazhanov, A., Yessenbayev, Z., Matkarimov, B., Sabyrgaliyev, I., Sharafudinov, A. (2012). Assembling the Kazakh Language Corpus. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC'12). https://doi.org/10.13140/2.1.5127.4882
- Zhenisbekovna, M. S., Aslanbekkyzy, B. M., Bolatkyzy, B. G. (2024). Investigating long short‐term memory approach for extremist messages detection in Kazakh language. Expert Systems, 42 (1). https://doi.org/10.1111/exsy.13595
- Kartbayev, A. (2015). Learning Word Alignment Models for Kazakh-English Machine Translation. Integrated Uncertainty in Knowledge Modelling and Decision Making, 326–335. https://doi.org/10.1007/978-3-319-25135-6_31
- Aitim, A., Abdulla, M. (2024). Data processing and analysing techniques in UX research. Procedia Computer Science, 251, 591–596. https://doi.org/10.1016/j.procs.2024.11.154
- Mohyuddin, H., Moosavi, S. K. R., Zafar, M. H., Sanfilippo, F. (2023). A comprehensive framework for hand gesture recognition using hybrid-metaheuristic algorithms and deep learning models. Array, 19, 100317. https://doi.org/10.1016/j.array.2023.100317
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Aigerim Aitim, Ryskhan Satybaldiyeva

This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.
A license agreement is a document in which the author warrants that he/she owns all copyright for the work (manuscript, article, etc.).
The authors, signing the License Agreement with TECHNOLOGY CENTER PC, have all rights to the further use of their work, provided that they link to our edition in which the work was published.
According to the terms of the License Agreement, the Publisher TECHNOLOGY CENTER PC does not take away your copyrights and receives permission from the authors to use and dissemination of the publication through the world's scientific resources (own electronic resources, scientometric databases, repositories, libraries, etc.).
In the absence of a signed License Agreement or in the absence of this agreement of identifiers allowing to identify the identity of the author, the editors have no right to work with the manuscript.
It is important to remember that there is another type of agreement between authors and publishers – when copyright is transferred from the authors to the publisher. In this case, the authors lose ownership of their work and may not use it in any way.





