Grammatical categories determination for Turkish and Kazakh languages based on machine learning algorithms and fulfilling dictionaries of link grammar parser

Aigerim Yerimbetova; Madina Tussupova; Madina Sambetbayeva; Mussa Turdalyuly; Bakzhan Sakenov

doi:10.15587/1729-4061.2021.238743

Authors

Aigerim Yerimbetova Institute of Information and Computational Technologies, Kazakhstan https://orcid.org/0000-0002-2013-1513
Madina Tussupova ENGIE IT, Belgium https://orcid.org/0000-0002-6597-0716
Madina Sambetbayeva Institute of Information and Computational Technologies, Kazakhstan https://orcid.org/0000-0001-9358-1614
Mussa Turdalyuly Institute of Automation and Information Technologies, Kazakhstan https://orcid.org/0000-0002-1470-3706
Bakzhan Sakenov Institute of Information and Computational Technologies, Kazakhstan https://orcid.org/0000-0002-9849-6176

DOI:

https://doi.org/10.15587/1729-4061.2021.238743

Keywords:

natural language processing, part-of-speech, machine learning algorithms, agglutinative language, Word2vec

Abstract

This research is aimed at identifying the parts of speech for the Kazakh and Turkish languages in an information retrieval system. The proposed algorithms are based on machine learning techniques. In this paper, we consider the binary classification of words according to parts of speech. We decided to take the most popular machine learning algorithms. In this paper, the following approaches and well-known machine learning algorithms are studied and considered. We defined 7 dictionaries and tagged 135 million words in Kazakh and 9 dictionaries and 50 million words in the Turkish language.

The main problem considered in the paper is to create algorithms for the execution of dictionaries of the so-called Link Grammar Parser (LGP) system, in particular for the Kazakh and Turkish languages, using machine learning techniques.

The focus of the research is on the review and comparison of machine learning algorithms and methods that have accomplished results on various natural language processing tasks such as grammatical categories determination.

For the operation of the LGP system, a dictionary is created in which a connector for each word is indicated – the type of connection that can be created using this word. The authors considered methods of filling in LGP dictionaries using machine learning.

The complexities of natural language processing, however, do not exclude the possibility of identifying narrower tasks that can already be solved algorithmically: for example, determining parts of speech or splitting texts into logical groups. However, some features of natural languages significantly reduce the effectiveness of these solutions. Thus, taking into account all word forms for each word in the Kazakh and Turkish languages increases the complexity of text processing by an order of magnitude

Supporting Agency

Firstly, we would like to offer special thanks to Dr. Feodor Murzin who, although no longer with us, continues to inspire by his example and dedication to the students he served over the course of his career. This research has been funded by the Science Committee of the Ministry of Education and Science of the Republic of Kazakhstan (Grant No. AP08857179)

Author Biographies

Aigerim Yerimbetova, Institute of Information and Computational Technologies

PhD, Associate Professor, Leading Researcher

Madina Tussupova, ENGIE IT

Master of Science in Applied Mathematics and Informatics, Data Scientist

Madina Sambetbayeva, Institute of Information and Computational Technologies

PhD, Associate Professor, Senior Researcher

Mussa Turdalyuly, Institute of Automation and Information Technologies

PhD, Head of Department

Department of Software Engineering

Bakzhan Sakenov, Institute of Information and Computational Technologies

Software-Engineer

References

StanfordNLP v0.2.0. python 3.6 | 3.7. Available at: https://stanfordnlp.github.io/stanfordnlp/performance.html
Batura, T. V., Murzin, F. A. (2008). Mashinno-orientirovannye logicheskie metody otobrazheniya semantiki teksta na estestvennom yazyke. Novosibirsk: Izd. NGTU, 248.
Yerimbetova, A. S., Sagnayeva, S. K., Murzin, F. A., Tussupov, J. A. (2018). Creation of Tools and Algorithms for Assessing the Relevance of Documents. 2018 3rd Russian-Pacific Conference on Computer Technology and Applications (RPC). doi: https://doi.org/10.1109/rpc.2018.8482202
Index to Link Grammar Documentation. Available at: https://www.link.cs.cmu.edu/link/dict/index.html
Mel'chuk, I. A. (1974). Opyt teorii lingvisticheskih modeley «Smysl ↔ Tekst». Moscow: Nauka.
Paducheva, E. V. (2010). Semanticheskie issledovaniya: Semantika vremeni i vida v russkom yazyke. Semantika narrativa. Moscow: Yazyki slavyanskoy kul'tury, 480.
Kasekeyeva, A. B., Batura, T. V., Efimova, L. V., Murzin, F. A., Tussupov, J. A., Yerimbetova, A. S., Doshtayev, K. Zh. (2020). Link grammar and formal analysis of paraphrased sentences in a natural language. Journal of Theoretical and Applied Information Technology, 98 (10), 1724–1736. Available at: http://www.jatit.org/volumes/Vol98No10/10Vol98No10.pdf
Kumar, N., Srinathan, K., Varma, V. (2012). Using Graph Based Mapping of Co-occurring Words and Closeness Centrality Score for Summarization Evaluation. Lecture Notes in Computer Science, 353–365. doi: https://doi.org/10.1007/978-3-642-28601-8_30
Exactus. Available at: http://www.exactus.ru/
Avtomaticheskaya Obrabotka Teksta. Available at: http://www.aot.ru/
Sochenkov, I. V. (2013). Metod sravneniya tekstov dlya resheniya poiskovo-analiticheskih zadach. Iskusstvennyy intellekt i prinyatie resheniy, 2, 32–43. Available at: http://www.isa.ru/aidt/images/documents/2013-02/32_43.pdf
Batura, T. V., Murzin, F. A., Semich, D. F., Sagnayeva, S. K., Tazhibayeva, S. Z., Bakiyev, M. N. et. al. (2016). Using the link grammar parser in the study of turkic languages. Eurasian Journal of Mathematical and Computer Applications, 4 (2), 14–22. doi: https://doi.org/10.32523/2306-6172-2016-4-2-14-22
Zura, D., Doyle, W. J. (2018). A Grammar of Kazakh. Durhame: Duke University, Duke Center for Slavic, Eurasian, and East European Studies, 69. Available at: https://www.twirpx.com/file/2587861/
Göksel, A. (2015). Phrasal compounds in Turkish: Distinguishing citations from quotations. STUF - Language Typology and Universals, 68 (3), 359–394. doi: https://doi.org/10.1515/stuf-2015-0017
Sultanova, N., Kozhakhmet, K., Jantayev, R., Botbayeva, A. (2019). Stemming algorithm for Kazakh Language using rule-based approach. 2019 15th International Conference on Electronics, Computer and Computation (ICECCO). doi: https://doi.org/10.1109/icecco48375.2019.9043253
Makhambetov, O., Makazhanov, A., Yessenbayev, Z., Matkarimov, B., Sabyrgaliyev, I., Sharafudinov, A. (2013). Assembling the Kazakh Language Corpus. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1022–1031. Available at: https://aclanthology.org/D13-1104.pdf
Aksan, Y., Aksan, M., Koltuksuz, A., Sezer, T., Mersinli, Ü., Demirhan, U. U. et. al. (2012). Construction of the Turkish National Corpus (TNC). Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 3223–3227. Available at: http://www.lrec-conf.org/proceedings/lrec2012/pdf/991_Paper.pdf
Smola, A., Vishwanathan, S. V. N. (2008). Introduction to machine learning. Cambridge University Press, 234. Available at: https://alex.smola.org/drafts/thebook.pdf
Markus, S. (1970). Teoretiko-mnozhestvennye modeli yazykov. Moscow: Nauka, 332.
Murzin, F. A., Tussupova, M. J., Yerimbetova, A. S. (2018). Filling up Link Grammar Parser dictionaries by using Word2Vec techniques. Joint issue of the International Conference, Computational and Information Technologies in Science, Engineering and Education (CITech-2018). Ust-Kamenogorsk-Novosibirsk, 169–176. Available at: http://www.ict.nsc.ru/jct/getfile.php?id=1920
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv.org. Available at: https://arxiv.org/abs/1310.4546
Batura, T. V., Bakieva, A. M., Erimbetova, A. S., Murzin, F. A., Sagnaeva, S. K. (2018). Grammatika svyazey, relevantnost' i opredelenie tem tekstov. Novosibirsk: Izd-vo SO RAN, 91. Available at: http://lib.iis.nsk.su/node/277940
Krippes, K. A. (1996). Kazakh Grammar with Affix List. Dunwoody Press, 84. Available at: http://www-lib.tufs.ac.jp/opac/en/recordID/catalog.bib/BA36636430
Makazhanov, A., Yessenbayev, Z., Sabyrgaliyev, I., Sharafudinov, A., Makhambetov, O. (2014). On certain aspects of Kazakh part-of-speech tagging. 2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT). doi: https://doi.org/10.1109/icaict.2014.7035953
The CMU Link Grammar natural language parser. Available at: https://github.com/opencog/link-grammar

Grammatical categories determination for Turkish and Kazakh languages based on machine learning algorithms and fulfilling dictionaries of link grammar parser

Authors

DOI:

Keywords:

Abstract

Supporting Agency

Author Biographies

Aigerim Yerimbetova, Institute of Information and Computational Technologies

Madina Tussupova, ENGIE IT

Madina Sambetbayeva, Institute of Information and Computational Technologies

Mussa Turdalyuly, Institute of Automation and Information Technologies

Bakzhan Sakenov, Institute of Information and Computational Technologies

References

Downloads

Published

How to Cite

Issue

Section

License

Language

Information

Make a Submission

Developed By