Aligning and extending technologies of parallel corpora for the Kazakh language
DOI:
https://doi.org/10.15587/1729-4061.2022.259452Keywords:
parallel corpora, aligning, Kazakh, English, sentence generation, extending technologyAbstract
The paper presents the two-stage alignment and extending methods of parallel corpora for the Kazakh language. The Kazakh language is agglutinative with rich morphology and related to the Turkic language group. So, the traditional alignment methods for similar languages do not work for the Kazakh language. The alignment is used primarily to ensure that the fragment corresponding to the original is found in the translation. After that, identical fragments of parallel texts are compared with each other. At the initial stage, the question is what needs to be leveled. It is possible to align word by word, but this often becomes almost impossible for several reasons: sets of lexemes and expressions do not match in different languages. Considering the linguistic peculiarities of languages, the developed technologies and ways of universal alignment of parallel text may not work in languages with agglutination. It means that the form of the word is formed by additional affixes and auxiliary words that carry semantic and morphological information. The approach presented in this paper is to use a two-stage alignment, which uses a bilingual dictionary of synonyms. The evaluation with the use of the English-Kazakh corpus verifies that our method shows an average of 89 % correct alignment. The second method is designed to expand the parallel corpus due to the lack of natural parallel corpora of the Kazakh-English language pair with good quality. The developed method uses a combinatorial method taking into account the semantic and grammatical features of the Kazakh language. Different tenses of the Kazakh language are used for sentence generation, and different endings for parts of speech are also considered.
Supporting Agency
- Thіs rеsеаrсh wаs pеrfоrmеd аnd fіnаnсеd bу thе grаnt Prоjесt ІRN АP08052421 of Ministry of Science and Higher Education of the Republic of Kazakhstan.
References
- R. Nazar, “Parallel corpus alignment at the document, sentence and vocabulary levels,” Natural Language Processing. No. 47, ISSN 1135-5948, pp. 129-136, 2011.
- A. Bharati, V. Sriram, A.Vamshi Krishna, R. Sangal, and S.M.Bendre, “An Algorithm for Aligning Sentences in Bilingual Corpora Using Lexical Information,” In Proceedings of ICON-2002: International Conference on Natural Language Processing, Mumbai, India, pp. 1-12, 2002
- P. Brown, J. Lai, and R. Mercer, “Aligning Sentences in Parallel Corpora,” IBM Report submitted to 29th Annual Meeting of the Association for Computational Linguistics, pp. 169-171, 1991.
- E. Bicici, “Context-Based Sentence Alignment in Parallel Corpora,” in Proc. International Conference on Intelligent Text Processing and Computational Linguistics, pp. 434-444, 2008.
- S. F. Adafre and M. de Rijke.” Finding similar sentences across multiple languages in wikipedia,” in: Proceedings of the Workshop on NEW TEXT Wikis and blogs and other dynamic text sources, 2006.
- J. R. Smith, Ch. Quirk, and K.Toutanova, “Extracting parallel sentences from comparable corpora using document level alignment,” in: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, pp 403–411, 2010.
- M. Saad, D. Langlois, and K. Smaıli, “Extracting comparable articles from Wikipedia and measuring their comparabilities,” in Procedia-Social and Behavioral Sciences, 95:40–47, 2013.
- R. Sennrich and M. Volk, “Iterative, MTbased Sentence Alignment of Parallel Texts,” in Proc. the 18th Nordic Conference of Computational Linguistics, Riga, Latvia, Vol. 11, pp. -175-182, 2011.
- M. Mohammadi and N. G. Aghaee, “Building bilingual parallel corpora based on wikipedia,” in: 2010 Second International Conference on Computer Engineering and Applications, IEEE, vol 2, pp 264–268, 2010.
- P. G. Otero and I. G. Lopez, “Wikipedia as multilingual source of comparable corpora,” in Proceedings of the 3rd Workshop on Building and Using Comparable Corpora, LREC, pages 21–25, 2010.
- P.G. Otero and I. G. López, “Measuring comparability of multilingual corpora extracted from Wikipedia,” in Iberian Cross-Language Natural Language Processings Tasks (ICL), pp. 8, 2010.
- R. Sennrich and M. Volk, “Iterative, MTbased Sentence Alignment of Parallel Texts,” in Proceedings of the 18th Nordic Conference of Computational Linguistics, Riga, Latvia, Vol. 11, pp. -175-182, 2011.
- Y. Xu and A. Max and F.Yvon, “Sentence Alignment for Literary Texts. Linguistic Issues in Language Technology,” in Linguistic Issues in Language Technology, Volume 12, 2015 - Literature Lifts up Computational Linguistics, LiLT. Volume 12, Issue 6, 2015.
- V. Chaudhary, Y.Tang, F. Guzmán, H. Schwenk, and P.Koehn, “LowResource Corpus Filtering Using Multilingual Sentence Embeddings,” in Proceedings of the Fourth Conference on Machine Translation, Volume 3: Shared Task Papers, Florence, Italy, pp. 261-266, 2019.
- M. Artetxe and H. Schwenk, “Margin based parallel corpus mining with multilingual sentence embeddings,” in Proc. the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3197-3203, 2019.
- Zh. Zhumanov, A. Madiyeva, and D. Rakhimova, “New Kazakh Parallel Text Corpora with On-line Access,” in Proc. Conference on Computational Collective Intelligence Technologies and Applications, pp. 501-508, 2017.
- A. Kartbaev, “Refining Kazakh Word Alignment Using Simulation Modeling Methods for Statistical Machine Translation,” in Proc. Natural Language Processing and Chinese Computing (NLPCC), Nanchang, China, pp. 421-427, 2015.
- D.R. Rakhimova and A.O. Turganbaeva, ”Normalization of Kazakh language words,” Scientific and Technical Journal of Information Technologies, Mechanics and Optics, Vol. 20(4), St. Petersburg, Russia, pp. 545–551, 2020.
- N. Khairova, O. Mamyrbayev, and K. Mukhsina, “The Aligned Kazakh-Russian Parallel Corpus Focused on the Criminal Theme,” in Proc. 3rd COLINS: Computational linguistics and intelligent systems, Volume 1, Kharkiv, Ukraine, pp. 116-125, 2019.
- Zh. Assylbekov, A. Makazhanov, and B. Myrzakhmetov, “Experiments with Russian to Kazakh sentence alignment,” in Proc. The Kyrgyz State Technical University named I. Razzakova., pp. 18-23, 2016.
- Hunalign tool. Avaiable: https://github.com/danielvarga/hunalign
- A. Singhal, C. Buckley, and M. Mitra, “Pivoted document length normalization,” in Proc. The 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, pp. 176-184, 1996.
- H. C. Wu, R. W. P. Luk, K. F. Wong, and K. L. Kwok, “Interpreting tf-idf term weights as making relevance decisions,” ACM Transactions on Information Systems (TOIS), vol. 26, no. 3, pp. 1-37, 2008
- M. J. Lavin, “Analyzing Documents with TF-IDF,” The Programming Historian journal. no. 8, [Online], Available: https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf
- I. Arroyo-Fernández, C.F. Méndez-Cruz, G. Sierra, J.M. Torres-Moreno, and G. Sidorov, “Unsupervised sentence representations as word information series: Revisiting tf–idf,” Computer Speech and Language, vol. 56, pp. 107–129, 2019.
- Nazarbayev University site. Available: https://nu.edu.kz/
- Akorda site. Aviable: https://www.akorda.kz/ru
- H. T. Sueno, B. D. Gerardo, and R. P. Medina, “Converting text to numerical representation using modified Bayesian vectorization technique for multiclass classification,” International Journal of Advanced Trends in Computer Science and Engineering, vol. 9, no. 4, pp. 5618–5623, 2020.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2022 Diana Rakhimova, Aidana Karibayeva
This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.
A license agreement is a document in which the author warrants that he/she owns all copyright for the work (manuscript, article, etc.).
The authors, signing the License Agreement with TECHNOLOGY CENTER PC, have all rights to the further use of their work, provided that they link to our edition in which the work was published.
According to the terms of the License Agreement, the Publisher TECHNOLOGY CENTER PC does not take away your copyrights and receives permission from the authors to use and dissemination of the publication through the world's scientific resources (own electronic resources, scientometric databases, repositories, libraries, etc.).
In the absence of a signed License Agreement or in the absence of this agreement of identifiers allowing to identify the identity of the author, the editors have no right to work with the manuscript.
It is important to remember that there is another type of agreement between authors and publishers – when copyright is transferred from the authors to the publisher. In this case, the authors lose ownership of their work and may not use it in any way.