Aligning and extending technologies of parallel corpora for the Kazakh language

Authors

DOI:

https://doi.org/10.15587/1729-4061.2022.259452

Keywords:

parallel corpora, aligning, Kazakh, English, sentence generation, extending technology

Abstract

The paper presents the two-stage alignment and extending methods of parallel corpora for the Kazakh language. The Kazakh language is agglutinative with rich morphology and related to the Turkic language group. So, the traditional alignment methods for similar languages do not work for the Kazakh language. The alignment is used primarily to ensure that the fragment corresponding to the original is found in the translation. After that, identical fragments of parallel texts are compared with each other. At the initial stage, the question is what needs to be leveled. It is possible to align word by word, but this often becomes almost impossible for several reasons: sets of lexemes and expressions do not match in different languages. Considering the linguistic peculiarities of languages, the developed technologies and ways of universal alignment of parallel text may not work in languages with agglutination. It means that the form of the word is formed by additional affixes and auxiliary words that carry semantic and morphological information. The approach presented in this paper is to use a two-stage alignment, which uses a bilingual dictionary of synonyms. The evaluation with the use of the English-Kazakh corpus verifies that our method shows an average of 89 % correct alignment. The second method is designed to expand the parallel corpus due to the lack of natural parallel corpora of the Kazakh-English language pair with good quality. The developed method uses a combinatorial method taking into account the semantic and grammatical features of the Kazakh language. Different tenses of the Kazakh language are used for sentence generation, and different endings for parts of speech are also considered.

Supporting Agency

  • Thіs rеsеаrсh wаs pеrfоrmеd аnd fіnаnсеd bу thе grаnt Prоjесt ІRN АP08052421 of Ministry of Science and Higher Education of the Republic of Kazakhstan.

Author Biographies

Diana Rakhimova, Al-Farabi Kazakh National University

PhD

Department of Information Systems

Aidana Karibayeva, Al-Farabi Kazakh National University

Master

Department of Information Systems

References

  1. R. Nazar, “Parallel corpus alignment at the document, sentence and vocabulary levels,” Natural Language Processing. No. 47, ISSN 1135-5948, pp. 129-136, 2011.
  2. A. Bharati, V. Sriram, A.Vamshi Krishna, R. Sangal, and S.M.Bendre, “An Algorithm for Aligning Sentences in Bilingual Corpora Using Lexical Information,” In Proceedings of ICON-2002: International Conference on Natural Language Processing, Mumbai, India, pp. 1-12, 2002
  3. P. Brown, J. Lai, and R. Mercer, “Aligning Sentences in Parallel Corpora,” IBM Report submitted to 29th Annual Meeting of the Association for Computational Linguistics, pp. 169-171, 1991.
  4. E. Bicici, “Context-Based Sentence Alignment in Parallel Corpora,” in Proc. International Conference on Intelligent Text Processing and Computational Linguistics, pp. 434-444, 2008.
  5. S. F. Adafre and M. de Rijke.” Finding similar sentences across multiple languages in wikipedia,” in: Proceedings of the Workshop on NEW TEXT Wikis and blogs and other dynamic text sources, 2006.
  6. J. R. Smith, Ch. Quirk, and K.Toutanova, “Extracting parallel sentences from comparable corpora using document level alignment,” in: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, pp 403–411, 2010.
  7. M. Saad, D. Langlois, and K. Smaıli, “Extracting comparable articles from Wikipedia and measuring their comparabilities,” in Procedia-Social and Behavioral Sciences, 95:40–47, 2013.
  8. R. Sennrich and M. Volk, “Iterative, MTbased Sentence Alignment of Parallel Texts,” in Proc. the 18th Nordic Conference of Computational Linguistics, Riga, Latvia, Vol. 11, pp. -175-182, 2011.
  9. M. Mohammadi and N. G. Aghaee, “Building bilingual parallel corpora based on wikipedia,” in: 2010 Second International Conference on Computer Engineering and Applications, IEEE, vol 2, pp 264–268, 2010.
  10. P. G. Otero and I. G. Lopez, “Wikipedia as multilingual source of comparable corpora,” in Proceedings of the 3rd Workshop on Building and Using Comparable Corpora, LREC, pages 21–25, 2010.
  11. P.G. Otero and I. G. López, “Measuring comparability of multilingual corpora extracted from Wikipedia,” in Iberian Cross-Language Natural Language Processings Tasks (ICL), pp. 8, 2010.
  12. R. Sennrich and M. Volk, “Iterative, MTbased Sentence Alignment of Parallel Texts,” in Proceedings of the 18th Nordic Conference of Computational Linguistics, Riga, Latvia, Vol. 11, pp. -175-182, 2011.
  13. Y. Xu and A. Max and F.Yvon, “Sentence Alignment for Literary Texts. Linguistic Issues in Language Technology,” in Linguistic Issues in Language Technology, Volume 12, 2015 - Literature Lifts up Computational Linguistics, LiLT. Volume 12, Issue 6, 2015.
  14. V. Chaudhary, Y.Tang, F. Guzmán, H. Schwenk, and P.Koehn, “LowResource Corpus Filtering Using Multilingual Sentence Embeddings,” in Proceedings of the Fourth Conference on Machine Translation, Volume 3: Shared Task Papers, Florence, Italy, pp. 261-266, 2019.
  15. M. Artetxe and H. Schwenk, “Margin based parallel corpus mining with multilingual sentence embeddings,” in Proc. the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3197-3203, 2019.
  16. Zh. Zhumanov, A. Madiyeva, and D. Rakhimova, “New Kazakh Parallel Text Corpora with On-line Access,” in Proc. Conference on Computational Collective Intelligence Technologies and Applications, pp. 501-508, 2017.
  17. A. Kartbaev, “Refining Kazakh Word Alignment Using Simulation Modeling Methods for Statistical Machine Translation,” in Proc. Natural Language Processing and Chinese Computing (NLPCC), Nanchang, China, pp. 421-427, 2015.
  18. D.R. Rakhimova and A.O. Turganbaeva, ”Normalization of Kazakh language words,” Scientific and Technical Journal of Information Technologies, Mechanics and Optics, Vol. 20(4), St. Petersburg, Russia, pp. 545–551, 2020.
  19. N. Khairova, O. Mamyrbayev, and K. Mukhsina, “The Aligned Kazakh-Russian Parallel Corpus Focused on the Criminal Theme,” in Proc. 3rd COLINS: Computational linguistics and intelligent systems, Volume 1, Kharkiv, Ukraine, pp. 116-125, 2019.
  20. Zh. Assylbekov, A. Makazhanov, and B. Myrzakhmetov, “Experiments with Russian to Kazakh sentence alignment,” in Proc. The Kyrgyz State Technical University named I. Razzakova., pp. 18-23, 2016.
  21. Hunalign tool. Avaiable: https://github.com/danielvarga/hunalign
  22. A. Singhal, C. Buckley, and M. Mitra, “Pivoted document length normalization,” in Proc. The 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, pp. 176-184, 1996.
  23. H. C. Wu, R. W. P. Luk, K. F. Wong, and K. L. Kwok, “Interpreting tf-idf term weights as making relevance decisions,” ACM Transactions on Information Systems (TOIS), vol. 26, no. 3, pp. 1-37, 2008
  24. M. J. Lavin, “Analyzing Documents with TF-IDF,” The Programming Historian journal. no. 8, [Online], Available: https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf
  25. I. Arroyo-Fernández, C.F. Méndez-Cruz, G. Sierra, J.M. Torres-Moreno, and G. Sidorov, “Unsupervised sentence representations as word information series: Revisiting tf–idf,” Computer Speech and Language, vol. 56, pp. 107–129, 2019.
  26. Nazarbayev University site. Available: https://nu.edu.kz/
  27. Akorda site. Aviable: https://www.akorda.kz/ru
  28. H. T. Sueno, B. D. Gerardo, and R. P. Medina, “Converting text to numerical representation using modified Bayesian vectorization technique for multiclass classification,” International Journal of Advanced Trends in Computer Science and Engineering, vol. 9, no. 4, pp. 5618–5623, 2020.

Downloads

Published

2022-08-31

How to Cite

Rakhimova, D., & Karibayeva, A. (2022). Aligning and extending technologies of parallel corpora for the Kazakh language . Eastern-European Journal of Enterprise Technologies, 4(2(118), 32–39. https://doi.org/10.15587/1729-4061.2022.259452