Devising an entropy-based approach for identifying patterns in multilingual texts
DOI:
https://doi.org/10.15587/1729-4061.2021.228695Keywords:
Google Translator, Yandex.Translator, Renyi entropy, Minkowski metric, Hamming distanceAbstract
Even though the plagiarism identification issue remains relevant, modern detection methods are still resource-intensive. This paper reports a more efficient alternative to existing solutions.
The devised system for identifying patterns in multilingual texts compares two texts and determines, by using different approaches, whether the second text is a translation of the first or not. This study's approach is based on Renyi entropy.
The original text from an English writer's work and five texts in the Russian language were selected for this research. The real and "fake" translations that were chosen included translations by Google Translator and Yandex Translator, an author's book translation, a text from another work by an English writer, and a fake text. The fake text represents a text compiled with the same frequency of keywords as in the authentic text.
Upon forming a key series of high-frequency words for the original text, the relevant key series for other texts were identified. Then the entropies for the texts were calculated when they were divided into "sentences" and "paragraphs".
A Minkowski metric was used to calculate the proximity of the texts. It underlies the calculations of a Hamming distance, the Cartesian distance, the distance between the centers of masses, the distance between the geometric centers, and the distance between the centers of parametric means.
It was found that the proximity of texts is best determined by calculating the relative distances between the centers of parametric means (for "fake" texts ‒ exceeding 3, for translations ‒ less than 1).
Calculating the proximity of texts by using the algorithm based on Renyi entropy, reported in this work, makes it possible to save resources and time compared to methods based on neural networks. All the raw data and an example of the entropy calculation on php are publicly available
References
- Imran, M. (2020). Advantages of Neural Networks - Benefits of AI and Deep Learning. Folio3. Available at: https://www.folio3.ai/blog/advantages-of-neural-networks/
- Hanlon, J. (2017). Why is so much memory needed for deep neural networks? Graphcore. Available at: https://www.graphcore.ai/posts/why-is-so-much-memory-needed-for-deep-neural-networks
- Yu, J., Chen, R., Xu, L., Wang, D. (2019). Concept extraction for structured text using entropy weight method. 2019 IEEE Symposium on Computers and Communications (ISCC). doi: https://doi.org/10.1109/iscc47284.2019.8969759
- Shi, Y., Lei, L. (2020). Lexical Richness and Text Length: An Entropy-based Perspective. Journal of Quantitative Linguistics, 1–18. doi: https://doi.org/10.1080/09296174.2020.1766346
- Kouyama, N., Köppen, M. (2019). Entropy Analysis of Questionable Text Sources by Example of the Voynich Manuscript. Soft Computing in Data Science, 3–13. doi: https://doi.org/10.1007/978-981-15-0399-3_1
- Authorship Proven by Mathematics Burrow's Delta helps determine the real author of And Quiet Flows the Don. IQ: Research and Education Website. Available at: https://iq.hse.ru/news/367813734.html
- Bubnov, V. A., Survilo, A. V. (2016). Comparative Computer Analysis of the Text the Novel «The Quiet Don» with Texts of Four Fyodor Kryukov’s Stories. Vestnik Rossiyskogo universiteta druzhby narodov. Seriya: Informatizatsiya obrazovaniya, 1, 60–69. Available at: https://cyberleninka.ru/article/n/sravnitelnyy-kompyuternyy-analiz-teksta-romana-tihiy-don-s-tekstami-chetyreh-rasskazov-fyodora-kryukova/viewer
- Zhao, Y., Zhang, J., Zong, C., He, Z., Wu, H. (2019). Addressing the Under-Translation Problem from the Entropy Perspective. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 451–458. doi: https://doi.org/10.1609/aaai.v33i01.3301451
- Bromiley, P., Thacker, N., Bouhova-Thacker, E. (2010). Shannon Entropy, Renyi Entropy, and Information. TINA. Available at: https://www.academia.edu/32317926/Shannon_Entropy_Renyi_Entropy_and_Information
- Investigation of distances between sets of entropies. Available at: http://102030.kz/entropyR2.php
- Word and Character Counter. Available at: https://countwordsfree.com/
- Russian stemming algorithm. Available at: http://snowball.tartarus.org/algorithms/russian/stemmer.html
- The Porter Stemming Algorithm. Available at: https://tartarus.org/martin/PorterStemmer/
- XAMPP Installers and Downloads for Apache Friends. Available at: https://www.apachefriends.org/index.html
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2021 Gulnur Yerkebulan, Valentina Kulikova, Vladimir Kulikov, Zaru Kulsharipova
This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.
A license agreement is a document in which the author warrants that he/she owns all copyright for the work (manuscript, article, etc.).
The authors, signing the License Agreement with TECHNOLOGY CENTER PC, have all rights to the further use of their work, provided that they link to our edition in which the work was published.
According to the terms of the License Agreement, the Publisher TECHNOLOGY CENTER PC does not take away your copyrights and receives permission from the authors to use and dissemination of the publication through the world's scientific resources (own electronic resources, scientometric databases, repositories, libraries, etc.).
In the absence of a signed License Agreement or in the absence of this agreement of identifiers allowing to identify the identity of the author, the editors have no right to work with the manuscript.
It is important to remember that there is another type of agreement between authors and publishers – when copyright is transferred from the authors to the publisher. In this case, the authors lose ownership of their work and may not use it in any way.