Identification of authorship of Ukrainian-language texts of journalistic style using neural networks
DOI:
https://doi.org/10.15587/1729-4061.2020.195041Keywords:
authorship identification, text analysis, artificial neural networks, multilayer perceptron, text vectorizationAbstract
The problem of development of an effective method for text authorship identification (on the material of publications of well-known Ukrainian journalists) is explored. Most existing methods require text preprocessing, which entails new costs when solving the set problem. In the case where the number of possible authors can be minimized, this approach is often excessive. Another disadvantage of the existing approaches is that their vast majority was applied to texts in foreign languages and did not take into consideration the peculiarities of the Ukrainian language. Therefore, it was decided to develop an approach that makes it possible to identify the author of the text in Ukrainian without preprocessing and give high accuracy results, as well as to establish what types of artificial neural networks provide the minimum error for Ukrainian publicists.
The developed method uses a multilayer perceptron of direct distribution, the algorithm of supervised learning, vectorization HashingVectorizer, and Adam optimizer. It was determined that with a small number of iterations (4–5 iterations) of artificial neural network learning, we obtain a rather high accuracy of identification of authorship of journalistic texts and rather small value of error. Over 1,000 fragments of texts by three Ukrainian authors were used. As a result of the conducted experiments, it was found that the application of the developed approach to solving the set problem enables achieving rather high results. In the texts containing not less than 500 characters, the accuracy reaches 91 %, and the maximum number of iterations of artificial neural network learning does not exceed 15. Such results were achieved primarily due to the efficient selection of the vectorization method at the preparatory stage and the structure of an artificial neural networkReferences
- Yermolenko, S. Ya. (2007). Linhvostylistyka: osnovni poniattia, napriamy y metody doslidzhennia. Ukrainska linhvostylistyka XX – pochatku XXI st.: systema poniat i bibliohrafichni dzherela. Kyiv: Hramota.
- Lytvyn, V., Vysotska, V., Pukach, P., Nytrebych, Z., Demkiv, I., Senyk, A. et. al. (2018). Analysis of the developed quantitative method for automatic attribution of scientific and technical text content written in Ukrainian. Eastern-European Journal of Enterprise Technologies, 6 (2 (96)), 19–31. doi: https://doi.org/10.15587/1729-4061.2018.149596
- Lytvyn, V., Vysotska, V., Pukach, P., Nytrebych, Z., Demkiv, I., Kovalchuk, R., Huzyk, N. (2018). Development of the linguometric method for automatic identification of the author of text content based on statistical analysis of language diversity coefficients. Eastern-European Journal of Enterprise Technologies, 5 (2 (95)), 16–28. doi: https://doi.org/10.15587/1729-4061.2018.142451
- Khomytska, I., Teslyuk, V. (2016). The Method of Statistical Analysis of the Scientific, Colloquial, Belles-Lettres and Newspaper Styles on the Phonological Level. Advances in Intelligent Systems and Computing, 149–163. doi: https://doi.org/10.1007/978-3-319-45991-2_10
- Khomytska, I., Teslyuk, V. (2017). Modelling of phonostatistical structures of the colloquial and newspaper styles in english sonorant phoneme group. 2017 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT). doi: https://doi.org/10.1109/stc-csit.2017.8098738
- Marchenko, O. O., Nykonenko, A. O., Rossada, T. V., Melnikov, E. A. (2016). Authorship attribution system. Shtuchnyi intelekt, 2, 77–85. Available at: http://dspace.nbuv.gov.ua/bitstream/handle/123456789/132051/08-Marchenko.pdf?sequence=1
- Bhargava, M., Mehndiratta, P., Asawa, K. (2013). Stylometric Analysis for Authorship Attribution on Twitter. Lecture Notes in Computer Science, 37–47. doi: https://doi.org/10.1007/978-3-319-03689-2_3
- Calix, K., Connors, M., Levy, D., Manzar, H., MCabe, G., Westcott, S. (2008). Stylometry for e-mail author identification and authentication. Proceedings of CSIS Research Day. Pace University.
- Ebrahimpour, M., Putniņš, T. J., Berryman, M. J., Allison, A., Ng, B. W.-H., Abbott, D. (2013). Automated Authorship Attribution Using Advanced Signal Classification Techniques. PLoS ONE, 8 (2), e54998. doi: https://doi.org/10.1371/journal.pone.0054998
- Chakraborty, T. (2012). Authorship identification in bengali literature: a comparative analysis. Available at: https://arxiv.org/pdf/1208.6268.pdf
- Kotsovsky, V., Geche, F., Batyuk, A. (2015). Artificial complex neurons with half-plane-like and angle-like activation function. 2015 Xth International Scientific and Technical Conference “Computer Sciences and Information Technologies” (CSIT). doi: https://doi.org/10.1109/stc-csit.2015.7325430
- Kotsovsky, V., Geche, F., Batyuk, A. (2019). On the Computational Complexity of Learning Bithreshold Neural Units and Networks. Lecture Notes in Computational Intelligence and Decision Making, 189–202. doi: https://doi.org/10.1007/978-3-030-26474-1_14
- Gamon, M. (2004). Linguistic correlates of style. Proceedings of the 20th International Conference on Computational Linguistics - COLING ’04. doi: https://doi.org/10.3115/1220355.1220443
- Zhao, Y., Zobel, J. (2007). Searching with style: Authorship attribution in classic literature. In Proceedings of the thirtieth Australasian conference on Computer science, 62, 59–68.
- Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013). Efficient estimation of word representations in vector space. Available at: https://arxiv.org/pdf/1301.3781.pdf
- Cai, C., Xu, Y., Ke, D., Su, K. (2015). A Fast Learning Method for Multilayer Perceptrons in Automatic Speech Recognition Systems. Journal of Robotics, 2015, 1–7. doi: https://doi.org/10.1155/2015/797083
- Bodyanskiy, Y., Pliss, I., Kopaliani, D., Boiko, O. (2018). Deep 2D-Neural Network and its Fast Learning. 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP). doi: https://doi.org/10.1109/dsmp.2018.8478578
- Haykin, S. (1994). Neural networks: a comprehensive foundation. Prentice Hall PTR, 768.
- Neural network models (supervised). Available at: https://scikit-learn.org/stable/modules/neural_networks_supervised.html
- Backpropagation Algorithm. Available at: http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm
- Kotsovsky, V., Geche, F., Batyuk, A. (2018). Finite Generalization of the Offline Spectral Learning. 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP). doi: https://doi.org/10.1109/dsmp.2018.8478584
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2020 Maksym Lupei, Alexander Mitsa, Volodymyr Repariuk, Vasyl Sharkan
This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.
A license agreement is a document in which the author warrants that he/she owns all copyright for the work (manuscript, article, etc.).
The authors, signing the License Agreement with TECHNOLOGY CENTER PC, have all rights to the further use of their work, provided that they link to our edition in which the work was published.
According to the terms of the License Agreement, the Publisher TECHNOLOGY CENTER PC does not take away your copyrights and receives permission from the authors to use and dissemination of the publication through the world's scientific resources (own electronic resources, scientometric databases, repositories, libraries, etc.).
In the absence of a signed License Agreement or in the absence of this agreement of identifiers allowing to identify the identity of the author, the editors have no right to work with the manuscript.
It is important to remember that there is another type of agreement between authors and publishers – when copyright is transferred from the authors to the publisher. In this case, the authors lose ownership of their work and may not use it in any way.