Identification of authorship of Ukrainian-language texts of journalistic style using neural networks

Authors

DOI:

https://doi.org/10.15587/1729-4061.2020.195041

Keywords:

authorship identification, text analysis, artificial neural networks, multilayer perceptron, text vectorization

Abstract

The problem of development of an effective method for text authorship identification (on the material of publications of well-known Ukrainian journalists) is explored. Most existing methods require text preprocessing, which entails new costs when solving the set problem. In the case where the number of possible authors can be minimized, this approach is often excessive. Another disadvantage of the existing approaches is that their vast majority was applied to texts in foreign languages and did not take into consideration the peculiarities of the Ukrainian language. Therefore, it was decided to develop an approach that makes it possible to identify the author of the text in Ukrainian without preprocessing and give high accuracy results, as well as to establish what types of artificial neural networks provide the minimum error for Ukrainian publicists.

The developed method uses a multilayer perceptron of direct distribution, the algorithm of supervised learning, vectorization HashingVectorizer, and Adam optimizer. It was determined that with a small number of iterations (4–5 iterations) of artificial neural network learning, we obtain a rather high accuracy of identification of authorship of journalistic texts and rather small value of error. Over 1,000 fragments of texts by three Ukrainian authors were used. As a result of the conducted experiments, it was found that the application of the developed approach to solving the set problem enables achieving rather high results. In the texts containing not less than 500 characters, the accuracy reaches 91 %, and the maximum number of iterations of artificial neural network learning does not exceed 15. Such results were achieved primarily due to the efficient selection of the vectorization method at the preparatory stage and the structure of an artificial neural network

Author Biographies

Maksym Lupei, Uzhhorod National University Narodna sq., 3, Uzhhorod, Ukraine, 88000

Postgraduate Student

Department of Information Management Systems and Technologies

Alexander Mitsa, Uzhhorod National University Narodna sq., 3, Uzhhorod, Ukraine, 88000

PhD, Associate Professor, Head of Department

Department of Information Management Systems and Technologies

Volodymyr Repariuk, Uzhhorod National University Narodna sq., 3, Uzhhorod, Ukraine, 88000

Department of Information Management Systems and Technologies

Vasyl Sharkan, Uzhhorod National University Narodna sq., 3, Uzhhorod, Ukraine, 88000

PhD, Associate Professor

Department of Journalism

References

  1. Yermolenko, S. Ya. (2007). Linhvostylistyka: osnovni poniattia, napriamy y metody doslidzhennia. Ukrainska linhvostylistyka XX – pochatku XXI st.: systema poniat i bibliohrafichni dzherela. Kyiv: Hramota.
  2. Lytvyn, V., Vysotska, V., Pukach, P., Nytrebych, Z., Demkiv, I., Senyk, A. et. al. (2018). Analysis of the developed quantitative method for automatic attribution of scientific and technical text content written in Ukrainian. Eastern-European Journal of Enterprise Technologies, 6 (2 (96)), 19–31. doi: https://doi.org/10.15587/1729-4061.2018.149596
  3. Lytvyn, V., Vysotska, V., Pukach, P., Nytrebych, Z., Demkiv, I., Kovalchuk, R., Huzyk, N. (2018). Development of the linguometric method for automatic identification of the author of text content based on statistical analysis of language diversity coefficients. Eastern-European Journal of Enterprise Technologies, 5 (2 (95)), 16–28. doi: https://doi.org/10.15587/1729-4061.2018.142451
  4. Khomytska, I., Teslyuk, V. (2016). The Method of Statistical Analysis of the Scientific, Colloquial, Belles-Lettres and Newspaper Styles on the Phonological Level. Advances in Intelligent Systems and Computing, 149–163. doi: https://doi.org/10.1007/978-3-319-45991-2_10
  5. Khomytska, I., Teslyuk, V. (2017). Modelling of phonostatistical structures of the colloquial and newspaper styles in english sonorant phoneme group. 2017 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT). doi: https://doi.org/10.1109/stc-csit.2017.8098738
  6. Marchenko, O. O., Nykonenko, A. O., Rossada, T. V., Melnikov, E. A. (2016). Authorship attribution system. Shtuchnyi intelekt, 2, 77–85. Available at: http://dspace.nbuv.gov.ua/bitstream/handle/123456789/132051/08-Marchenko.pdf?sequence=1
  7. Bhargava, M., Mehndiratta, P., Asawa, K. (2013). Stylometric Analysis for Authorship Attribution on Twitter. Lecture Notes in Computer Science, 37–47. doi: https://doi.org/10.1007/978-3-319-03689-2_3
  8. Calix, K., Connors, M., Levy, D., Manzar, H., MCabe, G., Westcott, S. (2008). Stylometry for e-mail author identification and authentication. Proceedings of CSIS Research Day. Pace University.
  9. Ebrahimpour, M., Putniņš, T. J., Berryman, M. J., Allison, A., Ng, B. W.-H., Abbott, D. (2013). Automated Authorship Attribution Using Advanced Signal Classification Techniques. PLoS ONE, 8 (2), e54998. doi: https://doi.org/10.1371/journal.pone.0054998
  10. Chakraborty, T. (2012). Authorship identification in bengali literature: a comparative analysis. Available at: https://arxiv.org/pdf/1208.6268.pdf
  11. Kotsovsky, V., Geche, F., Batyuk, A. (2015). Artificial complex neurons with half-plane-like and angle-like activation function. 2015 Xth International Scientific and Technical Conference “Computer Sciences and Information Technologies” (CSIT). doi: https://doi.org/10.1109/stc-csit.2015.7325430
  12. Kotsovsky, V., Geche, F., Batyuk, A. (2019). On the Computational Complexity of Learning Bithreshold Neural Units and Networks. Lecture Notes in Computational Intelligence and Decision Making, 189–202. doi: https://doi.org/10.1007/978-3-030-26474-1_14
  13. Gamon, M. (2004). Linguistic correlates of style. Proceedings of the 20th International Conference on Computational Linguistics - COLING ’04. doi: https://doi.org/10.3115/1220355.1220443
  14. Zhao, Y., Zobel, J. (2007). Searching with style: Authorship attribution in classic literature. In Proceedings of the thirtieth Australasian conference on Computer science, 62, 59–68.
  15. Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013). Efficient estimation of word representations in vector space. Available at: https://arxiv.org/pdf/1301.3781.pdf
  16. Cai, C., Xu, Y., Ke, D., Su, K. (2015). A Fast Learning Method for Multilayer Perceptrons in Automatic Speech Recognition Systems. Journal of Robotics, 2015, 1–7. doi: https://doi.org/10.1155/2015/797083
  17. Bodyanskiy, Y., Pliss, I., Kopaliani, D., Boiko, O. (2018). Deep 2D-Neural Network and its Fast Learning. 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP). doi: https://doi.org/10.1109/dsmp.2018.8478578
  18. Haykin, S. (1994). Neural networks: a comprehensive foundation. Prentice Hall PTR, 768.
  19. Neural network models (supervised). Available at: https://scikit-learn.org/stable/modules/neural_networks_supervised.html
  20. Backpropagation Algorithm. Available at: http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm
  21. Kotsovsky, V., Geche, F., Batyuk, A. (2018). Finite Generalization of the Offline Spectral Learning. 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP). doi: https://doi.org/10.1109/dsmp.2018.8478584

Published

2020-02-29

How to Cite

Lupei, M., Mitsa, A., Repariuk, V., & Sharkan, V. (2020). Identification of authorship of Ukrainian-language texts of journalistic style using neural networks. Eastern-European Journal of Enterprise Technologies, 1(2 (103), 30–36. https://doi.org/10.15587/1729-4061.2020.195041