IDENTIFICATION OF AUTHORSHIP OF UKRAINIAN- LANGUAGE TEXTS OF JOURNALISTIC STYLE USING NEURAL NETWORKS

With the advancement of technology, artificial neural networks are increasingly being used to solve certain tasks that take a lot longer for a person than a computer to solve. Some of such relevant issues include the identification of the primary source, identification of the authorship of anonymous texts, fight against plagiarism, determining belonging of a text to a certain author during legal expert examination. At present, there are many approaches to solving them, based on different methods, and yielding different results of accuracy. However, the issue of developing a universal method that will produce the best results, that is, will provide the highest accuracy in authorship identification with the consumption of fewer resources, remains unresolved. Identification of authorship of Ukrainian publicists who have a similar style with the use of computational approaches is the problem that has been little explored. The approaches to its solution can have their own peculiarities and differ from the generally accepted universal approaches. That is due to the fact that the main parameters, which researchers try to separate, are significantly affected by some general language features, rather than by individual features of the author’s language style. Given exactly the specificity of a particular language, one can develop an effective approach to solving the problems of authorship identification.


Introduction
With the advancement of technology, artificial neural networks are increasingly being used to solve certain tasks that take a lot longer for a person than a computer to solve. Some of such relevant issues include the identification of the primary source, identification of the authorship of anonymous texts, fight against plagiarism, determining belonging of a text to a certain author during legal expert examination. At present, there are many approaches to solving them, based on different methods, and yielding different results of accuracy. However, the issue of developing a universal method that will produce the best results, that is, will provide the highest accuracy in authorship identification with the consumption of fewer resources, remains unresolved.
Identification of authorship of Ukrainian publicists who have a similar style with the use of computational approaches is the problem that has been little explored. The approaches to its solution can have their own peculiarities and differ from the generally accepted universal approaches. That is due to the fact that the main parameters, which researchers try to separate, are significantly affected by some general language features, rather than by individual features of the author's language style. Given exactly the specificity of a particular language, one can develop an effective approach to solving the problems of authorship identification.

Literature review and problem statement
In Ukrainian linguistics, the greatest attention is traditionally paid to studying the belles-lettres style and lessto other styles, including journalistic. This also concerns studying the individual style of authors: more studies deal with the language of writers than with the language of publicists [1]. The study of Ukrainian linguists refers mainly to the speech of individual authors or the language of specific works, whereas little attention in traditional linguistics is paid to the issue of identification of authorship of texts.
A series of papers, which explore the stylometric parameters of scientific texts in Ukrainian have been published in recent years. In particular, in paper [2], the identification of the author's style is based on comparative analysis of the author's speech coefficients: speech coherence, vocabulary diversity, syntax complexity indices, concentration and exclusivity regarding the author's passage and other analyzed passages for further comparison and determining the degree of belonging of the analyzed text to a particular author. Support-vector machines (SVM) are used at one of the stages. A disadvantage is a small sampling of about 200 single-author papers of the technical area written by about 100 different authors. In article [3], the dynamics of a change of different parameters of the author's style (the number of different language units -words in a text, sentences, prepositions, conjunctions, the number of words with frequency 1 and with frequency 10 or more) is traced. The methods developed in these works of the scientific style can be applied to other functional styles of the Ukrainian language.
A comparative study of statistical parameters of the publicist ("newspaper") and other styles is presented in a number of articles. Thus, article [4] aims to address the issue of distinction of the scientific, fiction, journalistic and conversational styles at the phonological level using mathematical-statistical methods based on the data about the use of consonants in the texts by English-speaking authors, and in article [5], the statistical parameters of conversational and journalistic styles are examined separately. However, these studies are performed using the material of the English language. In this regard, there remains a need to develop the optimum structure of the ANN to identify the authorship of texts the journalistic style in the Ukrainian language.
The approaches to authorship identification can combine accumulated knowledge from the theory of image recognition, mathematical statistics and probability theory, neural networks, cluster analysis, Markov chains, and others [6][7][8][9][10][11]. Paper [6] studies the state of the problem today; it is noted that if there are texts by 3-4 authors in the training and testing samples, trained classifiers confidently demonstrate up to 85 % of the accuracy of identification of authorship of a text in the test sample. Article [6] proposed duplex architecture with the use of the support vector machine (SVM). Paper [7] examined the problem of recognition of short texts taken from the Internet in order to detect criminals. Given the fact that the messages are quite short, it was important to have a lot of information about possible candidates. The approach in paper [7] was based on the use of a support vector machine (SVM), but the preparatory stage required the use of Stylemetric Analysis. In paper [8], the problem of identifying the authorship of e-mails for 12 people, each of which created 10 letters up to 150 words, was explored. The accuracy, in this case, was 75-80 %. The method K of closest neighbors was used for recognition. In article [9], two approaches based on multiple-discriminant analysis (MDA) and support vector machine (SVM) were proposed. Their effectiveness for various problems, including the identification of authorship of disputable texts of the collection of papers "Federalist", is compared, and the issue of the authorship of the Message to Hebrews in the New Testament is explored. Some problems are solved quite effectively, but at the same time, the research lacks clear recommendations on a universal approach to text authorship identification, which for any tasks will work equally effectively. In paper [10], the authors used support-vector machines (SVM), the machine learning algorithm that designs the plane through a multidimensional hyperspace, dividing incident training into target classes, and Platt sequential minimum optimization (SMO). This method also combines various functions of text analysis and does not require its prior processing. This method solves the problem of recognition of belonging of a text fragment to Bengali authors Rabindranath Tagore and Sarat Chandra Chattopadhyay. However, it uses a rather small sampling (only 1-2 texts by each author) and solves the problem of recognizing if a text fragment belongs to the certain work rather than the problem of authorship identification. Transferring this approach to the identification of authorship of Ukrainian publicists will not give such good results. In paper [11], it was proposed to use neural networks based on complex neural cells with generalized activation functions. Article [12] showed that the use of neural networks with two-threshold activation functions can significantly improve the recognition ability of the network. In paper [13], the approaches using stylistic features were proposed as more reliable style markers than, for example, the units of the lexical-semantic level, because stylistic markers are less consciously controlled by the author. Three known indicators based on statistical similarity are used to obtain the individual effect: cosine-similarity (COS), ChiSquare measure (CS) and Euclidean distance (ED). The model of machine learning includes three different modules: decision trees (DT), neural networks (NN), and support vector machines (SVM). However, even the most effective of these methods give results (83.3 %), which can be improved. Study [14] deals with the problem of authorship identification among 55 classics of the world literature and proposes to use the probabilistic approach for it. The identification accuracy for various authors differs significantly, although for some of them it is quite high and exceeds 90 %. The known problem of Marlow-Shakespeare, in which the used method refutes the hypothesis that Christopher Marlow is the co-author of the early plays of his peer Shakespeare, is considered separately. In article [14], it is proposed to use Kullback-Leibler Divergence (KLD) to identify authorship. The effective combination of vectorization and the architecture of artificial neural networks makes it possible to reduce computation costs and obtain high accuracy in the problems of identification of authors [15]. This process should be also considered for authorship identification among well-known Ukrainian publicists.
The considered approaches are adjusted to the specifics of each of the problems. This allows arguing that by working out the approach that uses the peculiarities of a particular problem, it is possible to improve the accuracy of authorship identification, in particular, for journalistic texts in Ukrainian.

The aim and objectives of the study
The aim of this study is to determine the effective combination of the vectorization method and the artificial neural nets structures to identify the authorship of texts of journalistic style in the Ukrainian language.
To achieve the aim, it is necessary to consider different methods of text vectorization and different architectures of neural networks, to determine their most effective combinations, to find out its accuracy, as well as the speed of the ANN learning process for the problem of authorship identification of journalistic texts in the Ukrainian language.

Method for studying the authorship identification of a journalistic text in the Ukrainian language with the help of ANN
The texts by Y. Makarov, I. Losev, O. Pokalchuk in the quantity of 50 texts by each author published in the "Ukrainian Week" and "Weekly Mirror Ukraine" during 2015−2019 (by chronological principle from the latest to the oldest ones) were chosen for analysis. The total number of word usage in the studied texts is more than 150 thousand (Y. Makarov -32,758, I. Losev -61,556, O. Pokalchuk -72,286). The largest number of word usage is in the texts by O. Pokalchuk, the smallest -in the texts by Y. Makarov. Using a specially created program, the texts (including the headings, however, without specifying the name of the author) were divided into fragments (their total number is 1,194), not less than 500 characters, but to the end of the sentence. Multilayer neural networks of direct propagation Multi-Layer Perceptron [16], an algorithm for supervised learning [17,18] and a corresponding technology stack (Table 1) were used for authorship identification. Perceptron implements the function f():Rᵐ→R 1 through training on a dataset where m is the input data size, 1 -output size [19]. The algorithm of error backpropagation will be used as the learning algorithm [20]. Having obtained and processed the input data X=(x 1 , x 2 , …, x n )ᵀ output y, construct a nonlinear function approximator for classification or regression; however, there may be a certain number of layers (hidden layers) between the input and the output layer. Perceptrons with a different number of layers ( Fig. 1) are considered. Each neuron in a hidden layer converts the value from the previous layer with the weight line adding w 1 x 1 +w 2 x 2 +…+w n x n with a non-linear activation function, f(.):R 1 →R 1 , which is implemented via relu or sigmoid. At the output, there appears a mean value of output signal y. The algorithm of error backpropagation will be used as the learning algorithm.
The developed approach works according to the following scheme (Fig. 2).
The initial stage is to search for the texts on the Internet to form the dataset. The selected texts undergo further processing using the program written in the C++ programming language. This program removes unnecessary spaces and blank parts of a text and forms a .csv file where all information is split and grouped according to the predetermined parameters. The artificial neural network subsequently works with this. The next step is vectorization, which is an extremely important section, and the result of the work is very sensitive to the conducted stage of input data vectorization. Then, the texts are divided into learning and training ones. In the flowchart, it is the K-fold record, where K indicates how many times the texts were divided into testing and learning ones in various ways. Further, the artificial neural network learning, which implements the testing identification of the text's author and gives results, works with the texts; it all repeats the definite number of times. In the end, the results of experiments are collected and accuracy and research error are determined.
We will separate the problem of finding effective vectorization, which plays a very important role in creating an effective method [15].  better results in conjunction with the considered architectures of artificial neural networks. This vectorization method operates at two stages. At the first stage, we convert the collection of text documents into the frequency matrix of lexemes. In the second stage, we convert the collection of text documents into a 2-dimensional matrix. It contains a count of the number of lexemes (or binary information about occurrence), normalized as the frequency of a lexeme, if norm='11' or projected on a Euclidean unit of a sphere if norm='l2'. This implementation of the text vectorizer uses a hash-trick to find the name of the marker line to display the integer index. The Adam Optimization Algorithm for Deep Learning is also used. This optimization algorithm based on the first-order gradients of the stochastic target function rests on adaptive estimates of the moments of lower order. The method is simple to implement, efficient to apply and has low memory requirements, invariant to the diagonal scaling of gradients and is well suited for problems that have large sizes of data or parameters. This method is also suitable for non-stationary purposes and problems associated with very noisy or sparse gradients. Hyperparameters for this method have intuitive interpretation and usually require little customization. Empirical results show that the Adam is effective in practical application and shows better results compared to the methods of stochastic optimization. During the research, the variant of this method, which is based on the infinity norm, was used.
The issue of the number of iterations to perform while training the artificial network is very important. This study was carried out and the results were displayed in Fig. 3. We see how accuracy increases and error decreases, depending on the number of training iterations. As one can see, accuracy improves as early as on the 4 th -5 th iterations and does not increase so rapidly.
It should be noted that the number of iterations required to achieve the desired accuracy can be reduced by 10-15 % through the use of polynomial neurons [21] in the first hidden layer of the network.
The basic parameters and toolsets, which were used during computational experiments, are assigned in Table 2. The explored basic parameters and toolsets are constant in all experiments. All experiments use the sample of 1,194 fragments of texts, which are subsequently divided into two datasets -training and testing. These datasets are divided into completed sentences with the dimensionality of not less than 500 characters. From them, we form vectors of dimensionality of 10,000, which are fed to the input of a neural network. The Hashing Vectorizer algorithm is used in the formation of these vectors. In all experiments, the model of the neural network keras.Sequential, the Adam Optimization Algorithm for Deep Learning and loss function binary cross entropy, are used.

The results of the study of the Ukrainian-language journalistic text belonging to a particular author
As a result of the conducted computational experiment, we determined the effectiveness of using different structures of artificial neural networks for the problem of identification of belonging of a text in the journalistic style to one of the three authors. The results of the operation of the developed software are given in Table 3. Table 3 Results of experiments  Table 3 shows that depending on the changes of certain parameters or architectures of the network, results become bet-а b Fig. 3. A change in the accuracy and error at an increase in the number of iterations: а -increase in accuracy; b -decrease in error ter or worse. The most effective architecture of neurons in the ANN was determined based on the database from the studied fragments of texts by 3 authors (the texts are evenly distributed in alphabetical order beginning with the first word). The 10fold cross-validation revealed that the architecture of neurons of ANN [10, relu][10, relu] [1, sigmoid] is the most effective. The total number of characters in the texts is more than 500, the maximum number of iterations is 15, the average accuracy is 0.9161, and the average error is f1: 0.8609. It was experimentally proved that such parameters in the sample are most effective (computational experiment No. 1) among other parameters. Consider separately for each author the accuracy of authorship identification using the developed method. In addition, check the influence of dimensionality of the vectors, which are fed to the input into the neural network, on the accuracy of authorship identification.
As Table 4 shows, the authorship of publicist I. Losev was identified most effectively. We also see that at an increase in the dimensionality of vectors from 10000 to 100000, accuracy significantly increases in the identification of authorship of Pokalchuk from 0.9347 to 0.9559. For other authors, an increase in dimensionality affects the accuracy insignificantly.
Consider how the authorship of I. Losev was identified using the example of samples from the dataset.

Discussion of results of studying the identification of authorship of a journalistic text in the Ukrainian language
The developed approach after training makes it possible to verify online quickly and effectively belonging of a text or its fragment to each of the three studied publicists.
The proposed simple scheme (Fig. 2) enabled achieving high accuracy in the conducted research into the identification of authorship of journalistic texts in Ukrainian. This can be conditioned by several factors. First of all, an effective combination of the vectorization method and the ANN structure, which showed the highest results (Tables 3, 4) was chosen automatically from a large number of possible variants. In addition, the authors selected for analysis are some of the most well-known contemporary Ukrainian publicists, who have a rather unique and recognizable writing style and whose works are published in the leading Ukrainian editions. The stylistic expressiveness of the studied texts is also determined by peculiarities of the journalistic style: an author has a better opportunity to express his "self" in comparison, for example, with the official-business or scientific style.
Given this, it is appropriate in the future to test this method of research on a broader empirical material. Firstly, a greater number of authors (perhaps, with not so pronounced author's individual language peculiarities) and their journalistic texts should be selected. Secondly, it might be interesting to take the texts belonging to other styles (scientific, belles-lettres).
The specific features of the style of a particular author are traditionally separated in linguistics, but the reverse process (authorship identification by a text fragment) without using the ANN is a very difficult task. Traditional methods of studying the individual style of authors in linguistics are based on the general features, identified by a person (at the lexical, morphological, and syntactical levels). The ANN itself choos-es the criteria of distinguishing the authors, and these criteria can be based on the phenomena not noticeable for a linguist. Therefore, the next stage of the study is to trace the features that help the ANN to identify the authorship.
Unlike the approaches in other studies, the proposed approach at the preparatory stage does not identify any specific elements that will act as the parameters in authorship identification. This process is fully passed on to the vectorization procedure and the ANN. It is the vectorization procedure that converts input texts into the matrix, which records the frequency parameters of a lexeme [15]. The main advantage of this approach is that large volumes of input datasets are wellscaled because there is no need to keep a dictionary in memory. The shortcomings include the fact that there is no possibility to calculate the reverse conversion (from characteristics indices to line names), which can be a problem when trying to perform self-analysis -figuring out what features are the most important for the model. In addition, there may occur some coincidences: separate lexemes can be displayed in one function index. However, in practice, this rarely happens if n_features is large enough (for example 2 18 for problems of text classification). Note that an increase in the dimensionality of an input vector is directly proportional to the accuracy of the model.
By developing this direction, we plan to increase the number of input data and of authors, to use other types of ANN and input data processing.