ANALYSIS OF THE DEVELOPED QUANTITATIVE METHOD FOR AUTOMATIC ATTRIBUTION OF SCIENTIFIC AND TECHNICAL TEXT CONTENT WRITTEN IN UKRAINIAN V

The scheme of combining methods of attribution of Ukrainian scientific and technical text content consists of lexical and syntactic levels [1]. Use of the syntactic level involves calculation of linguistic correlations in word combinations [2]. A model for constructing author’s style profile is proposed in [3]. It consists of a characteristic author’s vocabulary and author’s syntax [4]. To describe syntax, it is necessary to use formalized description of linguistic relations ANALYSIS OF THE DEVELOPED QUANTITATIVE METHOD FOR AUTOMATIC ATTRIBUTION OF SCIENTIFIC AND TECHNICAL TEXT CONTENT WRITTEN IN UKRAINIAN


Introduction
The scheme of combining methods of attribution of Ukrainian scientific and technical text content consists of lexical and syntactic levels [1].Use of the syntactic level involves calculation of linguistic correlations in word combinations [2].A model for constructing author's style profile is proposed in [3].It consists of a characteristic author's vocabulary and author's syntax [4].To describe syntax, it is necessary to use formalized description of linguistic relations
Methods of attribution of Ukrainian scientific and technical text content were proposed and studied in [1][2][3][4][5].To implement these methods, various algorithms [14], in particular quantitative one [15], can be used.Therefore, a problem arises in analysis of such algorithms in order to find the most effective of them [16].
Authorship identification is a technique for text attribution when it is questionable who wrote it [17].It is useful when several people claim to be the authors of the same publication [18] or in cases where nobody claims to be the author of text content [19], for example, so-called trolls in social networks during information warfare [20].Complexity of the problem of the author's text, obviously, is exponentially higher the greater the number of probable authors [21].Availability of author's text samples is also significant in advancing this problem [22].Text attribution includes the following three problems [23]: -identification of the text author in a group of probable or expected authors where the author is always in a group of suspects [24]; -non-identification of the text author in a group of probable or expected authors, where the author may not be in a group of suspects [25]; -assessment of the possibility whether or not this text could be written by the author under consideration [26].
Therefore, the problem of automatic attribution of scientific and technical text content is relevant and requires new (more perfect) approaches to its solution [27].

Literature review and problem statement
Text attribution is the text study in order to establish its author or obtain any information about the author and conditions of the textual document creation [17].Attribution is divided into identification and diagnostic tasks [18].Identification tasks make it possible to verify authorship [19]: -confirm/exclude authorship of a certain person [20]; -check the fact that the author of the whole text is one person [21]; -check whether the author of the text simultaneously is its actual author [22].
Identification tasks are solved with the assumption that author of the text is known [23].Diagnostic tasks make it possible to determine personal characteristics of the author (educational level, mother tongue, origin, knowledge of foreign languages, place of permanent residence, etc.) and/ or the fact of conscious distortion of the written language [24].Diagnostic tasks are solved with the assumption that the text author is unknown [25].In these cases, it is usually impossible to compare the text under study with texts of another author [26].Attribution methods enable study of the text at five levels [27]: -punctuation (feature of the use of punctuation marks and characteristic errors) [28]; -spelling (characteristic errors in word spelling) [29]; -syntactic (features of constructing sentences, giving preference to one or another language structures, use of times, active or passive voice, word order, characteristic syntactic errors) [30]; -lexical-phraseological (author's vocabulary [31], peculiarities of use of words and expressions [32], tendency to use rare and foreign words, dialecticisms, archaisms, neologisms, professionalisms, argotisms [33], skill of using phraseologisms, proverbs, sayings, winged words, etc.) [34]; -stylistic (genre [35], general structure of the text [36], plot for literary works [37], characteristic pictorial tools (metaphor, irony, allegory, hyperbole, comparison) [38], stylistic figures (gradation, antithesis, rhetorical questions, etc.) [39], other linguistic techniques.
There are quite a lot of methods for style analysis [42].In general, there are two large groups: expert and formal methods [43].Expert methods involve the text study by a professional linguist [44].Formal approaches include techniques from probability theory and mathematical statistics, algorithms of cluster analysis and neural networks [46].The most complete classification of the main formal methods of text attribution is given, e.g. in studies [1][2][3][4][5][6][7][8]47].As can be seen, formal methods are most often based on comparison of computational characteristics of texts, as in the theory of image recognition [49].Applying the theory of image recognition to the task of text attribution can be found, e. g. in [50].In general, the text is displayed in a vector of parameters calculated for it, each of them objectively characterizing a certain set of text features [51].Thus, text is graphically displayed to some point in the n-dimensional space [52].With such formalization, the author is presented in the form of a similar vector of parameters.This vector is the vector of texts written by the author [53].
The distance between corresponding vectors is calculated as a criterion for proximity of two texts [54].The sets of parameters and dispersion factors are presented as ordinary vectors in the n-dimensional Cartesian space from the origin of coordinates.Then the distance between the texts is the usual Cartesian distance between the ends of corresponding vectors.Such normal distance is an integral characteristic of difference between texts and the texts with a large distance between them belong with high probability to different authors.So, in order to compare authorship of two texts, it is enough to calculate parameters for them and determine distance [55].To juxtapose the text with the author, vectors of the author parameters and the given text are compared, that is, actually two texts are compared again: the text with the known author (reference text) and the text whose authorship has to be established, confirm or refute (the analyzed/investigated text) [56].Also, vectors of formal parameters that distinguish not concrete authors (or groups) but establish certain characteristics of authors (e. g. educational level) are constructed [57].In most cases, statistical features are chosen as characteristic parameters of the text: -amount of use of certain parts of speech, certain specific words, punctuation marks, phraseologisms, archaisms, rare and foreign words; -number and length of sentences (measured in words, syllables, signs), mean sentence length; -number of meaningful and auxiliary words; -volume of vocabulary, ratio of the number of verbs to the total number of words in the text, etc. [58].
The main problem of formal methods of authorship analysis is precisely the choice of parameters and coefficients of talking [59].There are a number of formal statistical characteristics of texts that are not suitable for attribution because of one of two shortcomings [1][2][3][4][5]60]: -lack of stability.The spread of parameter values in texts of the same author is so great that the ranges of possible values for different authors intersect.Obviously, this parameter will not help distinguish authors and when used as a part of the group of parameters, they only play the role of additional informational noise [61]; -lack of distinguishing ability.The parameter may accept close values for all or most authors since its value is determined by properties of the language in which the texts are written and not by individual features of the text author.Therefore, parameters must be investigated beforehand for stability and ability to distinguish, preferably in the texts of many different authors [62].
The following conditions of applicability of the formal talking coefficient of the author's style are determined in [3,[63][64][65]: -mass character (the use of those characteristics of the text that are poorly controlled by the author at the conscious level in order to eliminate possibility of conscious distortion by the author of the style characteristic for him or imitation of the style of another author) [3,63]; -stability (constant value for one participant is maintained but some deviation of values from the mean value should be rather small) [3,64]; -ability to distinguish (takes substantially different values for different authors, that is, exceeds variations that are possible for one participant) [3,65].
It is very hard to choose coefficients and talking parameters which assuredly distinguish any two authors [66].Whatever the parameters, there is always a probability that two or more participants are close by virtue of accidental coincidence [67].Therefore, it is sufficient in practice that the parameter allows us to confidently distinguish between different subsets of authors, that is, there would exist a sufficiently large number of subsets of authors for which mean values of the parameter differ significantly [68].The parameter obviously will not help distinguish the authors' texts from one subset but will allow us to confidently distinguish between texts of authors who fall into different subsets [69].Texts of authors of one subset can be distinguished by simultaneous use of a sufficiently large vector of parameters with different characters.In this case, probability of accidental coincidence will be noticeably less [70].For a confident elucidation of texts for which formally calculated parametric distance is small, an additional study by expert methods, e.g.analysis of key and/or stop (auxiliary) words is necessary [70].
Consequently, it is necessary to conduct a study in this direction because of the lack of practical experiments to know the author's style for Ukrainian scientific and technical texts.To solve the task of plagiarism or copyright, many systems have already been developed recently.As for rewrite, it is quite difficult in Slavic languages to solve such a task in the presence of a large number of synonyms and the possibility of restructuring sentences with the use of other endings.This question does not apply to the use of auxiliary words as most people do not even pay attention to them when referring to plagiarism.Therefore, this induces to explore the problem of identifying the author's style to determine the degree of belonging of a particular text to a particular author.

The aim and objectives of the study
This study objective was to analyze quantitative algorithms for automatic attribution of the Ukrainian scientific and technical text content on the basis of stylistics and NLP methods.
To achieve the objective, the following tasks were formulated: -develop a method for attribution of the text based on analysis of algorithms and coefficients of author's lexical talking in a reference text; -develop a content analysis software for attribution of Ukrainian texts based on stylistic analysis of coefficients of talking of the text content; -analyze results of experimental testing of the proposed method based on the content analysis for comparison of algorithms of automatic attribution of Ukrainian scientific and technical texts.

Method for determining a style of the text content
Several algorithms were taken as the basis of the developed method.
Algorithm I. Pre-processing of data based on content analysis (parsing, segmentation and tokenization of the text and linguistic analysis of the text).
Algorithm II.Calculation and analysis of talking parameters for each author (frequency of word use, number of punctuation marks, symbols, sentences, words and the ratio of the number of signs to the number of sentences).
Algorithm III.Calculation and analysis of talking coefficients for each author (lexical diversity, degree of syntactic complexity, talking coherency, uniqueness indexes and text concentration).
Algorithm IV.According to these factors, classify project participants (use of three classifiers as fuzzy, SVM and a combination of these two).
Algorithm V. Performance analysis to determine exactness of each classifier.
Algorithm VI.Identification of a subset of probable authors from the set of all investigated participants (algorithms VIІІ-XI) by superimposing the filters.
To achieve the study objective, a lexer-type system with the ability to select language/languages of the analyzed content implemented at the Victana web-site [16] has been developed.Lexer (tokenizer, segmentator) is the section of the text analyzer in a natural language.The lexer task was to define basic structural units in the text, lexems, and recognize by comparing with dictionary forms or other morphological samples.As a result of the lexer functioning, a complex data structure, namely the tokenization graph is obtained.The tokenization graph is the source material for operation of the syntactic parser (Fig. 1).There are markers in the graph nodes.Each token stores information about location of the extracted word in the original text (index of the first character and the number of characters in the word), the word itself and the results of its identification.At the left, there is always a special token of identification of the sentence beginning.Each letter in the graph is a special token of sentence ending.Each path in the graph ends with a special token.For most cases, this token denotes the right boundary of the sentence.Thus, the parser has the ability to take into account proximity of words to the limits of expression which is useful for optimizing some rules of token filtering.
The lexer works in very close collaboration with the text parser.The words recognized by the lexer confirm/refute the parser's hypothesis on syntactic structure of the text.On the basis of the current context, parser puts forward new hypotheses that interrupt/continue certain paths of tokenization in the graph.Thus, tokens elongate during functioning of the parser rules and are immediately checked for compliance of conditions in these rules.For example, without taking in account syntactic rules, the chain of "ямаладонькататамого" allows a plurality of variants of breakdown into words (Fig. 2).
Some paths are interrupted because of inability to find an appropriate word in the vocabulary.Simple greedy algorithm does not work by the rule "identify from the input buffer the maximum long word found in the lexicon".The lexer appears in the text processing system [16] as a result of decomposition of the parsing problem.It simplifies implementation of morphological and syntactic analyzers (Fig. 3) since it allows one to work with larger units, lexemes.The simplicity introduced in this way implicitly limits commonality of the entire system since the idea of splitting the text into independent lexemes in itself does not work out with all languages.Moreover, even for languages with natural highlighting of words in writing, complex effects of merging words in larger units appear in audio representation.In Germanic languages, this is reflected in writing as articles and prepositions merged with other words.Lexer and tokenizer work without explicit rules at all, they only use information in the lexicon.Only the task of specifying the type of word boundaries in the language and the list of separator symbols is more or less binding.Fig.Additional rules help solve some practical tasks increasing efficiency of the grammar slider.In particular, the rules make it possible to reduce ambiguity of word identification due to the partial removal of homonymy.Algorithms that are designed to solve the above tasks, allow different settings for the object language and the features of the processed texts and messages.Rules for each type of setting were created.Rules are written in text source files of the vocabulary according to the given specifications.The specifications are designed so that the rules can be easily edited in any plain text editor or generated by software, e.g. as a result of statistical processing of the language bodies.When translating these specifications, the vocabulary compiler formally checks correctness of the rules, optimizes and saves them in a special internal representation.Then, the slider loads compiled rules during the text parsing, usually without wasting time for syntax (algorithm VІІ), a compromise between convenience of writing rules and effectiveness of using the slider is achieved in this way.
Algorithm VІІ.Segmentation of the text content.
Step 2. Definition of the lexeme boundaries.
Step 3. Definition of complete word-forms.
Step 4. Identification of indivisible tokens having dots, spaces, etc.
Step 5. Splitting the text into sentences.The characters that are delimiters of sentences (point, question, and exclamation marks) are determined by the appropriate parameter in the language description.Another parameter in the language description specifies maximum length of the sentence.It is used to prevent overflow of internal buffers and looping when parsing complex formatted texts when the algorithm cannot find the sentence end marker.If a point is used as a separator, then it is processed in a special way, unlike '?' and '!' signs.The fact is that some words may contain a point, and this should not cause the sentence break.A typical example is shortening of the "etc."type or abbreviations such as "N.Y.".Analogously, numbers with a decimal point like "9.3" are specially processed.Processing of such exceptions (tokens with a dot inside) relies on the tokenizer ability to recognize special chains with separators inside in a thread of characters.
A point after a complete wordform is considered an absolute divisor of the sentence.A list of special tokens is used to do this.If a complete wordform follows such a token, it starts with a capital letter and it is noted in the vocabulary lexicon that the entry does not begin with a capital letter, then a special token is the boundary of the sentence.For example, the first sentence in the text "Text, video, etc. Message, article, etc. "will be cut after "etc."since such word (Message) starts with a capital letter.
The values of minimum length of the full word-form are used in the case when point stands after the full word.Since the segmenter usually sees that characters stand in front and considers the event when the next word begins with a capital letter is the sentence boundary, the text "sun.sea.sand" will be considered one sentence.A corresponding rule forces the segmentator to check the word before the point according to the lexicon and in case of success, consider the point the sentence boundary regardless of the shift of the next word characters.The corresponding parameter enables avoiding of unnecessary checks for cases like "etc.characters".It sets minimum length of the analyzed complete word.
In addition to defining boundaries of lexemes, the lexer also preliminarily recognizes morphological attributes of words by transforming lexemes into tokens.To this end, the lexer uses information in the lexicon and rules of recognizing non-vocabulary tokens as well as a number of auxiliary algorithms, including fuzzy recognition.When recognizing a word, characteristics such as belonging to a certain part of speech and a set of grammatical attributes are defined.Noun phrases, Ñ, and verbal phrases, Ř, are distinguished in the structure of Ukrainian sentences with direct word order (Fig. 4, 5) [1][2][3][4][5].
A user can only observe how the tree of constituents or syntactic structure of the analyzed sentence (Fig. 6) is obtained.
A vocabulary entry of a lexeme form is also defined for vocabulary lexemes.In alpha-frequency dictionaries, word characteristics are given after a slash (Fig. 7) where A is verb, the uppercase English letters are additional characteristics of the verb, V is adjective, the lowercase English letters indicate the noun characteristics.
The rules for reducing words to their stems are kept in the database (Fig. 9, a) where the 'flag' is the rule for word type identification (e.g.noun, singular), 'mask' is the word flexion (exclusion is given in square brackets), 'find' is the word flexions, subjective case, 'repl' is the word flexions in conjugation (Fig. 10).
In addition, the database (Fig. 9, b) contains a vocabulary of auxiliary words, that is, the words that are additional parameters for analysis of the author's talking style and their taking into consideration in analysis of texts has a significant effect on the final result.As a result, we obtained values given in Table 2 (algorithm VIII).Columns A contain the result of analysis of all values of vectors of coefficients and talking parameters for the authors from Table 1.Columns B contain the result of analysis of only the last 5 columns in Table 1.Unfortunately, this algorithm has provided such results: it is unlikely that the mentioned authors have written the papers by themselves (the best results are highlighted in red) and not enough to assert that they are actual authors of more than 50 % of these composite papers.On the other hand, although this algorithm yields good results: reduced number of authors at the first stage of attribution (up to 34.04 % of the total number of project participants).This is necessary for further filtering by means of analysis of stop words (prepositions and conjunctions) and keywords, features of semantics and vocabulary in construction of sentences, etc.
where V[l] is the array of mean absolute values of deviations of data points from the mean value.As a result, the values given in Table 2 (algorithm IX) were obtained.The results have improved a bit but not so much as to assert that the authors Nos.6 and 30 are the actual authors of the composite papers 1-4 although they actually wrote them.On the other hand, the number of authors increased slightly (up to 38.56 % of the total number of project participants) with similarity in the style of talking.Now, let us analyze the algorithm X.Also, replace condition in the third cycle of algorithm 1 with the following: As a result, the values given in Table 2 (algorithm X) are obtained.As can be seen, the obtained values give firm grounds to assert that style of authors Nos.6 and 30 is rather close (over 75-100 %) to the style of composite papers 1-4 accordingly (positive results are highlighted in red).Although the number of authors with similarity in the talking style increased significantly (up to 42.02 % of the total number of project participants).On the other hand, many authors who did not fall in the list at the previous stages of the study were found there at present and on the contrary, those who fell in the list at the previous two stages fell out of the present list.Next, let us try to reduce that total number by applying algorithm XI to the obtained initial data, namely parameters and talking coefficients of 94 participants of the project.Improve condition in the third cycle (by filtering) in algorithm X as follows:

Discussion of results obtained in the study of author's style in Ukrainian scientific and technical texts
Detailed graphs of the results obtained in using algorithms VIII-XI (Nos.1-4, respectively) for analysis of our method of style attribution are given in Fig. 11.At the next stage, analysis of stop words (prepositions and conjunctions) and key words in papers of the authors who fell to those 38.03 % was used to attribute author's style (Fig. 12).Each individual uses its own special vocabulary to convey its thoughts, including the so-called filler words (e.g."that is", "therefore", "though", etc.) and auxiliary words ("and", "but", "at least").iary and key words with taking into account various filters, analysis of full texts with references and abstracts in various languages and analysis of only informative part of the publication, i. e. the main text, with compilation of frequency vocabularies respectively containing 200, 10 and 50 words is given in Fig. 12.However, it should be noted that the number of texts sampled for analysis (over 200) and the number of authors (94) are small to guarantee exact results.The study should be continued with more texts (it should be noticed that they are not always available).In the future, it is also necessary to improve the method by analyzing the texts using methods of stylemetry and glotochronology.

Conclusions
1.A method for attribution of texts based on analysis of coefficients of lexical author's talking in a reference excerpt of the author's text was developed.Establishment of the author's style is based on a comparative analysis of coefficients of lexical author's talking: speech connectivity, lexical diversity, syntactic complexity, concentration and exclusiveness indexes for the author's excerpt and other analyzed excerpt for further comparison and determination of the degree of belonging of the analyzed text to a particular author.The main stylistic coefficients for the author's excerpt and other analyzed excerpt include speech connectivity, lexical diversity, syntactic complexity, as well as concentration and exclusivity indexes.Further analysis is needed to compare values of the coefficients and determine the degree of attribution of the analyzed text to a particular author.The developed method features adaptation of the morphological and syntactic analysis of lexical units to peculiarities of Ukrainian word/text structure.That is, analysis of linguistic units of the word type took into account their belonging to a part of speech and conjugation within this part of speech.To this end, we analyzed flexions of these words for classification and extraction of word stems for compilation of corresponding alphabetic and frequency dictionaries.Supplementation of these dictionaries was further taken into account in subsequent steps of text attribution as calculation of parameters and coefficients of the author's talking.Namely auxiliary (stop or reference) words are indicative for the individual writer style because they are in no way related to the topic and content of the publication.An algorithm of definition of stop words of the text content on the basis of linguistic analysis of the text content was designed.It fea-tures adaptation of the morphological and syntactic analysis of lexical units to peculiarities of structure of Ukrainian words/texts.Theoretical and experimental substantiation of the method of content monitoring and definition of stop words of Ukrainian texts was made.The method is aimed at automatic detection of significant stop words in a Ukrainian text by means of the proposed formal approach to implementation of parsing of scientific and technical text content.
2. A formal approach to attribution of Ukrainian texts was proposed.The study was conducted with Ukrainian scientific and technical texts.Decomposition of the author's method of attribution based on analysis of such talking coefficients as lexical diversity, degree of syntactic complexity, talking connectivity, indexes of text exclusiveness and concentration was made.Parallel analysis of author's style parameters such as the number of words in a particular text, the total number of words in this text, the number of sentences, the number of prepositions and conjunctions, the number of words with one occurrence and the number of words with 10 or more occurrences.The developed system has analyzed over 200 individual scientific publications from all issues of Lviv Politechnic National University Bulletin, Information Systems and Networks series, for the period from 2001 to 2017.
3. The results of application of the designed algorithms of automatic attribution of the text content on the basis of NLP and stylemetry methods were analyzed.Prospects and peculiarities of application of information stylemetry methods for attribution of text contents were considered.Quantitative analysis of scientific and technical text contents uses benefits of content monitoring and content analysis of texts based on NLP, Web-Mining and stylemetry methods to determine the number of authors whose talking styles are similar to that of the text fragment being studied.This has narrowed the search circle for later use in stylemetry methods to determine the degree of belonging of the analyzed text to a particular author.Comparison of the results obtained with 200 individual technical papers written by about 100 different authors during the period from 2001 to 2017 has been made to determine whether the coefficients of diversity of the text of these authors varied at different time intervals.Experimental results of the proposed approach were obtained to determine belonging of the analyzed text to a particular author in the presence of a reference information stream of author's text content.Absence of analysis of introduction and conclusion sections somewhat improved results as the main section usually discloses its style when describing the study essence.This is achieved through training the system and checking the clarified blocked words and due to the refined idioglossary. ...........................................................................................................................................................

36 Fig. 3 .
Fig. 3.The result of syntactic analysis of a Ukrainian sentence

Fig. 6 .
Fig.6.An example of syntactic structure of the analyzed sentence

Table 1
Result of work of the algorithm for analyzing the author's publication style at Victana's information resource[16]