DEVELOPMENT OF THE QUANTITATIVE METHOD FOR AUTOMATED TEXT CONTENT AUTHORSHIP ATTRIBUTION BASED ON THE STATISTICAL ANALYSIS OF N-GRAMS DISTRIBUTION

The peculiarities of the application of linguo-statistics technologies for the identification of the style of the author of text content of scientific and technical profile are considered. Quantitative linguistic analysis of a text uses the benefits of content monitoring based on the NLP methods to identify and analyze the set of stop words, keywords, set phrases and to study N-gram. The latter are used in the linguometry methods to determine in per cent if the given text belongs to a particular author. The quantitative method for automatic text content authorship attribution was developed based on statistical analysis of the 3-gram distribution. The approach to the implementation of identification of the author of the text in the Ukrainian language of the scientific and technical profile was proposed. Experimental results of the proposed method to determine the belonging of the analyzed text to a specific author in the presence of the reference text were obtained. Application of the linguo-statistical analysis of the 3-grams to a set of articles will make it possible to form a subset of publications that are similar in linguistic descriptions. Imposing additional conditions in the form of statistical and quantitative analyses (a set of keywords, set expressions, stylometric, linguometric analyses, etc.) on a subset will allow a significant reduction of this subset by specifying the list of the most likely author. For qualitative and effective content analysis when determining the degree of authorship of a particular author, we propose to analyze the reference text and the one under consideration at several stages: linguometric analysis of the coefficients of the diversity of the author's speech, stylometric analysis, analysis of set expressions, linguo-statistical analysis of 3-grams. For automated text processing, not only the frequency of occurrence of a certain category, but also its existence in the studied text in general are important. Quantitative computation makes it possible to draw objective conclusions about the orientation of materials by the number of using the units of analysis in the studied texts. Qualitative analysis does the same, but as a result of the study of whether (and in what context) there is a certain important original category in general

writing, format (a novel, an essay, and a scientific article), emotional coloration, style of speech, as well as the problem of text authorship attribution [2]. With the simplification of access to various data, expanding the ability of finding, copying and distributing information in the Internet, the problem of the authorship attribution is becoming relevant [3]. The problems related to authorship attribution are important in linguistic, historical, and forensic research [4]. The availability of electronic devices makes it possible to push the author's recognition with the involvement of a large number of experts to the background, accelerate and simplify this process through its automation [5]. The concept of the authorship attribution is defined as the process of recognizing the author by a set of general and individual features of a text constituting the author's style [6].
The statistical methods for authorship attribution, based on the search for the author's invariant enjoy great popularity today [7]. This is a characteristic of the linguistic peculiarity of a text (lexical, grammatical, phraseological, etc.) [8]. In particular, an invariant is the percentage of using vowels/ consonants, frequency of using of a certain part of speech, probability of transitions from one part of speech to another, parasite words, information entropy, etc. [9]. The statistical method of identification of the author and the genre of a text, based on the distribution of frequencies of letter combinations (N-grams), is also effective. However, the accuracy of the statistical methods for authorship attribution relies heavily on the specifics of the data used: the speech style and text length [11]. Due to this, it is difficult to make conclusion about the accuracy of this approach on scientific and technical articles [12]. For this reason, it is necessary to analyze the applicability of such mathematical apparatus as the distribution of frequencies of different languages simultaneously with other techniques in solving the problem of attribution of authorship of texts with different lengths and written in different speech styles [13]. The methods for authorship attribution for the Ukrainian-language text content of the scientific and technical area are proposed and studied in papers [1][2][3][4][5][6][7]. Various algorithms can be used to implement these methods [14], including quantitative [15]. Therefore, there arises the problem of analyzing such algorithms to find the most effective one [16]. The authorship attribution is the technique for determining the author of a text, when it is ambiguous, who wrote it [17]. This is useful when several people claim for the authorship of one publication [18] or in cases when no one claims to be the author of the textual content [19], for instance, so-called trolls in social networks during information warfare. The complexity of the problem of the author's test is evidently exponentially higher, and the number of likely authors is larger [21]. The existence of the author's text samples is also significant in advancing this problem [22]. The author's text attribution includes the following three issues [23]: -identification of the author of a text from the group of probable or expected authors, where the author is always in the group of suspects [24]; -non-identification of the text author from the group of probable or expected authors, where the author may not be in the group of suspects [25]; -evaluation of the possibility that this text is written by this author or not.
Therefore, the problem of automatic authorship attribution for the text content of scientific and technical direction is relevant and requires new (more advanced) approaches to its solution [27].

Literature review and problem statement
Papers [1][2][3]28] show the results of the studies of language and speech material on a representative array of texts. This should be a homogeneous array (body) of certain units, that is, the general totality (GT) [3]. It was shown that the volume and the character of the GT depend on the tasks of the study. For example, if the peculiarities of Ivan Franko's style are explored, the GT is all his works. If we explore the Ukrainian language of the XX century, the GT is all the texts (spoken and written) of the XX century [3]. The boundaries of the latter are difficult to identify and it is simply impossible to explore entire oral speech [29], especially when analyzing the author's speech. The issues related to attribution of the authorship of the text in the collective works of scientific and technical direction based on the analysis of reference sample also remain unresolved. The reason is the lack of experiments in this direction. Another reason is the existence of insufficient statistics for the formation of conclusions due to the fact that the authors in this direction rarely write papers individually, and in some areas, works are generally rarely written cooperatively. An option to overcome the corresponding difficulties, when the overall study of the GT is impossible, is sampling and the formation of a set of parameters for the corresponding analysis [1][2][3]30]. All this gives grounds to argue that it is appropriate to conduct research dedicated to the textual content authorship attribution based on statistical analysis of the distribution of characteristics of the author's speech with a sufficient number of data samples.
Samples are a certain amount of material, by studying which it is possible to make correct conclusions about the entire GT [31]. The basic requirements for the samples include representativity and homogeneity [32]. To be representative, the sampling should [33]: 1) be evenly distributed around the GT [34]; 2) have a sufficiently large volume, which is enough for the correct conclusions about the GT [35].
There are two types of sampling uniformity: linguistic and statistical.
Within the linguistic homogeneity of the sampling, they distinguish [3]: -chronological (sample texts must have chronological boundaries) [36]; -genre (sample texts should be genre-limited) [37]; -thematic (texts should be thematically limited) [38]. Statistically homogeneous samples are the samples, in which the studied units have statistical behavior that does not differ substantially among them [39]. If the average frequency of a phenomenon (letters, morphemes, words, word length, sentence length, etc.) in a single sampling does not significantly differ from its frequency in other samples, then these samples are statistically homogeneous in relation to this phenomenon [40].
According to the way of organization, the following kinds of samples are distinguished [3]: -mechanical -organized taking into consideration the uniformity of distribution of a studied unit in the general totality [41]. All texts of the general totality are numbered, and then, for example, a segment of the necessary length is chosen from each fifth and tenth or the twentieth text [42]; -random -arranged by random choice of the texts from the GT [43]. At the core of this method of organization of samples, there is the hypothesis that quite a large number of randomly chosen units from the GT must adequately represent it [44]. Thus, each page, section, or other unit of a text of the GT should have the same chance of getting into the sampling. Therefore, as a rule, random sampling is based on the table of random numbers [45]; -zonal (typical) -organized based on the linguistically homogeneous totality of texts, that is, zones [46]. Depending on the purpose of the study, a zone can be considered to be prose, poetry and drama in fiction; works by one author or a specific work; totality of words of a certain morpheme structure (for example, prefixed or monomorphemic), etc. [47].
Samples may be structural, that is, composed of smaller parts (sub-samples) and non-structural, that is continuous [48].
The ratio between the frequency of language and speech units will be shown by the example: "if one takes 33 bingo barrels, glue down the Ukrainian alphabet on them and mix, then the probability that the first taken barrel is purely vowel will be 6:33 (6 pure vowels letters (а, о, у, є, и, i) to all 33 letters of the Ukrainian alphabet), that is, approximately 16 %" [3]. If one takes a random Ukrainian text and chooses randomly one letter from it, the probability that it will turn out to be pure vowel will be approximately 30 % [3]. In the first case, it goes about the probability of a group of six letters at the paradigmatic level (language), in the second case -at the syntagmatic level (speech) [49]. To assume that all vowels or all the case forms, or all the members of the sentence are equally probable, would mean to replace the natural speech with its scheme [50]. Thus, speech prefers a small number of units (the prevalence law), which constitutes the core of the speech subsystem, whereas in the language, all units are equally probable [51]. In different languages, the frequency of the same letter or a sequence of letters is different, so knowing the order of the most frequent letters, bigrams, trigrams, four-grams of a particular language, it is possible to identify it automatically [52]. The frequency of these units in a language is determined using representative sampling, since frequency in the works by specific authors, styles or themes is also different [53]. For example, it is found for Ukrainian texts that frequency of vowels, consonants, spaces between words, as well as the groups of consonants: soft, sonoric can be considered statistical style parameters [1][2][3][4][5][6][7]. The frequencies of letters in the texts were studied for the needs of cryptography (science about ciphering and deciphering messages), in particular, Morse code (the more frequent a letter or a letter combination, the shorter the slashes for their designation), for shorthand, for automatic determining of a language, confirmation or denial of authorship of a work, etc. [54]. Morphemes and grammatical categories also have their own quantitative characteristics: -non-uniform use of morpheme of a foreign language and specific language morphemes [55]; -verbs of present, past and future tense, indicative, conditional, imperative moods [56]; -forms of the verb (infinitive, personal forms, participles, impersonal forms ending in -но, -то) [57]; -different parts of speech depending on the style [58]. The regularity was found that in different functional styles, the quantitative ratio of functioning of different cases is not the same [59]. For example, scientific prose prefers genitive case and neglects nominative case, but the opposite is true for the spoken language, etc. [3].
The quantitative characteristics of words are best seen in the WF [59]. The functional dependence of the relationship between the word frequency and polysemy, as well as between the word frequency and its rank in a dictionary by descending frequencies is expressed by the Zipf-Mandelbrot law. The most frequent are functional parts of speech or general abstract concepts [60]. Instead, words with a specific meaning (required for a conversation in a usual situation) are low-frequency words [61]. Although they are rarely used, they are always in the speaker's mind [62]. In other words, the frequency criterion is supplemented by the subject-matter criterion [3]. The formula to establish the synonymy degree (semantic proximity) of words [63]: C=2c/(n 1 +n 2 ), where n 1 is the number of meanings of the first word, n 2 is the number of meanings of the second word, c is the number of common meanings in the given pair of words. Quantitative characteristics of the syntactic constructions also depend on a functional style: simple uncomplicated, even incomplete and broken sentences prevail in the colloquial everyday style, while composite sentences, complicated by constructions, parentheses and inserted structures dominate in the official style [64].
The tempo of speech-thought can be represented in a simple way as a ratio of the number of independent words to the number of simple sentences, since the fewer words are included in one sentence, the more frequent the sentences are (subsequently, the thoughts) [65]. It was revealed that the tempo of speech-thought in a fairy tale is 2.39, and in a scientific text -only 0.42 [3]. This means that speech and action in a fairy tale unfold almost 6 times as fast. And this is understandable: in a fairy tale, thoughts and statements expressed are simple in structure, and therefore faster, easier to construct in a dynamic sequence; in a scientific article, the structure of a speech-thought is much more complicated, so consciousness channels pass the units of such speechthought slower [66].
It is logical to measure the speech coherence coefficient based on the ratio of the number of prepositions and conjunctions to the number of separate sentences [67]. Let this coefficient be equal to unity when one sentence has three connecting elements (prepositions and conjunctions) [3]: K z =(P+C)/(3N), where P is the number of prepositions, C is the number of conjunctions, N is the number of separate sentences. It was found that the text of a fairy tale has coherence coefficient 0.77, and the text of a scientific article -3.0, that is, coherence in the second text is 3.9 times stronger than in the first one [3].
The concept of language synthetics is determined as M/W, where M is the number of morphs in a certain segment of the text, W is the number of words in this text [68]. Languages with index from 1 to 2 are considered analytical, from 2 to 3 -synthetic, and 3 or more -polysynthetic [69]. The Vietnamese language has the lowest magnitude -1.06, that is, there are 196 morphs per 100 words, the Eskimos language has the highest magnitude -3.72, that is, there are 372 morphs per 100 words [3]. English has an indicator of 1.68, Russian -2.33 [3]. Based on the synthetics index, analytical languages include Vietnamese, Chinese, Persian, Italian, German, and Danish; synthetic languages include Ukrainian, Russian, Sanskrit, Lithuanian, Czech, Polish, Yakut: polysynthetic languages include Eskimos, Tubal-American, Iberic-Caucasian [69]. Due to the increasing availability and spreading of text documents in the electronic form, the importance of using automatic methods for analysis of the documents' content increased [70]. It is possible to consider that text analysis includes the tasks of documents' classification and clustering by various criteria, for example, genre, epoch of writing, format (novel, essay, sketch), emotional coloration, style of speech, as well as task of determining the author of the text [71]. With the simplification of access to different data, expanded data search, copy, and distribution capabilities in networks, the task of authorship attribution is becoming even more urgent [72]. Similarly, the issues related to authorship attribution are important in linguistic, historical and forensic studies [73]. The availability of electronic devices makes it possible to push back the authorship attribution with the involvement of a large number of experts, to accelerate and simplify this process through its automation [74]. The notion of the authorship attribution is defined as the process of finding the author by many general and private features of a text that constitute the author's style [75].
In the existing systems of text authorship attribution, statistical methods based on the search for the "author's invariant" enjoy popularity [76]. The "author's invariant" characterizes the language peculiarity (lexical, grammatical, phraseological and other) of a text [77]. An invariant can include: proportion of vowels or consonants, frequency of using a certain part of speech, probability of transitions from one part of speech to another, "favorite" words, information entropy and so on [78]. In paper [3], the statistical method for determining the author and the genre of text based on the distribution of frequencies of letter combinations (n-grams) was proposed. The method showed decent results for the works of Russian literature [79]. However, the accuracy of the statistical methods for authorship attribution relies heavily on the specifics of the used data: on the language, on which the texts are written [80], on the style of speech of the text [82], and, above all, on the lengths of texts under research [83]. Because of this, it is difficult to conclude about the accuracy of this approach for the data of another nature. For this reason, the purpose of this work was to analyze the applicability of such mathematical apparatus as the distribution frequencies of letter combination for different languages in solving the problem of identification of the authorship of texts of different lengths and written in different styles of speech [84].
Calculation of the distance between the corresponding vectors is used as the criterion of proximity of two texts [85]. The sets of parameters and speech coefficients are represented as normal vectors in the n-dimensional Cartesian space from coordinate origin [86]. Then the distance between the texts is the usual Cartesian distance between the ends of the corresponding vectors. This norm distance is an integral characteristic of text differences [87]. And the texts with a large distance are highly likely to belong to different authors. Thus, in order to compare the authorship of two texts, it is enough to compute the parameters and determine the distance for them [88]. To associate a text with the author, the vectors the author's parameters and of the text are compared, that is, actually, two texts are compared again -a text, the author of which is known (the reference text), and the text, the authorship of which is necessary to identify, confirm or refute (analyzed/ researched text) [89]. The vectors of formal parameters that recognize not specific authors (or groups), but rather distinguish certain characteristics of authors (for example, educational level) are constructed [90]. In most cases, according to [91], statistic characteristics are selected as characteristic parameters of a text: -the number of applications of certain parts of speech, some specific words, punctuation marks, phraseological units, archaisms, rare and foreign words [92]; -the number and the length of sentences (measured in words, syllables, and signs), average sentence length [93]; -the number of notional and functional words [94]; -vocabulary volume, ratio of the number of verbs to the total number of words used in a text, etc. [95].
The main problem of the formal methods for authorship analysis is to select the parameters and speech coefficients [96]. There are a number of formal statistical characteristics of the texts unsuitable for the identification of authorship because of one of the two shortcomings [1][2][3][4][5][6][7]97]: -Lack of stability. The scatter of parameter values for the texts of the same author is so large that the ranges of possible values for different authors intersect [99]. Obviously, this parameter will not help distinguish between the authors, and when used as a part of the parameter group, will only play the role of additional information noise [100].
-Lack of the distinguishing ability. The parameter can accept close values for all or most authors, because its values are determined by the properties of the language, in which the texts are written, rather than by the individual features of the author of a text [101]. That is why parameters must be previously studied in terms of stability and distinguishing ability, preferably in the texts of a large number of different authors [102].
In articles [1][2][3][4][5][6][7], the following conditions of applicability of formal speech coefficient of the author's style were separated: -Mass character (the use of those characteristics of the text, which are poorly controlled by an author at the conscious level to eliminate the possibility of conscious distortion by the author of his characteristic style or imitation of the style of another author) [103].
-Stability (retaining a constant value for one participant, but some deviation of values from the mean should be rather small) [104].
-Distinguishing ability (takes substantially different values for different authors, that is, exceeds the fluctuations that are possible for one participant) [105].
It is very difficult to choose speech coefficients and parameters that are sure to distinguish between any two authors [106]. Whatever the parameters are, there is always the possibility that two or more participants are close by these parameters due to random coincidence [107]. That is why in practice it is sufficient for a parameter to make it possible to distinguish between different subsets of authors, that is, there should be quite a large number of subsets of the authors, for who mean values of the parameter are significantly different [108]. The parameter, obviously, will not help distinguish between the authors' texts from one subset, but will enable us to distinguish confidently between the texts of the authors that got in different subsets [1][2][3][4][5][6][7]. It is possible to distinguish between the texts of the authors of one subset by using simultaneously a rather large vector of parameters that are different by nature -in this case, the probability of random coincidence will be noticeably smaller. For confident output of the texts, for which the formally calculated distance is small, it is necessary to conduct additional research by expert methods, for example, analysis of key and/or stop (auxiliary) words [1][2][3][4][5][6][7].
Thus, there is a need to conduct the research in this direction due to the lack of practical experiments for identification of the author's style for Ukrainian scientific-technical texts. Recently, many systems have been developed to solve the problem of plagiarism as copyright. When it comes to re-writing, it is quite difficult to solve such a problem for the Slavic languages due to the existence of a large set of synonyms and the possibility of restructuring sentences using other endings. This issue does not apply to the use of auxiliary words, as most people do not even pay attention to them in case of plagiarism. That is why this encourages the exploration of the problem of the author's style identification to determine the degree of belonging of a particular text to a particular author.

The aim and objectives of the study
The aim of this study is to develop the method for automatic text content authorship attribution based on statistical analysis of N-gram distribution.
To achieve the set aim, the following tasks were to be solved: -to develop the quantitative method for identifying the potential author of a text from a set of possible ones comparing the results of analysis of the reference text with the studied text; -to develop content-monitoring software to determine the author in the texts in the Ukrainian language based on the linguo-statistical analysis of the reference text content; -to obtain and analyze the results of the experimental approbation of the proposed content monitoring method to determine the author of scientific texts of technical profile in the Ukrainian language.

Quantitative method
The quantitative method for identification of the potential author of a text from a set of possible ones is based on comparison of the results of analysis of the reference text with the studied text.
Linguometry is the field of applied linguistics that detects, measures, and analyzes quantitative characteristics of the units of different language or speech levels [3]. Using the apparatus of mathematical statistics, linguometry participates in solving the following problems of linguistics, as creation of: -dictionaries (including frequency and statistical) and comparison; -automatic dictionaries, thesauruses; -systems of transcripts; -methods and means of automatic language determining; -methods and means of information search, etc. Each language has its own statistical parameters, and knowledge of the frequency of occurrence of letters and their combinations (2-gram, 3-gram, 4-gram) of a particular language makes it possible to identify it automatically. For example, for Ukrainian texts, it was found that statistical parameters of styles can be frequencies of vowels, consonants, spaces between words, as well as soft and sonoric groups of consonants [3]. We will show how to evaluate the speech of a certain author taking a particular passage of his work [77] with the help of a certain reference -for example, the data on the frequency of the letters of the Ukrainian language. Consider two passages of a technical text in Ukrainian, presented in the format, where the letters are arranged in the order of descending frequencies of their occurrence in the passage (frequencies are presented in Table 1), and small and capital letters are not distinguished. We will find the type of correlation of the letters of the passages [76] and the standard [77], and the results proving these conclusions will be presented, in particular, in a graphical form. The following data were entered in Table 1 for convenience: frequency of using the letters of the Ukrainian language, absolute and relative frequencies of using the letters in studied Passage 1 by Author 1 [76] and Passage 2 by Author 2 [77]. Note that Passage 1 contains 556 characters, Passage 2 contains 541 characters. Note that the term "others" in the column of letters contains authentic letters for the Ukrainian language (ї, є, ґ, і), which are less used in most technical texts. This makes it possible to achieve certain independence in analysis. Show the obtained results graphically in Fig. 1.
The graphical representation of relative frequencies of letters' occurrence in the passages gives a convincing answer to the question which of the passages is written by which author.
1-gram distribution in the works is different. 3-gram analysis gives the optimal indicators of text research [3]. This will be checked at the following stages of research. There is an abrupt jump in relative frequency of occurrence of letter "e" for Passage 2 relative to the reference values of Reference passage 1 [77] (Fig. 2), so we will assume that it is more likely that Reference passage 1 was written by the author of Passage 1 [76]. We will also give numerical values for the correlation of letters' frequency in the Passages and in the Reference passage. Find two correlation factors: for the reference passage and for Passage 1 [76] and for the reference passage and Passage 2 [78]; the factor that is closest to unity will indicate the greater probability of belonging to the corresponding passage to the reference passage. Calculations of the correlation factor for the reference passage and Passage 1 give R e-P1 =0.962716, and correlation factor for the reference passage and Passage 2 -R e,P2 =0.909958. Similarly, the values of relative frequencies in Reference passage 2 and Passages 1, 2 in Fig. 3 are substantially different, so it is likely that the author of Reference passage 2 [75] is not the author of Passages 1, 2.
The obtained values of the factors, as well as analysis of graphic results, suggest that the probability of belonging of Passage 1 [76] to Reference passage 1 [77] is higher than for Passage 2 [75].

Content-monitoring software to identify the author in Ukrainian language texts
To achieve the aim of the research, the system with the possibility to choose the language/languages of the analyzed content, which is realized at the Web-resource Victana, was developed. For qualitative and effective content analysis, when determining the degree of authorship of a particular person, we propose to analyze the reference text and the studied one at several stages: -linguometric analysis of coefficients of diversity of the author's speech (alg. 1); -stylometric analysis (alg. 2); -analysis of set expressions (alg. 3); -linguistic analysis via N-gram (alg. 4).
At the Web-resource, there are the following fields for linguometric analysis (Fig. 4): -Characters. (The input text must contain not less than 100 and not more than 10,000 characters.) -the maximum size of content is set.
-Content -the field where the studied text is copied from the buffer.
-Clear -clear the input data. Algorithm 1. Linguometric analysis of the text for authorship attribution Step 1. Checking the text length -the excess is removed.
Step 2. Determining the number of sentences.
Step 3. Clearing the studied text (figures, special characters).
Step 4. Determining the total number of the words in text N.
Step 5. Determining the number of words W.
Step 6. Determining the number of prepositions Z.
Step 7. Determining the number of conjunctions S.
Coefficient of speech coherence: Exclusiveness index: Concentration index: There are the following fields for stylometric analysis at the Web-resource (Fig. 5): -Reference text is the field where the Reference text is copied from the buffer.
-Choose Passage 1 (2, 3) -open the access to the passages. The access to the following passage is only after activation of the access to the previous one. The access is opened sequentially from the smaller to the larger number.
-Passage 1 (2, 3) is the field where the text of the corresponding passage is copied. The input text should have not less than 100 characters. (Now 0) -After the calculation run, the actual number of characters of each passage separately will be computed and shown.
-Clear -clearing the input data. Algorithm 2. Stylometric analysis of the text for authorship attribution.
Step 1. Checking the lengths of the reference text and of selected passages and bringing the length of the reference text to the minimum of the checked texts.
Step 2. Clearing the reference text from special characters and others.
Step 3. Determining the number of words in the reference text.
Step 5. The length of Passage 1 is not more than the minimal text.
Step 6. Clearing Passage 1 from special characters and others.
Step 7. Determining the number of words W1 for Passage 1.
Step 8. Determining the number of stop-words (preposi-tion+conjunctions+particles) in the text.
Step 9. Preparation of separate arrays (passage and reference text) for the calculation of correlation factor (Fig. 7).
Step 10. Click the function for calculation of correlation factor.
Step 1. Generation of the array for formation of the graphic image of the relative frequency of occurrence of stopwords in Passage 1 and in the reference sample.
Step 12. Click the function for calculation of the diagram of RF (Fig. 8).
Step 13. Click the function for calculation of correlation factor for Passages 2 (3) for each of auxiliary words.
Step 14. Form the words of the Swadesh list from the directory, determining the number of words from the Swadesh list in the text of the passage (for the reference text and selected passages - Table 3).
Step 15. From the lists common for the Reference sample, Passages 1-3 and the Swadesh list.
Step 16. Research results are displayed on the screen (Table 4).
For automated text processing, not only the frequency of occurrence of a particular category in the text is important, but also its presence in the studied text. Quantitative counting makes it possible to draw objective conclusions about the orientation of the materials by the number of using the units of the analysis (key quotations) in the studied texts. Qualitative analysis does the same, but as a result of studying whether (and in what context) some important, original category is found in general. Summing up, it should be noted that the use Fig. 4. Example of the result of using linguometric analysis of content-analysis for the creation of information systems makes it possible to catch the prevalence of a particular feature of the studied totality of texts. In this case, not absolute, but rather relative value of the feature, that is, the characteristic of its place (fraction) among other features is important. Measurement of the ratio between the features in the texts gives empirical material to understand the functional relations between the elements of reality displayed in texts. In the presence of texts that have a chronological sequence, it is possible to have a number of time-fixed portraits of the studied reality, which makes it possible to put forward the hypotheses of predictive nature about functioning of the elements of the system. For example, frequency characteristics of a text (the average sentence length) may indicate certain specificity of intellectual abilities of a person in terms of verbal representation of thoughts. Determining the average sentence length, it is possible to characterize a change in the emotional state of an individual.
The choice of analysis of the vocabulary variant in context dependence is one of the most significant and powerful in psycholinguistic diagnostics. Due to the established coefficient of vocabulary diversity (Table 5) in the speech of a person, it is possible to identify psychopathology, for example, schizophrenia, as well as a tendency to it.
Another criterion of language competence is a verb coefficient (aggressiveness). The essence of this coefficient is the ratio of the number of verbs and verb forms (participles) to the total number of all words. Like in psychology, a high coefficient of aggressiveness indicates a possible high emotional tension of an author, which is reflected in the text by manifestations of a change in the dynamics of events and other characteristic features. Coefficient of logical coherence is also calculated from the formula of the ratio of the total number of functional words (conjunctions, prepositions and particles) to the total number of sentences. At magnitudes within unity, rather harmonious relations between functional words and syntactic constructions are ensured. Embolus coefficient (med. embolus -a blockage of a blood vessel), or speech "contamination" is the ratio of the total number of emboli (words that do not carry semantic load) to the total number of words in a sentence. The structure of the emboli includes exclamations (nunu, hа-hа, еhе, zh, oi, etc.), vulgarisms (rough vocabulary), and unnecessary repetitions. The embolic coefficient demonstrates the peculiarities of verbal intelligence and the emotional state of a speaker/ author of the text. It can also give an idea of the general culture of speech. Even taking into consideration the fact that a belle-letter text is generally considered to be androgenic and is a weave of subordination functions -the qualities of the author's "I", which are in some way graded depending on the characterological profile of a particular author. In other words, the text of the original and the text of the translation depend on their authors. There are the following fields for analysis of set expressions in the Web-resource ( Fig. 9): -enter the number of phrases to be displayed on the screen (10; 100) -so many word combinations will be displayed on the screen after calculation; -select Passage 1 (2, 3) -open the access to passages. Fig. 5. Example of input data for stylometric analysis Access to the next passage is only after activating the access to the previous one. Access opens sequentially from a smaller number to a larger one. (Not implemented -only one passage is analyzed); -passage 1 -the field where the text of the corresponding passage is copied; -used: 57 % The input test must contain not less than 100 characters. (Limit: 4000) -analysis of the text size.
-clear -input data clearing. Algorithm 3. Quantitative analysis of set expressions.
Step 1. Clearing the obtained content from special characters and others.
Step 2. Form the list of blocked words from the database depending on the chosen language of the context.
Step 3. Preparation to the formation of the arrays of double word combinations and all words. The array at the input: clue -figures, meanings -text, split into sentences (divider dot). The words are compared with the database of the given keywords and by the rule, described in the database, it brings the given word to the word base if it itself is not the word base.
Step 4. Determining set expressions by the FREG method: to obtain the absolute frequency of word combinations (Fig. 10).
Step 5. Determining set expressions by the t-test method: P(W1)*P(W2) accounting not only the pairs, but also the frequency of using separate words (those that make up the pair).
Step 6. Determining set expressions by the LR method.
Step 8. Research results are displayed on the screen.
If a word is missing in the database, it is added automatically. For this word, a moderator needs to describe the rule of bringing the word to the word base.
When identifying the author of a text, it is assumed that the text reflects the individual manner of author's writing, which makes it possible to distinguish it from others. To compare the texts with each other, it is necessary to compare a text with some numerical characteristic that would be close to the texts of the same author, and would be substantially different for the works by other authors. Such a characteristic can be the density of distribution function (DDF) of letter combinations from three consecutive characters (3-grams). The DDF is defined as a set of empirical frequencies of using the letters or their combinations. When analyzing the text using the DDF, the inclusion of punctuation marks, spaces and figures is not taken into consideration. The task of identification of the author of an unknown text in terms of the DDF is stated as follows.
There is a set of texts, which contains works A by well-known authors. Let us assume that K a is the amount of content by the a-th author, N i,a is the number of characters in the i-th content of the a-th participant, i=1, ..., K a All the texts in this set are given in the DDF form.   The DDF of the content, the volume of which is equal to N i,a , is assigned as the set of values f i,a ( j)=k j /N i,a , k j is the number of using N-grams by number j. Argument These DDF are the author's references. To compare two texts, or a text and the author's reference text, one must assign the distance between the corresponding distribution functions. The norm in the space of functions as summands is used as distance metrics. Thus, for example, distance p 0,a between the DDF of an unknown text f 0 and any author's DDF F a is calculated as follows:  Accordingly, text "0" will belong to the author, the distance of which to the DDF will be the shortest. In solving the classification problem, the dataset was not split explicitly into the test and training sets. Weighted average DDF were constructed throughout the set of the content of one author.
The distance from content i to specific author a was calculated as: The formula makes it possible to exclude participation of the DDF content i in the average DDF of a specific author. In the Web-resource, there are the following fields for N-gram analysis (Fig. 11): -Choose the language of the text -language of the text for analysis (research). By default, "Ukrainian".
-The number of grams -the number of characters in the gram. By default, 3. It can be changed for 1, 2, 3, 4.
-Text limitation in characters.
-Text is the field, where the studied text is copied from the buffer.
-Generate -to run N-gram generation.
-Clear -clearing the entered data. Step 1. Clearing the studied text (figures, special characters).
Step 2. Calculate the number of the words in the text Step 3. All three words in the text are put into lower case.
Step 5. Depending on the chosen language, substitute the appropriate alphabet.
Step 6. Depending on the established number of grams, run the appropriate function, which calculate all possible variants of grams and stores them in the array.
Step 7. Next, run the function of calculation of the number of occurrence of words.
Here, we calculate relative occurrence frequency and store in the array: the sequence number of the gram, the gram itself, the number of occurrences of this gram, relative frequency of occurrence of this gram.
Step 8. The next function forms the array for exporting to the CSV file, obtained in the previous function. This file is stored on the server. It can be downloaded to the user's (researcher's) computer by the link, which will be accessed after generating the form with the research results.
Step 9. Research results are displayed on the screen (only the grams found in the text).
Step 10. Access to the export file is opened.
Step 11. The summarizing results are displayed: -alphabet size; -number of words in the text; -number of characters in the text with spaces; -number of characters in the text that was completely cleared; -total number of N-grams; -total number of found N-grams without repetitions; -total number of found N-grams with repetitions.   Fig. 10. Example of the result of using analysis of set expressions Fig. 11. Example of using analysis of N-grams of the text

Results of experimental testing of the proposed content-monitoring method for text authorship attribution
Compare three publications [1,74,77] in scientific technical area based on the linguo-statistical analysis of 3-grams. Article 1, 2 were written by one team of authors [1,74], Article 3 was written by another author [77] (Table 7). The language of the text is Ukrainian (there are 33 letters in the alphabet, then the number of all possible N-grams is 35,937). Table 7 Values of parameters for the analyzed articles 1-3 When comparing the articles, we will take into account only the 3-grams, which were found in the text in three articles simultaneously at least one time. That is why for this specific example, there are in total 2147 3-grams. That is, for Article 1, we analyze 78.4814 % of 3-grams, for Article 2 -72.6332 % and for Article 3 -84.1271 %. Accordingly, the difference in using the corresponding 3-grams between Articles 1 and 2 is R 12 =56.5254 %, between Articles 2 and 3 -R 23 =69.4271 %, between Articles 1 and 3 -R 13 =62.9839 %. These indicators show that the characteristics of Articles 1 and 2 are more similar (R 23 >R 12 by 12.9017 %, R 23 > R 13 by 6.4432 %, R 13 > R 12 by 6.4585 %, that is, R 23 >R 13 >R 12 ), than the characteristics of Articles 1-3 and 2-3, respectively. The lower R ij , the higher the degree of confidence that the articles were written by the same author. Then Article 1 and 2 are more likely to be written by one author/ team of authors, than Articles 2-3 and Articles 1-3, respectively. Analyze the use of separate clusters of 3-grams in corresponding articles and compare the results obtained. Fig. 12, 13 show the results of using in Articles 1-3 of 3-grams, which begin with letter а (occurrence in Articles 1-3 in the range of 6.1125-6.7087 %). Most often, the curves for Articles 1-2 (4.2322 %) and Articles 1-3 (4.197 %) coincide or approach each other (average divergence 0.02713 % and 0.0269 %, respectively). But not always -there is a coincidence with Articles 2-3 (4.6322 %) and there are substantial divergences (average divergence is 0.02969 %). If we analyze only such 3-grams, it follows that all 3 articles are most likely to have been written by one author. This is explained by the fact that this letter is one of the most often used for the formation of Ukrainian words. Fig. 14, 15 show the results of analysis of using in Articles 1-3 of 3-grams beginning with letter б (occurrence in articles 1-3 in the range of 0.48884-0.77738 %). Most often the curves for Articles 1-2 (0.594 %), unlike for Articles 1-3 (0.7072 %) and Articles 2-3 (1.1208 %), coincide or approach each other. But the trajectories of the curve of Article 1 and Article 3 coincide most often (most likely the articles were written by one author -average divergence is 0.01809 %, while for articles 1-2 -0.0261 % and for Articles 2-3 -0.02866 %). If we analyze only the 3-grams (which are less common), it turns out that all Articles 1-2 are more likely to have been written by one author, and Article 3 -by another author. This is explained by the fact that this letter б is rare in the formation of Ukrainian words. And some authors use such words more often because of the habit and/or because of the subject of their publications (this needs additional research). Fig. 16 shows the results of analysis of using in Articles 1-3 of 3-graphs beginning with letter в (occurrence in articles 1-3 in the range of 4.2622-4.5219 %). Most often the lines of curves for Articles 1-2 (3.55581 %), Articles 1-3 (3.6523 %) and Articles 2-3 (4.1064 %) coincide or approach each other (average divergence is 0.03067 %, 0.03149 % and 0.0354 % respectively). According to these data, all three articles are most likely to have been written by one author. Fig. 17, 18 show the results of analysis of using 3-grams beginning with the letter г (occurrence in articles 1-3 in the range of 0.7493-1.4544 %) in Articles 1-3. Most often the lines of curves for Articles 1-2 (0.6551 %), unlike Articles 1-3 (1.309 %) and Articles 2-3 (1.3451 %), coincide or approach each other. But the trajectories of the curves of Article 1 and Article 2 coincide most often (most often they are written by one author, average divergence is 0.02047 %, while for articles 2-3 it is 0.04203 %, for articles 1-3 -0.04091 %). If we analyze only such 3-grams (which are less common), it turns out that Articles 1-2 were written more likely by one author, Articles 2-3 and Articles 1-3 were definitely written by different authors.

Discussion of results of research into the authorship attribution in Ukrainian-language texts based on the technology of statistical linguistics
According to the data in Table 8 and Fig. 19, a part of the letters in the Ukrainian language are most often used, the others are used much more rarely. For most used letters, the frequency of occurrence of 3-grams with such initial letters will have almost the same distribution (peak values in the graph in Fig. 19), while for other letters, the distribution will not be the same.
Therefore, it is advisable to investigate only trigrams for the beginning letters, which are less common in texts of a specific language in order to determine the degree of belonging of a text to a particular author (for example, Fig. 20, 21). Thus, for the 3-grams beginning with letter є (occurrence in articles 1-3 in the range of 0.2517 -0.707 %), most frequently the curves for Articles 1-2 (0.2508 %), unlike for Articles 1-3 (0.6077 %) and Articles 2-3 (0.5443 %) coincide or approach each other. However, the trajectories of the curve of Article 1 and Article 2 most often coincide (the articles are most likely to have been written by one author -average divergence is 0.0114 %, while for Articles 2 -3 -0.02478 % and for articles 1-3 -0.02762 %, this value is 2 times as high). Table 8 Distribution of frequencies of appearance of 1-grams in Articles 1-3 However, often it does not work. Thus, for 3-grams beginning with letter ж (occurrence in Articles 1-3 in the range of 0.3408-0.4738 %), all the curves for Articles 1-2 (0.25 %), Articles 1-3 (0.2126 %) and Articles 2-3 (0.2302 %) coincide or approach each other. Average divergence for Ar- а б в г д е є ж з и і ї й к л м н о п р с т у ф х ц ч ш щ ь ю я Article 1 Article 2 Article 3 Fig. 19. The graph of distribution of frequencies of occurrence of 1-grams in Articles 1-3 ticles 1-2 is 0.01786 %, while for Articles 2-3 -0.01644 % and for Articles 1-3 -0.01519 %. It looks like all the articles were written by one author. Although the trajectories of the curves in Fig. 22 and the columns of the graph in Fig. 23 show that Articles 1-2 are likely to have been written by one author and Article 3 -by another.
Check again. For 3-grams beginning with letter з (occurrence in articles 1-3 in the range of 1.3108-1.973 %), the curves for Articles 1-2 (1.1879 %), Articles 1-3 (1.3259 %) and Articles 2-3 (1.25 %) coincide or approach each other. Average divergence for Articles 1-2 is 0.02121 %, while for Articles 2-3 -0.02232 % and for Articles 1-3 -0.02368 %. It looks like all the articles were written by one author. Although the trajectory of the curves in Fig. 24 shows that Articles 1 are 2 are likely to have been written by one author, and Article 3 -by another one.
For 3-grams beginning with letter й (occurrence in articles 1-3 in the range of 0.301-0.4319 %), the curves for Articles 1-2 (0.3352 %), Articles 1-3 (0.3483 %) and Articles 2-3 (0.3469 %) coincide or approach each other. Average divergence for Articles 1-2 is 0.01457 %, while for Articles 2-3 -0.01508 % and for Articles 1-3 -0.01514 %. It looks like all the articles were written by one author. Though the trajectory of the curves in Fig. 25 shows that Articles 1-2 are more likely to have been written by one author, and Article 3 -by another.
For 3-gram beginning with letter м (occurrence in articles 1-3 in the range of 2.1681-3.1225 %), the curves for Articles 1-2 (1.7619 %) and Articles 1-3 (1.8193 %), unlike Articles 2-3 (2.6606 %), coincide or approach each other. Average divergence for Articles 1-2 is 0.01936 %, while for Articles 2-3 -0.02936 % and for Articles 1-3 -0.02 %. It looks like all the articles were written by one author. Though the trajectory of the curves in Fig. 26 shows that Articles 1 and 2 are more likely to have been written by one author, and Article 3 -by another. Thus, not only the number of occurrence of trigrams with a certain initial letter influences the correctness of the result, but also the frequency of occurrence of such 3-grams. For 3-grams beginning with letter п (occurrence in articles 1-3 in the range of 1.8583-2.8092 %), all curves for Articles 1-2 (1.6619 %), unlike Articles 1-3 (2.1261 %) and Articles 2-3 (2.5456 %) coincide or approach each other (Fig. 27). Average divergence for Articles 1-2 is 0.04261 %, while for Articles 2-3 -0.06527 % and Articles 1-3 -0.05452 %. Articles 1-2 were written by one author, and Article 3 -by another.
For 3-grams beginning with letter у (occurrence in articles 1-3 in the range of 2.1927-2.7261 %), the curves for Articles 1-2 (1.7905 %), unlike for Articles 1-3 (1.9443 %) and Articles 2-3 (1.9852 %), coincide or approach each other (Fig. 29). Average divergence for Articles 1-2 is 0.02184 %, while for Articles 2-3 -0.02421 % and Articles 1-3 -0.02371 %. It looks like all the articles were written by one author. If the trigram beginning with the certain letter is used in the text more than 1 %, average divergence during comparison some articles, irrespective of the authorship, will be almost the same. Then it is necessary to take into account only the trigrams that have the percentage of occurrence less than 1.
For 3-grams beginning with letter ф (occurrence in articles 1-3 in the range of 0.3069-0.4759 %), all curves for Articles 1-2 (0.2762 %) and Articles 1-3 (0.299 %), unlike Articles 2-3 (0.495 %) coincide or approach each other (Fig. 30). Average divergence for Articles 1-2 is 0.03453 %, while for Articles 2-3 -0.06188 % and for Articles 1-3 -0.03738 %. It looks like articles 1-2 were written by one author, similarly, articles 1-3 were written by one author, and articles 1-3 were written by different authors. But the trajectory of the curves coincides for Articles 1-2. Though the frequency of appearance of trigram with the initial letter ф is less than 1 % in each article (this allowed splitting the set of potential authors into two subsets), the frequency of appearance of each trigram beginning with letter ф is very small (8 trigrams out of 333 possible). In comparison with letter а -156 trigrams out of 333 possible ones. The best results are shown by trigrams with a specific initial letter when their number in within (30.90). These numbers are approximate. To specify their values, when studying scientific and technical texts in the Ukrainian language, it is necessary to carry out additional research using a sufficiently large volume of texts (over 1000) among a large number of authors (over 100) and to have accurate reference sole author's texts (with confirmation of their authorship). This is next to impossible to do, due to the fact that most scientific and technical literature is co-authored. This imposes subjective characteristics on the analyzed text.
For 3-grams beginning with letter я (occurrence in articles 1-3 in the range of 1.4442-1.5541 %, in total 72 trigrams), all curves for Articles 1-2 (0.9522 %), Articles 1-3 (0.9361 %) and Articles 2-3 (1.0555 %) coincide or approach each other (Fig. 37). Average divergence for Articles 1-2 is 0.013225 %, while for Articles 2-3 -0.01466 % and for Articles 1-3 -0.013 %. It looks like all the articles were written by one author. This is explained by the fact that the frequency of occurrence of such trigrams is much more than 1 %. Therefore, we compare the frequencies of appearance of all the trigrams that begin with a specific letter (Fig. 38, 39).
According to these graphs, Article 1 and Article 2 are most likely to have been written by one author, although Article 1 and Article may also have been written by one author (but this is not the truth). However, articles 2-3 defini/ tely have been written by different authors. Application of linguo-statistical analysis of 3-grams to a set of articles will make it possible to form a subset of publications that are similar by linguistic characteristics. The imposition on this subset of additional conditions in the form of statistical and quantitative analyses (sets of keywords, set expressions, stylometric, linguometric analyses, etc.) will significantly reduce this subset by specifying the list of the most probable author's papers. Thus, the analysis of the content and frequency of occurrence of only functional words will separate article 1 and 3 in different subsets, articles 1 and 2 will remain in the same subset.
This study does not imply solving the task of identifying the author to the full because the distinction of the author's traits is subjective in nature and depends on the limitations imposed on the author's creative process. However, as a result, the system, which implements such methods, can provide recommendations on the degree of belonging of a text to a specific author. The testing of the proposed method for identification of the author's style for other categories of texts -scientific humanities, fiction, journalism, etc. requires further experimental research

Conclusions
1. The quantitative method for identifying a potential author of a text from set of the possible ones was developed based on the comparison of the results of analysis of the reference text with the studied text. The algorithm of determining the stop words of the text content based on linguistic analysis of text content was developed. The algorithm of lexical analysis of texts in the Ukrainian language and the algorithm of the syntax analyzer of text content were developed. Its specific features are the adaptation of morphological and syntactic analysis of lexical units to the peculiarities of Ukrainian-language words/ texts. The theoretical and experimental substantiation of the method of content-monitoring and determining stop words of a text in the Ukrainian language was presented. The method is aimed at automatic detection of significant stop words of a text in the Ukrainian language due to the proposed formal approach to the implementation of content parsing.
2. The approach to the development of content monitoring software was proposed to identify the style of the author in text written in Ukrainian based on Web Mining. The problem of realization of identification of the author of a text in the Ukrainian language using reference characteristics of the author's speech based on the methods of NLP and stylometry was considered. This is important because the introduction of information technologies of stylometry for textual content authorship attribution leads to a higher coefficient of reliability of authorship identification for the studied text. However, there are objective difficulties, related to the accuracy of authorship attribution of a particular person, because sampling of individual scientific and technical publications is small (most articles in this field are written by co-authors). Only taking into consideration their personal characteristics through system learning can significantly reduce the range of potential authors of a particular technical text. As a part of the research described in this article, the quantitative method for automatic textual content authorship attribution based on statistical analysis of the N-gram distribution was developed. The system, based on the modern methods of NLP and stylometry with consideration of the metrics of evaluation of the analyzed text compared to the reference text, was developed. In addition, based on the modern methods of Machine Learning, the developed system learns to specify the results of text analysis for the authorship degree compared to the reference sample. This makes it possible to approach reasonably the determining of the quality of automatic identification of the author of a scientific technical text and obtain certain effects from its implementation in production. In particular, the coefficients of the author's speech can be clarified. In short, the algorithms of authorship attribution based on modern approaches of the NLP and stylometry taken together enable us to decrease the set of potential authors of the studied text. Further analysis of keywords, use of functional words and set expressions makes it possible to determine more accurately the degree of belonging of a paper to a particular author.
3. The results of experimental testing of the proposed content-monitoring method for determining the style of an author of scientific technical texts written in the Ukrainian language were explored. Typically, the author attribution systems use plagiarism detection algorithms on copyright and rewrite metrics. It is necessary only in order to determine whether a paper was not borrowed in whole or in part. However, they do not take into consideration the situation when a paper has not yet been published. The quantitative content-analysis of the textual content of scientific technical direction uses the benefits of content-monitoring and content-analysis of a text based on the methods of NLP, Web-Mining and stylometry for determining of a set of authors, whose writing styles are similar to the studied text а б в г д е є ж з и і ї й к л м н о п р с т у ф х ц ч ш щ ь ю я ||p1-p2|| ||p1-p3|| ||p2-p3|| Fig. 39. The graph of the difference of using 3-grams beginning with a specific letter passage. This narrows the range of the search at further use of stylometry methods to determine the degree of belonging of the analyzed text to a particular author. We performed the decomposition of the authorship attribution method based on analysis of such speech coefficients as lexical diversity, degree (measure) of syntax complexity, speech coherence, exclusivity index and text concentration index. In parallel, such parameters of the author's style were analyzed, as the number of words in a certain text, total number of words in this text, the number of sentences, the number of prepositions, the number of conjunctions, the number of words with frequency 1, the number of words with frequency 10 and more. 3-grams from 3 articles were analyzed as an example. For Article 1, 78.4814 % of 3-grams were analyzed, for Article 2 -72.6332 % and for Article 3 -84.1271 %. Accordingly, the difference in using the corresponding 3-grams between Articles 1 and 2 is R 12 =56.5254 %, between Articles 2 and 3 -R 23 =69.4271 %, between Articles 1 and 3 -R 13 =62.9839 %. These indicators show that the characteristics of Articles 1 and 2 are more similar (R 23 >R 12 by 12.9017 %, R 23 >R 13 by 6.4432 %, R 13 > R 12 by 6.4585 %, that is, R 23 >R 13 >R 12 ), than the characteristics of Articles 1-3 and 2-3, respectively. The lower R ij , the greater degree that the articles were written by the same author. In this case, Articles 1 and 2 are more likely to have been written by one author/team than Articles 2-3 and Articles 1-3, respectively. This paper contains the materials from the completed scientific research in the field of information technology, related to computer linguistics, artificial intelligence, and Machine Learning. The obtained results, presented in the article, give grounds to argue about the possibility of their implementation in actual industrial production.