DEVELOPMENT OF THE LINGUOMETRIC METHOD FOR AUTOMATIC IDENTIFICATION OF THE AUTHOR OF TEXT CONTENT BASED ON STATISTICAL ANALYSIS OF LANGUAGE DIVERSITY COEFFICIENTS

V . L y t v y n Doctor of Technical Sciences, Professor Department of Information Systems and Networks** E-mail: yevhen.v.burov@lpnu.ua V . V y s o t s k a PhD, Associate Professor Department of Information Systems and Networks** E-mail: victoria.a.vysotska@lpnu.ua P . P u k a c h Doctor of Technical Sciences, Professor * E-mail: ppukach@gmail.com Z . N y t r e b y c h Doctor of Physical and Mathematical Sciences, Professor Department of Mathematics** E-mail: znytrebych@gmail.com I . D e m k i v Doctor of Physical and Mathematical Sciences, Associate Professor Department of Computational Mathematics and Programming** E-mail: ihor.i.demkiv@lpnu.ua R . K o v a l c h u k PhD, Associate Professor* Е-mail: roma_kov@meta.ua N . H u z y k PhD* E-mail: hryntsiv@ukr.net * Department of Engineering Mechanics (Weapons and Equipment of Military Engineering Forces) Hetman Petro Sahaidachnyi National Army Academy Heroiv Maidanu str., 32, Lviv, Ukraine, 79026 **Lviv Polytechnic National University S. Bandery str., 12, Lvіv, Ukraine, 79013 Розробленно лiнгвометричний метод алгоритмiчного забезпечення процесiв контент-монiторiнгу для розв’язання задачi автоматичного визначення автора україномовного текстового контенту на основi технологiї статистичного аналiзу коефiцiєнтiв мовної рiзноманiтностi. Проведено декомпозицiю методу визначення автора на основi аналiзу таких коефiцiєнтiв мовлення як лексична рiзноманiтнiсть, ступiнь (мiра) синтаксичної складностi, зв’язнiсть мовлення, iндекси винятковостi та концентрацiї тексту. Проаналiзованi також параметри авторського стилю як кiлькiсть слiв у певному текстi, загальна кiлькiсть слiв цього тексту, кiлькiсть речень, кiлькiсть прийменникiв, кiлькiсть сполучникiв, кiлькiсть слiв iз частотою 1, та кiлькiсть слiв iз частотою 10 та бiльше. Особливостями розробленого є адаптацiя морфологiчного та синтаксичного аналiзу лексичних одиниць до особливостей конструкцiй україномовних слiв/текстiв. Тобто при аналiзi лiнгвiстичних одиниць типу слiв враховувалась належнiсть до частини мови та вiдмiнювання в межах цiєї частини мови. Для цього провадився аналiз флексiй цих слiв для класифiкацiї, видiлення основи для формування вiдповiдних алфавiтно-частотних словникiв. Наповнення цих словникiв в подальшому враховувалися на наступних кроках визначення авторства тексту як розрахунок параметрiв та коефiцiєнтiв авторського мовлення. Для iндивiдуального стилю письменника показовими є саме службовi (стоповi або опорнi) слова, оскiльки вони нiяк не пов’язанi з темою i змiстом публiкацiї. Проведено порiвняння результатiв на множинi 200 одноосiбних робiт технiчного спрямування бiля 100 рiзних авторiв за перiод 2001–2017 рр. для визначення, чи змiнюються i як коефiцiєнти рiзноманiтностi тексту цих авторiв в рiзнi промiжки часу. Виявлено, що для обраної експериментальної бази з понад 200 робiт найкращих результатiв за критерiєм щiльностi досягає метод аналiзу статтi без початкової обов’язкової iнформацiї як анотацiї та ключовi слова рiзними мовами, а також списку лiтератури Ключовi слова: NLP, контент-монiторiнг, стоп-слова, контент-аналiз, статистийний лiнгвiстичний аналiз, квантитативна лiнгвi стика UDC 572. 511:61+004.94 DOI: 10.15587/1729-4061.2018.141451


Introduction
Important tasks of linguistics-based linguometry is creation and comparison of dictionaries (including frequency and statistics dictionaries), automatic dictionar-ies, thesauruses, shorthand systems, automatic language identification, information search, etc. [1].For example, statistical and transition probabilities of morphemes of a text are found in order to model information search processes [2].Based on the constructed tables, proof-reading of a studied word is modeled and some most probable variants are offered [3].In turn, stylemetry as a subdivision of applied linguistics reveals and analyzes the quantitative characteristics of a certain functional style of the language or speech of the authors of text content, that is, the author's attributions [4].Attribution implies determining with the use of methods of quantitative linguistics the validity, authenticity of the author's work, its author, the place and time of its creation based of the analysis of the technological and stylistic patterns and characteristics of coefficients of the language diversity of a particular author and/or of a particular text of the work [5].For example, one of the known linguistic problems is the process of determining the author's attribution of passages of a particular text content [6].To do this, the frequencies of word usage of the proposed passages are calculated [7].Using frequency dictionaries of the author's creative work in general or of his separate works, the author of a work (or a work -if a dictionary makes it possible) are identified [8].The disadvantage is saving or auto-generation of large data arrays in the form of frequency dictionaries of author's works [9].Processing of such dictionaries requires a lot of time, while saving them demands a lot of resources [10].In their turn, there are the authors who have not created a large number of works, which makes it impossible to reproduce exactly the results of the analysis of the author's attribution [11].A wellknown method of dating in order to determine the duration of the separate existence of two closely related languages is based on the assumption that the bulk of the lexical structure of any language (nuclear vocabulary) changes at the same rate and requires counting a percentage of common elements in the basic vocabulary [12].The modified methods of glottochronology are used to determine the dynamics of a change of the author's speech in his text content for a long time in order to date the approximate period, within which a particular text was created by this author [13].That is why the problem of automatic identification of the author of the text content is relevant and requires new (improved) approaches to its solution, for example, based on statistical analysis of language diversity coefficients.

Literature review and problem statement
The separation (distribution) of a linguistic unit in a text -the presence of a linguistic unit in various (usually equal) subsamples (passages) -is of great importance in quantitative linguistics [15].If a studied linguistic unit operates only in one subsample, although with a high frequency, such sample is non-representative in respect to this linguistic unit [16].It is important, when a studied linguistic unit is evenly distributed in the general totality [17].To do this, the distribution factor is analyzed [18]: K r =N p /N z , where N p is the ratio of the number of subsamples with a certain linguistic unit, N z is the general number of subsamples.However, the characteristics, obtained from the material of a sample, usually differ from the real characteristics of the general totality, as there is a relative inaccuracy of research in quantitative linguistics [19].Distribution of frequency of linguistic units in text content has a certain regularity and forms its statistical (frequency, probabilistic) structure [20].This distribution is different for each of the language elements -lexemes, morphemes, phonemes, etc. [21].That is why the linguo-statistic parameters of the authors' styles, established at different levels (phonemic, morphemic, N-gram, lexemic, etc.), have a different style identifiable power of the authors' speech for different pairs of styles [22].For example, related styles are more clearly distinguished at the semantic level, while less related -at the lexical level [23].To do this, frequency dictionaries of certain linguistic units are created and with their use, the average frequency of words in a text, the hapax legomena coefficient (words that have frequency 1 in the studied sample), exclusivity index, concentration index, etc. are analyzed [1-5, 14, 24].
According the WF, one calculates such characteristics as vocabulary richness, diversity index (K l ) -a ratio of the volume of lexeme vocabulary (W) to text volume (N), that is K l =W/N.In accordance to Table 1, the most diverse, the richest lexis is found in poetry, then in descending order, in prose, everyday colloquial style, journalism, scientific and formal business style [14,25].The average frequency of a word in text A is the ratio of text volume N to the volume of lexeme vocabulary W (inverse to diversity index), that is, A=N/W [26].According to the WF data, each word in the colloquial everyday style on the average was used 14 times, and in the scientific style -17 times [27].
Exclusivity index characterizes lexis variability, that is, percentage of a text (vocabulary), occupied by the words that were found 1 time (Table 1) [28]: -of vocabulary I wt -the ratio of the number of lexemes with frequency 1 W 1 to the total number of lexemes: I wt =W 1 /W [14]; -of text I t -the ratio of the number of lexemes with frequency 1 W 1 to text volume N: I t =W 1 /N [14].
Concentration index indicates percentage of a text (vocabulary), occupied by the words that were used 10 or more times (Table 1) [29]: -of vocabulary I kt -the ratio of the number of words in vocabulary with absolute frequency 10 and more (W 10 ) to the total number of words in vocabulary (W): I kt =W 10 /W [14]; -of text I tn -the ratio of the sum of absolute frequencies of words with absolute frequency 10 and more W 10t to text volume N: I tn =W 10t /N [14].
As indicated by WF, speech gives preference to a small number of units, which are often used [30].They form the core of any speech subsystem, while most units are of low frequency [31].This regularity was noticed by Dewey at the beginning of the XX century, calling it the outweigh law [32].This regularity was more researched by the German linguist J. Zipf, who formulated the Zipf's law, which sets the dependences [33]: -of the word frequency and its rank in the vocabulary: the more frequent a word, the higher its rank at F×i=const, where F is the word frequency in the frequency vocabulary, і is the rank of its word [34]; -of the word frequency and its length: the more frequent a word, the shorter it is at k=C lg r, where k is the length of a word in phonemes, С is the constant, r is the rank [35]; -of the word frequency and the number of its meanings: the more frequent a word, the more meanings it has at m C f = , where т is the number of meanings of a word, С is the constant, f is the word frequency [36]; -of the word frequency and its origin: the longer a word, the more frequent it is [37].
According to the law of the German linguist P. Mencerat, the length of a language structure (word, word combination, super phrase unity, sentence) is inversely proportional to the length of its components (syllables, words, word combinations, etc.), that is, the longer the language structure, the shorter its components [14].According to the research of G. Altmann, y=ax b , where y is the average length of the constituents, x is the length of a language structure, b is the indicator that characterizes the dynamics of a change in the length of the components (the law is valid, if b<0) [38].
The Krilov law establishes the relationship between the number of polysemic words and frequency: where p x is the probability of using a word, which has х meanings, ω is the average number of meanings of a word in the dictionary [14].
Some major quantitative characteristics of a language are very simple.For example, the difference between the number of words (10 4 -10 5 ), the number of morphemes (several thousand), the number of syllables (from several hundred to several thousand) and the number of phonemes (from 10 to 80) [31][32][33][34][35][36][37][38][39][40][41][42][43][44][45][46][47][48][49].There is an assumption that such ratios are associated with the property of human memory [39].We will also note that the more frequent a word, the faster a person can recollect it [40].However, there is no research in the field of dependence of changes in the coefficients of lexical author's speech during the period of his creative work [41].

The aim and objectives of the study
The aim of this work is to develop a method for identifying the author in texts in the Ukrainian language based on the linguometry technology.
To accomplish the aim, the following tasks have been set: -based on the analysis of coefficients of lexical author's speech in the reference text, to develop the algorithms for identification of the author of a text; -to develop software of content monitoring to identify the author of the texts in the Ukrainian language based on the linguometric analysis of the identified stop words in text content; -to carry out analysis of the results of experimental testing of the proposed method of content monitoring to identify the author in Ukrainian scientific texts in the technical area.

The method for identifying the style of the author of text content
Linguometry is a branch of applied linguistics that detects, measures, and analyzes the quantitative characteris-tics of the units of different levels of a language or speech [42].One of the ways to characterize the literary richness of a text is the evaluation of the character of using language units at all language levels [43].This makes it possible to equate the concept of richness and diversity of speech [44].The calculation of linguistic diversity coefficients should assume the relationship of such coefficients as lexical diversity, degree (measure) of syntactic complexity [14], speech coherence, indices of exclusivity and concentration of a text [45].Because a coefficient is an absolute value, it is possible to neglect in certain limits the length of the compared texts [46].It is of theoretical interest to study the internal "dynamics" of a text in the part of matching coefficients from its various sections and the coefficient that is general for the entire text (Table 2) [47]: -for lexical diversity, the larger the resulting decimal fraction, the higher the lexical diversity of a text [48]; -for syntactic complexity, the larger the fraction (within [0; 1]), the more words in general there are is such a text, and therefore, the higher the possibility of diversity of syntactic relationships between the words in a single sentence [49]; -for speech coherence, it is equal to unity when there are three connective elements in one sentence (prepositions and conjunctions) [50].

Lexical diversity
The ratio of the number of words to the total number of word forms.The value of the coefficient lies within [0; 1]

Syntactic complexity
The ratio of the number of sentences to the number of words of a certain text It was found [14], that the text of a Ukrainian fairy tale has K z =0,77, and the text of a scientific article in Ukrainian -3.0, that is, the coherence in the second text is 3.9 times stronger than in the first one.There are no official standards for speech diversity coefficients for K l and K s [51], but the reference point for comparison and evaluation of a text in a homogeneous group of texts is the average statistical rate of the value of the coefficient for passages of the equal length [52].100 words will be accepted as the minimum size (length) of a passage, we will consider that the coefficients have already stabilized, reflecting the real features of the author's language [53].The proximity or remoteness of a separate individual coefficient from the average one serves as a basis for evaluating the speech diversity in the respective text [54].The texts, the diversity coefficients of which fall within the area of the root mean square deviations D from a certain mean, are considered satisfactory.
Analysis and interpretation at the linguistic level of the stylistic features and peculiarities of the writing style of a certain author (or a certain literary epoch) includes the most basic stages, presented in algorithm 1 [14,56].

Algorithm 1. Analysis and interpretation at the linguistic level of the stylistic features and peculiarities of the writing style of a certain author
Stage 1. Selection and primary processing of text content.For selection, the text filters are built by the parameters (the main language of the text, the text sample volume, time of the publication, publication source, format etc.) [41,57].The basic steps of primary text processing are: -bringing it to a unified format (such as removing tags, if a previous publication is in the Internet-resource in the form of a static page); -eliminating information noise (pictures, formulas, references, abstracts in other languages, etc.), which does not affect the outcome, but increases the time of processing; -bringing to a unified volume (shortening if necessary, removing non-informative sections of the beginning and ending of a text).
Stage 3. Removal of non-homogeneity of text linguistic units.Solving the problem of non-homogeneity of text linguistic units, for example, from the position of belonging to different kinds of a language (author's, non-author's, etc.) Stage 4. Construction of a system of frequency dictionaries based on statistical distributions in necessary frequency scales.A frequency dictionary is the type of a dictionary, which contains the number of usages (frequency) of a certain linguistic unit of a language (composition, words, word forms, word combinations, idioms, idioms) in various texts of a certain volume [59].Usually, absolute and relative frequencies of the usage of language units are presented, dictionary article are placed in the descending order of frequencies [60].
Stage 5. Search for parameters that adequately reflect the structure of the frequency dictionary.The following parameters make it possible to formulate some basic linguostatistic methods for researching a text [61]: -the method of anchor words (calculation of the total frequency of usage and finding percentage of syntactic words [62]: prepositions, conjunctions, particles); -the punctuation signs method (calculation only of the number of internal and external punctuation signs) [63]; -the method of words (calculation only of the words of a certain length) [64]; -the method of sentences (calculation of only the sentences of a certain length) [65]; -the syntactical method (calculation of punctuation signs, words, and sentences of a certain length) [66]; -the combined method (a combination of the syntactical method and the method of anchor words) [67].
Stage 6. Checking effectiveness of parameters.Analysis and comparison of obtained results from the well-known author's works to identify the patterns of influence of the author's stylistics on formation of the author's structure of the frequency dictionary by these parameters [68].
Stage 8. Construction of statistic classifications, i.e., author's reference, which reflect stylistic patterns within the works by a certain author or a certain literary style taking into consideration a literary epoch or specifics of the language, in which the analyzed works are written [70].
Stage 9. Interpretation of results from the positions of stylistic ideas over a specific time, the general and the author's style, taking into consideration time parameters [71].Thus, we will also solve the problem of the author's attribution, which we will form as follows.Let us assume that there is a statistically processed work by an author (reference work).It is necessary to estimate belonging of certain passages to the reference work with the use of appropriate methods.A graphic representation of the relative frequency of occurrence of syntactic words in Passage 4 and in the reference work is shown in Fig. 1.The correlation coefficient for the syntactic words in this case makes up R e-U4 =0,7326.We will also present the correlation factors for each of the syntactic words for passages 1-4 (Table 4).Analyzing the correlation factors for syntactic words, we conclude that the probability of belonging of passages to the studied reference is the highest for Passage 4, followed by Passage 2, Passage 1, and Passage 3. We will note that for all the four passages, consistently high correlations are observed for particles, which can by understood as the lack of influence of particles on the author's style.In addition, we will analyze the frequencies of occurrences only of prepositions and conjunctions for passages, find the appropriate correlation factors and compare results (Table 3).Passage 4 remained a likely candidate to belong to the reference sample, followed with a slight margin by Passage 1, then Passage 2. Passage 3, like in the previous study, is the least likely to belong to the reference sample.To prove the results, we will turn to [1][2][3][4], from which the three passages were taken to be studied.

Results of research into identifying the author in the Ukrainian scientific and technical texts
In the course of the research, we developed the system with the probability of selecting the language/languages of the analyzed content, which is implemented at the Internet site Victana [25] (Fig. 2).Analyzing the components of the formulas for estimating the richness of the work, we conclude that it is necessary to find such magnitudes as the number of words and word forms, sentences, conjunctions and prepositions, the words with the frequency of 1 and not less than 10.The algorithm for the analysis of the whole text is enabled on the server after starting the process of calculating the coefficients of text diversity (Algorithm 2).Analyzing the components of formulas for evaluating the richness of the work, we see that it is necessary to find the number of sentences, words and word forms, prepositions, and conjunctions, words with the frequency of 1 and with the frequency of not less than 10.For convenience, we will enter the results into the table.The generated table (Table 4) and the obtained results of the study are displayed on the screen on the information resource.Based on the above, we will estimate the richness of passages in the scientific articles in the technical field from Visnyk of the National University "Lviv Polytechnic" in the series "Information systems and networks", written by one author over the period of 2001-2017 [25], using coefficients of diversity and speech coherence, exclusivity, and concentration indices.For the analysis, we will select the first part (10,000 characters) of each article (Alg.3).

Algorithm 3. Analysis of statistics of functioning of the system of stop words identification from 215 scientific articles of the technical area
Stage 1. Analysis of 100 scientific articles to determine the range of the optimal size of the text.First, the full volume of the texts was analyzed, then these texts were analyzed to identify different numbers of characters.The results showed that the optimal study of the texts is in the range of [100; 10,000] characters.If there are less than 100 characters, the obtained information is non-informative, very often the values of the coefficients of different authors are similar and are significantly different for one author on various tests.If there are more than 10,000 characters, the coefficients do not change significantly, but the analogs for studying have a different length due to the lack of diversity of the analogues of a large length, so the maximal number of 10,000 was selected for analysis.Stage 5. Analysis of the obtained coefficient of speech of more than 100 different authors over the period of 2001-2017 to determine the subset of authors with the style that is similar to 4 reference papers (collective papers, the authors of which are among the studied oneauthor papers).
Stage 6. Analysis of the results, obtained at stage 5. Checking if there are actual authors of these reference texts in the obtained texts.To select the best algorithm for studying the author's style in the Ukrainian scientific and technical texts based on the technology of quantitative linguistics.
For the accuracy of research, it is necessary to analyze, if the time of the publication of papers influences the text diversity coefficients, that is, if these coefficients change over time based on the sample of the same authors and texts.First, we will analyze how the total volume of words with the passages of the same size change in the range 2001-2017.As one can see, over time the same authors use shorter words more often (Fig. 3a).
Over time, lexical diversity coefficient K l does not change substantially (Fig. 3, b-d).Similarly, over time syntactic complexity coefficient K s does not substantially change either.But speech coherence coefficient K z changes insignificantly over 16 years.In the beginning (2001), it varies in the range of [0.5; 1.2], and at the end of the period -in the range of [0.4; 0.9] (Fig. 4).
Similarly, we will compare distributions of indices of exclusivity and concentration (Fig. 5).While the scope of distribution does not change significantly over time for I wt , significant changes were recorded for I kt .Over time, the authors of these papers more often repeat some terms in their papers more than 10 times, narrowing down the circle of their research.Fig. 5, d shows the result of analysis of speech coefficients for the passages of the equal size in the range 2001-2017 as minimum, maximum, and mean values for this period (determining the fluctuations of values in this period).More substantial fluctuations are observed for K z (Fig. 6).We will analyze separately the distribution of the usage of all word forms (Fig. 6, d), the words, used once, the words used 10 times in the studied texts for the passages of equal size in the range 2001-2017 (Fig. 7).Fig. 7, b shows the analysis of the usage of prepositions, conjunctions, and separate sentences in the studied texts in the passages of equal size in the range of 2001-2017, where Z is the number of prepositions, S is the number of conjunctions, P is the number of separate sentences.According to Fig. 7, c, over time, the authors use shorter sentences to describe the subject area than at the beginning of the studied period.While the number of prepositions decreases, the distribution of the use of conjunctions does not change essentially (Fig. 7, e).Fig. 8, a-b shows the analysis of a change in the dynamics of the use of words in the studied texts within a specified period.Fig. 8, c, 9, d show the result of the analysis of a change in the dynamics of the use of prepositions, conjunctions, and sentences in the studied texts for the specified period.
It was proved that there exists the dynamics of a change of not only in the speech coefficients of the author's text within the specified period of his work, but also the dynamics of a change in the separate components, such as the number of the use of word forms per total number of words, conjunctions, and prepositions, sentences in the determined volume of the passage, word forms that are used only once and those used more than 10 times.

Discussion of results of research into identifying the author of the Ukrainian scientific and technical texts
For more accurate identification of the magnitude of an increase in each studied parameter, it is necessary to do more substantial research on a large sample of papers, written by one author and to increase the range of research into creative work of different authors by a longer period of their creative work.
Then, we will analyze the sample for the author's style and select the best algorithm to determine the style of the author.In Fig. 9, a, the diagram displays the identification of the author's style by speech coefficients.In Fig. 9, b, the diagram with accumulation displays changes in the total sum by the speech coefficients.In Fig. 9, c, the normalized diagram reflects a change of contribution of each value by speech coefficients.
As we can see, coefficients of the author's speech, except for Kz, do not change much depending on the style of a spe- And the larger such set, the more difficult the process of identification of a specific style of the author without any additional parameters.Then, we will analyze the sample for the author's style by such additional parameters as the total number of sentences in the passages that are equal in volume, the number of words in the sample, the frequency and occurrence of prepositions and conjunctions.In Fig. 10, the diagram displays the identification of the author's style by the additional parameters of the author's speech.
In Fig. 10, b, the diagram with accumulation displays changes in the total sum according to the parameters.In Fig. 10, c, the normalized diagram displays a change in contribution of each value by the parameters.As we see, the introduction of the additional parameters will decrease the set of authors, whose speech styles are similar to the Ukrainian scientific and technical style of publications.We will introduce the additional parameters, such as the number of sentences, conjunctions, and prepositions (Fig. 11) and will analyze the dynamics (Table 5).Table 5 shows the results of analysis of the style of 94 authors in papers written by one author (over 200 papers) in technical field over the period of 2001-2017.For each author, we will derive arithmetic mean value of each coefficient and parameter of speech based on the analysis of several of his work within the specified period.The styles of 4 articles of one team of authors at numbers 95-98 (in the Table they are highlighted in yellow), a part of the authors of which are in Table 5 at number 6 and 30 (in the Table, they are highlighted in blue).
However, too small sample of texts for analysis (more than 200) and the number of authors (94) does not guarantee exact results.The study should be extended to a greater number of texts, which are not always easily accessible.In the future, it is necessary to improve the method due to the analysis of texts by the methods of stylemetry and glottochronology.

Conclusions
1.The method for identifying the author of the text based on the analysis of coefficients of the lexical author's speech in the reference sample passage of the author's text was developed.The algorithm of lexical analysis of Ukrainian texts and the algorithm of the parser of text content based on analysis of each word taking into consideration its part of speech and declension was designed.That is, when analyzing the linguistic units of the type of words, belonging to a part of speech and declension within this part of speech were taken into consideration.For this, the analysis of flexions of these words for classification, separation of the base for the formation of the corresponding alphabetical-frequency dictionaries was performed.Filling these dictionaries was subsequently considered at the following stages of determining the authorship of a text as calculation of parameters and coefficients of the author's speech.Syntactic words (stop or anchor) words are most essential for an individual style of an author, as they are not related to the subject and content of the publication.The algorithm for determining stop words in the text content based on linguistic analysis of text content was developed.Its features are the adaptation of morphological and syntactic analysis of the lexical units to the features of the structure of Ukrainian words/texts.The theoretical and experimental substantiation of the method for content monitoring and determining stop words of the Ukrainian text were presented.The method is aimed at automatic detection of significant stop words of the Ukrainian text at the expense of the proposed formal approach to implementation of parsing of text content in the scientific and technical area.
2. The approach to the development of software of content monitoring to identify the author in Ukrainian scientific and technical texts based on NLP, stylemetry, and Web Mining was proposed.More than 200 scientific publications written by one author from all issues in Visnyk of the National University "Lviv Polytechnic" from the series "Information systems and networks" (Ukraine) over the period of 2001-2017 were analyzed by the developed system.The internal "dynamics" of these texts of randomly selected authors was studied through the analysis of coefficients of speech coherence, lexical diversity, and syntactic complexity, as well as indices of concentration and exclusivity for the first k, n and m (without a title) words of the author's passage and the one that was analyzed.
3. The results of experimental testing of the proposed method of content monitoring for identifying the author in Ukrainian scientific texts of the technical profile were studied.We compared the results in a set of 200 one-author papers in the technical area of more than 100 different authors over the period of 2001-2017 to determine if and how the coefficients of diversity of a text of these authors change within different periods of time.Based on the developed software, we obtained the results of experimental testing of the proposed method of content monitoring to identify and analyze stop words in Ukrainian scientific texts of the technical area based on Web Mining technology.It was found that for the selected experimental base of more than 200 papers, the best results according to the density criterion are reached by the method for analysis of an article without the initial compulsory information, such as abstracts and keywords in different languages, as well as the list of K z =(Z+S)/(3*P).Stage 15.Displaying results of the Internet page at the web site Victana [25].

Fig. 1 .Fig. 2 .
Fig. 1.Relative frequency of occurrence of syntactic words in Passage 4 and in the reference sample

Stage 2 .
Analysis of over 200 one-author papers in technical area of over 50 different authors over the period of 2001-2017 to determine if and how the text diversity coefficients of these authors change within different periods of time.Stage 3. Analysis of over 200 one-author papers in technical area by over 100 different authors over the period of 2001-2017 to determine if and how the text diversity coefficients of these authors change within different periods of time.Stage 4. Analysis of over 200 one-author papers in technical area by over 100 different authors over the period of 2001-2017 to determine the speech style of these authors.

Fig. 3 .
Fig. 3. Distribution: a -of the words and speech coefficients for passages of equal size in the range of 2001-2017: b -K l ; c -K s ; d -K z

Fig. 4 .
Fig. 4. Comparison of distribution of speech coefficients K l , K s and K z

Fig. 5 .Fig. 6 .Fig. 7 .Fig. 8 .
Fig. 8. Result of analysis of a change in the dynamics of the use of words in the studied texts within a certain period of time: a -dynamics of a change in speech parameters; b -distribution of values of speech parameters within the specified period of research; c -dynamics of a change in the use of word combinations, prepositions and sentences in the studied texts; d -distribution of values of the use of word combinations, prepositions and sentences for the specified period of research of authors' styles

Fig. 9 .Fig. 10 .Fig. 11 .
Fig. 9. Detailed analysis: a -of the process of identification of the author's style by speech coefficients over time; b -of a change of the total sum by speech coefficients; c -of a change of contribution of each value by speech coefficients

Table 1
[14]lts of speech coefficients according to the styles of theUkrainian language[14]

Table 3
Correlation factors for a syntactic part of speech and each of the passages

Table 4
Example of a generated table as a result of operation of the algorithm for analysis of the style of the author of a publication at the Internet site Victana[25]

Table 5
Result of operation of the algorithm for analysis of the style of the author of the publication at the Internet site Victana[25] Continuation of Table5Testing the proposed method for identification of the author's style from other categories of texts -scientific, humanitarian, artistic, journalistic, etc. -requires further experimental research.