ANALYSIS OF STATISTICAL METHODS FOR STABLE COMBINATIONS DETERMINATION OF KEYWORDS IDENTIFICATION

In modern intellectual systems of linguistic nature, it is important to determine effectively stable word combinations for identifying a set of keywords while processing Web resources [1]. An optimally appropriate set of stable word combinations is used for information retrieval (IR), SEO and Web-mining technologies, natural language processing (NLP) and automatic machine translation. It is also essential to identify the content by using specific natural language texts and rubrics as well as reflexive and automatic analysis of comments on published products. A new direction is the automatic processing of texts while integrating data from various sources of different fields, including Internet tourism [2]. Stable word combinations are used in algorithms for correct tokenization, compiling dictionaries (lexicography), automatic translation, learning foreign languages (міцний чай – strong tea « міцний сон – fast sleep), and distinguishing terminology [3]. Analysis of stable word combinations is used for identifying relevant content, indexing in IR, tokenization, content categorization, creating a search image of some content, and constructing thematic ontologies [4]. Usually, this work is the prerogative of the person who is the moderator of Web-resources [5]. Automating the process of extracting data or knowledge of natural language content using NLP methods greatly reduces the time and the amount of Web resources to obtain the desired result [6]. The use of methods of artificial intelligence (AI) in linguistic processing of natural language texts is usually effective after qualitative ANALYSIS OF STATISTICAL METHODS FOR STABLE COMBINATIONS DETERMINATION OF


Introduction
In modern intellectual systems of linguistic nature, it is important to determine effectively stable word combinations for identifying a set of keywords while processing Web resources [1].An optimally appropriate set of stable word combinations is used for information retrieval (IR), SEO and Web-mining technologies, natural language processing (NLP) and automatic machine translation.It is also essential to identify the content by using specific natural language texts and rubrics as well as reflexive and automatic analysis of comments on published products.A new direction is the automatic processing of texts while integrating data from various sources of different fields, including Internet tourism [2].Stable word combinations are used in algo-rithms for correct tokenization, compiling dictionaries (lexicography), automatic translation, learning foreign languages (міцний чай -strong tea « міцний сон -fast sleep), and distinguishing terminology [3].
Analysis of stable word combinations is used for identifying relevant content, indexing in IR, tokenization, content categorization, creating a search image of some content, and constructing thematic ontologies [4].Usually, this work is the prerogative of the person who is the moderator of Web-resources [5].Automating the process of extracting data or knowledge of natural language content using NLP methods greatly reduces the time and the amount of Web resources to obtain the desired result [6].The use of methods of artificial intelligence (AI) in linguistic processing of natural language texts is usually effective after qualitative morphological and syntactic parsing of these texts [7].If for English texts these questions are easily solved by a simple parser and using the Porter algorithm, then for Slavic languages, including Ukrainian texts, it is not so easy [8].Therefore, there appears the problem of choosing an optimal statistical method for determining stable word combinations for identifying keywords in the development of Ukrainian-language Web resources [9].
The use of knowledge engineering for effective NLP improves the quality of the results of research on texts [10].This entails developing new NLP approaches and techniques, including the automatic determination of stable word combinations for identifying keywords when processing Web resources [11].With the proliferation of Internet services and their introduction into everyday life of every ordinary person, there appears redundancy of information as an IR result.The so-called informational noise negatively affects both the Internet business and the irritability of a regular user of these services.The daily IR results are the following: Google>8 billion pages, Yandex>600 million pages, and 2.5 million sites [12].Therefore, the qualitative and optimal definition of stable word combinations as a set of keywords in Ukrainian and English texts will significantly reduce the time for receiving relevant content search results in response to user queries.

Literature review and problem statement
Modern NLP methods are increasingly used not only in AI and computer linguistics, but in Internet environments, especially in the IR direction (Fig. 1).Today in IR, it implies not only representation, storage, organization and access to information elements.It also focuses on the needs of regular and potential users in on-line information and the emphasis on finding important relevant content (and not data, Fig. 2) [13].The main models and methods of IR are indexing, the Boolean model, the vector model and the evaluation of the search quality [14].The average complexities of a direct search (Brute Force, O(n+m)) and a complex search (Dboyer-Moore, O(n/m)) [14] were experimentally tested long ago.The effectiveness of the indexing method is directly proportional to the effectiveness of the process of creating a document or content search image (logical representation).This, in turn, affects the efficiency of presenting relevant information on the Internet [14].
The effectiveness of any NLP method depends directly on the quality of the prior processing of the text content.This, in turn, depends on extracting and/or receiving text (HTML, PDF...) [15]; coding and language [16]; breakdown into words and sentences (tokenization); elimination of stop words [17]; and stemming as determining the word form [18]. Tokenization is a process of demarcating and classifying sections of a series of input characters for the desired content: -dates, numbers ( The resulting tokens are then subjected to a different form of processing.The process is considered as a subtask for analysing input data [19]. Stop-words (or noise words) are words that do not carry a meaningful load, so their usefulness and role during searching are not significant.A text, in its turn, is an unstructured set of meaningful words ("bag of words"), where stop-words belong to the functional parts of speech, that is, they are prepositions, conjunctions, particles (а, га, ай, ау, ах, ба, без, поблизу, брр, зась, ніби, б, бути, в, ви, ваш, поблизу, вглиб, до того ж, уздовж, адже, замість, замість, поза, усередині, як, біля, навколо, геть,...) [20].
An effective IR model is highly dependent on the quality and method: presentation of text files and content [21]; setting informational needs (queries) of users; estimation of the proximity between the query and the document [22].
The Boolean model of IR considers content as a plurality of words (terms) and a query as a Boolean expression: "(кішка OR пес) AND корм"; "птаха ANDNOT військовий" [14].Processing a query in this model is the operation on sets that correspond to words (terms) (Table 1).
Table 1 An example of the Boolean model (keywords in articles found as to [23]) The advantages of the Boolean model are simplicity and convenience for those who are familiar with logical operators.The disadvantage is that this model is too "contrast-based" (in terms of both content submission and its relevance).
The Vector model presents IR content and query as vectors in the space of words (terms), where the vector component is the meaning of a word for the document (query).The model uses a measure of proximity (ranking).This is the cosine of the angle between the vectors (Fig. 3): ( , ) , where d i is the weight of a term in the content (frequency of use in the content/collection); q i is the weight of the term і in the query.A well-known example of the vector model is the approach TF×IDF (TF is term frequency, IDF is inverse document frequency).The basic version of TF×IDF [14] is where avg _dl is the average length of a document, and с is the size of the collection b=0…1.

Fig. 3. The vector pattern of IR
The advantage of a vector model is the effectiveness of processing primary static collections.It also involves partial coincidence.The disadvantage of the model is that it is easily attacked (spammed) and does not work well on short texts.However, the Web is an uncontrolled collection (Fig. 4), large volumes of content, its various formats, variety (language, themes, etc.), high competition (spam), present clicks and links (PageRank).Therefore, the two previous IR models do not resolve the problem of the quality of searching for relevant content.The basis of the quality assessment of IR is the notion of relevance (compliance with the information needs) of the content sought.), where a represents the relevant signs in the answer, b denotes all the signs in the answer, and c is all relevant signs (Fig. 5).
Fig. 4. The method of general information search pool Fig. 5.The 11-point graph of P/R information retrieval results The main initiators of the IR evaluation methods are (Fig. 4, 5) the following: TREC (Text Retrieval Evaluation Conference, trec.nist.gov)and CLEF (Cross-Language Evaluation Forum, www.clef-campaign.org).

The aim and objectives of the study
The aim of the work is to analyse statistical methods for developing an optimal approach to determining stable word combinations in identifying keywords when developing and processing Ukrainian-language Web-resources based on the technology of computational linguistics.
To achieve this aim, the following tasks are set and done: -to develop a method for determining stable word combinations while identifying keywords in Ukrainian-language texts based on the analysis of lexical speech coefficients in standard content fragmentation; -to devise a formal approach to designing content monitoring software to determine stable word combinations when identifying keywords in Ukrainian texts based on Web Mining and NLP; -to obtain and analyse the results of experimental testing of the proposed content-monitoring method for determining stable word combinations in identifying keywords in Ukrainian-language scientific texts on technical matter.

The method for determining stable word combinations when identifying keywords for text content
The method for determining stable word combinations consists of the following phases: morphological analysis (MA), syntactic analysis (SA), keyword selection, and stability analysis of word combinations from a multitude of keywords.Fig. 6 lists the main steps in determining stable word combinations when identifying keywords for text content.

Fig. 6. A flowchart of linguistic analysis of Ukrainianlanguage texts to identify stable word combinations as keywords
Stage 1.The aim of the MA is to define the keyword equivalence classes in IR [30].MA methods for identifying keywords are procedural, tabular, and statistical stemming or their various combinations.One of the well-known MA algorithms is the Porter stemmer [31], which is the stemming algorithm published by Martin Porter in 1980.The original version of the Stemmer is for English and written in BCPL.Stemming is the process of reducing the word to the stem by rejecting it auxiliary parts, such as an inflexion or suffix.For Ukrainian texts in MA, it is best to use combinations of approaches such as procedural, tabular and statistical stemming [32].In the procedural approach of MA, the emphasis is placed on analysing words of the dictionaries of stems and full forms dictionaries (FFDs).Then the MA algorithm consists of three main stages: the search in the FFD, the selection of the stem and the search for the stem in the dictionary.Examples of the tabular approach are вовка  вовк (masc., animate, sg., [gen.|mean.]);не  не (particle); годуй  годувати (imperf., imperat., sg.); в  в (preposition).An example of the model for a Ukrainian word change is лев masc.1*b (animal); лев masc.1*a (currency); стриже imperf.8*b (-г-); гостьова femin.4а (п).The basis of most of machine MA of the Ukrainian language is a tree or a finite state automaton (Fig. 7) [33].
The statistical stemmer is based on the probability of determining the stem of the word, for example: The main rule is one vowel in the stem of the word.The types of words are determined by the forms of their inflexions (Fig. 8).
Features of the algorithm.The algorithm works with separate words, so the context in which the word is used is unknown.Other unavailable categories of linguistics are the word structure (root, suffix, etc.) and the part of speech (noun, adjective, etc.).We currently have the following techniques for analysing words: -the ending in removed from the word, for example, the ending увати transfers the word критикувати into критик; -the word has a stable ending: the words with this ending are left unchanged, for example, ск and the invariable words блиск, тиск, обеліск, etc.; -the word changes the ending, but this rule applies to words in which certain letters drop out (ядро and ядер, where the ending ер changes to p) or change (чоловік and чоловіче, where к changes into ч); -the word corresponds to a stable expression: this is an attempt to combine several rules into one complex, for example, in the code there are expressions similar to (ов)*у ва(в|вши|вшись|ла|ло|ли|ння|нні|нням|нню|ти|вся|всь|л ись|лися|тись|тися); -the word does not change during its stemming, but there is an exception to the rules: it is necessary to maintain a dictionary of exception words, for example, віче, наче; -the word changes with stemming, but it is also an exception: it is necessary to keep in the dictionary at once two forms of the word (original and schematised), for example, the word відер should change to відр, although other words ending in ер are not stemmed so (авіадиспетчер, вітер, гравер, etc.); -the short words remain unchanged: the functional parts of speech (prepositions, conjunctions, particles) are usually very short words and ignored by the algorithm (words up to 2 letters inclusive).

Tables of solutions Diagnostics
Flow of lexemes

Tables of objects Diagnostics
Diagnostics

Tables of objects
Intermediate form (prefix, postfix, triple, etc.) 7. Methods for storing MA results: a is a tree and b is the Finite State Automata, FSA All these techniques are used for groups that generate and illustrate the rules of stemming.However, this greatly complicates the search algorithm for keywords.First, it is necessary to take into account the widespread endings (not the traditional inflexions, as part of the word), that is, the sequence of letters in which a word ends.Tables 2, 3 contain endings of words from 1 to 4 letters in length.Five or more letters are not given, as there are few such words (for 5 maximum йтесь (6,837), for 6 (4,656), etc.).This is a peculiar map for the stemming project.For the effectiveness of the search algorithm, it is necessary to construct a static tree of endings and to cover all branches of the tree [34].The level of the tree detail varies within 500-600 words with a common ending.
Table 3 A static tree of endings the total proportion of which is less than 1 % р (2,709) ч (959) г ( 636 Stage 2. Syntax represents the rules of combining words in correct expressions such as word combinations and sentences [35].The task of a SA (syntactic analyser, parser) is to construct the syntactic structure of an input sentence [36].The aspects of implementing the SA are dictionaries (data on individual units of language); formal rules and interaction with adjacent levels of processing (MA, semantic analysis).Often, the SA uses the Context-free grammar (CFG) rules: <N, T, X, R>, where N is the set of nonterminal characters, T is the set of terminal characters ( )

. система рубрикувати україномовний контент за ключовий слово T S =
The disadvantage of using the CFG is the periodic appearance of ambiguity in the SA, for example, "Система рубрикує україномовний контент за ключовими словами / The system categorizes Ukrainian-language content by keywords" (Fig. 9).The examples of known SA systems for English language tests are "Machinese Word combination Tagger" [37] and VISL [38].There is no available online information resource for the SA of Ukrainian-language texts.We will analyse the results of the SA of an English text through these resources through the example of such a sentence set: "The train went up the track out of sight, around one of the hills of burnt timber.Nick sat down on the bundle of canvas and the baggage that I had got out of the door of the baggage car." "Machinese Word combination Tagger" is a text analyser that processes base forms and component structures.It also recognizes the "part of speech" classes (noun, adjective, verb, pronoun, etc.) and generates a micro-indicative syntax of a word combination, marks fragments or brackets noun word combinations (Fig. 10).
The Connexor Machinese Tokenizer is a set of components of the program that performs the basic tasks of text analysis at a very high speed and provides relevant word information for bulk programs.The Machinese Tokenizer splits the text into clear words and provides possible forms and classes for the words (Fig. 11).The first column displays the position of a token in the text (the calculation in characters); in the following column, there is information on the length of the token; the third column is for the text form, and the other columns contain the main form(s) and the tag denoting the part of speech (PRON=pronoun, V=verb, DET=determinant, or N=noun).If a word has several meanings, the analysis also includes several columns of the main parts of speech.
The Ontology Matcher Demo uses metadata to identify ontology objects in the text (Fig. 12).The program corresponds to the concepts in the Finnish general ontology with approximately 28,000 notions in each language.The found notions of ontology are given in the text below as a reference.With the cursor over a word, there appears the notion to which the word refers.For SA of Ukrainian-language texts, such information resources do not exist [39][40][41][42].Moreover, the SA process itself is rather cumbersome for Ukrainian-language content [43][44][45][46].Let us consider the example of the input sentence: "Він зробив це так незручно, що зачепив образок мого ангела, який висів на дубовій спинці ліжка, і що вбита муха впала мені прямо на голову" ("He made it so uncomfortable that he touched the image of my angel that hung on the oak-bed backboard, and that the killed fly fell to my head").
Its SA example with using pre-syntax is shown in Fig. 15.

Results of studying the definition of stable word combinations when identifying keywords for text content
To isolate stable word combinations in the analysed texts and to conduct a comparative analysis, we will use 4 different methods: FREQ (frequency+morphological patterns, that is, direct counting of the number of words) [52]; t-test [53]; statistics 2 χ [54]; LR as a likelihood ration [55].A collocation is a word combination that has features of a syntactically and semantically integral unit [56].In it, the choice of one component is based on the context, and the choice of another depends on the choice of the first element [57].For example, ставити умови (to set conditions): the choice of the verb ставити (to set) is determined by tradition and depends on the noun умови; with the noun пропозицію (suggestion, proposal), there will be another verb -вносити (to make).This concerns a limited (selective) combining of words: word combinationologisms, idioms, proper names, and brandnames.A collocation also usually includes components of toponyms, anthroponyms, and other frequently used naming conventions (for example, супермаркет «Метро» (Metro supermarket), завод «Електрон» (Electron factory)) [58].Other names for the same phenomenon are stable (set) or word combinationological units and N-grams.Examples of collocations are the following: -грати роль, мати значення, впливати, справляти враження; -засоби масової…, зброя масової…, вищий навчаль-ний…; -глибокий старець « поверхневий/мілкий невеликий юнак; -міцний чай « сильний чай; -кока-кола, Microsoft Windows; -Гола Пристань, Володимир Волинський, Нью Йорк, Стів Джобс.
1.The FREQ method is a direct calculation of the frequency of using pairs (triples) of words.For example, FREQ for the sentence "В літературі описано декілька підходів до автоматичного виділення стійких словосполучень."→ в літературі; літературі описано; описано декілька; декілька підходів; підходів до; до автоматичного; автоматичного виділення; виділення стійких; стійких словосполучень.Unfortunately, as a result of applying this method to large volumes of text, we receive information noise due to the high frequency of function words.The method also requires taking into account the frequency of use and the patterns of word combinations.An example of morphology rules in FREQ is as follows: A N: турецький гамбіт (Turkish gambit), перша похідна (first derivative), інформаційний ресурс (information resource); N N G : контент аналіз (content analysis), баланс інтересів (balance of interests), контент-комерція (content commerce), контент моніторинг (content monitoring); N Pr N: трава у дворі (grass in the yard), дрова на траві (firewood on the grass).2. The t-test method consists in checking statistical hypotheses and using the statistical model of MA: -Н 0 : words found accidentally; -

− µ =
where x is the empirical average, µ is the theoretical average, 2  s is the empirical variance, and N is the size of the empirical sample.
The method is not quite correct for the language, but it helps get results in practice -for example, the frequency of the occurrence of the stable word combination контент аналіз (content analysis) in [14] with Р(контент)=28/1368 and Р(аналіз)=38/1368 is χ method is applied to tables of 2×2 (Table 4).In the calculations, normality is not expected.
The LR method is used to calculate the hypotheses we obtain the relation of likelihood LR: For example, with 1 28, с = 12 18 с = and 2 38 с = , In order to choose the optimal statistical method for determining stable word combinations, it is necessary to analyse a Ukrainian language text based on the stems of words without taking into account their inflexions.It will greatly improve the accuracy of the result.

Discussion of the research results on identifying stable word combinations for keyword identification
An experiment of distinguishing terms was carried out on 3 technical articles [1][2][3] written in two languages -Ukrainian and English.The template for the experiment contained the following: χ (G).The analysis of the 3 articles [1][2][3] was conducted in Ukrainian and the results were translated into English (Tables 5, 6).The keywords in bold are those that occurred in the results of applying all the methods, the italicized keywords are only those obtained through the B-G methods, and the underlined keywords are those in the methods A and C-G.While conducting linguistic analysis for compiling alphanumeric dictionaries of two words, the following features and algorithms were used: -bigrams were formed within the punctuation marks (if there was at least one punctuation mark between the words, these words were not considered as a bigram); -an alphanumeric dictionary of two-word combinations was formed on the basis of stems, that is, the bigrams контений аналіз and контентного аналізу were considered as one and the same bigram; -in the analysis of the inflexions of the analysed words, verbs were not taken into account when forming the alphanumeric dictionary of bigrams (verbs were considered as punctuation marks); -before the linguistic analysis of the texts, all stop words (particles, adverbs, conjunctions) and pronouns (they were also considered as punctuation marks) were excluded.
The statistical methods make it possible to take into account the use of separate words.The peculiarities that are associated with using the methods for different volumes of data and probability ranges (better than the t-test for larger p, where normality is violated; the likelihood ratio is better approximated with 2 χ than tables 2´2 for small volumes).They are often used not for the acceptance/rejection of hypotheses but for the ranking of candidate word combinations.
The list of frequency index for stable word combinations in articles [1][2][3] No.
Author's as to [ To compare the results, we used the Google-based library word2vec, which has proven itself as an alternative of TF×IDF (А 1 in Table 7 according to the template ['bigram', number of uses]).We also used the built-in methods to search for word combinations in Python.However, for these datasets, it did not work effectively, because for high-quality work, it needs a huge corpus [58].The most interesting thing is that the system allows doing it after transferring each word from the corpus to a space whose dimension is specified by the user, for example, ['king' + 'woman' -'man' = 'queen'].Continuation of Table 5 Table 6 Differences of the methods according to the rating list of 100 stable word combinations After the transference into a space of some dimension, each word becomes a vector, so words can form basic relational operations of addition, subtraction, multiplication, etc. Besides, let us consider the analysis through the bigrams (A 2 in Table 7) and the skipgrams (A 3 in Table 7).The results are better than those obtained through word2vec, which means that it is the best way to analyse skipgrams with a value of 3 and also to eliminate stop words in English (A 4 in Table 7).However, these results are far enough from the ones listed in Table 5.The outcome is worse due to the failure to identify punctuation marks and the use of stop words in linguistic analysis as content units of speech.

Conclusions
1.The study has developed a method for determining stable word combinations while identifying keywords of text content in standard passages of an author's text.For this purpose, the well-known statistical methods for determining stable word combinations when identifying keywords of text content were analysed.The factors influencing the quality of identifying stable word combinations were determined during the pre-linguistic elaboration of these texts.A comparative analysis of the corresponding methods was carried out on the basis of the obtained results.The developed method consists in using Zipf's law in the formation of stable word combinations as keywords, taking into account the following rules of a preliminary linguistic processing of the text: -removing all word stops; bigrams are formed only within the limits of punctuation marks; the verb and the pronoun are to be considered punctuation marks; -verbs are determined by their inflexions; bigrams are formed on the basis of stems without taking into account inflexions; -adjectives are identified by their inflexions, and it is assumed that adjectives should occupy only the first place in the bigrams of Ukrainian texts.
This allowed taking into account the peculiarities of constructing keywords in the Ukrainian language, regardless of the inflexions within the word combinations.Also, the results obtained were closer to the number of keywords identified by the authors.This increases 1.4 times the degree of relevancy of the analysed content.
2. A program set has been developed to identify stable word combinations as keywords.An approach has been suggested for devising linguistic content analysis software to determine stable word combinations while identifying keywords of Ukrainian and English textbased contents.The peculiarity of the approach is that the linguistic statistical analysis of lexical units is adapted to the peculiarities of Ukrainian-language and English-language words/texts.The developed information system, which is based on identified stable word combinations, helps convey more accurately the analysed content in accordance with the author' idea about it.This can produce a more accurate search result for the user and can better render the opinion of the author about the content under analysis.
3. The results of the experimental testing of the proposed method of content analysis of English and Ukrainian texts for determining stable word combinations when identifying the keywords of technical texts have been verified.
The developed method conveys the content of the analysed text by the identified keywords in the form of stable word combinations more accurately than other known resources.Further experimental research requires approbation of the proposed method for determining stable word combinations in other categories of texts -scientific, humanitarian, belletristic, journalistic, etc. Continuation of Table 7

Fig. 8 .
Fig. 8. Definition of the word type by the inflexion form

Fig. 13 .
Fig. 13.The structure of a tree on the VISL information resource account not only pairs but also the frequency of using separate words (those that make up a pair);

Table 2 A
static table of common endings in the Ukrainian language

Table 4
An example of using the Pearson 2 χ method