DEVELOPMENT OF INFORMATION TECHNOLOGY OF TERM EXTRACTION FROM DOCUMENTS IN NATURAL LANGUAGE

Domain dictionaries (DD) are widely used in software design [1]. In particular, when determining the roles of 8. Recognition based segmentation of connected characters in text based CAPTCHAs / Hussain R., Gao H., Shaikh R. A., Soomro S. P. // 2016 8th IEEE International Conference on Communication Software and Networks (ICCSN). 2016. doi: https:// doi.org/10.1109/iccsn.2016.7586608 9. Abdullah Hasan W. K. A Survey of Current Research on CAPTCHA // International Journal of Computer Science & Engineering Survey. 2016. Vol. 7, Issue 3. P. 1–21. doi: https://doi.org/10.5121/ijcses.2016.7301 10. Anti-captcha. URL: https://anti-captcha.com/mainpage/ 11. Myroniv I. Development of the character recognition software on the base cellular authomata // VI-th International Conference of Students, PhD-Students and Young Scientists “Engineer of XXI Century”. 2016. Р. 229–240. 12. OpenCV library. URL: https://opencv.org/ 13. Leonenkov A. V. Samouchitel’ UML. Sankt-Peterburg: BHV Peterburg, 2004. 576 p. 14. Fake Captcha is the #1 free fake captcha maker! URL: https://fakecaptcha.com/


Introduction
Domain dictionaries (DD) are widely used in software design [1].In particular, when determining the roles of

Literature review and problem statement
In [9], the LEXTER software package for term extraction is proposed.It is of interest that terms are formed on the basis of allocation of nouns.Related words are determined by empirical rules, which limits the package scope to French.The statistical method of term extraction considered in [10] is applicable to Slavic languages.However, the authors solve the problem of term extraction in the context of document clustering and search for contrasting terms.This leads to a large number of false terms.In [11], the task to extract not only individual keywords, but also phrases united by frequency characteristics was set.However, the solution is proposed for the construction of hierarchical document clustering, when not all terms are to be defined and extracted terms contain no more than two words.It is of interest to study the extraction of key phrases [12], which can be used in the interpretation of terms in DD.However, from the point of view of term extraction, the key phrase requires further analysis.In [7], the method of automated preliminary text grouping and term extraction by frequency characteristics and simplified syntax rules is proposed.This allowed extracting terms from two and partially three words.However, term formation according to the "noun + adjectives" scheme did not allow extracting all multi-word terms, and a double pass through documents reduced productivity.In [13], the deep syntactic and semantic analysis of documents in natural languages is given.However, the proposed models are not brought to such a degree of formalization, which allows using them in applied problems.The interesting solution to reduce the complexity of keyword extraction by organizing parallel computing is proposed in [14].However, the proposed algorithm is applicable only for extraction of single-word terms and loses effectiveness with a small number of documents, which is typical for narrow knowledge domains.
Thus, the task of term extraction for DD construction has a number of unsolved problems, namely: -the study of characteristics of terms that allowing to formulate requirements for extracting them from the text (the number of words, arrangement and type of headwords, limits); -preliminary text grouping, ensuring term detection in small documents; -development of the technology providing the extraction of terms containing an arbitrary number of words; -development of a software product implementing the proposed technology and allowing to approve the decisions made.

The aim and objectives of the study
The aim of the study is to reduce the time and improve the quality of term extraction from documents in a narrow knowledge domain.
To achieve the aim, the following objectives were formulated: -to determine the characteristics of terms affecting the technology of their extraction from the text; -to develop an information technology of term extraction from the text, including preliminary document grouping; -to develop a software product and estimate the quality of term extraction.

Characterization of multi-word terms
To automate the process of MWT extraction, it was necessary to identify a number of MWT characteristics affecting the technology of this process.The defined characteristics use the concept of "head-word" -a noun included in the MWT.To work with MWT, the following characteristics were required: -possible number of words included in the term; -arrangement of "head-words" in the MWT; -possible number of "head-words" in the MWT; -definition of words and punctuation marks that limit the MWT.
The domain dictionary is used for the design and maintenance of the software product.Therefore, text documents from various fields of technology and applied sciences in Russian, Ukrainian and Belarusian languages were chosen for the study.200 terms were extracted for each knowledge domain.
Fig. 1 shows the average probability distribution of occurrence of a certain number of words in the multi-word term.The scatter of values determined by a particular knowledge domain did not exceed 1-2 %.Fig. 2 shows the results of the analysis of the arrangement of the head-word in the multi-word term.Nouns were chosen as the head-word, for example, for the term "information system" the head-word is "system".If the term contains more than one noun, as, for example, in the term "relational databases", each of them was assigned to the corresponding category ("bases" -in the middle, "data"on the right).Fig. 3 shows the probability of occurrence of several head-words (nouns) in the MWT.From the above data, it follows that the probability of occurrence of several nouns (head-words) in the term is high.Therefore, the method of term extraction by a noun and related adjective [7] leads to large errors, and a deeper analysis of the connection of words requires complex syntax analysis.This confirms the need to look for a more efficient method of MWT extraction.

Fig. 3. Probability of occurrence of nouns in the MWT
Table 1 shows the results of determining possible MWT limits.The case when the comma is included in the MWT (No. 2) turned out to be the only one out of 1,000 analyzed MWT ("decision makers").
In accordance with the results of the study, the following conclusions were made: -a multi-word term can be represented by no more than five words; -the arrangement and number of head-words in the multi-word term can be any; -the sequence of words included in the multi-word term should be limited at the left and right by punctuation marks or pronouns.

Technology of term extraction from the text
The technology involves a number of stages: -selection and grouping of documents reflecting the knowledge domain; -document format conversion; -morphological analysis of the analyzed text, extraction of nouns; -determination of possible MWT based on head-words; -identification of inter-phrase relations and replacement of references with terms; -calculation of the number of MWT occurrences in the text; -dictionary updating by searching for the occurrence of some terms in another.
The proposed technology is applicable to the most common European languages.For Slavic languages, due to case declension, gender agreement and rather complex rules of plural formation, the mechanism for using the normalized form of their representation is additionally introduced for comparison of terms.

1. Preliminary document grouping
In the process of knowledge domain (KD) analysis, the system analyst has to deal with a variety of documents in order to determine the requirements for the developed software product.These documents may represent various aspects of the organization's activities.Term extraction from the entire set of documents as a whole can lead to an underestimation of those terms that are concentrated in separate small documents.
The processing of each document separately with a small amount of some of them may not provide for the accumulation of statistics.To determine the influence of the document size on the quality of term extraction, the study of a set of documents of different volume was conducted.It was believed that if some phrase occurred in the document once, then it would not be identified as a term in the automated method of term extraction.If the expert considers this phrase a term, this indicates a potential error in the automated term search.Based on the analysis of 100 documents with different number of words, the dependence of the probability of the term definition error on the size of the document was obtained (Fig. 4).
In [7], it is proposed to group documents on the basis of the normalized distance between them, which provides for a partial morphological analysis.In this paper, we introduce the concept of the volume i v of some document D i .If it turned out that v i <5,000, then in accordance with Fig. 4, we assume that the document D i should be combined with other documents into some integrated document T x in order to achieve the volume of at least 5,000 words in the group: The problem of document selection for the group can be assigned to an expert in the knowledge domain or systems analyst.

2. Document format conversion
Known text analyzers [15] accept documents in the .txtformat at the input, therefore the corresponding conversion is required: . ⇒ x txt T T Such conversion is performed in any text editor.

3. Mathematical model of multi-word terms
The proposed model considers the single-word term as a special case of a multi-word term.As a result of processing of the text T txt , a list of terms should be obtained.We will call this list a dictionary, because later on, when term interpretations are added to the list, it becomes a KD dictionary.At the stage of term extraction, the dictionary will be represented as a set of records Each record has the following form: where tm is the set of term representations, lsn is the list of head-words (nouns) included in the term in the normalized form, nf is the normalized representation of the term, q is the number of occurrences of the term in the document.Introduction of the normalized form of term representation is necessary only for Slavic languages.As an example, three sentences in Russian, English and French are given.
We work with relational databases.
We made changes to a relational database.
The relational database contains a set of tables.

Nous faisons les changements dans la base de données relationnelle
La base de données relationnelle contient la multitude de tableaux.
All sentences contain the term "relational database".However, the options of its representation in Russian significantly differ one from another, whereas in English and French these differences are minimum.Therefore, for comparing terms from texts in non-Slavic languages, fuzzy string matching can be successfully applied (for example, using the Levenshtein distance), while for texts in Slavic languages it is suggested to use the normalized form of term representation.
Representation of one term by a set of tm options is also used for Slavic languages, since it allows choosing the correct representat i on of the multi-word term containing several head-words at the end of the analysis.Each element of the set с consists of the same sequence of words.The elements differ in cases an d number of the corresponding words.Table 2 illustrates the use of a set of representation options of one term in Slavic languages.The normalized form of representation nf is the same for all representation options of the term.In [13], it is proposed to compare the content of texts by means of a special linguistic processor.Using the normalized form allows comparing multi-word terms using a very simple procedure of non-fuzzy string matching, which significantly reduces text processing time.
For non-Slavic languages, the set tm from (2) will contain one element (term), and nf -the same term.
The list of the head-words of the term lst at the completion of forming the dictionary D will allow choosing the most appropriate representation of the term tm.According to the diagram in Fig. 1, a multi-word term can include up to 5 words.According to the diagram in Fig. 2, the head-word can take any position in the multi-word term.Therefore, it is proposed to form all possible groups of words relative to the head-word.In order to reduce the number of possible groups, the set of types of left and right limits of MWT is defined in accordance with Table 1: {" : ",";",".","?","!","(",")"," "," ", \ }.It is proposed to form possible terms as sequences of 5, 4, 3, 2 and one word containing at least one head-word.First assume that the sequence will include only one head-word.
We represent a piece of the text S as a sequence of elements: 1 ,..., ,.., .
l m e e e ( 4 ) An element can be a single word or a punctuation mark.Each word is represented by a sequence of letters W (directly from the text), the set A, the normalized form of representation nf (the analyzer result): We define the attributes that will be needed to determine the MWT limits, as well as to take into account the inter-phrase relations [16].Let A1 represent a part of speech, А2 -number, A3 -gender, A4 -person, A5 -case.
Punctuation marks are represented only by their spelling , ,* .=< ∅ > e W Let some element be the head-word e 0 =<W, A, q>, where A1=noun (noun), q -the number of occurrences of the term in the text S.
We formulate the rules for composing sequences of words: -the sequence is formed of nearby elements; -the head-word must be included in the sequence; -the number of elements in the sequence should not be more than 5 and less than 1 (punctuation marks included in the sequence are not taken into account); -the sequence can be limited to the left or right of the head-word, if some element of the sentence e i , provided that e j ÎB.
Let some text contain a sequence of elements: where e 0 is the head-word.Then the possible sequences of words (without limits) will be as follows: The formula for determining the number of possible combinations is proposed: We consider possible limits for combinations.Let some element e j ÎB.Then all combinations including elements with indices i≤j are excluded from further analysis.The formula for determining the number of possible combinations under the left limit is: 5 1
Let us consider a more general case, when a group may include more than one head-word.Let some text have a sequence of elements: * * .... ... ... , e e e e where e j * and e k * are head-words.Then, provided that: Formed word sequences will contain one head-word as terms.If k-j<5, then, using the previously described method of forming word sequences separately for the head-word e j * and for the head-word e k * , we obtain repeated sequences.For example, for the sentence fragment: The number of possible word sequences in the presence of several head-words in a sequence depends on the number of head-words, but cannot exceed K from ( 6) per one headword.

4. Inclusion of the word sequence in the dictionary
Each sequence of words E, obtained after the accounting of limits, should be represented by the record (2) in the dictionary (1).For this purpose, we define its normalized form E nf .We introduce the notation for the belonging of some word sequence to the dictionary EÎ e D. If: then the combination of words is already present in the dictionary.In this case, we increase the number of occurrences by 1 (r i .q:r i .q+1)and check the occurrence of E in r i .tm.where lsn will contain all nouns (head-words, selected at the stage of building sequences) from E in the normalized form.

Accounting of inter-phrase relations
The main criterion of term selection is the frequency of occurrence in the analyzed text.Inter-phrase relations occur if a term in the subsequent sentences is replaced with a pronoun, ordinal number, etc.For example, in the sentence "Hard drive is the main data storage device for the majority of personal computers.",the phrase "Hard drive" can be defined as a term.In the next sentence, "Usually it is characterized by capacity and speed.", the term "Hard drive" is replaced with the pronoun "it".If the relation between the sentences is not found, then only one occurrence of the term "Hard drive" will be defined.In the present study, we used the results obtained in [16], where algorithms for identifying inter-phrase relations are presented.Let some element of the sentence e i be an anaphor (replacement or reference) of the previously found term e i r i .t,then the number of occurrences of r i .t in the text should be increased: . : .1.

6. Updating of the dictionary
For each term, it is necessary to introduce the lower limit Be of the number of term occurrences in G: The minimum value of the lower limit is Be=2.With this value of Be, some sequence of words, extracted in accordance with (8), repeatedly occurred in the text.For large texts, the value of Be can be increased.It is recommended to entrust this operation to an expert in the knowledge domain.Consistently increasing the Be value, the moment should be fixed when all important terms for the given knowledge domain still remain in the dictionary.
As a result of the analysis of the document, the terms that are included in other terms may occur.The question of keeping such terms in the dictionary or excluding them from the dictionary depends on their independent use in the text.The procedure of dictionary updating provides a comparison of records.If: r nf r nf r q r q the record i r is excluded from the dictionary.If: r nf r nf r q r q then it is necessary to analyze D=r i .q-rj .q.If D≥Be, then the record r i is not excluded from the dictionary.After determining the terms to be included in the dictionary, it is necessary to choose one of the representation options of each term in the set tm.For this purpose, we introduce the concept of "main word" in the term.There are a number of signs that distinguish it from other words: -it must be a noun (mandatory); -it usually ranks first among other nouns in the term; -its spelling options (case and number variations) usually define various options of term representation in tm.
Thus, the process of choosing an option of term representation involves the following sequence of actions.
We determine the number of elements of the set tm.If tm|=1, then there is only one option of term representation in the dictionary.
If |tm|=k˄k>1, then the number of head-words in the list lsn is determined.
If |lsn|=1, then there is only one head-word w 1 Îlsn in the term.Its position j in options of term representation is determined based on the position of this word in r..nf: w w w r nf (10) Then, the word w i,j in the position j is selected from each representation option of the term tm i Îtm and compared with the normalized representation. If: , then all elements except tm i are removed from the set tm, that is, tm={tm i }.
If the condition ( 11) is not met, then: and the problem of formulating the term definition is solved by an expert.If |lsn|=l˄l>1, then there are several head-words in the definition of the term.For each head-word w p Îlsn|p=1, l, its position j is determined in options of term representation in accordance with (10).
Further, from each representation option of the term tm i Îtm, the word , i j w in the position j is selected and compared with the normalized representation.
If w i,j =w p ,then all the elements except tm i are removed from the set tm, that is, tm i Îtm and the cycle of searching for the best option of term representation is completed.Otherwise, p=p+1 and the cycle continues.If the best representation was not found, then the decision is made in accordance with (12) and the expert solves the problem of formulating the term definition.

Development of the software product and assessment of the quality of term extraction
To implement the proposed technology and models, the TermsSelect software product was developed.The scheme of document processing is presented in Fig. 5.
Fig. 6 presents the window allowing the expert to edit the list of terms found in the text.The terms were obtained as a result of the analysis of 15 texts on the subject "Materials and technologies of ceramics production" with a total volume of about 20,000 words [17].The arrangement of terms is determined by the first noun.The content of the first column "Term" is subject to editing.In addition, the expert can remove a row from the table or enter a new term.
The purpose of testing the software product was a comparative assessment of new and previously existing technol-ogies by time characteristics and quality of term extraction.Quality was understood as the percentage of errors of the first kind ("excess" terms) and the second kind ("lost" terms) of the total number of the terms found.
To test the proposed technology and software product, texts from various fields of science and technology were used.As a result of the experiments, it was found that when using TermsSelect, the average time of term extraction from the document of 10,000 words was 15.6 seconds.The timing of the expert's work on the extraction of terms and their frequency characteristics "manually" gave the result of about 10 hours.The simplified task -term extraction only was performed by the expert within 1.5 hours.During the term extraction with the program, "excess terms" were found.They made up about 5 % of the extracted terms.At the same time, "lost terms" were not found.It should be noted that the removal of "excess terms" does not require a special procedure, since in all cases the list of extracted terms is viewed by the expert.
For comparison, testing of the DictionaryCreator software product, proposed in [7] was carried out.Here, time of term extraction was 12.4 seconds for the document of 10,000 words.However, the number of "lost terms" was 22 % (mostly terms of three or more words).Definition of "lost terms" is a very labor-intensive procedure that can only be performed manually.Thus, with an insignificant increase in the time of text processing, it was possible to obtain a significant improvement in the quality of extraction of multi-word terms.

Discussion of the results of research on the speed and quality of term extraction
A significant reduction in the number of "lost terms" at a high speed of processing of source texts is explained by two main solutions: -the proposed method of forming potential terms as all admissible chains of words located near the head-words; -preliminary grouping of short documents.Representation of a term as a set of word chains allows defining terms as a subset of chains that are repeated in the text.Such a principle can be used for the majority of natural languages and requires only morphological analysis.Grouping of short documents for the period of analysis allows finding terms that occur in a document once.The proposed solution requires the expert to only edit the term included in the dictionary.The existing solutions for determining the frequencies of single words are characterized by high speed, but leave a lot of work to the expert related to the analysis of source texts.The methods of term extraction as a noun with related adjectives do not cover the whole variety of terms.The speed of such method is commensurate with the one proposed in this work, however, a large number of "lost terms" also require the expert to work with the source text.
The studies were limited to Slavic and most common European languages, for which the concept of the head-word can be introduced.They can not be applied, for example, to Vietnamese and other languages like Chinese.
The disadvantages of the study include the representation of the extracted term in the form, which in some cases requires editing by an expert.Attempts to present a multi-word term in the final form without the use of known labor-intensive methods of text generation have so far failed.
In addition to the generally accepted concept of a term in the texts representing a narrow knowledge domain, specific abbreviations and names (programs, processes, machines, etc.) can be used, which can also be attributed to terms.Extraction of such terms requires the formalization of the concepts of "abbreviation", "name" and is a continuation of this study.The domain dictionary should contain an interpretation of terms, which is currently performed manually.Automation of this process involves the search for relevant sources of information and selection of suitable pieces of text.This problem also requires further research.included, the possible number and arrangement of nouns in the term, as well as possible limiters of the chain of words included in the term are determined.The results of the study are needed to construct a mathematical model of the term.
2. The information technology of term extraction from text documents, containing document grouping; mathematical model of the term, allowing to extract it from the sentence; adjustment of the frequency of terms based on the identification of inter-phrase relations and occurrence of some terms in others is developed.The technology allows term extraction without a detailed syntax analysis of the sentence, which significantly reduces the processing time of the document.
3. The TermsSelect software product, implementing the proposed technology is developed.Text documents in any standard formats were submitted to the input.To allocate parts of speech and obtain the normalized form of word representation, freely available plug-in text analyzers were used.The maximum length of the word chain was set equal to five.The expert's task was only the editing of terms.The analogue was the earlier developed DictionaryCreator software product, which extracts terms as nouns and syntactically related adjectives.Comparative tests of the products on the same texts showed that with almost the same time spent on text processing, TermsSelect found all the terms, and DictionaryCreator found 78 % of the terms.The search for "lost terms" was estimated at 1.5 hours of work of the expert work.Thus, the achieved improvement in the quality of term extraction significantly reduced the total time of term extraction.

Fig. 1 .
Fig. 1.Probabilities of occurrence of the term containing one or more words

Fig. 2 .
Fig. 2. Probabilities of arrangement of the head-word in the multi-word term

Fig. 4 .
Fig. 4. Dependence of the term extraction error on the document size in words e e e e e e e e e e e e .On the basis of * 0 e , we obtain the following sequences with two head-words:It will be shown below how to eliminate repeated word sequences in the dictionary.

Table 1
Possible limits of MWT inclusion in the text

Table 2
Representation options of the multi-word term