Development of Methods for Pre-Clustering and Virtual Merging of Short Documents for Building Domain Dictionaries

The aim of research is to improve the quality of domain dictionaries by expanding the corpus of the documents under study by using short documents. A document model is proposed that allows to define a short document and the need to combine it with other documents to highlight verbose terms. An algorithm for highlighting the substantive part of the document has been developed, since in a short document the heading and closing parts usually contain terms that are not related to the studied domain. A method for preliminary clustering of short documents to highlight verbose terms has been developed. The method is based on highlighting and counting occurrences of nouns (one-word terms) for all analyzed documents. The concept of document proximity is introduced, which is determined by the combination of two criteria: the relative number of matching terms and the relative frequency of occurrence of matching terms. The principle of grouping documents at the customer's site often does not correspond to the principles of grouping necessary for building a dictionary of the domain. In a short document, it is usually impossible to isolate a verbose term because the repetition of terms is very low. A method has been developed for virtual combining of short documents based on the principle of achieving the necessary repeatability of one-word terms. The merged document has the highest possible frequency of terms for the cluster it belongs to. At the same time, the original text of documents is preserved and the ability to associate the selected verbose term with those documents in which it is included. The experiment made it possible to find the best ratio for the elements of the document proximity coefficient and confirm the effectiveness of the proposed preliminary clustering method.


Introduction
When designing software products "to order", it becomes necessary to create a dictionary for a narrow domain (DSA). DSA contains the terms of the domain and their interpretation. It allows the customer of the project and the developer of the project to communicate in "the same language", which is necessary at the first stage of the project to determine the requirements for the software product [1]. In the future, such a dictionary is used to create user interfaces, instructions, in the development support process [2]. Automatic construction of domain-specific ontologies is a difficult task [3,4], since it requires extracting domain-specific terms from the body of documents and assigning them the corresponding labels of domain concepts. On the other hand, the DSA availability allows to solve many problems associated with document processing [5]. Usually, the terminology of the Customer's domain is heterogeneous and specific, which does not allow using existing specialized explanatory dictionaries. The documents that are used in the Customer's organization are examined as a source material for the DSA. Until now, the definition of terms for DSA is usually performed by an expert in "manual mode", which is a very time consuming process. When automating the process of building DSA, there is a problem of highlighting terms in short documents. The low frequency of occurrence of some word or phrase in the document does not allow to consider it a term, however, if the document was processed manually, the expert would probably include it in the DSA.

Literature review and problem statement
In [6] it is shown that to represent the domain, it is sufficient to use terms based on nouns. This approach is taken in this article. The analysis of the effectiveness of various approaches to the selection of terms is carried out in [7]. It is shown that the rule-based approach is more accurate, but less versatile and more laborious. A hybrid approach can be a solution to the problem. In [8], using the examples of search engines, it is shown that deep parsing of a text is less effective than primitive analysis and subsequent statistical processing. A similar method of text processing was used in [9] to highlight keywords. However, the author limited himself to one-word terms, which greatly simplified the task, but lowered the quality of the result. The work [10] is devoted to the study of the effectiveness of methods for extracting terms. However, the authors limited themselves to only one-word and two-word terms. At the same time, research has shown that terms can contain up to five or more words [11]. In [12], a general approach to the analysis of texts in natural language was proposed, but the problems of practical implementation of procedures for extracting terms and determining their labor intensity remained unresolved. The most noteworthy is the method that combines statistical processing of texts with partial syntactic analysis [11]. The method is based on forming a possible term from a noun and surrounding words. The implementation of the method is fraught with a number of problems. First, the corpus of documents to be analyzed is heterogeneous in terms of topics. Searching for terms based on the entire corpus can lead to errors both in the selection of terms and in their further interpretation. Secondly, a feature of the array of documents representing a certain organizational structure is the presence of a large number of short documents. In the study [13], the analysis problems associated with the size of text documents are indicated, it is proposed to use the classification method based on weight coefficients. Noteworthy is the proposal to split documents into three types: long, short, and very short, and terms in the thesaurus into unique, rare, and general. However, the work lacks a clear definition of a short document in terms of the number of errors that occur during its analysis. In addition, the corpus of the studied documents was limited to documents from the area of the executive and legislative powers, which made it possible to provide a certain dynamism of the headings, but could not be extended to other domains. The dependence of errors in the selection of verbose terms on the number of words in documents in Slavic languages is determined experimentally in [11] (Table 1). Here, the loss of a term means a one-time appearance of some noun in a document, since in this case there is no way to build a verbose term on its basis. In addition, short documents often have some formalized form [14], the elements of which can significantly affect the selection of terms and their frequency characteristics.
The interpretation of the term depends on the domain where it is used, therefore, to create DSA, it is necessary to cluster the corpus of the documents under consideration. Since the search for verbose terms in short documents is associated with a large number of errors [13], in this study it is proposed to perform preliminary clustering at the first stage using one-word terms. Subsequently, it is required to perform virtual merging of short documents (creating a merged document while preserving its constituent documents), which will allow to highlight verbose terms and, if necessary, carry out clustering based on verbose terms. In [15], clustering of a corpus of short documents is considered. It has been established that "the short length of documents makes it difficult to infer the hidden distribution of topics," therefore, the focus is not on quality, but on the speed of clustering. In [16][17][18], various approaches to document clustering are considered, in particular, the methods of K-means, K*-means, EM algorithm, sGEM, LAR, HSTC and TCFS algorithms, and their efficiency is analyzed. It is shown that all methods are effective in the process of clustering documents, but TCFS performance is slightly better in terms of clustering quality, and the traditional K-means algorithm has a high speed in obtaining clustering results. At the same time, the conducted studies are not extended to short documents. The paper [19] shows the efficiency of the K-means algorithm with four clusters for partitioning into groups of unlabeled documents. A proposal was made on the possible unification of documents. However, the use of the results obtained is limited to a library of dissertations and documents exclusively in pdf format.
Analysis of literary sources allows to conclude that the following unsolved problems exist: -preliminary processing of short documents; -preliminary clustering of documents until a full set of their features is obtained; -virtual merging of short documents.

The aim and objectives of research
The aim of research is to present short documents in a form that will allow to highlight verbose terms from them. In this case, the document language can be any of the common European languages for which there is a corresponding analyzer.
Based on the identified problems, the following research objectives are formulated: -create a mathematical model of a short document; -develop an algorithm for highlighting the content of the document; -develop a method for preliminary clustering of short documents; -develop a method for the virtual combination of short documents; -carry out an experimental study of the proposed methods.

Mathematical model of the document
Let's assume that the corpus of all documents under study is represented by their set D As a result of preliminary analysis of documents, at the stage of which nouns were selected, each document can be represented by a tuple , , , where text i -text of the document; nw i -document size in words; Mt i -set of one-word terms of the d i document.
where t q -term represented by a noun word; nn q -the number of occurrences of the term t q in document d i ; nn i -the number of different terms in the document.

Algorithm for highlighting the content of the document
Short documents often have a formalized structure (orders, statements, orders, reports, etc.). For example, an order has the following components: 1) the full name of the institution where the order is issued; 2) the title of the document; 3) date; 4) number; 5) the name (title) of the order (what the order is about); 6) main content; 7) signature (position of the head, signature, initials, surname).
Consequently, short documents can contain a number of general terms that define the form, but not the content of the document (name of the organization, address, etc.). Such terms should be excluded from further analysis. In general, formalized documents in an organization can be in the same groups with other documents. If they are collected in separate groups, then this can somewhat simplify the preprocessing process.
It is proposed to present a formalized document in the form of three sections: -heading part h; -content part b; -the final part (signatures) f. The heading part refers to a term (of one or more words) that defines the type of document. It actually separates the title from the content. The number of document types is small, therefore it was proposed to compose a set of M b types, in which a separate line corresponds to each type, for example, "Protocol", "Order", "Office memorandum", etc. A distinctive feature of the term representing a type is that it is written as a title in the center of the page. Therefore, the elements of the set are represented by combining the newline character and the text itself, which makes it possible to distinguish the type of document, for example, "Statement" from the word "statement", which may occur in some phrase of the content section of the document.
The final part may include a list of the signatories of the document, indicating the position of each person, perhaps the words "signature", "approve", etc. It is proposed to create a set of M f terms that could be included in the final part. The upper boundary of the final part of the document should be sought if it was previously established that the document in question belongs to the group of formalized documents. For this, the occurrence of elements of the set M f in the text of the document is determined, starting from the end of the document.
To reduce the processing time of the document, it is proposed to limit the search for the boundary of the heading part K to words, and the final part to N words.
The content of the M b , M f sets and the values of K and N are determined on the basis of the accepted office work and are refined with the participation of a domain expert.
In the general case, an element of the sets M b , M f is a phrase that includes a different number of words.
The text of the document is checked for the occurrence of word combinations from M b , M f according to the following algorithm.
Let n stb be the number of the word with which the content of the document begins d i ; n stf is the number of the word from which the final part of the document begins; l is the number of words in the studied phrase, j is the position in the document in which the studied phrase is located, measured in words.
1. n sb =0; n stf =0. 2. Define the text ts as a fragment from the beginning of the document with a length of K words.
3. m=1, w m -the key word combination under study, 4. Determine j as the result of the function of finding the position of the substring w m in ts.
5. If j is non-empty, then n stb =j. Go to step 7. 6. m=m+1. If 10. If j is non-empty, then n stf =j. Go to step 12. 11. m=m+1. If , f m M ≤ then go to step 9. 12. Mark up the document by marking the text from position 0 to (n stb +l b ) and from n stb to nw i as comments that are not considered in further analysis of the document d i .
Since the concept of a "short document" is rather subjective before the frequency of the terms included in it is determined, the extension of the procedure for isolating the content of the document to documents that will not turn out to be short will not lead to any negative consequences.

Method of preliminary clustering of short documents
The aim is to get clusters within which it is possible to automatically combine short documents, which will significantly reduce the time of the merging process. Getting into clusters of documents that are not short should not in any way affect the achievement of the goal. Other goals, for example, to use the resulting clusters in the future to search for information in the corpus of documents, are not set, but are a "side" result of the method.
At this stage, the original language of the document actually doesn't matter.

1. Determining the proximity of documents
To include documents d i and d j in a certain cluster, let's define set of the same terms in these documents.
The number of identical terms n i,j of documents d i and d j , referred to the number of all terms in two documents, can serve as a component of the coefficient of closeness of the 1st order of documents d i and d j .  Ks in the general formula for the proximity coefficient of the 1 st order of two documents Ks i,j , j is not obvious, therefore let's introduce a certain factor γ, which will be determined experimentally: Let's introduce the concept of the minimum value of the first-order proximity coefficient Ksmin.
If accept Ksmin=0, then all documents will be combined into one cluster. If to take Ksmin=1, then with γ=0 each document will be a separate cluster. A number of experiments were carried out to determine the recommended values of Ksmin and γ.

2. Stages of the method of preliminary clustering of documents
There is an ordered set of documents 6) the end of the algorithm. A number of documents that fell into a certain cluster could fall into other clusters as well. It is possible that the clustering result will be better than the initial one.
To improve the distribution of documents across clusters, let's introduce the concept of "cluster core" in the form of a set of documents that can't be included in other clusters. For each core, let's find a document for which the relative total coefficient of proximity with other documents included in the Possible document movements between clusters are shown in Fig. 1.
Upon completion of the clustering process, cluster f contains only the core. However, in the process of redistributing documents between clusters, cluster f can be supplemented with documents from other clusters.

3. Algorithm for preliminary clustering
1. Perform preliminary processing of documents. 2. Determine Ksmin and -. 3. Form clusters using the pre-clustering algorithm. 4. For the last cluster, find the central document. 5. For each next cluster, in descending order of numbers, determine the core by excluding documents that may be included in previous clusters. 6. For all documents of the current cluster that are not included in its core, let's determine the possibility of transferring to the previously considered clusters.
7. Repeat steps 5 and 6 until all clusters have been adjusted.

Possible movement of documents
Cluster core Cluster f Cluster f-1 Cluster f-2 Cluster f-3 Cluster 1

Method of virtual merging of short documents
In the future, let's consider documents that belong to only one cluster, and short documents that have been pre-processed.
From Table 1 it follows that significant errors are possible if the document size is less than 5,000 words. Thus, let's consider a document d short if d.nw≤5,000, where nw -the number of words in the document. Raising the short document detection threshold does not affect the quality of the merge result, and only leads to an increase in the overall time spent on document processing. As mentioned earlier, it can be assumed that verbose terms formed on the basis of nouns included in the documents once will be lost. Therefore, it is of interest to form a set of unique nouns for each document.
where nn i,k -the number of occurrences of the term t i,k in the text of the document d i . Thus, the document will be represented by the following tuple Based on (7), it is possible to clarify the concept of a short document and select a subset of them from the corpus of all documents. | .
Here Kc defines the boundary value of the concept of a short document and can be established by an expert.
By the virtual union of two documents, let's mean a meta document that is represented by a tuple (7) obtained from two or more documents, if their texts were combined.
To combine the documents, let's introduce the coefficient of closeness of the 2nd order Ku i,j of the document d i to the document d j . It will be determined by two components: The coefficient 1 , i j Ku determines the relative number of terms that, after the documents are combined, will no longer be unique: To determine the joint proximity of the 2nd order of documents, it is proposed to introduce an integrated coefficient , , Since the coefficient of proximity of the 2nd order of the document d i to the document d j differs from the coefficient of proximity of the 2nd order of the document d i to the document d j , the formula for KP i,j gives a certain priority to the combination of short documents.
The method contains the following steps: 1) determine the total number of unique terms in all documents

Proc if t t nn nn
where t i,k -another term from document d i . 5) form a set of unique terms for a virtual document that unites all documents in the corpus: The number of such terms n1v=|Mt'v|; (16) 6) construct a matrix of documents proximity ( Table 2). The rows and columns of the matrix are numbered in accordance with the numbering of documents in the corpus. At the intersection of the i-th row and the j-th column, KP i,j is written -the sum of the coefficients of proximity of the 2nd order of the document d i to the document d j and the coefficient of proximity of the 2nd order of the document d j to the document d i (12). There should be no empty cells in the  The current number of unique terms in the corpus n1 is determined, taking into account the union. If n1≠n1v, then documents max i d and max j d are removed from the proximity matrix, document d i,j is added, and the transition to step 6 is performed. Otherwise, the processing of documents is completed.

Experimental study of the proposed methods
To test the clustering method, the DocCluster program was created (Fig. 2). The source documents go to the preprocessing block, where the content of the document is highlighted, as well as converted to the format the parser works with.
Cognitive Dwarf (Russian and English, version for non-commercial use), MySteam (Russian, very fast and compact analyzer), Language Tool API (Ukrainian) were used as a parser. For the Cognitive Dwarf and MySteam analyzers, the preprocessing block was used to convert the original text of documents into txt format.
In the block for extracting nouns from the tables obtained at the output of the analyzer, nouns are selected and their occurrence in the document is counted. Thus, the internal form of the document is formed, represented by one-word terms and the number of their occurrences in the document (frequencies). Further processing of documents does not depend on the language of the source documents.
In the pre-clustering block, in accordance with the given algorithm, the documents are distributed into clusters. To control the results, it is possible to see the resulting clusters in the form of groups of source documents.
In the block for virtual merging of documents, documents are grouped within clusters. The resulting groups can later be used to identify verbose terms.
To determine the values of Ksmin and γ, 22 short documents were selected from the university's field of activity (educational process, scientific work, personnel management and repair work. It was assumed that a lot of documents should be grouped into 4 clusters. This was confirmed by the results of the analysis of the work of the noun extraction block At the same time, a number of terms were identified that were included in different supposed clusters (student, teacher, audience, workload, task, etc.). The pre-clustering unit processed these documents 100 times for 10 Ksmin values and 10 γ values. Results clustering were estimated by the relative error , nDer Rer nD = where nDer is the number of documents that did not fall into the "own" cluster; nD is the total number of documents. From the analysis of the graphs in Fig. 3 it follows that the values of Ksmin should be set in the range 0.3-0.4, and the values of γ -in range 0.3-0.6. These recommendations were accepted for further research.
To test the algorithm for the virtual combination of documents, the cluster of documents representing the educational process was expanded to 10 documents. After preliminary processing and identification of one-word terms, 8 documents fell into the category of short ones when determining the value of Kc=0.05. The experiment showed that the second component of the proximity coefficient 2 , can improve the repeatability of terms in a combined document (the case was observed once) by combining documents that have more repeated terms than other pairs of documents. However, since priority should be given to the "exclusion" of unique terms, it was proposed to determine δ by a formula After the introduction of the recommended values of Ksmin, γ and δ, the quality of clustering and merging was experimentally tested on another body of documents. In total, the corpus includes 32 documents. As documents, selected reports from conferences on various topics ("Business Management in the Digital Economy" [20], "Strongly Correlated Two-Dimensional Systems: From Theory to Practice [21]," Transport in the Integration Processes of the World Economy " [22]," Digital transformation of education [23], "Ensuring life safety at the present stage of development of society" [24]).
As a result of clustering, 6 groups of documents were formed (Fig. 4).
The expert confirmed the number of clusters, but considered that one document fell into the wrong cluster. From the point of view of virtual merging of documents, there were no errors in the clusters. As a result of virtual

Discussion of the results of clustering and virtual merging of documents
The mathematical model of the document in accordance with the formula (2) allowed in the future to determine the coefficients of the proximity of documents, which is the basis for their clustering.
The results obtained on the clustering of documents became possible, firstly, due to the selection of the content of the documents, which is especially important for short documents. According to the document model, the value of the nw i (2) component is reduced, that is, words that can't be considered as terms are excluded.
Secondly, for the clustering of documents, a proximity coefficient (5) was proposed, which contains two components. The first was determined by the relative number of overlapping terms, and the second was determined by the repeatability of these terms. Since the ratio of these components (γ) was not obvious, the recommended value was obtained experimentally (Fig. 3). The clustering method provides for a preliminary determination of the number of clusters, and then the refinement of their composition based on the calculation of nuclei for each cluster (Fig. 1).
The virtual merging of documents allows to consider a group of documents as one document (meta-document) in terms of highlighting terms. The merging process is based on the proposed second order proximity factor (9), which provides priority in merging short documents. The virtual combination of documents, built on the principle of maximum "elimination" of unique terms in the combined document, made it possible to achieve the theoretically possible minimum of this indicator in the cluster. This can be verified by comparing the number of unique terms in an imaginary document that combines the entire cluster (16) with the number of unique terms in all combined documents. The proposed method allows to expand the concept of "unique term", supplementing it with cases when the term in one document occurs within a specified number of times. Some discrepancy between the results of manual and automated clustering observed during the experiments (one document out of 32 fell into a wrong cluster) is determined by the fact that the expert evaluated the result in terms of the semantics of the document, and not the frequency of occurrence of terms. Therefore, the expert's conclusions are only a recommendatory assessment.
The main purpose of the proposed method of preliminary clustering is to reduce the time for building DSA. However, it showed quite good results in solving the clustering problem for further information retrieval. In the future, it is planned to improve the quality of clustering by using verbose terms.
At this stage of the study, preliminary clustering was performed. In the future, it is planned to carry out clustering based on verbose terms, which will reduce the time needed to find the necessary information.

Conclusions
1. A mathematical model of the document has been developed, taking into account such characteristics as the number of words, a set of one-word terms, and their frequency. The model is needed for further clustering and possible merging of documents.
2. An algorithm for highlighting the substantive part of the document has been developed, which implies the removal of individual structural components of the document, which obviously do not contain words that can be attributed to terms of a narrow domain. A distinctive feature is the definition of the heading and closing parts of the document according to the introduction of the concept of a document type, determined by a set of keywords. The implementation of the algorithm can significantly reduce the time for preprocessing documents.
3. A method for preliminary clustering of short documents has been developed, which is distinguished by the use of the proximity coefficient, which takes into account both coinciding terms and their relative frequency. To improve the quality of grouping of documents, an iterative process of their distribution among clusters is provided. The method allows to reduce the time and the number of errors when grouping documents manually. . A method has been developed for the virtual combination of short documents based on their closest proximity. The method is distinguished by the ability to reduce the number of unique terms in the document corpus to the theoretically achievable level.
5. An experiment was carried out to process 32 documents from 6 different domains. The documents were submitted to the program input in a random order, as a result, the program identified 6 clusters, which confirmed the efficiency of the proposed methods and algorithms. Also, experimental studies made it possible to clarify the ratio of the components of the proximity coefficient; Ksmin in the range 0.3-0.4, γ -in the range 0.3-0.6. The quality of clustering and virtual combining of documents allows using the proposed methods in the technology of creating DSA, and also showed the prospects of their development based on the use of verbose terms.