Development of information technology of term extraction from documents in natural language
DOI:
https://doi.org/10.15587/1729-4061.2018.147978Keywords:
domain dictionary, multi-word term, morphological analysis, mathematical model of the term, text documentAbstract
It is shown that domain dictionaries are widely used at various stages of design and operation of software products. The process of dictionary development, especially term extraction, is very labor-intensive, requiring high qualification of the expert. Studies are conducted to identify the most important characteristics of multi-word terms (MWT), such as: the probability of the presence of terms containing different numbers of words in the document; arrangement of nouns in MWT; possible number of nouns in MWT. The context of the use of terms is analyzed and possible limits of terms in the text are identified. The procedure is proposed for preliminary document grouping, thus avoiding the “loss” of terms included in short documents. The dependence of errors of term extraction on the size of the analyzed document is determined.
The mathematical model of term representation, based on the definition of the set of word chains grouped around a head-word – a noun is proposed. Filtration of chains is performed depending on the frequency of their occurrence in the text based on a comparison of normalized representations of MWT.
Mechanisms for filling the domain dictionary with new records and adjusting existing ones in the process of analyzing the input document are developed. The solution to adjust the frequency of occurrence of terms based on the identification of inter-phrase relations is proposed. All processes and models are combined into a single information technology of construction of the domain dictionary. The problem of term interpretation is not considered in this paper, since it requires a separate solution. The software product allowing to automate substantially the process of term extraction from text documents is developed. The results of testing of the proposed solutions showed the absence of “lost terms” and, as a result, the reduction of the time of term extraction from texts of 10,000 words by 1.5 hours by freeing the expert from analyzing the original document. The research results can be used at various stages of design and operation of software productsReferences
- Izbachkov, Yu. S., Petrov, V. N. (2011). Informacionnye sistemy. Piter, 544.
- Liubchenko, V., Sulimova, I. (2017). Examining the attributes of transitions between team roles in the software development projects. Eastern-European Journal of Enterprise Technologies, 1 (3 (85)), 12–17. doi: https://doi.org/10.15587/1729-4061.2017.91597
- Best Practices for Data Dictionary Definitions and Usage Version 1.1. 2006. Available at: https://s3.us-west-2.amazonaws.com/org-pnamp-assets/prod/best_practices_for_data_dictionary_definitions_and_usage_version_1.1_2006-11-14.pdf
- Ways Data Dictionary Increases Software Developers Productivity. Available at: https://dataedo.com/blog/ways-data-dictionary-increases-software-developers-productivity
- Novokhatska, K., Kungurtsev, O. (2016). Application of Clustering Algorithm CLOPE to the Query Grouping Problem in the Field of Materialized View Maintenance. Journal of Computing and Information Technology, 24 (1), 79–89. doi: https://doi.org/10.20532/cit.2016.1002694
- Novokhatska, K., Kungurtsev, O. (2016). Developing methodology of selection of materialized views in relational databases. Eastern-European Journal of Enterprise Technologies, 3 (2 (81)), 9–14. doi: https://doi.org/10.15587/1729-4061.2016.68737
- Kungurcev, A. B., Potochnyak, Ya. V., Silyaev, D. A. (2015). Method of automated construction of explanatory dictionary of subject area. Technology audit and production reserves, 2 (2 (22)), 58–63. doi: https://doi.org/10.15587/2312-8372.2015.40895
- Califf, M., Mooney, R. J. (2003). Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research, 4, 177‒210.
- Bourigault, D. (1992). Surface grammatical analysis for the extraction of terminological noun phrases. COLING '92 Proceedings of the 14th conference on Computational linguistics, 977–981. DOI: https://doi.org/10.3115/993079.993111
- Bessmertny, I. A., Nugumanova, A. B., Mansurova, M. Y., Baiburin, Y. M. (2017). Method of rare term contrastive extraction from natural language texts. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 17 (1), 81–91. doi: https://doi.org/10.17586/2226-1494-2017-17-1-81-91
- Popova, S. V., Hodyrev, I. A. (2012). Izvlechenie klyuchevyh slovosochetaniy. Nauchno-tekhnicheskiy vestnik Sankt-Peterburgskogo gosudarstvennogo universiteta informacionnyh tekhnologiy, mekhaniki i optiki, 1 (77), 67–71.
- Hasan, K. S., Ng, V. (2014). Automatic keyphrase extraction: a survey of the state of the art. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 1262–1273. doi: https://doi.org/10.3115/v1/p14-1119
- Vavilenkova, A. (2017). Methods of identifying logical connections between parts of text documents. Bulletin of the National Technical University «KhPI» Series: New solutions in modern technologies, 7 (1229), 118–122. doi: https://doi.org/10.20998/2413-4295.2017.07.16
- Bessmertniy, I. A., Karimov, A. T., Novoselov, A. O., Nugumanov, A. B. (2013). Realizaciya algoritma izvlecheniya klyuchevyh slov iz tekstov predmetnoy oblasti na osnove modeli MapReduce. Trudy VIII Mezhdunarodnoy nauchno-prakticheskoy konferencii "Sovremennye informacionnye tekhnologii i IT-obrazovanie", 617–624.
- Programmniy paket sintaksicheskogo razbora i mashinnogo perevoda. Available at: https://www.cognitive.ru/
- Kungurcev, A. B., Gavrilova, A. I., Leongard, A. S., Potochnyak, Ya. V. (2016). Uchet mezhfrazovyh svyazey pri avtomatizirovannom postroenii tolkovogo slovarya predmetnoy oblasti. Informatika i matematicheskie metody v modelirovanii, 2, 173–183.
- Materialy i tekhnologiya izgotovleniya keramicheskih izdeliy. Available at: http://art-con.ru/node/233
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2018 Oleksii Kungurtsev, Svetlana Zinovatnaya, Iana Potochniak, Maxim Kutasevych
This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.
A license agreement is a document in which the author warrants that he/she owns all copyright for the work (manuscript, article, etc.).
The authors, signing the License Agreement with TECHNOLOGY CENTER PC, have all rights to the further use of their work, provided that they link to our edition in which the work was published.
According to the terms of the License Agreement, the Publisher TECHNOLOGY CENTER PC does not take away your copyrights and receives permission from the authors to use and dissemination of the publication through the world's scientific resources (own electronic resources, scientometric databases, repositories, libraries, etc.).
In the absence of a signed License Agreement or in the absence of this agreement of identifiers allowing to identify the identity of the author, the editors have no right to work with the manuscript.
It is important to remember that there is another type of agreement between authors and publishers – when copyright is transferred from the authors to the publisher. In this case, the authors lose ownership of their work and may not use it in any way.