Development of information technology of term extraction from documents in natural language

Oleksii Kungurtsev; Svetlana Zinovatnaya; Iana Potochniak; Maxim Kutasevych

doi:10.15587/1729-4061.2018.147978

Authors

Oleksii Kungurtsev Odessa National Polytechnic University Shevchenka ave., 1, Odessa, Ukraine, 65044, Ukraine https://orcid.org/0000-0002-3207-7315
Svetlana Zinovatnaya Odessa National Polytechnic University Shevchenka ave., 1, Odessa, Ukraine, 65044, Ukraine https://orcid.org/0000-0002-9190-6486
Iana Potochniak Odessa National Polytechnic University Shevchenka ave., 1, Odessa, Ukraine, 65044, Ukraine https://orcid.org/0000-0003-1291-1146
Maxim Kutasevych Odessa National Polytechnic University Shevchenka ave., 1, Odessa, Ukraine, 65044, Ukraine https://orcid.org/0000-0003-0059-4964

DOI:

https://doi.org/10.15587/1729-4061.2018.147978

Keywords:

domain dictionary, multi-word term, morphological analysis, mathematical model of the term, text document

Abstract

It is shown that domain dictionaries are widely used at various stages of design and operation of software products. The process of dictionary development, especially term extraction, is very labor-intensive, requiring high qualification of the expert. Studies are conducted to identify the most important characteristics of multi-word terms (MWT), such as: the probability of the presence of terms containing different numbers of words in the document; arrangement of nouns in MWT; possible number of nouns in MWT. The context of the use of terms is analyzed and possible limits of terms in the text are identified. The procedure is proposed for preliminary document grouping, thus avoiding the “loss” of terms included in short documents. The dependence of errors of term extraction on the size of the analyzed document is determined.

The mathematical model of term representation, based on the definition of the set of word chains grouped around a head-word – a noun is proposed. Filtration of chains is performed depending on the frequency of their occurrence in the text based on a comparison of normalized representations of MWT.

Mechanisms for filling the domain dictionary with new records and adjusting existing ones in the process of analyzing the input document are developed. The solution to adjust the frequency of occurrence of terms based on the identification of inter-phrase relations is proposed. All processes and models are combined into a single information technology of construction of the domain dictionary. The problem of term interpretation is not considered in this paper, since it requires a separate solution. The software product allowing to automate substantially the process of term extraction from text documents is developed. The results of testing of the proposed solutions showed the absence of “lost terms” and, as a result, the reduction of the time of term extraction from texts of 10,000 words by 1.5 hours by freeing the expert from analyzing the original document. The research results can be used at various stages of design and operation of software products

Author Biographies

Oleksii Kungurtsev, Odessa National Polytechnic University Shevchenka ave., 1, Odessa, Ukraine, 65044

PhD, Associate Professor

Department of System Software

Svetlana Zinovatnaya, Odessa National Polytechnic University Shevchenka ave., 1, Odessa, Ukraine, 65044

PhD, Associate Professor

Department of System Software

Iana Potochniak, Odessa National Polytechnic University Shevchenka ave., 1, Odessa, Ukraine, 65044

Postgraduate student

Department of System Software

Maxim Kutasevych, Odessa National Polytechnic University Shevchenka ave., 1, Odessa, Ukraine, 65044

Department of System Software

References

Izbachkov, Yu. S., Petrov, V. N. (2011). Informacionnye sistemy. Piter, 544.
Liubchenko, V., Sulimova, I. (2017). Examining the attributes of transitions between team roles in the software development projects. Eastern-European Journal of Enterprise Technologies, 1 (3 (85)), 12–17. doi: https://doi.org/10.15587/1729-4061.2017.91597
Best Practices for Data Dictionary Definitions and Usage Version 1.1. 2006. Available at: https://s3.us-west-2.amazonaws.com/org-pnamp-assets/prod/best_practices_for_data_dictionary_definitions_and_usage_version_1.1_2006-11-14.pdf
Ways Data Dictionary Increases Software Developers Productivity. Available at: https://dataedo.com/blog/ways-data-dictionary-increases-software-developers-productivity
Novokhatska, K., Kungurtsev, O. (2016). Application of Clustering Algorithm CLOPE to the Query Grouping Problem in the Field of Materialized View Maintenance. Journal of Computing and Information Technology, 24 (1), 79–89. doi: https://doi.org/10.20532/cit.2016.1002694
Novokhatska, K., Kungurtsev, O. (2016). Developing methodology of selection of materialized views in relational databases. Eastern-European Journal of Enterprise Technologies, 3 (2 (81)), 9–14. doi: https://doi.org/10.15587/1729-4061.2016.68737
Kungurcev, A. B., Potochnyak, Ya. V., Silyaev, D. A. (2015). Method of automated construction of explanatory dictionary of subject area. Technology audit and production reserves, 2 (2 (22)), 58–63. doi: https://doi.org/10.15587/2312-8372.2015.40895
Califf, M., Mooney, R. J. (2003). Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research, 4, 177‒210.
Bourigault, D. (1992). Surface grammatical analysis for the extraction of terminological noun phrases. COLING '92 Proceedings of the 14th conference on Computational linguistics, 977–981. DOI: https://doi.org/10.3115/993079.993111
Bessmertny, I. A., Nugumanova, A. B., Mansurova, M. Y., Baiburin, Y. M. (2017). Method of rare term contrastive extraction from natural language texts. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 17 (1), 81–91. doi: https://doi.org/10.17586/2226-1494-2017-17-1-81-91
Popova, S. V., Hodyrev, I. A. (2012). Izvlechenie klyuchevyh slovosochetaniy. Nauchno-tekhnicheskiy vestnik Sankt-Peterburgskogo gosudarstvennogo universiteta informacionnyh tekhnologiy, mekhaniki i optiki, 1 (77), 67–71.
Hasan, K. S., Ng, V. (2014). Automatic keyphrase extraction: a survey of the state of the art. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 1262–1273. doi: https://doi.org/10.3115/v1/p14-1119
Vavilenkova, A. (2017). Methods of identifying logical connections between parts of text documents. Bulletin of the National Technical University «KhPI» Series: New solutions in modern technologies, 7 (1229), 118–122. doi: https://doi.org/10.20998/2413-4295.2017.07.16
Bessmertniy, I. A., Karimov, A. T., Novoselov, A. O., Nugumanov, A. B. (2013). Realizaciya algoritma izvlecheniya klyuchevyh slov iz tekstov predmetnoy oblasti na osnove modeli MapReduce. Trudy VIII Mezhdunarodnoy nauchno-prakticheskoy konferencii "Sovremennye informacionnye tekhnologii i IT-obrazovanie", 617–624.
Programmniy paket sintaksicheskogo razbora i mashinnogo perevoda. Available at: https://www.cognitive.ru/
Kungurcev, A. B., Gavrilova, A. I., Leongard, A. S., Potochnyak, Ya. V. (2016). Uchet mezhfrazovyh svyazey pri avtomatizirovannom postroenii tolkovogo slovarya predmetnoy oblasti. Informatika i matematicheskie metody v modelirovanii, 2, 173–183.
Materialy i tekhnologiya izgotovleniya keramicheskih izdeliy. Available at: http://art-con.ru/node/233

Development of information technology of term extraction from documents in natural language

Authors

DOI:

Keywords:

Abstract

Author Biographies

Oleksii Kungurtsev, Odessa National Polytechnic University Shevchenka ave., 1, Odessa, Ukraine, 65044

Svetlana Zinovatnaya, Odessa National Polytechnic University Shevchenka ave., 1, Odessa, Ukraine, 65044

Iana Potochniak, Odessa National Polytechnic University Shevchenka ave., 1, Odessa, Ukraine, 65044

Maxim Kutasevych, Odessa National Polytechnic University Shevchenka ave., 1, Odessa, Ukraine, 65044

References

Downloads

Published

How to Cite

Issue

Section

License

Language

Information

Make a Submission

Developed By

Current Issue