METHODS AND MEANS OF INTELLIGENT ANALYSIS OF TEXT DOCUMENTS
DOI:
https://doi.org/10.24025/2306-4412.2.2022.259408Keywords:
keywords, text analysis, search, text documents, classificationsAbstract
The paper reviews the methods of analysis and processing of electronic documents. Methods of analysis of text documents to solve the problem of determining the thematic affinity of texts are analyzed. An overview of existing approaches to solving the classification problem is performed. The main approaches used in the task of text classification are described; the stages of the classification process are determined and the most common methods of classifying text documents are considered. The main approaches to text pre-processing, such as: lower case, root correction, stemming, lemmatization, stop word removal, normalization, are considered. Advantages and disadvantages of each approach are considered. The procedure for reducing the dimension of a set of features with a division into sub-processes: selecting features and highlighting features is considered.
References
D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. Adison Wesley, Reading, MA, 1998.
N. Kasyanchuk, and L. Tkachuk, "Protection of information in databases", in Conf. VNTU of Electron. Sci. Publications, XLVIII Sci. and Tech. Conf. of the Faculty of Manage-ment and Information Security, 2019, pp. 2419-2424 [in Ukrainian].
I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques. 3rd ed. Morgan Kaufmann, 2011.
J. F. Luger, Artificial Intelligence. Strategies and methods for solving complex problems. 4th ed. Moscow: Izdat. Dom Williams, 2003.
T. Joachims, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algoritmhs. MA, USA: Kluwer Academic Publisher Norwel, 2002.
O. V. Havrylenko, Yu. O. Oliynyk, and G. V. Khanko, "Overview and analysis of text mining algorithms", Project Manage-ment, System Analysis and Logistics, no. 19, pp. 15-23, Kyiv, 2017 [in Ukrainian].
M. Lemke, and G. Wiedemann, Text Mining in den Sozialwissenschaften. Springer Fachmedien Wiesbaden, 2016, pp. 397-419.
I. V. Gushchin, and D. O. Sych, "Analysis of the influence of pre-processing of the text on the results of text classification", Young Scientist, no. 10, pp. 264-267, Kherson, 2018 [in Ukrainian].
G. Salton et al., "Automatic text structuring and summarization", Information Processing & Management, vol. 33, no. 2, pp. 193-207, 1997.
Z. Yao, Y. Sun, W. Ding, N. Rao, and H. Xiong, "Dynamic word embeddings for evolving semantic discovery", WSDM 2018 Proc. 11th ACM Int. Conf. on Web Search and Data Mining. Marina Del Rey, CA, USA, Febr. 5-9, 2018, pp. 673-681.
Word2Vec Implementation. [Online]. Avail-able: https://towardsdatascience.com/a-word2vec-implementation-using-numpy-and-python-d256cf0e5f28. [12] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space", arXiv:1301.3781, 2013.
I. G. Oksanich, "Intellectual analysis of an array of text documents based on text mining technology", Information Processing Systems, pp. 139-143, Lutsk, 2013 [in Ukrainian].
A. Yu. Zubrytskyi, "Intellectual system of text research and analysis", M.S. thesis, Na-tional Technical University of Ukraine "Ihor Sikors'kyy Kyiv Polytechnic Institute, Kyiv, Ukraine, 2019 [in Ukrainian].
G. S. Linoff, and M. J. A. Berry, Data Mining Techniques: For Marketing, Sales, and Cus-tomer Relationship Management, 3rd ed. NY, USA: Wiley Publishing inc., 2011.
S. Deerwester et al., Indexing by Latent Se-mantic Analysis. Chicago, IL, USA: Gradu-ate Library School University of Chicago, 1990.
E. V. Bodyansky, and O. G. Rudenko, Arti-ficial Neural Networks: Architecture, Training, Application. Kharkiv: TELE-TECH, 2004 [in Ukrainian].
D. W. Lande, Search for Knowledge on the Internet. Professional Work. NY, USA: Williams, 2005.
M. T. Hagan, H. B. Demuth, M. H. Beale, and O. De Jesús, Neural Network Design. 2014.
K. S. Jones, "A statistical interpretation of term specificity and its application in re-trieval", Journal of Documentation, vol. 60, no. 5, pp. 493-502, MCB University Press, 2004.
A. Shalloway, and J. R. Trott, Design Tem-plates. A New Approach to Object-Oriented Analysis and Design. NY, USA: Williams, 2002.
"Library of software components of text analysis technology". [Online]. Available: https://www.analyst.ru/index.php?lang=rus&dir=content/downloads/.
"Advego - content exchange №1". [Online]. Available: https://advego.com/.
DeepDive [Online]. Available: http://deepdive.stanford.edu/.
F. Pedregosa et al., "Scikit-learn: Machine learning in Python", Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011.
Downloads
Published
How to Cite
Issue
Section
URN
License
Copyright (c) 2022 D.O Yakymenko, Ye.Yu. Kataieva

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The authors who publish in this journal agree to the following terms:The authors reserve the right to authorship of their work and give the journal the right to first publish this work under the terms of the Creative Commons Attribution License CC BY-NC, which allows other persons to freely distribute published work with a mandatory reference to authors of the original work and the first publication of the work in this journal.
Authors have the right to conclude separate additional agreements for the non-exclusive distribution of the paper in the form in which it was published by this journal (for example, posting work in electronic repository or publishing as part of a monograph), provided that the link to the first publication in this journal is maintained.
The journal policy allows and encourages authors to post on the Internet (for example, in repositories of institutions or on personal websites) the manuscript of work, both before the submission of this manuscript to the editorial staff, and during its editorial work, as it contributes to the emergence of productive scientific discussion and positively affects the efficiency and dynamics of published work citation (see The Effect of Open Access).