METHODS AND MEANS OF INTELLIGENT ANALYSIS OF TEXT DOCUMENTS

Authors

DOI:

https://doi.org/10.24025/2306-4412.2.2022.259408

Keywords:

keywords, text analysis, search, text documents, classifications

Abstract

The paper reviews the methods of analysis and processing of electronic documents. Methods of analysis of text documents to solve the problem of determining the thematic affinity of texts are analyzed. An overview of existing approaches to solving the classification problem is performed. The main approaches used in the task of text classification are described; the stages of the classification process are determined and the most common methods of classifying text documents are considered. The main approaches to text pre-processing, such as: lower case, root correction, stemming, lemmatization, stop word removal, normalization, are considered. Advantages and disadvantages of each approach are considered. The procedure for reducing the dimension of a set of features with a division into sub-processes: selecting features and highlighting features is considered.

Author Biographies

D.O. Yakymenko, Cherkasy State Technological University

Postgraduate student (applicant)

Ye.Yu. Kataieva, Cherkasy State Technological University

PhD., Associate Professor

References

D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. Adison Wesley, Reading, MA, 1998.

N. Kasyanchuk, and L. Tkachuk, "Protection of information in databases", in Conf. VNTU of Electron. Sci. Publications, XLVIII Sci. and Tech. Conf. of the Faculty of Manage-ment and Information Security, 2019, pp. 2419-2424 [in Ukrainian].

I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques. 3rd ed. Morgan Kaufmann, 2011.

J. F. Luger, Artificial Intelligence. Strategies and methods for solving complex problems. 4th ed. Moscow: Izdat. Dom Williams, 2003.

T. Joachims, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algoritmhs. MA, USA: Kluwer Academic Publisher Norwel, 2002.

O. V. Havrylenko, Yu. O. Oliynyk, and G. V. Khanko, "Overview and analysis of text mining algorithms", Project Manage-ment, System Analysis and Logistics, no. 19, pp. 15-23, Kyiv, 2017 [in Ukrainian].

M. Lemke, and G. Wiedemann, Text Mining in den Sozialwissenschaften. Springer Fachmedien Wiesbaden, 2016, pp. 397-419.

I. V. Gushchin, and D. O. Sych, "Analysis of the influence of pre-processing of the text on the results of text classification", Young Scientist, no. 10, pp. 264-267, Kherson, 2018 [in Ukrainian].

G. Salton et al., "Automatic text structuring and summarization", Information Processing & Management, vol. 33, no. 2, pp. 193-207, 1997.

Z. Yao, Y. Sun, W. Ding, N. Rao, and H. Xiong, "Dynamic word embeddings for evolving semantic discovery", WSDM 2018 Proc. 11th ACM Int. Conf. on Web Search and Data Mining. Marina Del Rey, CA, USA, Febr. 5-9, 2018, pp. 673-681.

Word2Vec Implementation. [Online]. Avail-able: https://towardsdatascience.com/a-word2vec-implementation-using-numpy-and-python-d256cf0e5f28. [12] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space", arXiv:1301.3781, 2013.

I. G. Oksanich, "Intellectual analysis of an array of text documents based on text mining technology", Information Processing Systems, pp. 139-143, Lutsk, 2013 [in Ukrainian].

A. Yu. Zubrytskyi, "Intellectual system of text research and analysis", M.S. thesis, Na-tional Technical University of Ukraine "Ihor Sikors'kyy Kyiv Polytechnic Institute, Kyiv, Ukraine, 2019 [in Ukrainian].

G. S. Linoff, and M. J. A. Berry, Data Mining Techniques: For Marketing, Sales, and Cus-tomer Relationship Management, 3rd ed. NY, USA: Wiley Publishing inc., 2011.

S. Deerwester et al., Indexing by Latent Se-mantic Analysis. Chicago, IL, USA: Gradu-ate Library School University of Chicago, 1990.

E. V. Bodyansky, and O. G. Rudenko, Arti-ficial Neural Networks: Architecture, Training, Application. Kharkiv: TELE-TECH, 2004 [in Ukrainian].

D. W. Lande, Search for Knowledge on the Internet. Professional Work. NY, USA: Williams, 2005.

M. T. Hagan, H. B. Demuth, M. H. Beale, and O. De Jesús, Neural Network Design. 2014.

K. S. Jones, "A statistical interpretation of term specificity and its application in re-trieval", Journal of Documentation, vol. 60, no. 5, pp. 493-502, MCB University Press, 2004.

A. Shalloway, and J. R. Trott, Design Tem-plates. A New Approach to Object-Oriented Analysis and Design. NY, USA: Williams, 2002.

"Library of software components of text analysis technology". [Online]. Available: https://www.analyst.ru/index.php?lang=rus&dir=content/downloads/.

"Advego - content exchange №1". [Online]. Available: https://advego.com/.

DeepDive [Online]. Available: http://deepdive.stanford.edu/.

F. Pedregosa et al., "Scikit-learn: Machine learning in Python", Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011.

Downloads

Published

2022-06-27

How to Cite

Yakymenko, D., & Kataieva, Y. . (2022). METHODS AND MEANS OF INTELLIGENT ANALYSIS OF TEXT DOCUMENTS. Bulletin of Cherkasy State Technological University, (2), 43–52. https://doi.org/10.24025/2306-4412.2.2022.259408

URN