RESEARCH OF THE TEXT PROCESSING METHODS IN ORGANIZATION OF ELECTRONIC STORAGES OF INFORMATION OBJECTS

Authors

  • Olesia Barkovska Kharkiv National University of Radio Electronics, Ukraine https://orcid.org/0000-0001-7496-4353
  • Viktor Khomych Kharkiv National University of Radio Electronics, Ukraine
  • Oleksandr Nastenko Kharkiv National University of Radio Electronics, Ukraine

DOI:

https://doi.org/10.30837/ITSSI.2022.19.005

Keywords:

information system;, parallelism;, word processing;, linguistic programming;, library;, acceleration;, method

Abstract

The subject matter of the article is electronic storage of information objects (IO) ordered by specified rules at the stage of accumulation of qualification thesis and scientific work of the contributors of the offered knowledge exchange system provided to the system in different formats (text, graphic, audio). Classified works of contributors of the system are the ground for organization of thematic rooms for discussion to spread scientific achievements, to adopt new ideas, to exchange knowledge and to look for employers or mentors in different countries. The goal of the work is to study the libraries of text processing and analysis to speed-up and increase accuracy of the scanned text documents classification in the process of serialized electronic storage of information objects organization. The following tasks are: to study the text processing methods on the basis of the proposed generalized model of the system of classification of scanned documents with the specified location of the block of text processing and analysis; to investigate the statistics of change in the execution time of the developed parallel modification of the methods of the word processing module for the system with shared memory for collections of text documents of different sizes; analyze the results. The methods used are the following: parallel digital sorting methods, methods of mathematical statistics, linguistic methods of text analysis. The following results were obtained: in the course of the research fulfillment the generalized model of the scanned documents classification system that consist of image processing unit and text processing unit that include unit of the scanned image previous processing; text detection unit; previous text processing; compiling of the frequency dictionary; text proximity detection was offered. Conclusions: the proposed parallel modification of the previous text processing unit gives acceleration up to 3,998 times. But, at a very high computational load (collection of 18144 files, about 1100 MB), the resources of an ordinary multiprocessor-based computer with the shared memory obviously is not enough to solve such problems in the mode close to real time.

Author Biographies

Olesia Barkovska, Kharkiv National University of Radio Electronics

Ph.D (Engineering Sciences), Docent

Viktor Khomych, Kharkiv National University of Radio Electronics

students

Oleksandr Nastenko, Kharkiv National University of Radio Electronics

students

References

Barkovska, O., Kholiev, V., Pyvovarova, D., Ivaschenko, G., Rosinskiy, D. (2021), "International system of knowledge exchange for young scientists", Advanced Information Systems, No. 5 (1), P. 69 – 74. DOI: https://doi.org/10.20998/2522-9052.2021.1.09

Barkovska, O., Pyvovarova, D., Kholiev, V., Ivashchenko, H, Rosinskyi, D. (2021), "Information Object Storage Model with Accelerated Text Processing Methods", Proceedings of the 5th International Conference on Computational Linguistics and Intelligent Systems (COLINS 2021), No. 2870, P. 286 – 299.

Koroteev, M. (2020), "On the Usage of Semantic Text-Similarity Metrics for Natural Language Processing in Russian", 13th International Conference "Management of large-scale system development" (MLSD), Р. 1 – 4. DOI: https://doi.org/10.1109/MLSD49919.2020.9247691

Liu, Y. Sheng, Wei, Z., Yang, Y. (2018), "Research of Text Classification Based on Improved TF-IDF Algorithm", IEEE International Conference of Intelligent Robotic and Control Engineering (IRCE), P. 218 – 222. DOI: https://doi.org/10.1109/IRCE.2018.8492945

Zhang, Y. (2021), "Research on Text Classification Method Based on LSTM Neural Network Model", IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC), P. 1019 – 1022. DOI: https://doi.org/10.1109/IPEC51340.2021.9421225

Jindal, R., Shweta, (2018), "A Novel Method for Efficient Multi-Label Text Categorization of research articles", International Conference on Computing, Power and Communication Technologies (GUCON), P. 333 – 336. DOI: https://doi.org/10.1109/GUCON.2018.8674985

Martínek, J., Lenc, L., Král, P. (2020), "Building an efficient OCR system for historical documents with little training data", Neural Computing and Applications, No. 32, P. 17209 – 17227. DOI: https://doi.org/10.1007/s00521-020-04910-x

Pawar, N., Shaikh, Z., Shinde, P., Warke Y. (2019), "Image to Text Conversion Using Tesseract", International Research Journal of Engineering and Technology (IRJET), No. 6 (2), Р. 516– 519.

Revathi, A., Modi, N. A. (2021), "Comparative Analysis of Text Extraction from Color Images using Tesseract and OpenCV", 8th International Conference on Computing for Sustainable Global Development (INDIACom), Р. 931 – 936. DOI: https://doi.org/10.1109/INDIACom51348.2021.00167

Burns, S. (2019), Natural Language Processing: A Quick Introduction to NLP with Python and NLTK (Step-by-Step Tutorial for Beginners), Amazon KDP Printing and Publishing C, 123 p.

Lane, H., Hapke, H., Howard, C. (2019), Natural Language Processing in Action: Understanding, analyzing, and generating text with Python, Manning; 1st edition, 544 p.

Jurafsky, D., Martin, J.H., "Speech and Language Processing", available at: https://web.stanford.edu/~jurafsky/slp3/ (last accessed: 16.02.2022)

Kim, J., Hur, S., Lee, E., Lee, S. (2021), "NLP-Fast: A Fast, Scalable, and Flexible System to Accelerate Large-Scale Heterogeneous NLP Models," 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), P. 75– 89. DOI: https://doi.org/10.1109/PACT52795.2021.00013

Berko, A., Matseliukh, Y., Ivaniv, Y., Chyrun, L., Schuchmann, V. (2021), "The Text Classification Based on Big Data Analysis for Keyword Definition Using Stemming," IEEE 16th International Conference on Computer Sciences and Information Technologies (CSIT), P. 184– 188. DOI: https://doi.org/10.1109/CSIT52700.2021.9648764

Sakthi vel, S. (2021), "Pre-Processing techniques of Text Mining using Computational Linguistics and Python Libraries," International Conference on Artificial Intelligence and Smart Systems (ICAIS), P. 879–884. DOI: https://doi.org/10.1109/ICAIS50930.2021.9395924

Al Omran, F. N. A., Treude, C. (2017), "Choosing an NLP Library for Analyzing Software Documentation: A Systematic Literature Review and a Series of Experiments," IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), P. 187– 197. DOI: https://doi.org/10.1109/MSR.2017.42.

Vasiliev, Y. (2020), Natural Language Processing with Python and SpaCy: A Practical Introduction, No Starch Press,

p

Naseer, S., Mudasar Ghafoor, M., Alvi, S. bin K., Kiran, A., Shafique Ur Rahmand, Ghulam Murtazae, & Murtaza, G. (2022), "Named Entity Recognition (NER) in NLP Techniques, Tools Accuracy and Performance", Pakistan Journal of Multidisciplinary Research, No. 2 (2), P. 293– 308.

Downloads

Published

2022-03-31

How to Cite

Barkovska, . O., Khomych, V., & Nastenko, . O. (2022). RESEARCH OF THE TEXT PROCESSING METHODS IN ORGANIZATION OF ELECTRONIC STORAGES OF INFORMATION OBJECTS. INNOVATIVE TECHNOLOGIES AND SCIENTIFIC SOLUTIONS FOR INDUSTRIES, (1 (19), 5–12. https://doi.org/10.30837/ITSSI.2022.19.005