RESEARCH OF THE TEXT PROCESSING METHODS IN ORGANIZATION OF ELECTRONIC STORAGES OF INFORMATION OBJECTS
DOI:
https://doi.org/10.30837/ITSSI.2022.19.005Keywords:
information system;, parallelism;, word processing;, linguistic programming;, library;, acceleration;, methodAbstract
The subject matter of the article is electronic storage of information objects (IO) ordered by specified rules at the stage of accumulation of qualification thesis and scientific work of the contributors of the offered knowledge exchange system provided to the system in different formats (text, graphic, audio). Classified works of contributors of the system are the ground for organization of thematic rooms for discussion to spread scientific achievements, to adopt new ideas, to exchange knowledge and to look for employers or mentors in different countries. The goal of the work is to study the libraries of text processing and analysis to speed-up and increase accuracy of the scanned text documents classification in the process of serialized electronic storage of information objects organization. The following tasks are: to study the text processing methods on the basis of the proposed generalized model of the system of classification of scanned documents with the specified location of the block of text processing and analysis; to investigate the statistics of change in the execution time of the developed parallel modification of the methods of the word processing module for the system with shared memory for collections of text documents of different sizes; analyze the results. The methods used are the following: parallel digital sorting methods, methods of mathematical statistics, linguistic methods of text analysis. The following results were obtained: in the course of the research fulfillment the generalized model of the scanned documents classification system that consist of image processing unit and text processing unit that include unit of the scanned image previous processing; text detection unit; previous text processing; compiling of the frequency dictionary; text proximity detection was offered. Conclusions: the proposed parallel modification of the previous text processing unit gives acceleration up to 3,998 times. But, at a very high computational load (collection of 18144 files, about 1100 MB), the resources of an ordinary multiprocessor-based computer with the shared memory obviously is not enough to solve such problems in the mode close to real time.
References
Barkovska, O., Kholiev, V., Pyvovarova, D., Ivaschenko, G., Rosinskiy, D. (2021), "International system of knowledge exchange for young scientists", Advanced Information Systems, No. 5 (1), P. 69 – 74. DOI: https://doi.org/10.20998/2522-9052.2021.1.09
Barkovska, O., Pyvovarova, D., Kholiev, V., Ivashchenko, H, Rosinskyi, D. (2021), "Information Object Storage Model with Accelerated Text Processing Methods", Proceedings of the 5th International Conference on Computational Linguistics and Intelligent Systems (COLINS 2021), No. 2870, P. 286 – 299.
Koroteev, M. (2020), "On the Usage of Semantic Text-Similarity Metrics for Natural Language Processing in Russian", 13th International Conference "Management of large-scale system development" (MLSD), Р. 1 – 4. DOI: https://doi.org/10.1109/MLSD49919.2020.9247691
Liu, Y. Sheng, Wei, Z., Yang, Y. (2018), "Research of Text Classification Based on Improved TF-IDF Algorithm", IEEE International Conference of Intelligent Robotic and Control Engineering (IRCE), P. 218 – 222. DOI: https://doi.org/10.1109/IRCE.2018.8492945
Zhang, Y. (2021), "Research on Text Classification Method Based on LSTM Neural Network Model", IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC), P. 1019 – 1022. DOI: https://doi.org/10.1109/IPEC51340.2021.9421225
Jindal, R., Shweta, (2018), "A Novel Method for Efficient Multi-Label Text Categorization of research articles", International Conference on Computing, Power and Communication Technologies (GUCON), P. 333 – 336. DOI: https://doi.org/10.1109/GUCON.2018.8674985
Martínek, J., Lenc, L., Král, P. (2020), "Building an efficient OCR system for historical documents with little training data", Neural Computing and Applications, No. 32, P. 17209 – 17227. DOI: https://doi.org/10.1007/s00521-020-04910-x
Pawar, N., Shaikh, Z., Shinde, P., Warke Y. (2019), "Image to Text Conversion Using Tesseract", International Research Journal of Engineering and Technology (IRJET), No. 6 (2), Р. 516– 519.
Revathi, A., Modi, N. A. (2021), "Comparative Analysis of Text Extraction from Color Images using Tesseract and OpenCV", 8th International Conference on Computing for Sustainable Global Development (INDIACom), Р. 931 – 936. DOI: https://doi.org/10.1109/INDIACom51348.2021.00167
Burns, S. (2019), Natural Language Processing: A Quick Introduction to NLP with Python and NLTK (Step-by-Step Tutorial for Beginners), Amazon KDP Printing and Publishing C, 123 p.
Lane, H., Hapke, H., Howard, C. (2019), Natural Language Processing in Action: Understanding, analyzing, and generating text with Python, Manning; 1st edition, 544 p.
Jurafsky, D., Martin, J.H., "Speech and Language Processing", available at: https://web.stanford.edu/~jurafsky/slp3/ (last accessed: 16.02.2022)
Kim, J., Hur, S., Lee, E., Lee, S. (2021), "NLP-Fast: A Fast, Scalable, and Flexible System to Accelerate Large-Scale Heterogeneous NLP Models," 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), P. 75– 89. DOI: https://doi.org/10.1109/PACT52795.2021.00013
Berko, A., Matseliukh, Y., Ivaniv, Y., Chyrun, L., Schuchmann, V. (2021), "The Text Classification Based on Big Data Analysis for Keyword Definition Using Stemming," IEEE 16th International Conference on Computer Sciences and Information Technologies (CSIT), P. 184– 188. DOI: https://doi.org/10.1109/CSIT52700.2021.9648764
Sakthi vel, S. (2021), "Pre-Processing techniques of Text Mining using Computational Linguistics and Python Libraries," International Conference on Artificial Intelligence and Smart Systems (ICAIS), P. 879–884. DOI: https://doi.org/10.1109/ICAIS50930.2021.9395924
Al Omran, F. N. A., Treude, C. (2017), "Choosing an NLP Library for Analyzing Software Documentation: A Systematic Literature Review and a Series of Experiments," IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), P. 187– 197. DOI: https://doi.org/10.1109/MSR.2017.42.
Vasiliev, Y. (2020), Natural Language Processing with Python and SpaCy: A Practical Introduction, No Starch Press,
p
Naseer, S., Mudasar Ghafoor, M., Alvi, S. bin K., Kiran, A., Shafique Ur Rahmand, Ghulam Murtazae, & Murtaza, G. (2022), "Named Entity Recognition (NER) in NLP Techniques, Tools Accuracy and Performance", Pakistan Journal of Multidisciplinary Research, No. 2 (2), P. 293– 308.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2022 Olesia Barkovska, Viktor Khomych , Oleksandr Nastenko
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Our journal abides by the Creative Commons copyright rights and permissions for open access journals.
Authors who publish with this journal agree to the following terms:
Authors hold the copyright without restrictions and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-commercial and non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their published work online (e.g., in institutional repositories or on their website) as it can lead to productive exchanges, as well as earlier and greater citation of published work.