A comparative analysis of text data classification accuracy and speed using neural networks, Bloom filter and naive Bayes

Olena Hryshchenko; Vadym Yaremenko

doi:10.15587/2706-5448.2021.237767

Authors

Olena Hryshchenko National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Ukraine https://orcid.org/0000-0001-6888-8665
Vadym Yaremenko National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Ukraine https://orcid.org/0000-0001-8557-6938

DOI:

https://doi.org/10.15587/2706-5448.2021.237767

Keywords:

text data classification, Bloom filter, naive Bayes, neural network, classification time and accuracy

Abstract

The object of research is the methods of fast classification for solving text data classification problems. The need for this study is due to the rapid growth of textual data, both in digital and printed forms. Thus, there is a need to process such data using software, since human resources are not able to process such an amount of data in full.

A large number of data classification approaches have been developed. The conducted research is based on the application of the following methods of classification of text data: Bloom filter, naive Bayesian classifier and neural networks to a set of text data in order to classify them into categories. Each method has both disadvantages and advantages. This paper will reflect the strengths and weaknesses of each method on a specific example. These algorithms were comparatively among themselves in terms of speed and efficiency, that is, the accuracy of determining the belonging of a text to a certain class of classification. The work of each method was considered on the same data sets with a change in the amount of training and test data, as well as with a change in the number of classification groups. The dataset used contains the following classes: world, business, sports, and science and technology. In real conditions of the classification of such data, the number of categories is much larger than that considered in the work, and may have subcategories in its composition.

In the course of this study, each method was analyzed using different parameter values to obtain the best result. Analyzing the results obtained, the best results for the classification of text data were obtained using a neural network.

Author Biographies

Olena Hryshchenko, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”

Department of System Design

Institute for Applied Systems Analysis

Vadym Yaremenko, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”

Postgraduate Student, Assistant

Department of System Design

Institute for Applied Systems Analysis

References

Khatun, A., Mafiul Hasan, M., Miah, A.-A., Miah, R. (2020). Comparative Study on Text Classification. Available at: https://www.researchgate.net/publication/344199138_Comparative_Study_on_Text_Classification
Yaremenko, V., Budonnyi, D. (2019). Approach of the bloom filter application for real time text data multi-class classification. Computer-integrated technologies: education, science, production, 36, 153–159. doi: http://doi.org/10.36910/6775-2524-0560-2019-36-24
Leskovec, J., Rajaraman, A., Ullman, J. D. (2014). Mining Data Streams. Mining of Massive Datasets. Cambridge: Cambridge University Press, 123–153. doi: http://doi.org/10.1017/cbo9781139924801.005
Parsian, M. (2015). Data Algorithms: Recipes for Scaling Up with Hadoop and Spark. O'Reilly Media, Inc.
Lakshmi Prasanna, P., D. Rajeswara Rao, D. (2017). Text classification using artificial neural networks. International Journal of Engineering & Technology, 7 (1.1), 603–606. doi: http://doi.org/10.14419/ijet.v7i1.1.10785
Aggarwal, C. (2014). Data Classification Algorithms and Applications. New York: CRC Press, 707.
Yaremenko, V., Rogoza, W., Spitkovskyi, V. (2021). Application of neural network algorithms and naïve bayes for text classification. Journal of Theoretical and Applied Information Technology, 99 (1), 125–134.
Vander Plas, J. (2016). Python data science handbook: essential tools for working with data. Sebastopol: O'Reilly Media, Inc.
Mowafy, M., Rezk, A., El-bakry, H. M. (2018). An Efficient Classification Model for Unstructured Text Document. American Journal of Computer Science and Information Technology, 6 (1). doi: http://doi.org/10.21767/2349-3917.100016
Antons, D., Grünwald, E., Cichy, P., Salge, T. O. (2020). The application of text mining methods in innovation research: current state, evolution patterns, and development priorities. R&D Management, 50 (3), 329–351. doi: http://doi.org/10.1111/radm.12408

A comparative analysis of text data classification accuracy and speed using neural networks, Bloom filter and naive Bayes

Authors

DOI:

Keywords:

Abstract

Author Biographies

Olena Hryshchenko, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”

Vadym Yaremenko, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”

References

Downloads

Published

How to Cite

Issue

Section

License

Information site

Language

Information

Developed By

Current Issue