A COMPARATIVE ANALYSIS OF TEXT DATA CLASSIFICATION ACCURACY AND SPEED USING NEURAL NETWORKS, BLOOM FILTER AND NAIVE BAYES

The object of research is the methods of fast classification for solving text data classification problems. The need for this study is due to the rapid growth of textual data, both in digital and printed forms. Thus, there is a need to process such data using software, since human resources are not able to process such an amount of data in full. A large number of data classification approaches have been developed. The conducted research is based on the application of the following methods of classification of text data: Bloom filter, naive Bayesian classifier and neural networks to a set of text data in order to classify them into categories. Each method has both disadvantages and advantages. This paper will reflect the strengths and weaknesses of each method on a specific example. These algorithms were comparatively among themselves in terms of speed and efficiency, that is, the accuracy of determining the belonging of a text to a certain class of classification. The work of each method was considered on the same data sets with a change in the amount of training and test data, as well as with a change in the number of classification groups. The dataset used contains the following classes: world, business, sports, and science and technology. In real conditions of the classification of such data, the number of categories is much larger than that considered in the work, and may have subcategories in its composition. In the course of this study, each method was analyzed using different parameter values to obtain the best result. Analyzing the results obtained, the best results for the classification of text data were obtained using a neural network.


Introduction
The amount of textual information is growing every day. In most cases, these works are digital, but there are also printed versions. The e-mails that appear can be attributed to many areas: social networks, web pages, e-mails, articles, messages, books, customer support, telephone conversations, and more. With the advent of large amounts of information, manual processing of large amounts of data becomes unrealistic, there is a need for its processing.
For such tasks methods of classification and clustering of texts are used. To date, a large number of methods and their various variations for the classification of texts have been developed. Each group of methods has its advantages and disadvantages, areas of application, features and limitations [1]. The object of research is the methods of classification of text data. The aim of research is a comparative analysis of classification methods and determining the best to solve the problem of classification of text data. Therefore, it is important to conduct a study to assess the accuracy and time of texts classification by different methods depending on the amount of input data.

Research methodology
The study is based on the use of the following classification methods: Bloom filter [2, 3], naive Bayesian classifier [4] and neural network [5][6][7]. The following steps were defined for the classification of text data: -text preprocessing; -model training; -data set classification; -assessment of the accuracy of the results. During this research the total amount of text data is as follows: 30,000 texts per each of the 4 categories for training and 1,900 texts for testing. The data set result overview is presented in Table 1.
Data preprocessing includes removal of punctuation marks, cleaning of stop words, lemmatization, stamming ISSN 2664-9969 and cutting the words [8]. Then the data can be used to train or test the classification model, preprocessing results are shown in Table 2 [9].  If such data processing is sufficient to use the Bloom filter and the naive Bayesian classifier, it is not enough for the neural network. In the case of using neural networks, the input text must be converted into numerical form to be able to use mathematical and statistical calculations [10].
The operation of each method was investigated using different parameter values. For example, for a neural network, a different number of epochs, neurons, and different activation functions were considered. For each method, the optimal values of the parameters were determined, i.e. the values that gave the best results with different amounts of data, their completeness and the number of categories. Comparative analysis was performed by analyzing the obtained operating time and accuracy, the faster and more accurate the algorithm, the better it is. In this study, the following activation functions for the neural network were considered: rectified linear unit, sigmoid, tan h, exponential linear unit. The best result with optimal parameter values was obtained using the last activation function. Also, it is needed to say that this experiment is a continuation of research started in [7] where neural networks and naive Bayesian classifiers were compared.

Obtained results and discussion
Using the data from Table 1 and the obtained optimal values of variables for each of the methods showed the dependence of the accuracy of data classification on the number of data and the number of classification groups (Fig. 1-3). For the neural network the parameters are as follows: the activation function of the exponential linear unit was chosen, 10 neurons, 10 epochs and 3,000 unique words were used. For the Bloom filter, the probability of false positive case was chosen to be 0.6. A personal computer with an Intel (R) Core (TM) i7-8550U CPU @ 1.80GHz 1.99 GHz processor, 8 GB RAM and 256 GB SSD was used to perform the testing.
Before presenting the results, the following research limitations need to be mentioned: the classification was done only for 2-4 classes; the testing was performed only on 1 PC thus the timing might be different in another environment; words in the texts were converted into the numeric format but considering synonyms. The last limitation means that it is possible to improve the classification accuracy by using methods for finding the similarity of the words.
Also, the speed of data classification depending on the amount of text data and the number of categories was investigated. The results are presented in Table 3.  Based on result the further research should be done on the next directions: improving the accuracy of neural networks by changing its architectures and parameters, improving the Bloom filter to increase the quality of classification. Also, one of the possible ways to improve the result is to combine the two classification methods.

Conclusions
In this paper the methods of fast classification of text data are considered and their comparative analysis is prepared. Analysis of the methods was developed for two important parameters: speed and efficiency. Depending on the number of classification groups, the accuracy and time of classification change. As the number of categories increases, so does the amount of data to be taught to achieve the accuracy that is achieved for a small number of classes.
The neural network has the best accuracy -97.3 % for 2 classes, 85.21 % for 3 classes and 75.54 % for 2 classes. At that time, the results for naïve Bayesian classifier is 94.66 %, 90.28 % and 84.94 % accordingly, and for the Bloom filter -42 %, 23.1 % and 19.12 % accordingly.
When comparing the time, the best results are shown with the Bloom filter as shown in Table 3. In problems where the problem of accuracy and speed is solved, the neural network is best suited. The naive Bayesian classifier also gives a fairly high accuracy, but it is very slow compared to other methods. However, the Bloom filter has a great advantage, this method is quite fast. If to solve the problem of rapid classification of data, this method is better than others considered in this paper. Before choosing a method, it is necessary to determine the problems to be solved by a particular method and based on the selected tasks to choose the algorithms accordingly.