COMPARISON OF DATASET OVERSAMPLING ALGORITHMS AND THEIR APPLICABILITY TO THE CATEGORIZATION PROBLEM
DOI:
https://doi.org/10.30837/ITSSI.2023.24.161Keywords:
categorization; machine learning; methods of balancing; data generation methods; dataset; unbalanced datasetsAbstract
The subject of research in the article is the problem of classification in machine learning in the presence of imbalanced classes in datasets. The purpose of the work is to analyze existing solutions and algorithms for solving the problem of dataset imbalance of different types and different industries and to conduct an experimental comparison of algorithms. The article solves the following tasks: to analyze approaches to solving the problem – preprocessing methods, learning methods, hybrid methods and algorithmic approaches; to define and describe the oversampling algorithms most often used to balance datasets; to select classification algorithms that will serve as a tool for establishing the quality of balancing by checking the applicability of the datasets obtained after oversampling; to determine metrics for assessing the quality of classification for comparison; to conduct experiments according to the proposed methodology. For clarity, we considered datasets with varying degrees of imbalance (the number of instances of the minority class was equal to 15, 30, 45, and 60% of the number of samples of the majority class). The following methods are used: analytical and inductive methods for determining the necessary set of experiments and building hypotheses regarding their results, experimental and graphic methods for obtaining a visual comparative characteristic of the selected algorithms. The following results were obtained: with the help of quality metrics, an experiment was conducted for all algorithms on two different datasets – the Titanic passenger dataset and the dataset for detecting fraudulent transactions in bank accounts. The obtained results indicated the best applicability of SMOTE and SVM SMOTE algorithms, the worst performance of Borderline SMOTE and k-means SMOTE, and at the same time described the results of each algorithm and the potential of their usage. Conclusions: the application of the analytical and experimental method provided a comprehensive comparative description of the existing balancing algorithms. The superiority of oversampling algorithms over undersampling algorithms was proven. The selected algorithms were compared using different classification algorithms. The results were presented using graphs and tables, as well as demonstrated in general using heat maps. Conclusions that were made can be used when choosing the optimal balancing algorithm in the field of machine learning.
References
References
Mary, A. J., Claret, A. (2021), "Imbalanced Classification Problems: Systematic Study and Challenges in Healthcare Insurance Fraud Detection", 5th International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, Р. 1049–1055. DOI: 10.1109/ICOEI51242.2021.9452828
Srinilta, C., Kanharattanachai, S. (2021), "Application of Natural Neighbor-based Algorithm on Oversampling SMOTE Algorithms", 7th International Conference on Engineering, Applied Sciences and Technology (ICEAST), Pattaya, Thailand, Р. 217–220. DOI: 10.1109/ICEAST52143.2021.9426310
Das, R., Biswas, S. K., Devi, D., Sarma, B. (2020), "An Oversampling Technique by Integrating Reverse Nearest Neighbor in SMOTE: Reverse-SMOTE," International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, Р. 1239–1244. DOI: 10.1109/ICOSEC49089.2020.9215387
Feng, L. (2022), "Research on Customer Churn Intelligent Prediction Model based on Borderline-SMOTE and Random Forest," IEEE 4th International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, Р. 803–807. DOI: 10.1109/ICPICS55264.2022.9873702
Dudjak, M., Martinović, G. (2021), "An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult", Expert Systems with Applications, Vol. 182, DOI: https://doi.org/10.1016/j.eswa.2021.115297
Liu, C., Jin, S., Wang, D. (2020), "Constrained Oversampling: An Oversampling Approach to Reduce Noise Generation in Imbalanced Datasets with Class Overlapping," IEEE Access, Vol. 10, Р. 91452–91465. DOI: 10.1109/ACCESS.2020.3018911
Ali, H., Mohd Salleh, M., Saedudin, R., Hussain, K., Mushtaq, M. (2019). "Imbalance class problems in data mining: a review", Indonesian Journal of Electrical Engineering and Computer Science, No. 14(3), 1552. DOI: 10.11591/ijeecs.v14.i3.pp1552-1563
Medium (2022), "Undersampling and oversampling: An old and a new approach", available at: https://medium.com/analytics-vidhya/undersampling-and-oversampling-an-old-and-a-new-approach-4f984a0e8392 (last accessed: 10.05.2023)
Sandeep Kini, M., Devidas, Smitha, N. Pai, Sucheta Kolekar, Vasudeva Pai, Balasubramani, R. (2022), "Use of Machine Learning and Random OverSampling in Stroke Prediction", International Conference on Artificial Intelligence and Data Engineering (AIDE), Karkala, India, Р. 331–337. DOI: 10.1109/AIDE57180.2022.10060313
Blagus, R., Lusa, L. (2012), "Evaluation of SMOTE for High-Dimensional Class-Imbalanced Microarray Data", 11th International Conference on Machine Learning and Applications, Boca Raton, FL, USA, Р. 89–94. DOI: 10.1109/ICMLA.2012.183
Sáez, J., Luengo, J., Stefanowski, J., Herrera, F. (2015), "SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering", Information Sciences, Vol. 291, Р. 184–203. DOI: https://doi.org/10.1016/j.ins.2014.08.051
Mahalakshmi, M., Ramkumar, M. P., Emil Selvan, G., S., R. (2022), "SCADA Intrusion Detection System using Cost Sensitive Machine Learning and SMOTE-SVM", 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India, Р. 332–337. DOI: 10.1109/ICAC3N56670.2022.10074251
Puri, A., Gupta, M. (2020), "Improved Hybrid Bag-Boost Ensemble With K-Means-SMOTE–ENN Technique for Handling Noisy Class Imbalanced Data", The Computer Journal, Oxford University Press, Vol. 65, No. 1, Р. 124–138. DOI: 10.1093/comjnl/bxab039
"Titanic–Machine Learning from Disaster", (2022), available at: https://www.kaggle.com/competitions/titanic/data?select=gender_submission.csv (last accessed: 10.05.2023)
"Salary Prediction Classification", (2022), available at: https://www.kaggle.com/datasets/ayessa/salary-prediction-classification (last accessed: 10.05.2023)
Ni, N., Wu, H., Zhang, L. (2022), "Deformable Alignment and Scale-Adaptive Feature Extraction Network for Continuous-Scale Satellite Video Super-Resolution," IEEE International Conference on Image Processing (ICIP), Bordeaux, France, Р. 2746–2750. DOI: 10.1109/ICIP46576.2022.9897998
Yu, L., Zhou, R., Chen, R., Lai, K. K. (2020), "Missing data preprocessing in credit classification: One-hot encoding or imputation?" Emerging Markets Finance and Trade, Vol. 58, No. 2, Р. 472–482. DOI: 10.1080/1540496X.2020.1825935
Dahouda, M. K., Joe, I. (2021), "A Deep-Learned Embedding Technique for Categorical Features Encoding", IEEE Access, Vol. 9, Р. 114381–114391. DOI: 10.1109/ACCESS.2021.3104357
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Денис Тесленко, Анна Сорокіна, Артем Ховрат, Нурал Гулієв, Валентина Кирій
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Our journal abides by the Creative Commons copyright rights and permissions for open access journals.
Authors who publish with this journal agree to the following terms:
Authors hold the copyright without restrictions and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-commercial and non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their published work online (e.g., in institutional repositories or on their website) as it can lead to productive exchanges, as well as earlier and greater citation of published work.