COMPARISON OF DATASET OVERSAMPLING ALGORITHMS  AND THEIR APPLICABILITY TO THE CATEGORIZATION PROBLEM

Denys Teslenko; Anna Sorokina; Artem Khovrat; Nural Huliiev; Valentyna Kyriy

doi:10.30837/ITSSI.2023.24.161

Authors

Denys Teslenko Kharkіv National University of Radio Electronics, Ukraine https://orcid.org/0000-0002-6289-1633
Anna Sorokina Kharkіv National University of Radio Electronics, Ukraine https://orcid.org/0009-0006-7380-5223
Artem Khovrat Kharkіv National University of Radio Electronics, Ukraine https://orcid.org/0000-0002-1753-8929
Nural Huliiev Kharkіv National University of Radio Electronics, Ukraine https://orcid.org/0000-0003-2123-0377
Valentyna Kyriy Kharkіv National University of Radio Electronics, Ukraine https://orcid.org/0000-0002-2537-264X

DOI:

https://doi.org/10.30837/ITSSI.2023.24.161

Keywords:

categorization; machine learning; methods of balancing; data generation methods; dataset; unbalanced datasets

Abstract

The subject of research in the article is the problem of classification in machine learning in the presence of imbalanced classes in datasets. The purpose of the work is to analyze existing solutions and algorithms for solving the problem of dataset imbalance of different types and different industries and to conduct an experimental comparison of algorithms. The article solves the following tasks: to analyze approaches to solving the problem – preprocessing methods, learning methods, hybrid methods and algorithmic approaches; to define and describe the oversampling algorithms most often used to balance datasets; to select classification algorithms that will serve as a tool for establishing the quality of balancing by checking the applicability of the datasets obtained after oversampling; to determine metrics for assessing the quality of classification for comparison; to conduct experiments according to the proposed methodology. For clarity, we considered datasets with varying degrees of imbalance (the number of instances of the minority class was equal to 15, 30, 45, and 60% of the number of samples of the majority class). The following methods are used: analytical and inductive methods for determining the necessary set of experiments and building hypotheses regarding their results, experimental and graphic methods for obtaining a visual comparative characteristic of the selected algorithms. The following results were obtained: with the help of quality metrics, an experiment was conducted for all algorithms on two different datasets – the Titanic passenger dataset and the dataset for detecting fraudulent transactions in bank accounts. The obtained results indicated the best applicability of SMOTE and SVM SMOTE algorithms, the worst performance of Borderline SMOTE and k-means SMOTE, and at the same time described the results of each algorithm and the potential of their usage. Conclusions: the application of the analytical and experimental method provided a comprehensive comparative description of the existing balancing algorithms. The superiority of oversampling algorithms over undersampling algorithms was proven. The selected algorithms were compared using different classification algorithms. The results were presented using graphs and tables, as well as demonstrated in general using heat maps. Conclusions that were made can be used when choosing the optimal balancing algorithm in the field of machine learning.

Author Biographies

Denys Teslenko, Kharkіv National University of Radio Electronics

Master's degree at the Department of Software Engineering

Anna Sorokina, Kharkіv National University of Radio Electronics

Master's degree at the Department of Software Engineering

Artem Khovrat, Kharkіv National University of Radio Electronics

Master's degree at the Department of Software Engineering

Nural Huliiev, Kharkіv National University of Radio Electronics

Master's degree at the Department of Software Engineering

Valentyna Kyriy, Kharkіv National University of Radio Electronics

PhD (Economic Sciences), Associate Professor, Associate Professor at the Department of Economic Сybernetics and Management of Economic Security Department, Associate Professor at the Department of Software Engineering (part time).

References

Mary, A. J., Claret, A. (2021), "Imbalanced Classification Problems: Systematic Study and Challenges in Healthcare Insurance Fraud Detection", 5th International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, Р. 1049–1055. DOI: 10.1109/ICOEI51242.2021.9452828

Srinilta, C., Kanharattanachai, S. (2021), "Application of Natural Neighbor-based Algorithm on Oversampling SMOTE Algorithms", 7th International Conference on Engineering, Applied Sciences and Technology (ICEAST), Pattaya, Thailand, Р. 217–220. DOI: 10.1109/ICEAST52143.2021.9426310

Das, R., Biswas, S. K., Devi, D., Sarma, B. (2020), "An Oversampling Technique by Integrating Reverse Nearest Neighbor in SMOTE: Reverse-SMOTE," International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, Р. 1239–1244. DOI: 10.1109/ICOSEC49089.2020.9215387

Feng, L. (2022), "Research on Customer Churn Intelligent Prediction Model based on Borderline-SMOTE and Random Forest," IEEE 4th International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, Р. 803–807. DOI: 10.1109/ICPICS55264.2022.9873702

Dudjak, M., Martinović, G. (2021), "An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult", Expert Systems with Applications, Vol. 182, DOI: https://doi.org/10.1016/j.eswa.2021.115297

Liu, C., Jin, S., Wang, D. (2020), "Constrained Oversampling: An Oversampling Approach to Reduce Noise Generation in Imbalanced Datasets with Class Overlapping," IEEE Access, Vol. 10, Р. 91452–91465. DOI: 10.1109/ACCESS.2020.3018911

Ali, H., Mohd Salleh, M., Saedudin, R., Hussain, K., Mushtaq, M. (2019). "Imbalance class problems in data mining: a review", Indonesian Journal of Electrical Engineering and Computer Science, No. 14(3), 1552. DOI: 10.11591/ijeecs.v14.i3.pp1552-1563

Medium (2022), "Undersampling and oversampling: An old and a new approach", available at: https://medium.com/analytics-vidhya/undersampling-and-oversampling-an-old-and-a-new-approach-4f984a0e8392 (last accessed: 10.05.2023)

Sandeep Kini, M., Devidas, Smitha, N. Pai, Sucheta Kolekar, Vasudeva Pai, Balasubramani, R. (2022), "Use of Machine Learning and Random OverSampling in Stroke Prediction", International Conference on Artificial Intelligence and Data Engineering (AIDE), Karkala, India, Р. 331–337. DOI: 10.1109/AIDE57180.2022.10060313

Blagus, R., Lusa, L. (2012), "Evaluation of SMOTE for High-Dimensional Class-Imbalanced Microarray Data", 11th International Conference on Machine Learning and Applications, Boca Raton, FL, USA, Р. 89–94. DOI: 10.1109/ICMLA.2012.183

Sáez, J., Luengo, J., Stefanowski, J., Herrera, F. (2015), "SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering", Information Sciences, Vol. 291, Р. 184–203. DOI: https://doi.org/10.1016/j.ins.2014.08.051

Mahalakshmi, M., Ramkumar, M. P., Emil Selvan, G., S., R. (2022), "SCADA Intrusion Detection System using Cost Sensitive Machine Learning and SMOTE-SVM", 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India, Р. 332–337. DOI: 10.1109/ICAC3N56670.2022.10074251

Puri, A., Gupta, M. (2020), "Improved Hybrid Bag-Boost Ensemble With K-Means-SMOTE–ENN Technique for Handling Noisy Class Imbalanced Data", The Computer Journal, Oxford University Press, Vol. 65, No. 1, Р. 124–138. DOI: 10.1093/comjnl/bxab039

"Titanic–Machine Learning from Disaster", (2022), available at: https://www.kaggle.com/competitions/titanic/data?select=gender_submission.csv (last accessed: 10.05.2023)

"Salary Prediction Classification", (2022), available at: https://www.kaggle.com/datasets/ayessa/salary-prediction-classification (last accessed: 10.05.2023)

Ni, N., Wu, H., Zhang, L. (2022), "Deformable Alignment and Scale-Adaptive Feature Extraction Network for Continuous-Scale Satellite Video Super-Resolution," IEEE International Conference on Image Processing (ICIP), Bordeaux, France, Р. 2746–2750. DOI: 10.1109/ICIP46576.2022.9897998

Yu, L., Zhou, R., Chen, R., Lai, K. K. (2020), "Missing data preprocessing in credit classification: One-hot encoding or imputation?" Emerging Markets Finance and Trade, Vol. 58, No. 2, Р. 472–482. DOI: 10.1080/1540496X.2020.1825935

Dahouda, M. K., Joe, I. (2021), "A Deep-Learned Embedding Technique for Categorical Features Encoding", IEEE Access, Vol. 9, Р. 114381–114391. DOI: 10.1109/ACCESS.2021.3104357

COMPARISON OF DATASET OVERSAMPLING ALGORITHMS AND THEIR APPLICABILITY TO THE CATEGORIZATION PROBLEM

Authors

DOI:

Keywords:

Abstract

Author Biographies

Denys Teslenko, Kharkіv National University of Radio Electronics

Anna Sorokina, Kharkіv National University of Radio Electronics

Artem Khovrat, Kharkіv National University of Radio Electronics

Nural Huliiev, Kharkіv National University of Radio Electronics

Valentyna Kyriy, Kharkіv National University of Radio Electronics

References

Downloads

Published

How to Cite

Issue

Section

License

Language

Make a Submission