Unbalanced credit fraud modeling based on bagging and bayesian optimization

Authors

DOI:

https://doi.org/10.15587/1729-4061.2023.279936

Keywords:

unbalanced data, Bayesian optimization, random forest, majority and minority class

Abstract

Credit fraud modeling is a crucial area of research that is highly relevant to the credit loan industry. Effective risk management is a key factor in providing quality credit services and directly impacts the profitability and bad debt ratio of leading organizations in this sector. However, when the distribution of credit fraud data is highly unbalanced, it can lead to noise errors caused by information distortion, periodic statistical errors, and model biases during training. This can cause unfair results for the minority class (target class) and increase the risk of overfitting. While traditional data balancing methods can reduce bias in models towards the majority class in relatively unbalanced data, they may not be effective in highly unbalanced scenarios. To address this challenge, this paper proposes using Bagging algorithms such as Random Forest and Bagging to model highly unbalanced credit fraud data. Bayesian optimization is utilized to find hyperparameters and determine the accuracy of the minority class as an optimization function for the model, which is tested with real European credit card fraud data. The results of the proposed packing algorithms are compared with traditional data balancing methods such as Balanced Bagging and Balanced Random Forest. The study found that traditional data balancing methods may not be compatible with excessively unbalanced data, whereas Bagging algorithms show promise as a solution for modeling such data. The proposed method for finding hyperparameters effectively deals with highly unbalanced data. It achieved precision, recall, and F1-score for the minority category of 0.94, 0.81, and 0.87, respectively. The study emphasizes the importance of addressing the challenges associated with unbalanced credit fraud data to improve the accuracy and fairness of credit fraud models

Supporting Agency

  • The authors acknowledge University of Al-Hamdaniya for making laboratories and experimental materials available for this research project.

Author Biographies

Mohammed Alaaalden Kashmoola, University of Al-Hamdaniya

Master, Lecturer

Department of Computer Science

Samah Fakhri Aziz, University of Al-Hamdaniya

Master, Lecturer

Department of Computer Science

Hasan Mudhafar Qays, University of Al-Hamdaniya

Master, Lecturer

Department of Computer Science

Naors Y. Anad Alsaleem, University of Al-Hamdaniya

PhD, Asstans Professor

Department of Computer Science

References

  1. Patil, Dr. A. (2018). Digital Transformation in Financial Services and Challenges and Opportunities. International Journal of Trend in Scientific Research and Development, Special Issue (Special Issue-ICDEBI2018), 10–12. doi: https://doi.org/10.31142/ijtsrd18661
  2. McNulty, D., Milne, A. (2021). Bigger Fish to Fry: FinTech and the Digital Transformation of Financial Services. Disruptive Technology in Banking and Finance, 263–281. doi: https://doi.org/10.1007/978-3-030-81835-7_10
  3. Lokanan, M. E., Sharma, K. (2022). Fraud prediction using machine learning: The case of investment advisors in Canada. Machine Learning with Applications, 8, 100269. doi: https://doi.org/10.1016/j.mlwa.2022.100269
  4. Roberts, P. (2022). Risk Management Systems. Simplifying Risk Management, 43–66. doi: https://doi.org/10.4324/9781003225157-3
  5. Meng, C., Zhou, L., Liu, B. (2020). A Case Study in Credit Fraud Detection With SMOTE and XGBoost. Journal of Physics: Conference Series, 1601 (5), 052016. doi: https://doi.org/10.1088/1742-6596/1601/5/052016
  6. Lin, T.-H., Jiang, J.-R. (2021). Credit Card Fraud Detection with Autoencoder and Probabilistic Random Forest. Mathematics, 9 (21), 2683. doi: https://doi.org/10.3390/math9212683
  7. Shamsudin, H., Yusof, U. K., Jayalakshmi, A., Akmal Khalid, M. N. (2020). Combining oversampling and undersampling techniques for imbalanced classification: A comparative study using credit card fraudulent transaction dataset. 2020 IEEE 16th International Conference on Control & Automation (ICCA). doi: https://doi.org/10.1109/icca51439.2020.9264517
  8. Ahammad, J., Hossain, N., Alam, M. S. (2020). Credit Card Fraud Detection using Data Pre-processing on Imbalanced Data - both Oversampling and Undersampling. Proceedings of the International Conference on Computing Advancements. doi: https://doi.org/10.1145/3377049.3377113
  9. Praveen Mahesh, K., Ashar Afrouz, S., Shaju Areeckal, A. (2022). Detection of fraudulent credit card transactions: A comparative analysis of data sampling and classification techniques. Journal of Physics: Conference Series, 2161 (1), 012072. doi: https://doi.org/10.1088/1742-6596/2161/1/012072
  10. Wang, S., Dai, Y., Shen, J., Xuan, J. (2021). Research on expansion and classification of imbalanced data based on SMOTE algorithm. Scientific Reports, 11 (1). doi: https://doi.org/10.1038/s41598-021-03430-5
  11. Klikowski, J Woźniak, M. (2022). Deterministic Sampling Classifier with weighted Bagging for drifted imbalanced data stream classification. Applied Soft Computing, 122, 108855. doi: https://doi.org/10.1016/j.asoc.2022.108855
  12. Kim, M., Hwang, K.-B. (2022). An empirical evaluation of sampling methods for the classification of imbalanced data. PLOS ONE, 17 (7), e0271260. doi: https://doi.org/10.1371/journal.pone.0271260
  13. Liu, Z., Cao, W., Gao, Z., Bian, J., Chen, H., Chang, Y., Liu, T.-Y. (2020). Self-paced Ensemble for Highly Imbalanced Massive Data Classification. 2020 IEEE 36th International Conference on Data Engineering (ICDE). doi: https://doi.org/10.1109/icde48307.2020.00078
  14. Park, S., Lee, H.-W., Im, J. (2022). Resampling-Based Relabeling & Raking Algorithm to One-Class Classification. doi: https://doi.org/10.36227/techrxiv.17712122.v2
  15. Juez-Gil, M., Arnaiz-González, Á., Rodríguez, J. J., García-Osorio, C. (2021). Experimental evaluation of ensemble classifiers for imbalance in Big Data. Applied Soft Computing, 108, 107447. doi: https://doi.org/10.1016/j.asoc.2021.107447
  16. Wang, Q., Luo, Z., Huang, J., Feng, Y., Liu, Z. (2017). A Novel Ensemble Method for Imbalanced Data Learning: Bagging of Extrapolation-SMOTE SVM. Computational Intelligence and Neuroscience, 2017, 1–11. doi: https://doi.org/10.1155/2017/1827016
  17. Singh, G., Chu, L., Wang, L., Pei, J., Tian, Q., Zhang, Y. (2022). Mining Minority-Class Examples with Uncertainty Estimates. Lecture Notes in Computer Science, 258–271. doi: https://doi.org/10.1007/978-3-030-98358-1_21
  18. Rayhan, F., Ahmed, S., Mahbub, A., Jani, Md. R., Shatabda, S., Farid, D. Md., Rahman, C. M. (2017). MEBoost: Mixing estimators with boosting for imbalanced data classification. 2017 11th International Conference on Software, Knowledge, Information Management and Applications (SKIMA). doi: https://doi.org/10.1109/skima.2017.8294128
  19. Meng, Q. (2022). Credit Card Fraud Detection Using Feature Fusion-based Machine Learning Model. Highlights in Science, Engineering and Technology, 23, 111–116. doi: https://doi.org/10.54097/hset.v23i.3208
  20. Zhou, Z.-H. (2020). Ensemble Learning: Foundations and Algorithms. Beijing: Electronic Industry Press.
  21. Chen, Z., Duan, J., Kang, L., Qiu, G. (2021). A hybrid data-level ensemble to enable learning from highly imbalanced dataset. Information Sciences, 554, 157–176. doi: https://doi.org/10.1016/j.ins.2020.12.023
  22. Bader-El-Den, M., Teitei, E., Perry, T. (2019). Biased Random Forest For Dealing With the Class Imbalance Problem. IEEE Transactions on Neural Networks and Learning Systems, 30 (7), 2163–2172. doi: https://doi.org/10.1109/tnnls.2018.2878400
  23. Borah, P., Gupta, D. (2021). Robust twin bounded support vector machines for outliers and imbalanced data. Applied Intelligence, 51 (8), 5314–5343. doi: https://doi.org/10.1007/s10489-020-01847-5
  24. Gupta, D., Richhariya, B. (2018). Entropy based fuzzy least squares twin support vector machine for class imbalance learning. Applied Intelligence, 48 (11), 4212–4231. doi: https://doi.org/10.1007/s10489-018-1204-4
  25. Husejinovic, A. (2020). Credit card fraud detection using naive Bayesian and C4.5 decision tree classifiers. Periodicals of Engineering and Natural Sciences, 8 (1), 1–5.
  26. Kalid, S. N., Ng, K.-H., Tong, G.-K., Khor, K.-C. (2020). A Multiple Classifiers System for Anomaly Detection in Credit Card Data With Unbalanced and Overlapped Classes. IEEE Access, 8, 28210–28221. doi: https://doi.org/10.1109/access.2020.2972009
  27. Lee, T.-H., Ullah, A., Wang, R. (2019). Bootstrap Aggregating and Random Forest. Advanced Studies in Theoretical and Applied Econometrics, 389–429. doi: https://doi.org/10.1007/978-3-030-31150-6_13
  28. Denuit, M., Hainaut, D., Trufin, J. (2020). Bagging Trees and Random Forests. Effective Statistical Learning Methods for Actuaries II, 107–130. doi: https://doi.org/10.1007/978-3-030-57556-4_4
  29. Genuer, R., Poggi, J.-M. (2020). Random Forests. Random Forests with R, 33–55. doi: https://doi.org/10.1007/978-3-030-56485-8_3
  30. Syam, N., Kaul, R. (2021). Random Forest, Bagging, and Boosting of Decision Trees. Machine Learning and Artificial Intelligence in Marketing and Sales, 139–182. doi: https://doi.org/10.1108/978-1-80043-880-420211006
  31. Breiman, L. (2001). Random Forests. Machine Learning, 45, 5–32. doi: https://doi.org/10.1023/a:1010933404324
  32. Hatwell, J., Gaber, M. M., Azad, R. M. A. (2020). CHIRPS: Explaining random forest classification. Artificial Intelligence Review, 53 (8), 5747–5788. doi: https://doi.org/10.1007/s10462-020-09833-6
  33. Lin, W., Gao, J., Wang, B., Hong, Q. (2021). An Improved Random Forest Classifier for Imbalanced Learning. 2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA). doi: https://doi.org/10.1109/icaica52286.2021.9497933
  34. Sudha, C., Akila, D. (2021). Credit Card Fraud Detection System based on Operational & Transaction features using SVM and Random Forest Classifiers. 2021 2nd International Conference on Computation, Automation and Knowledge Management (ICCAKM). doi: https://doi.org/10.1109/iccakm50778.2021.9357709
  35. Aung, M. H., Seluka, P. T., Fuata, J. T. R., Tikoisuva, M. J., Cabealawa, M. S., Nand, R. (2020). Random Forest Classifier for Detecting Credit Card Fraud based on Performance Metrics. 2020 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE). doi: https://doi.org/10.1109/csde50874.2020.9411563
  36. Credit Card Fraud Detection. Available at: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
  37. Carrington, A. M., Manuel, D. G., Fieguth, P. W., Ramsay, T., Osmani, V., Wernly, B. et al. (2023). Deep ROC Analysis and AUC as Balanced Average Accuracy, for Improved Classifier Selection, Audit and Explanation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45 (1), 329–341. doi: https://doi.org/10.1109/tpami.2022.3145392
  38. Kamble, S., Desai, A., Vartak, P. (2014). Evaluation and Performance Analysis of Machine Learning Algorithms. International Journal of Engineering Sciences & Research Technology, 3 (5), 789–794.
  39. Glisic, S. G., Lorenzo, B. (2022). Machine Learning Algorithms. Artificial Intelligence and Quantum Computing for Advanced Wireless Networks, 17–54. doi: https://doi.org/10.1002/9781119790327.ch2
  40. Piegorsch, W. W. (2020). Confusion Matrix. Wiley StatsRef: Statistics Reference Online, 1–4. doi: https://doi.org/10.1002/9781118445112.stat08244
Unbalanced credit fraud modeling based on bagging and bayesian optimization

Downloads

Published

2023-06-30

How to Cite

Kashmoola, M. A., Aziz, S. F., Qays, H. M., & Alsaleem, N. Y. A. (2023). Unbalanced credit fraud modeling based on bagging and bayesian optimization. Eastern-European Journal of Enterprise Technologies, 3(4 (123), 47–53. https://doi.org/10.15587/1729-4061.2023.279936

Issue

Section

Mathematics and Cybernetics - applied aspects