Detection and classification of threats and vulnerabilities on hacker forums based on machine learning

Saken Mambetov; Ihor Ilhe; Vitalina Babenko; Bakytzhan Kulambayev; Olena Fridman; Serik Joldasbayev; Hanna Doroshenko; Oleksandr Gurko; Yenlik Begimbayeva; Serhii Neronov

doi:10.15587/1729-4061.2024.306522

Authors

Saken Mambetov Al-Farabi Kazakh National University, Kazakhstan https://orcid.org/0000-0002-7249-5378
Ihor Ilhe Kharkiv National Automobile and Highway University, Ukraine https://orcid.org/0000-0002-0585-8685
Vitalina Babenko Kharkiv National Automobile and Highway University; Daugavpils University, Ukraine https://orcid.org/0000-0002-4816-4579
Bakytzhan Kulambayev Turan University, Kazakhstan https://orcid.org/0009-0002-9279-6239
Olena Fridman V. N. Karazin Kharkiv National University, Ukraine https://orcid.org/0000-0002-7437-6372
Serik Joldasbayev International IT University, Kazakhstan https://orcid.org/0000-0002-8689-1822
Hanna Doroshenko V. N. Karazin Kharkiv National University, Ukraine https://orcid.org/0000-0002-5535-8494
Oleksandr Gurko Kharkiv National Automobile and Highway University, Ukraine https://orcid.org/0000-0001-9905-8584
Yenlik Begimbayeva AUPET named after Gumarbek Daukeyev, Kazakhstan https://orcid.org/0000-0002-4907-3345
Serhii Neronov Kharkiv National Automobile and Highway University, Ukraine https://orcid.org/0000-0003-2381-1271

DOI:

https://doi.org/10.15587/1729-4061.2024.306522

Keywords:

cybersecurity, hacker forum, threats identification, data classification, machine learning

Abstract

The object of this study is the process of detecting threats and vulnerabilities in hacker forums, which are a well-known source of potential dangers for Internet users. However, the problem of analyzing and classifying data from these forums is its complexity due to such features of the participants' language as specific slang, jargon, etc., which requires the use of modern tools of their processing. This paper explores the application of machine learning to devise an effective method for analyzing sentiment and trends in hacker forums to identify potential threats and vulnerabilities in cyberspace. All necessary stages of the process of detecting threats and vulnerabilities have been developed, ranging from data collection and preprocessing to the training of a model that is capable of processing “raw” unstructured data from hacker forums. The implementation of six popular machine learning algorithms, namely k Nearest Neighbors (kNN), Random Forest, Naive Bayes, Logistic Regression, Support Vector Machines (SVM), and Decision Tree algorithms have been studied with a view to determining their efficiency of threat and vulnerability detection and classification. The experiments have been conducted on real data (150,000 messengers). It has been determined that the Random Forest algorithm coped with the task the best (accuracy=0.89, recall=0.84, precision=0.91, F1-score=0.87 and ROC-AUC=0.89). The proposed tool based on machine learning not only collects data that poses a potential threat but also processes and classifies it according to the specified keywords. This allows detecting threats and vulnerabilities at a high speed. The results of the study make it possible to identify potential trends in threats and vulnerabilities. This will contribute to the improvement of cybersecurity systems and ensure more reliable protection of information resources

Author Biographies

Saken Mambetov, Al-Farabi Kazakh National University

PhD Student

Department of Information Systems

Ihor Ilhe, Kharkiv National Automobile and Highway University

Associate Professor

Department of Automation and Computer-Aided Technologies

Vitalina Babenko, Kharkiv National Automobile and Highway University; Daugavpils University

Doctor of Economic Sciences, PhD, Professor, Head of Department

Department of Computer Systems

Department of Law, Management & Economics

Bakytzhan Kulambayev, Turan University

Сandidate in Technical Sciences

Department of Radio Engineering, Electronics and Telecommunications

Olena Fridman, V. N. Karazin Kharkiv National University

Associate Professor

Department of Economics and Management

Serik Joldasbayev, International IT University

Master of Science

Department of Computer Engineering

Hanna Doroshenko, V. N. Karazin Kharkiv National University

Doctor of Economic Sciences, Professor, Head of Department

Department of Economics and Management

Oleksandr Gurko, Kharkiv National Automobile and Highway University

Doctor of Technical Sciences, Head of Department

Department of Automation and Computer-Aided Technologies

Yenlik Begimbayeva, AUPET named after Gumarbek Daukeyev

PhD, Head of Department

Department of Cybersecurity

Serhii Neronov, Kharkiv National Automobile and Highway University

Senior Lecturer

Department of Computer Systems

References

Mambetov, S., Begimbayeva, Y., Joldasbayev, S., Kazbekova, G. (2023). Internet threats and ways to protect against them: A brief review. 2023 13th International Conference on Cloud Computing, Data Science & Engineering (Confluence). https://doi.org/10.1109/confluence56041.2023.10048858
Dhake, B., Shetye, C., Borhade, P., Gawas, D., Nerurkar, A. (2023). Stratification of Hacker Forums and Predicting Cyber Assaults for Proactive Cyber Threat Intelligence. 2023 2nd International Conference on Paradigm Shifts in Communications Embedded Systems, Machine Learning and Signal Processing (PCEMS). https://doi.org/10.1109/pcems58491.2023.10136033
Leukfeldt, E. R., Kleemans, E. R., Stol, W. P. (2016). Cybercriminal Networks, Social Ties and Online Forums: Social Ties Versus Digital Ties within Phishing and Malware Networks. British Journal of Criminology, azw009. https://doi.org/10.1093/bjc/azw009
Shakarian, J., Gunn, A. T., Shakarian, P. (2016). Exploring Malicious Hacker Forums. Cyber Deception, 259–282. https://doi.org/10.1007/978-3-319-32699-3_11
Mikhaylov, A., Frank, R. (2016). Cards, Money and Two Hacking Forums: An Analysis of Online Money Laundering Schemes. 2016 European Intelligence and Security Informatics Conference (EISIC). https://doi.org/10.1109/eisic.2016.021
Abbasi, A., Li, W., Benjamin, V., Hu, S., Chen, H. (2014). Descriptive Analytics: Examining Expert Hackers in Web Forums. 2014 IEEE Joint Intelligence and Security Informatics Conference. https://doi.org/10.1109/jisic.2014.18
Zhang, X., Li, C. (2013). Survival analysis on hacker forums. SIGBPS workshop on business processes and service, 106–110.
Tariq, E., Akour, I., Al-Shanableh, N., Alquqa, E. K., Alzboun, N., Al-Hawary, S. I. S., Alshurideh, M. T. (2024). How cybersecurity influences fraud prevention: An empirical study on Jordanian commercial banks. International Journal of Data and Network Science, 8 (1), 69–76. https://doi.org/10.5267/j.ijdns.2023.10.016
Karuna, P., Purohit, H., Jajodia, S., Ganesan, R., Uzuner, O. (2021). Fake Document Generation for Cyber Deception by Manipulating Text Comprehensibility. IEEE Systems Journal, 15 (1), 835–845. https://doi.org/10.1109/jsyst.2020.2980177
Rebafka, T. (2023). Model-based clustering of multiple networks with a hierarchical algorithm. Statistics and Computing, 34 (1). https://doi.org/10.1007/s11222-023-10329-w
Fu, T., Abbasi, A., Chen, H. (2010). A focused crawler for Dark Web forums. Journal of the American Society for Information Science and Technology, 61 (6), 1213–1231. https://doi.org/10.1002/asi.21323
McAlaney, J., Kimpton, E., Thackeray, H. (2019). Fifty shades of grey hat: A socio-psychological analysis of conversations on hacking forums. CyPsy24: Annual CyberPsychology, CyberTherapy & Social Networking Conference. Available at: https://eprints.bournemouth.ac.uk/32495
McAlaney, J., Hambidge, S., Kimpton, E., Thackray, H. (2020). Knowledge is power: An analysis of discussions on hacking forums. 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). https://doi.org/10.1109/eurospw51379.2020.00070
Lacey, D., Salmon, P. M. (2015). It’s Dark in There: Using Systems Analysis to Investigate Trust and Engagement in Dark Web Forums. Lecture Notes in Computer Science, 117–128. https://doi.org/10.1007/978-3-319-20373-7_12
Benjamin, V., Valacich, J. S., Chen, H. (2019). DICE-E: A Framework for Conducting Darknet Identification, Collection, Evaluation with Ethics. MIS Quarterly, 43 (1), 1–22. https://doi.org/10.25300/misq/2019/13808
Zhang, Y., Fan, Y., Ye, Y., Zhao, L., Wang, J., Xiong, Q., Shao, F. (2018). KADetector: Automatic Identification of Key Actors in Online Hack Forums Based on Structured Heterogeneous Information Network. 2018 IEEE International Conference on Big Knowledge (ICBK). https://doi.org/10.1109/icbk.2018.00028
Park, A. J., Frank, R., Mikhaylov, A., Thomson, M. (2018). Hackers Hedging Bets: A Cross-Community Analysis of Three Online Hacking Forums. 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). https://doi.org/10.1109/asonam.2018.8508613
Macdonald, M., Frank, R., Mei, J., Monk, B. (2015). Identifying Digital Threats in a Hacker Web Forum. Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015. https://doi.org/10.1145/2808797.2808878
Frank, R., Macdonald, M., Monk, B. (2016). Location, Location, Location: Mapping Potential Canadian Targets in Online Hacker Discussion Forums. 2016 European Intelligence and Security Informatics Conference (EISIC). https://doi.org/10.1109/eisic.2016.012
Du, P.-Y., Zhang, N., Ebrahimi, M., Samtani, S., Lazarine, B., Arnold, N. et al. (2018). Identifying, Collecting, and Presenting Hacker Community Data: Forums, IRC, Carding Shops, and DNMs. 2018 IEEE International Conference on Intelligence and Security Informatics (ISI). https://doi.org/10.1109/isi.2018.8587327
Joldasbayev, S., Sapakova, S., Zhaksylyk, A., Kulambayev, B., Armankyzy, R., Bolysbek, A. (2023). Development of an Intelligent Service Delivery System to Increase Efficiency of Software Defined Networks. International Journal of Advanced Computer Science and Applications, 14 (12). https://doi.org/10.14569/ijacsa.2023.0141267
Balakayeva, G., Ezhilchelvan, P., Makashev, Y., Phillips, C., Darkenbayev, D., Nurlybayeva, K. (2023). Digitalization of enterprise with ensuring stability and reliability. Informatyka, Automatyka, Pomiary w Gospodarce i Ochronie Środowiska, 13 (1), 54–57. https://doi.org/10.35784/iapgos.3295
Balakayeva, G., Zhanuzakov, M., Kalmenova, G. (2023). Development of a digital employee rating evaluation system (DERES) based on machine learning algorithms and 360-degree method. Journal of Intelligent Systems, 32 (1). https://doi.org/10.1515/jisys-2023-0008
Balakayeva, G., Darkenbayev, D., Zhanuzakov, M. (2023). Development of a software system for predicting employee ratings. Informatyka, Automatyka, Pomiary w Gospodarce i Ochronie Środowiska, 13 (3), 121–124. https://doi.org/10.35784/iapgos.3723
Joldasbayev, S., Balakayeva, G., Joldasbayev, O. (2020). Application of load balancing algorithms to improve the quality of service delivery using modifications of the least connections algorithm. Journal of Theoretical and Applied Information Technology, 98 (12), 2063–2077. Available at: http://www.jatit.org/volumes/Vol98No12/7Vol98No12.pdf
Huang, C., Guo, Y., Guo, W., Li, Y. (2021). HackerRank: Identifying key hackers in underground forums. International Journal of Distributed Sensor Networks, 17 (5), 155014772110151. https://doi.org/10.1177/15501477211015145
Samtani, S., Chinn, R., Chen, H. (2015). Exploring hacker assets in underground forums. 2015 IEEE International Conference on Intelligence and Security Informatics (ISI). https://doi.org/10.1109/isi.2015.7165935
Benjamin, V., Li, W., Holt, T., Chen, H. (2015). Exploring threats and vulnerabilities in hacker web: Forums, IRC and carding shops. 2015 IEEE International Conference on Intelligence and Security Informatics (ISI). https://doi.org/10.1109/isi.2015.7165944
Deliu, I., Leichter, C., Franke, K. (2018). Collecting Cyber Threat Intelligence from Hacker Forums via a Two-Stage, Hybrid Process using Support Vector Machines and Latent Dirichlet Allocation. 2018 IEEE International Conference on Big Data (Big Data). https://doi.org/10.1109/bigdata.2018.8622469
Anand, M., Sahay, K. B., Ahmed, M. A., Sultan, D., Chandan, R. R., Singh, B. (2023). Deep learning and natural language processing in computation for offensive language detection in online social networks by feature selection and ensemble classification techniques. Theoretical Computer Science, 943, 203–218. https://doi.org/10.1016/j.tcs.2022.06.020
Sultan, D., Omarov, B., Kozhamkulova, Z., Kazbekova, G., Alimzhanova, L., Dautbayeva, A. et al. (2023). A Review of Machine Learning Techniques in Cyberbullying Detection. Computers, Materials & Continua, 74 (3), 5625–5640. https://doi.org/10.32604/cmc.2023.033682
Biswas, B., Mukhopadhyay, A., Bhattacharjee, S., Kumar, A., Delen, D. (2022). A text-mining based cyber-risk assessment and mitigation framework for critical analysis of online hacker forums. Decision Support Systems, 152, 113651. https://doi.org/10.1016/j.dss.2021.113651
Williams, R., Samtani, S., Patton, M., Chen, H. (2018). Incremental Hacker Forum Exploit Collection and Classification for Proactive Cyber Threat Intelligence: An Exploratory Study. 2018 IEEE International Conference on Intelligence and Security Informatics (ISI). https://doi.org/10.1109/isi.2018.8587336
Benjamin, G. (2021). What we do with data: a performative critique of data “collection.” Internet Policy Review, 10 (4). https://doi.org/10.14763/2021.4.1588
Jain, S., de Buitleir, A., Fallon, E. (2020). A Review of Unstructured Data Analysis and Parsing Methods. 2020 International Conference on Emerging Smart Computing and Informatics (ESCI). https://doi.org/10.1109/esci48226.2020.9167588
Thivaharan., S., Srivatsun., G., Sarathambekai., S. (2020). A Survey on Python Libraries Used for Social Media Content Scraping. 2020 International Conference on Smart Electronics and Communication (ICOSEC). https://doi.org/10.1109/icosec49089.2020.9215357
Sarkar, S., Almukaynizi, M., Shakarian, J., Shakarian, P. (2019). Predicting enterprise cyber incidents using social network analysis on dark web hacker forums. The Cyber Defense Review, 87–102. Available at: https://www.jstor.org/stable/26846122
Ampel, B., Samtani, S., Zhu, H., Ullman, S., Chen, H. (2020). Labeling Hacker Exploits for Proactive Cyber Threat Intelligence: A Deep Transfer Learning Approach. 2020 IEEE International Conference on Intelligence and Security Informatics (ISI). https://doi.org/10.1109/isi49825.2020.9280548
Ampel, B., Chen, H. (2021). Distilling Contextual Embeddings Into A Static Word Embedding For Improving Hacker Forum Analytics. 2021 IEEE International Conference on Intelligence and Security Informatics (ISI). https://doi.org/10.1109/isi53945.2021.9624848
Samtani, S., Zhu, H., Chen, H. (2020). Proactively Identifying Emerging Hacker Threats from the Dark Web. ACM Transactions on Privacy and Security, 23 (4), 1–33. https://doi.org/10.1145/3409289
Sen, P. C., Hajra, M., Ghosh, M. (2019). Supervised Classification Algorithms in Machine Learning: A Survey and Review. Emerging Technology in Modelling and Graphics, 99–111. https://doi.org/10.1007/978-981-13-7403-6_11
Sokolova, M., Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45 (4), 427–437. https://doi.org/10.1016/j.ipm.2009.03.002

Detection and classification of threats and vulnerabilities on hacker forums based on machine learning

Authors

DOI:

Keywords:

Abstract

Author Biographies

Saken Mambetov, Al-Farabi Kazakh National University

Ihor Ilhe, Kharkiv National Automobile and Highway University

Vitalina Babenko, Kharkiv National Automobile and Highway University; Daugavpils University

Bakytzhan Kulambayev, Turan University

Olena Fridman, V. N. Karazin Kharkiv National University

Serik Joldasbayev, International IT University

Hanna Doroshenko, V. N. Karazin Kharkiv National University

Oleksandr Gurko, Kharkiv National Automobile and Highway University

Yenlik Begimbayeva, AUPET named after Gumarbek Daukeyev

Serhii Neronov, Kharkiv National Automobile and Highway University

References

Downloads

Published

How to Cite

Issue

Section

License

Language

Information

Make a Submission

Developed By

Current Issue