Detection and classification of threats and vulnerabilities on hacker forums based on machine learning
DOI:
https://doi.org/10.15587/1729-4061.2024.306522Keywords:
cybersecurity, hacker forum, threats identification, data classification, machine learningAbstract
The object of this study is the process of detecting threats and vulnerabilities in hacker forums, which are a well-known source of potential dangers for Internet users. However, the problem of analyzing and classifying data from these forums is its complexity due to such features of the participants' language as specific slang, jargon, etc., which requires the use of modern tools of their processing. This paper explores the application of machine learning to devise an effective method for analyzing sentiment and trends in hacker forums to identify potential threats and vulnerabilities in cyberspace. All necessary stages of the process of detecting threats and vulnerabilities have been developed, ranging from data collection and preprocessing to the training of a model that is capable of processing “raw” unstructured data from hacker forums. The implementation of six popular machine learning algorithms, namely k Nearest Neighbors (kNN), Random Forest, Naive Bayes, Logistic Regression, Support Vector Machines (SVM), and Decision Tree algorithms have been studied with a view to determining their efficiency of threat and vulnerability detection and classification. The experiments have been conducted on real data (150,000 messengers). It has been determined that the Random Forest algorithm coped with the task the best (accuracy=0.89, recall=0.84, precision=0.91, F1-score=0.87 and ROC-AUC=0.89). The proposed tool based on machine learning not only collects data that poses a potential threat but also processes and classifies it according to the specified keywords. This allows detecting threats and vulnerabilities at a high speed. The results of the study make it possible to identify potential trends in threats and vulnerabilities. This will contribute to the improvement of cybersecurity systems and ensure more reliable protection of information resources
References
- Mambetov, S., Begimbayeva, Y., Joldasbayev, S., Kazbekova, G. (2023). Internet threats and ways to protect against them: A brief review. 2023 13th International Conference on Cloud Computing, Data Science & Engineering (Confluence). https://doi.org/10.1109/confluence56041.2023.10048858
- Dhake, B., Shetye, C., Borhade, P., Gawas, D., Nerurkar, A. (2023). Stratification of Hacker Forums and Predicting Cyber Assaults for Proactive Cyber Threat Intelligence. 2023 2nd International Conference on Paradigm Shifts in Communications Embedded Systems, Machine Learning and Signal Processing (PCEMS). https://doi.org/10.1109/pcems58491.2023.10136033
- Leukfeldt, E. R., Kleemans, E. R., Stol, W. P. (2016). Cybercriminal Networks, Social Ties and Online Forums: Social Ties Versus Digital Ties within Phishing and Malware Networks. British Journal of Criminology, azw009. https://doi.org/10.1093/bjc/azw009
- Shakarian, J., Gunn, A. T., Shakarian, P. (2016). Exploring Malicious Hacker Forums. Cyber Deception, 259–282. https://doi.org/10.1007/978-3-319-32699-3_11
- Mikhaylov, A., Frank, R. (2016). Cards, Money and Two Hacking Forums: An Analysis of Online Money Laundering Schemes. 2016 European Intelligence and Security Informatics Conference (EISIC). https://doi.org/10.1109/eisic.2016.021
- Abbasi, A., Li, W., Benjamin, V., Hu, S., Chen, H. (2014). Descriptive Analytics: Examining Expert Hackers in Web Forums. 2014 IEEE Joint Intelligence and Security Informatics Conference. https://doi.org/10.1109/jisic.2014.18
- Zhang, X., Li, C. (2013). Survival analysis on hacker forums. SIGBPS workshop on business processes and service, 106–110.
- Tariq, E., Akour, I., Al-Shanableh, N., Alquqa, E. K., Alzboun, N., Al-Hawary, S. I. S., Alshurideh, M. T. (2024). How cybersecurity influences fraud prevention: An empirical study on Jordanian commercial banks. International Journal of Data and Network Science, 8 (1), 69–76. https://doi.org/10.5267/j.ijdns.2023.10.016
- Karuna, P., Purohit, H., Jajodia, S., Ganesan, R., Uzuner, O. (2021). Fake Document Generation for Cyber Deception by Manipulating Text Comprehensibility. IEEE Systems Journal, 15 (1), 835–845. https://doi.org/10.1109/jsyst.2020.2980177
- Rebafka, T. (2023). Model-based clustering of multiple networks with a hierarchical algorithm. Statistics and Computing, 34 (1). https://doi.org/10.1007/s11222-023-10329-w
- Fu, T., Abbasi, A., Chen, H. (2010). A focused crawler for Dark Web forums. Journal of the American Society for Information Science and Technology, 61 (6), 1213–1231. https://doi.org/10.1002/asi.21323
- McAlaney, J., Kimpton, E., Thackeray, H. (2019). Fifty shades of grey hat: A socio-psychological analysis of conversations on hacking forums. CyPsy24: Annual CyberPsychology, CyberTherapy & Social Networking Conference. Available at: https://eprints.bournemouth.ac.uk/32495
- McAlaney, J., Hambidge, S., Kimpton, E., Thackray, H. (2020). Knowledge is power: An analysis of discussions on hacking forums. 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). https://doi.org/10.1109/eurospw51379.2020.00070
- Lacey, D., Salmon, P. M. (2015). It’s Dark in There: Using Systems Analysis to Investigate Trust and Engagement in Dark Web Forums. Lecture Notes in Computer Science, 117–128. https://doi.org/10.1007/978-3-319-20373-7_12
- Benjamin, V., Valacich, J. S., Chen, H. (2019). DICE-E: A Framework for Conducting Darknet Identification, Collection, Evaluation with Ethics. MIS Quarterly, 43 (1), 1–22. https://doi.org/10.25300/misq/2019/13808
- Zhang, Y., Fan, Y., Ye, Y., Zhao, L., Wang, J., Xiong, Q., Shao, F. (2018). KADetector: Automatic Identification of Key Actors in Online Hack Forums Based on Structured Heterogeneous Information Network. 2018 IEEE International Conference on Big Knowledge (ICBK). https://doi.org/10.1109/icbk.2018.00028
- Park, A. J., Frank, R., Mikhaylov, A., Thomson, M. (2018). Hackers Hedging Bets: A Cross-Community Analysis of Three Online Hacking Forums. 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). https://doi.org/10.1109/asonam.2018.8508613
- Macdonald, M., Frank, R., Mei, J., Monk, B. (2015). Identifying Digital Threats in a Hacker Web Forum. Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015. https://doi.org/10.1145/2808797.2808878
- Frank, R., Macdonald, M., Monk, B. (2016). Location, Location, Location: Mapping Potential Canadian Targets in Online Hacker Discussion Forums. 2016 European Intelligence and Security Informatics Conference (EISIC). https://doi.org/10.1109/eisic.2016.012
- Du, P.-Y., Zhang, N., Ebrahimi, M., Samtani, S., Lazarine, B., Arnold, N. et al. (2018). Identifying, Collecting, and Presenting Hacker Community Data: Forums, IRC, Carding Shops, and DNMs. 2018 IEEE International Conference on Intelligence and Security Informatics (ISI). https://doi.org/10.1109/isi.2018.8587327
- Joldasbayev, S., Sapakova, S., Zhaksylyk, A., Kulambayev, B., Armankyzy, R., Bolysbek, A. (2023). Development of an Intelligent Service Delivery System to Increase Efficiency of Software Defined Networks. International Journal of Advanced Computer Science and Applications, 14 (12). https://doi.org/10.14569/ijacsa.2023.0141267
- Balakayeva, G., Ezhilchelvan, P., Makashev, Y., Phillips, C., Darkenbayev, D., Nurlybayeva, K. (2023). Digitalization of enterprise with ensuring stability and reliability. Informatyka, Automatyka, Pomiary w Gospodarce i Ochronie Środowiska, 13 (1), 54–57. https://doi.org/10.35784/iapgos.3295
- Balakayeva, G., Zhanuzakov, M., Kalmenova, G. (2023). Development of a digital employee rating evaluation system (DERES) based on machine learning algorithms and 360-degree method. Journal of Intelligent Systems, 32 (1). https://doi.org/10.1515/jisys-2023-0008
- Balakayeva, G., Darkenbayev, D., Zhanuzakov, M. (2023). Development of a software system for predicting employee ratings. Informatyka, Automatyka, Pomiary w Gospodarce i Ochronie Środowiska, 13 (3), 121–124. https://doi.org/10.35784/iapgos.3723
- Joldasbayev, S., Balakayeva, G., Joldasbayev, O. (2020). Application of load balancing algorithms to improve the quality of service delivery using modifications of the least connections algorithm. Journal of Theoretical and Applied Information Technology, 98 (12), 2063–2077. Available at: http://www.jatit.org/volumes/Vol98No12/7Vol98No12.pdf
- Huang, C., Guo, Y., Guo, W., Li, Y. (2021). HackerRank: Identifying key hackers in underground forums. International Journal of Distributed Sensor Networks, 17 (5), 155014772110151. https://doi.org/10.1177/15501477211015145
- Samtani, S., Chinn, R., Chen, H. (2015). Exploring hacker assets in underground forums. 2015 IEEE International Conference on Intelligence and Security Informatics (ISI). https://doi.org/10.1109/isi.2015.7165935
- Benjamin, V., Li, W., Holt, T., Chen, H. (2015). Exploring threats and vulnerabilities in hacker web: Forums, IRC and carding shops. 2015 IEEE International Conference on Intelligence and Security Informatics (ISI). https://doi.org/10.1109/isi.2015.7165944
- Deliu, I., Leichter, C., Franke, K. (2018). Collecting Cyber Threat Intelligence from Hacker Forums via a Two-Stage, Hybrid Process using Support Vector Machines and Latent Dirichlet Allocation. 2018 IEEE International Conference on Big Data (Big Data). https://doi.org/10.1109/bigdata.2018.8622469
- Anand, M., Sahay, K. B., Ahmed, M. A., Sultan, D., Chandan, R. R., Singh, B. (2023). Deep learning and natural language processing in computation for offensive language detection in online social networks by feature selection and ensemble classification techniques. Theoretical Computer Science, 943, 203–218. https://doi.org/10.1016/j.tcs.2022.06.020
- Sultan, D., Omarov, B., Kozhamkulova, Z., Kazbekova, G., Alimzhanova, L., Dautbayeva, A. et al. (2023). A Review of Machine Learning Techniques in Cyberbullying Detection. Computers, Materials & Continua, 74 (3), 5625–5640. https://doi.org/10.32604/cmc.2023.033682
- Biswas, B., Mukhopadhyay, A., Bhattacharjee, S., Kumar, A., Delen, D. (2022). A text-mining based cyber-risk assessment and mitigation framework for critical analysis of online hacker forums. Decision Support Systems, 152, 113651. https://doi.org/10.1016/j.dss.2021.113651
- Williams, R., Samtani, S., Patton, M., Chen, H. (2018). Incremental Hacker Forum Exploit Collection and Classification for Proactive Cyber Threat Intelligence: An Exploratory Study. 2018 IEEE International Conference on Intelligence and Security Informatics (ISI). https://doi.org/10.1109/isi.2018.8587336
- Benjamin, G. (2021). What we do with data: a performative critique of data “collection.” Internet Policy Review, 10 (4). https://doi.org/10.14763/2021.4.1588
- Jain, S., de Buitleir, A., Fallon, E. (2020). A Review of Unstructured Data Analysis and Parsing Methods. 2020 International Conference on Emerging Smart Computing and Informatics (ESCI). https://doi.org/10.1109/esci48226.2020.9167588
- Thivaharan., S., Srivatsun., G., Sarathambekai., S. (2020). A Survey on Python Libraries Used for Social Media Content Scraping. 2020 International Conference on Smart Electronics and Communication (ICOSEC). https://doi.org/10.1109/icosec49089.2020.9215357
- Sarkar, S., Almukaynizi, M., Shakarian, J., Shakarian, P. (2019). Predicting enterprise cyber incidents using social network analysis on dark web hacker forums. The Cyber Defense Review, 87–102. Available at: https://www.jstor.org/stable/26846122
- Ampel, B., Samtani, S., Zhu, H., Ullman, S., Chen, H. (2020). Labeling Hacker Exploits for Proactive Cyber Threat Intelligence: A Deep Transfer Learning Approach. 2020 IEEE International Conference on Intelligence and Security Informatics (ISI). https://doi.org/10.1109/isi49825.2020.9280548
- Ampel, B., Chen, H. (2021). Distilling Contextual Embeddings Into A Static Word Embedding For Improving Hacker Forum Analytics. 2021 IEEE International Conference on Intelligence and Security Informatics (ISI). https://doi.org/10.1109/isi53945.2021.9624848
- Samtani, S., Zhu, H., Chen, H. (2020). Proactively Identifying Emerging Hacker Threats from the Dark Web. ACM Transactions on Privacy and Security, 23 (4), 1–33. https://doi.org/10.1145/3409289
- Sen, P. C., Hajra, M., Ghosh, M. (2019). Supervised Classification Algorithms in Machine Learning: A Survey and Review. Emerging Technology in Modelling and Graphics, 99–111. https://doi.org/10.1007/978-981-13-7403-6_11
- Sokolova, M., Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45 (4), 427–437. https://doi.org/10.1016/j.ipm.2009.03.002
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Saken Mambetov, Yenlik Begimbayeva, Oleksandr Gurko, Наnna Doroshenko, Serik Joldasbayev, Olena Fridman, Bakytzhan Kulambayev, Vitalina Babenko, Ihor Ilhe, Serhii Neronov
This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.
A license agreement is a document in which the author warrants that he/she owns all copyright for the work (manuscript, article, etc.).
The authors, signing the License Agreement with TECHNOLOGY CENTER PC, have all rights to the further use of their work, provided that they link to our edition in which the work was published.
According to the terms of the License Agreement, the Publisher TECHNOLOGY CENTER PC does not take away your copyrights and receives permission from the authors to use and dissemination of the publication through the world's scientific resources (own electronic resources, scientometric databases, repositories, libraries, etc.).
In the absence of a signed License Agreement or in the absence of this agreement of identifiers allowing to identify the identity of the author, the editors have no right to work with the manuscript.
It is important to remember that there is another type of agreement between authors and publishers – when copyright is transferred from the authors to the publisher. In this case, the authors lose ownership of their work and may not use it in any way.