MALWARE DETECTION MODEL BASED ON MACHINE LEARNING
DOI:
https://doi.org/10.24025/2306-4412.3.2023.286374Keywords:
intrusion detection, PE format, feature extraction, disassembled instructions, support vector machineAbstract
Every year, malware authors create more and more sophisticated and clever malware that can harm our computers. Traditional methods, which are based on searching for program signatures are no longer effective in solving the problem of malware detection. It is being replaced by automated file analysis, which is a more promising approach to detecting suspicious files. Machine learning methods are increasingly used to detect such malware programs. However, such solutions may require a lot of computing resources to perform their operations. Therefore, the task of creating an optimal machine learning model in terms of learning speed and malware detection accuracy arises. In addition, usually one method of data representation is not sufficient to detect malicious features of files. Therefore, this paper will describe two different methods: one method is based on the binary information of the file, the other one is based on disassembled code of executable files. The purpose of this work is to improve the efficiency of malware detection by optimising feature extraction methods and applying machine learning. The main tasks of the study include: extracting features from exe files, creating several machine learning models and comparing them to determine the most effective one. The dataset used in this study has been collected from various online sources and consists of 12824 executable files in .exe format, of which 11844 files are malicious and 980 are benign. This paper presents recommended methods of feature extraction and input data generation for machine learning models based on the support vector machine algorithm. These methods allow to find the best way to process the features describing a malicious file. Six machine learning models, each of which performed well in terms of F-score, precision, and recall metrics, were created. The model that was created based on the binary type of data representation showed the highest results for all metrics.
References
Abdessadki, I., & Lazaar, S. (2019). A new classification based model for malicious pe files detection. International Journal of Computer Network and Information Security, 11(6), 1-9.
Abri, F., Siami-Namini, S., Khanghah, M.A., Soltani, F.M. et al. (2019). The performance of machine and deep learning classifiers in detecting zero-day vulnerabilities. arXiv:1911.09586.
Alazab, M., Venkatraman, S., Watters, P., & Alazab M. (2011). Zero-day malware detection based on supervised learning algorithms of api call signatures. In Australasian Data Mining Conference (pp. 171-182).
Al-Khshali, H.H., Ilyas, M., & Ucan, O.N. (2020). Effect of pe file header features on accuracy. In IEEE Symposium Series on Computational Intelligence.
Bilar, D. (2007). Opcodes as predictor for malware. International Journal of Electronic Security and Digital Forensics.
Burges, C.J.C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121-167.
Chaudhary, P. (2021). Pe file-based malware detection using machine learning. In Proceedings of International Conference on Artificial Intelligence and Applications (pp. 113-123).
Handa, A., Sharma, A., & Shukla, S.K. (2019). Machine learning in cybersecurity: A review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery.
Kutlay, A., & Karađuzović-Hadžiabdić, K. (2020). Static based classification of malicious software using machine learning methods. Lecture Notes in Networks and Systems book series, 83.
Lifshits, Yu. (2006). Algorithms for internet: Support vector machines.
Microsoft. "Portable Executable". (n.d.). Retrieved from https://learn.microsoft.com/en-us/windows/win32/debug/pe-format.
Nafiiev, A., Kholodulkin, H., & Rodionov, A. (2022). Comparative analysis of machine learning methods for detecting malicious files. Theoretical and Applied Cybersecurity, 3(1).
Raff, E., Zak, R. et al. (2018). An investigation of byte n-gram features for malware classification. Journal of Computer Virology and Hacking Techniques.
Sikorski, M., & Honig, A. (2012). Practical Malware Analysis: The Hands on Guide to Dissecting Malicious Software.
The fields used for the DOS Header. (2016). Retrieved from https://github.com/wine-mirror/wine/blob/master/include/winnt.h.
Downloads
Published
How to Cite
Issue
Section
URN
License
Copyright (c) 2023 Alan Nafiiev, Dmytro Lande

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The authors who publish in this journal agree to the following terms:The authors reserve the right to authorship of their work and give the journal the right to first publish this work under the terms of the Creative Commons Attribution License CC BY-NC, which allows other persons to freely distribute published work with a mandatory reference to authors of the original work and the first publication of the work in this journal.
Authors have the right to conclude separate additional agreements for the non-exclusive distribution of the paper in the form in which it was published by this journal (for example, posting work in electronic repository or publishing as part of a monograph), provided that the link to the first publication in this journal is maintained.
The journal policy allows and encourages authors to post on the Internet (for example, in repositories of institutions or on personal websites) the manuscript of work, both before the submission of this manuscript to the editorial staff, and during its editorial work, as it contributes to the emergence of productive scientific discussion and positively affects the efficiency and dynamics of published work citation (see The Effect of Open Access).