Architecture of an automated program complex based on a multiple kernel svm classifier for analyzing malicious executable files
DOI:
https://doi.org/10.30837/2522-9818.2024.29.039Keywords:
cybersecurity; malware detection; automated program complex; static analysis; dynamic analysis; drakvuf; IDA Pro; multiple kernel.Abstract
Subject matter. This article presents the development and architecture of an automated program complex designed to identify and analyze malicious executable files using a classifier based on a multiple kernel support vector machine (SVM). Goal. The aim of the work is to create an automated system that enhances the accuracy and efficiency of malware detection by combining static and dynamic analysis into a single framework capable of processing large volumes of data with optimal time expenditure. Tasks. To achieve this goal, tasks were carried out that included developing a program complex that automates the collection of static and dynamic data from executable files using tools like IDA Pro, IDAPython, and Drakvuf; integrating a multiple kernel SVM classifier to analyze the collected heterogeneous data; validating the system's effectiveness based on a substantial dataset containing 1,389 executable samples; and demonstrating the system's scalability and practical applicability in real-world conditions. Methods. The methods involved a hybrid approach that combines static analysis – extracting byte code, disassembled instructions, and control flow graphs using IDA Pro and IDAPython – with dynamic analysis, which entails monitoring real-time behavior using Drakvuf. The multiple kernel SVM classifier integrates different data representations using various kernels, allowing for both linear and nonlinear relationships to be considered in the classification process. Results. The results of the study show that the system achieves a high level of accuracy and completeness, as evidenced by key performance metrics such as an F-score of 0.93 and ROC AUC and PR AUC values. The automated program complex reduces the analysis time of a single file from an average of 11 minutes to approximately 5 minutes, effectively doubling the throughput compared to previous methods. This significant reduction in processing time is critically important for deployment in environments where rapid and accurate malware detection is necessary. Furthermore, the system's scalability allows for efficient processing of large data volumes, making it suitable for real-world applications. Conclusions. In conclusion, the automated program complex developed in this study demonstrates significant improvements in the accuracy and efficiency of malware detection. By integrating multiple kernel SVM classification with static and dynamic analysis, the system shows potential for real-time malware detection and analysis. Its scalability and practical applicability indicate that it could become an important tool in combating modern cyber threats, providing organizations with an effective means to enhance their cybersecurity.
References
References
Raff, E., et al. (2018), "Malware Detection by Eating a Whole EXE." Workshop on Binary Analysis Research (BAR).
Santos, I., et al. (2013), "Opcode Sequences as Representation of Executables for Data-Mining-Based Unknown Malware Detection." Information Sciences, vol. 231, pp. 64–82.
Tu, K., Li, J., Towsley, D. and Braines, D. (2019), "gl2vec: Learning feature representation using graphlets for directed networks", Proceedings of the 2019 Workshop on Binary Analysis Research. DOI: 10.1145/3341161.3342908
Aziz, F., Ullah, A. and Shah, F. (2020), "Feature selection and learning for graphlet kernel", Pattern Recognition Letters, 140, pp. 45–51. DOI: 10.1016/j.patrec.2020.05.019
Paakkola, S. (2020), "Assessing performance overhead of Virtual Machine Introspection and its suitability for malware analysis", University of Turku. Available at: https://core.ac.uk/download/pdf/347180664.pdf
Khater, I.M., Meng, F., Nabi, I.R. and Hamarneh, G. (2019), "Identification of caveolin-1 domain signatures via machine learning and graphlet analysis of single-molecule super-resolution data", Bioinformatics, 35(18), pp. 3468–3474. DOI: 10.1093/bioinformatics/btz951
Nafiiev Alan, Kholodulkin Hlib, Rodionov Andrii, (2021) "Comparative analysis of machine learning methods for detecting malicious files". Theoretical and Applied Cybersecurity, Vol. 3 No. 1, pp 46–51.
Alan Nafiiev, Hlib Kholodulkin, Andrii Rodionov, (2022), "Malware dynamic analysis system based on virtual machine introspection and machine learning methods", Information Technologies and Security. Proceedings of the XXII International Scientific and Practical Conference ITB-2022. Issue 22: pp 53–58.
Nafiiev Alan, Lande Dmytro, (2023), "Malware detection model based on machine learning". Bulletin of Cherkasy State Technological University, No. 3, pp. 40–50.
Nafiiev Alan, Rodionov Andrii, (2023), "Malware detection system based on static and dynamic analysis using machine learning", Theoretical and Applied Cybersecurity, Vol. 5 No. 2, pp. 97–104.
Rizvi, S.K.J., Aslam, W., Shahzad, M., Saleem, S. (2022), "PROUD-MAL: static analysis-based progressive framework for deep unsupervised malware classification of windows portable executable", Complex & Intelligent Systems, 8(1), pp. 1345–1361. DOI: 10.1007/s40747-021-00560-1
Faloutsos, M. (2019), "IDAPro for IoT Malware analysis?", Workshop on Binary Analysis Research (BAR), Available at: https://escholarship.org/content/qt4rp172kk/qt4rp172kk.pdf
Chen, Z., Brophy, E., Ward, T. (2021), "Malware classification using static disassembly and machine learning", arXiv preprint arXiv:2201.07649.
Talukder, S. (2020), "Tools and techniques for malware detection and analysis", arXiv preprint arXiv:2002.06819, Available at: https://www.researchgate.net/publication/339301928_Tools_and_Techniques_for_Malware_Detection_and_Analysis
Aziz, F., Ullah, A. and Shah, F. (2020), "Feature selection and learning for graphlet kernel", Pattern Recognition Letters, 140, pp. 45–51. DOI: 10.1016/j.patrec.2020.05.019
Singh, S. (2023), "DRAKVUF Malware Sandbox", World Forum on Engineering and Science, 5(1), pp. 23–30. DOI: 10.5281/zenodo.5544337
Dietz, C., Antzek, M., Dreo, G., Sperotto, A. (2022), "Dmef: Dynamic malware evaluation framework", International Journal of Information Security, 21(1), pp. 67–85. DOI: 10.1007/s10207-021-00554-1
Sidey-Gibbons, J.A.M. and Sidey-Gibbons, C.J. (2019), "Machine learning in medicine: a practical introduction", BMC Medical Research Methodology, 19(1). DOI: 10.1186/s12874-019-0681-4
Starink, J.A.L. (2021), "Analysis and automated detection of host-based code injection techniques in malware", Journal of Computer Virology and Hacking Techniques, 17(1), pp. 1–12. DOI: 10.1007/s11416-020-00356-0
Leszczyński, M. and Stopczański, K. (2020), "A new open-source hypervisor-level malware monitoring and extraction system-current state and further challenges", Virus Bulletin 2020, Available at: https://vblocalhost.com/uploads/VB2020-Leszczynski-Stopczanski.pdf (Accessed: 14 July 2024).
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Our journal abides by the Creative Commons copyright rights and permissions for open access journals.
Authors who publish with this journal agree to the following terms:
Authors hold the copyright without restrictions and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-commercial and non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their published work online (e.g., in institutional repositories or on their website) as it can lead to productive exchanges, as well as earlier and greater citation of published work.