Comparative analysis of approaches to source code vulnerability detection based on deep learning methods

Yevhenii Kubiuk; Gennadiy Kyselov

doi:10.15587/2706-5448.2021.233534

Authors

Yevhenii Kubiuk National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute», Ukraine https://orcid.org/0000-0002-7086-0976
Gennadiy Kyselov National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute», Ukraine https://orcid.org/0000-0003-2682-3593

DOI:

https://doi.org/10.15587/2706-5448.2021.233534

Keywords:

AST-based approaches, program dependence graph-based approaches, code analysis

Abstract

The object of research of this work is the methods of deep learning for source code vulnerability detection. One of the most problematic areas is the use of only one approach in the code analysis process: the approach based on the AST (abstract syntax tree) or the approach based on the program dependence graph (PDG).

In this paper, a comparative analysis of two approaches for source code vulnerability detection was conducted: approaches based on AST and approaches based on the PDG.

In this paper, various topologies of neural networks were analyzed. They are used in approaches based on the AST and PDG. As the result of the comparison, the advantages and disadvantages of each approach were determined, and the results were summarized in the corresponding comparison tables. As a result of the analysis, it was determined that the use of BLSTM (Bidirectional Long Short Term Memory) and BGRU (Bidirectional Gated Linear Unit) gives the best result in terms of problems of source code vulnerability detection. As the analysis showed, the most effective approach for source code vulnerability detection systems is a method that uses an intermediate representation of the code, which allows getting a language-independent tool.

Also, in this work, our own algorithm for the source code analysis system is proposed, which is able to perform the following operations: predict the source code vulnerability, classify the source code vulnerability, and generate a corresponding patch for the found vulnerability. A detailed analysis of the proposed system’s unresolved issues is provided, which is planned to investigate in future researches. The proposed system could help speed up the software development process as well as reduce the number of software code vulnerabilities. Software developers, as well as specialists in the field of cybersecurity, can be stakeholders of the proposed system.

Author Biographies

Yevhenii Kubiuk, National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute»

Department of System Design

Gennadiy Kyselov, National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute»

PhD

Department of System Design

References

Prähofer, H., Angerer, F., Ramler, R., Lacheiner, H., Grillenberger, F. (2012). Opportunities and challenges of static code analysis of IEC 61131-3 programs. Proceedings of 2012 IEEE 17th International Conference on Emerging Technologies & Factory Automation (ETFA 2012). IEEE, 1–8. doi: http://doi.org/10.1109/etfa.2012.6489535
Lee, M., Cho, S., Jang, C., Park, H., Choi, E. (2006). A rule-based security auditing tool for software vulnerability detection. 2006 International Conference on Hybrid Information Technology. IEEE, 2, 505–512. doi: http://doi.org/10.1109/ichit.2006.253653
Turhan, B., Kocak, G., Bener, A. (2009). Data mining source code for locating software bugs: A case study in telecommunication industry. Expert Systems with Applications, 36 (6), 9986–9990. doi: http://doi.org/10.1016/j.eswa.2008.12.028
Murakami, H., Hotta, K., Higo, Y., Igaki, H., Kusumoto, S. (2013). Gapped code clone detection with lightweight source code analysis. 2013 21st International Conference on Program Comprehension (ICPC). IEEE, 93–102. doi: http://doi.org/10.1109/icpc.2013.6613837
Clang: A C Language Family Frontend for LLVM. Available at: https://clang.llvm.org/
Babelfish. GitHub. Available at: https://github.com/bblfsh
Büch, L., Andrzejak, A. (2019). Learning-based recursive aggregation of abstract syntax trees for code clone detection. 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 95–104. doi: http://doi.org/10.1109/saner.2019.8668039
Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013). Efficient estimation of word representations in vector space. Available at: https://arxiv.org/abs/1301.3781
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R. (1993). Signature verification using a «Siamese» time delay neural network. Advances in neural information processing systems, 6, 737–744.
Hochreiter, S., Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9 (8), 1735–1780. doi: http://doi.org/10.1162/neco.1997.9.8.1735
Dam, H. K., Pham, T., Ng, S. W., Tran, T., Grundy, J., Ghose, A. et. al. (2018). A deep tree-based model for software defect prediction. Available at: https://arxiv.org/abs/1802.00921
Guan, Z., Wang, X., Xin, W., Wang, J., Zhang, L. (2020). A survey on deep learning-based source code defect analysis. 2020 5th International Conference on Computer and Communication Systems (ICCCS). IEEE, 167–171. doi: http://doi.org/10.1109/icccs49078.2020.9118556
Hochreiter, S. (1998). The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6 (2), 107–116. doi: http://doi.org/10.1142/s0218488598000094
Zhang, J., Wang, X., Zhang, H., Sun, H., Wang, K., Liu, X. (2019). A novel neural source code representation based on abstract syntax tree. 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 783–794. doi: http://doi.org/10.1109/icse.2019.00086
Allamanis, M., Brockschmidt, M., Khademi, M. (2017). Learning to represent programs with graphs. Available at: https://arxiv.org/abs/1711.00740
Allamanis, M., Barr, E. T., Bird, C., Sutton, C. (2014). Learning natural coding conventions. Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, 281–293. doi: http://doi.org/10.1145/2635868.2635883
Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R. (2015). Gated graph sequence neural networks. Available at: https://arxiv.org/abs/1511.05493
Harer, J. A., Kim, L. Y., Russell, R. L., Ozdemir, O., Kosta, L. R., Rangamani, A. et. al. (2018). Automated software vulnerability detection with machine learning. Available at: https://arxiv.org/abs/1803.04497
LeCun, Y., Haffner, P., Bottou, L., Bengio, Y. (1999). Object recognition with gradient-based learning. Shape, contour and grouping in computer vision. Berlin, Heidelberg: Springer, 319–345.
Kim Y. (2014) Convolutional neural networks for sentence classification. Available at: https://arxiv.org/abs/1408.5882
Li, Z., Zou, D., Xu, S., Ou, X., Jin, H., Wang, S. et. al. (2018). VulDeePecker: A Deep Learning-Based System for Vulnerability Detection. Proceedings 2018 Network and Distributed System Security Symposium. doi: http://doi.org/10.14722/ndss.2018.23158
CWE – Common Weakness Enumeration. CWE. Available at: https://cwe.mitre.org/
Li, Z., Zou, D., Tang, J., Zhang, Z., Sun, M., Jin, H. (2019). A Comparative Study of Deep Learning-Based Vulnerability Detection System. IEEE Access, 7, 103184–103197. doi: http://doi.org/10.1109/access.2019.2930578
Chrenousov, A., Savchenko, A., Osadchyi, S., Kubiuk, Y., Kostenko, Y., Likhomanov, D. (2019). Deep learning based automatic software defects detection framework. Theoretical and Applied Cybersecurity, 1 (1). doi: http://doi.org/10.20535/tacs.2664-29132019.1.169086
Schuster, M., Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45 (11), 2673–2681. doi: http://doi.org/10.1109/78.650093
Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z. (2021). SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities. IEEE Transactions on Dependable and Secure Computing, 1–1. doi: http://doi.org/10.1109/tdsc.2021.3051525
Li, Z., Zou, D., Xu, S., Chen, Z., Zhu, Y., Jin, H. (2021). VulDeeLocator: A Deep Learning-based Fine-grained Vulnerability Detector. IEEE Transactions on Dependable and Secure Computing, 1–1. doi: http://doi.org/10.1109/tdsc.2021.3076142
The LLVM Compiler Infrastructure Project. Available at: https://llvm.org/

Comparative analysis of approaches to source code vulnerability detection based on deep learning methods

Authors

DOI:

Keywords:

Abstract

Author Biographies

Yevhenii Kubiuk, National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute»

Gennadiy Kyselov, National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute»

References

Downloads

Published

How to Cite

Issue

Section

License

Information site

Language

Information

Developed By

Current Issue