Analysis of machine learning methods in the task of searching duplicates in the software code

Tetiana Kaliuzhna; Yevhenii Kubiuk

doi:10.15587/2706-5448.2022.263235

Authors

Tetiana Kaliuzhna National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute», Ukraine https://orcid.org/0000-0002-0937-8988
Yevhenii Kubiuk National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute», Ukraine https://orcid.org/0000-0002-7086-0976

DOI:

https://doi.org/10.15587/2706-5448.2022.263235

Keywords:

clone detection, machine learning methods, decision tree, Support Vector Machine, TECCD, dataset

Abstract

The object of the study is code in the Python programming language, analyzed by machine learning methods to identify clones.

This work is devoted to the study of machine learning methods and implementation of the decision tree machine learning model in the problem of finding clones in the program code. The paper also analyzes existing machine learning approaches for detecting duplicates in program code. During the comparison, the advantages and disadvantages of each algorithm were determined, and the results were summarized in the corresponding comparison tables. As a result of the analysis, it was determined that the method based on the decision tree, which gives the best result in the task of finding clones in the program code, is the most optimal both from the point of view of accuracy and from the point of view of implementation.

The result of the work is a created model that, with an accuracy of more than 99 %, classifies cloned and non-cloned codes on an automatically generated dataset in a minimal amount of time. This system has several open questions for future research, the list of which is presented in this work. The proposed model has the following ways of further development:

– recognition of clones rewritten from one programming language to another;

– detection of vulnerabilities in the code;

– improvement of model performance by creating more universal datasets.

The perspective of the work lies in training a decision tree model for accurate and fast detection of code clones, which can potentially be widely used for plagiarism detection in both educational institutions and IT companies.

Author Biographies

Tetiana Kaliuzhna, National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute»

Department of System Design

Yevhenii Kubiuk, National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute»

Department of System Design

References

Roy, C. K., Cordy, J. R., Koschke, R. (2009). Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming, 74 (7), 470–495. doi: http://doi.org/10.1016/j.scico.2009.02.007
Code Duplicate. Available at: https://t2informatik.de/en/smartpedia/code-duplicate/
Roy, C. K., Cordy, J. R. (2007). A Survey on Software Clone Detection Research. Computer and Information Science, 115 (541), 115.
Arammongkolvichai, V., Koschke, R., Ragkhitwetsagul, C., Choetkiertikul, M., Sunetnanta, T. (2019). Improving Clone Detection Precision Using Machine Learning Techniques. 2019 10th International Workshop on Empirical Software Engineering in Practice (IWESEP), 31–36. doi: http://doi.org/10.1109/iwesep49350.2019.00014
Jadon, S. (2016). Code Clones Detection Using Machine Learning Technique: Support Vector Machine. International Conference on Computing, Communication and Automation (ICCCA2016), 299–303. doi: http://doi.org/10.1109/ccaa.2016.7813733
Gao, Y., Wang, Z., Liu, S., Yang, L., Sang, W., Cai, Y. (2019). TECCD: A Tree Embedding Approach for Code Clone Detection. 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME), 145–156. doi: http://doi.org/10.1109/icsme.2019.00025
Salzberg, S. (1994). C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Machine Learning, 16 (3), 235–240. doi: http://doi.org/10.1007/bf00993309
Conforti, R., Leoni, M. D., Rosa, M. L., Aalst, W. V. D. (2013). Supporting risk-informed decisions during business process execution. 25th International Conference on Advanced Information Systems Engineering (CAiSE’13), 116–132. doi: http://doi.org/10.1007/978-3-642-38709-8_8
Kundel, D. (2020). ASTs – What are they and how to use them. Available at: https://www.twilio.com/blog/abstract-syntax-trees
Agerholm, S., Larsen, P. G. (1999). A Lightweight Approach to Formal Methods. Lecture Notes in Computer Science, 168–183. doi: http://doi.org/10.1007/3-540-48257-1_10
BigCloneBench. Available at: https://github.com/clonebench/BigCloneBench
Buckland, M., Gey, F. (1994). The relationship between Recall and Precision. Journal of the American Society for Information Science, 45 (1), 12–19. doi: https://doi.org/10.1002/(SICI)1097-4571(199401)45:1<12::AID-ASI2>3.0.CO;2-L
Decision Tree Classification Algorithm. Available at: https://www.javatpoint.com/machine-learning-decision-tree-classification-algorithm
Decision Tree Classifier. Available at: https://www.sciencedirect.com/topics/computer-science/decision-tree-classifier
Bondarenko, O. (2021). Matrytsia nevidpovidnostei. Available at: https://oleghbond.medium.com/матриця-невідповідностей-329e7e4bf05e
Kubiuk, Y., Kyselov, G. (2021). Comparative analysis of approaches to source code vulnerability detection based on deep learning methods. Technology Audit and Production Reserves, 3 (2 (59)), 19–23. doi: http://doi.org/10.15587/2706-5448.2021.233534
Scikit-Learn. Available at: https://scikit-learn.org/stable/

Analysis of machine learning methods in the task of searching duplicates in the software code

Authors

DOI:

Keywords:

Abstract

Author Biographies

Tetiana Kaliuzhna, National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute»

Yevhenii Kubiuk, National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute»

References

Downloads

Published

How to Cite

Issue

Section

License

Information site

Language

Information

Developed By

Current Issue