Development of the combined method of identification of near duplicates in electronic scientific works

Petro Lizunov; Andrii Biloshchytskyi; Alexander Kuchansky; Yurii Andrashko; Svitlana Biloshchytska; Oleg Serbin

doi:10.15587/1729-4061.2021.238318

Authors

Petro Lizunov Kyiv National University of Construction and Architecture, Ukraine https://orcid.org/0000-0003-2924-3025
Andrii Biloshchytskyi Astana IT University; Taras Shevchenko National University of Kyiv , Kazakhstan https://orcid.org/0000-0001-9548-1959
Alexander Kuchansky Taras Shevchenko National University of Kyiv , Ukraine https://orcid.org/0000-0003-1277-8031
Yurii Andrashko Uzhhorod National University, Ukraine https://orcid.org/0000-0003-2306-8377
Svitlana Biloshchytska Taras Shevchenko National University of Kyiv , Ukraine https://orcid.org/0000-0002-0856-5474
Oleg Serbin Taras Shevchenko National University of Kyiv , Ukraine https://orcid.org/0000-0003-3119-690X

DOI:

https://doi.org/10.15587/1729-4061.2021.238318

Keywords:

near-duplicate, electronic scientific paper, antiplagiarism system, locally sensitive hashing

Abstract

The methods for identification of near-duplicates in electronic scientific papers, which include the content of the same type, for example, text data, mathematical formulas, numerical data, etc. were described. For text data, the method of locally sensitive hashing with the finding of Hamming distance between the elements of indices of electronic scientific papers was formalized. If Hamming distance exceeds a fixed numerical threshold, a scientific paper contains a near-duplicate. For numerical data, sub-sequences for each scientific work are formed and the proximity between the papers is determined as the Euclidian distance between the vectors consisting of the numbers of these sub-sequences. To compare mathematical formulas, the method for comparing the sample of formulas is used and the names of variables are compared. To identify near-duplicates in graphic information, two directions are separated: finding key points in the image and applying locally sensitive hashing for individual pixels of the image. Since scientific papers often include such objects as schemes and diagrams, subscriptions to them are examined separately using the methods for comparing text information. The combined method for identification of near-duplicates in electronic scientific papers, which combines the methods for identification of near-duplicates of various types of data, was proposed. To implement the combined method for the identification of near-duplicates in electronic scientific papers, an information-analytical system that processes scientific materials depending on the content type was devised. This makes it possible to qualitatively identify near-duplicates and as widely as possible identify possible abuses and plagiarism in electronic scientific papers: scientific articles, dissertations, monographs, conference materials, etc.

Author Biographies

Petro Lizunov, Kyiv National University of Construction and Architecture

Doctor of Technical Sciences, Professor, Head of Department

Department of Fundamentals of Informatics

Andrii Biloshchytskyi, Astana IT University; Taras Shevchenko National University of Kyiv

Doctor of Technical Sciences, Professor

Department of Information Systems and Technologies

Alexander Kuchansky, Taras Shevchenko National University of Kyiv

Doctor of Technical Sciences, Associate Professor

Department of Information Systems and Technologies

Yurii Andrashko, Uzhhorod National University

PhD, Associate Pofessor

Department of System Analysis and Optimization Theory

Svitlana Biloshchytska, Taras Shevchenko National University of Kyiv

Doctor of Technical Sciences, Associate Professor

Department of Intelligent and Information Systems

Oleg Serbin, Taras Shevchenko National University of Kyiv

Doctor of Science in Social Communications, Senior Researcher, Director of Library

Maksymovych Scientific Library

References

Wu, X., Ngo, C.-W., Hauptmann, A. G. (2008). Multimodal News Story Clustering With Pairwise Visual Near-Duplicate Constraint. IEEE Transactions on Multimedia, 10 (2), 188–199. doi: https://doi.org/10.1109/tmm.2007.911778
Chang, E. Y., Wang, J. Z., Li, C., Wiederhold, G. (1998). RIME: A replicated image detector for the World Wide Web. Proceedings of SPIE - The International Society for Optical Engineering, 3527, 58–67. doi: https://doi.org/10.1117/12.325852
Liu, G.-H., Yang, J.-Y. (2013). Content-based image retrieval using color difference histogram. Pattern Recognition, 46 (1), 188–198. doi: https://doi.org/10.1016/j.patcog.2012.06.001
Mikolajczyk, K., Schmid, C. (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27 (10), 1615–1630. doi: https://doi.org/10.1109/tpami.2005.188
Ke, Y., Sukthankar, R. (2004). PCA-SIFT: A more distinctive representation for local image descriptors. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004. doi: https://doi.org/10.1109/cvpr.2004.1315206
Zou, F., Feng, H., Ling, H., Liu, C., Yan, L., Li, P., Li, D. (2013). Nonnegative sparse coding induced hashing for image copy detection. Neurocomputing, 105, 81–89. doi: https://doi.org/10.1016/j.neucom.2012.06.042
Gadeski, E., Le Borgne, H., Popescu, A. (2016). Fast and robust duplicate image detection on the web. Multimedia Tools and Applications, 76 (9), 11839–11858. doi: https://doi.org/10.1007/s11042-016-3619-4
Li, Y. (2021). A Fast Algorithm for Near-Duplicate Image Detection. 2021 IEEE International Conference on Artificial Intelligence and Industrial Design (AIID). doi: https://doi.org/10.1109/aiid51893.2021.9456496
Yi, L., Liu, B., Li, X. (2003). Eliminating noisy information in Web pages for data mining. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’03. doi: https://doi.org/10.1145/956750.956785
Fetterly, D., Manasse, M., Najork, M. (2004). Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. Proceedings of the 7th International Workshop on the Web and Databases Colocated with ACM SIGMOD/PODS 2004 - WebDB ’04. doi: https://doi.org/10.1145/1017074.1017077
Chang, H.-C., Wang, J.-H. (2007). Organizing News Archives by Near-Duplicate Copy Detection in Digital Libraries. Lecture Notes in Computer Science, 410–419. doi: https://doi.org/10.1007/978-3-540-77094-7_52
Biloshchytskyi, A., Kuchansky, A., Biloshchytska, S., Dubnytska, A. (2017). Conceptual model of automatic system of near duplicates detection in electronic documents. 2017 14th International Conference The Experience of Designing and Application of CAD Systems in Microelectronics (CADSM). doi: https://doi.org/10.1109/cadsm.2017.7916155
Lizunov, P., Biloshchytskyi, A., Kuchansky, A., Biloshchytska, S., Chala, L. (2016). Detection of near dublicates in tables based on the locality-sensitive hashing method and the nearest neighbor method. Eastern-European Journal of Enterprise Technologies, 6 (4 (84)), 4–10. doi: https://doi.org/10.15587/1729-4061.2016.86243
Lizunov, P., Biloshchytskyi, A., Kuchansky, A., Andrashko, Y., Biloshchytska, S. (2019). Improvement of the method for scientific publications clustering based on n-gram analysis and fuzzy method for selecting research partners. Eastern-European Journal of Enterprise Technologies, 4 (4 (100)), 6–14. doi: https://doi.org/10.15587/1729-4061.2019.175139
Lizunov, P., Biloshchytskyi, A., Kuchansky, A., Andrashko, Y., Biloshchytska, S. (2020). The use of probabilistic latent semantic analysis to identify scientific subject spaces and to evaluate the completeness of covering the results of dissertation studies. Eastern-European Journal of Enterprise Technologies, 4 (4 (106)), 21–28. doi: https://doi.org/10.15587/1729-4061.2020.209886
Fellah, A. (2021). All-Three: Near-optimal and domain-independent algorithms for near-duplicate detection. Array, 11, 100070. doi: https://doi.org/10.1016/j.array.2021.100070
Mathew, M., Das, S. N., Lakshmi Narayanan, T. R., Vijayaraghavan, P. K. (2011). A novel approach for near-duplicate detection of web pages using TDW matrix. International Journal of Computer Applications, 19 (7), 16–21. doi: https://doi.org/10.5120/2374-3128
Arun, P., Sumesh, M. (2015). Near-duplicate web page detection by enhanced TDW and simHash technique. 2015 International Conference on Computing and Network Communications (CoCoNet), 765–770. doi: https://doi.org/10.1109/coconet.2015.7411276
Mishra, A. R., Panchal, V. K., Kumar, P. (2020). Similarity Search based on Text Embedding Model for detection of Near Duplicates. International Journal of Grid and Distributed Computing, 13 (2), 1871–1881. Available at: http://sersc.org/journals/index.php/IJGDC/article/view/35004/19401
National Library of Ukraine named after VI Vernadsky. Available at: http://nbuv.gov.ua/

Development of the combined method of identification of near duplicates in electronic scientific works

Authors

DOI:

Keywords:

Abstract

Author Biographies

Petro Lizunov, Kyiv National University of Construction and Architecture

Andrii Biloshchytskyi, Astana IT University; Taras Shevchenko National University of Kyiv

Alexander Kuchansky, Taras Shevchenko National University of Kyiv

Yurii Andrashko, Uzhhorod National University

Svitlana Biloshchytska, Taras Shevchenko National University of Kyiv

Oleg Serbin, Taras Shevchenko National University of Kyiv

References

Downloads

Published

How to Cite

Issue

Section

License

Language

Information

Make a Submission

Developed By

Current Issue