Development of the combined method of identification of near duplicates in electronic scientific works

Authors

DOI:

https://doi.org/10.15587/1729-4061.2021.238318

Keywords:

near-duplicate, electronic scientific paper, antiplagiarism system, locally sensitive hashing

Abstract

The methods for identification of near-duplicates in electronic scientific papers, which include the content of the same type, for example, text data, mathematical formulas, numerical data, etc. were described. For text data, the method of locally sensitive hashing with the finding of Hamming distance between the elements of indices of electronic scientific papers was formalized. If Hamming distance exceeds a fixed numerical threshold, a scientific paper contains a near-duplicate. For numerical data, sub-sequences for each scientific work are formed and the proximity between the papers is determined as the Euclidian distance between the vectors consisting of the numbers of these sub-sequences. To compare mathematical formulas, the method for comparing the sample of formulas is used and the names of variables are compared. To identify near-duplicates in graphic information, two directions are separated: finding key points in the image and applying locally sensitive hashing for individual pixels of the image. Since scientific papers often include such objects as schemes and diagrams, subscriptions to them are examined separately using the methods for comparing text information. The combined method for identification of near-duplicates in electronic scientific papers, which combines the methods for identification of near-duplicates of various types of data, was proposed. To implement the combined method for the identification of near-duplicates in electronic scientific papers, an information-analytical system that processes scientific materials depending on the content type was devised. This makes it possible to qualitatively identify near-duplicates and as widely as possible identify possible abuses and plagiarism in electronic scientific papers: scientific articles, dissertations, monographs, conference materials, etc.

Author Biographies

Petro Lizunov, Kyiv National University of Construction and Architecture

Doctor of Technical Sciences, Professor, Head of Department

Department of Fundamentals of Informatics

Andrii Biloshchytskyi, Astana IT University; Taras Shevchenko National University of Kyiv

Doctor of Technical Sciences, Professor

Department of Information Systems and Technologies

Alexander Kuchansky, Taras Shevchenko National University of Kyiv

Doctor of Technical Sciences, Associate Professor

Department of Information Systems and Technologies

Yurii Andrashko, Uzhhorod National University

PhD, Associate Pofessor

Department of System Analysis and Optimization Theory

Svitlana Biloshchytska, Taras Shevchenko National University of Kyiv

Doctor of Technical Sciences, Associate Professor

Department of Intelligent and Information Systems

Oleg Serbin, Taras Shevchenko National University of Kyiv

Doctor of Science in Social Communications, Senior Researcher, Director of Library

Maksymovych Scientific Library

References

  1. Wu, X., Ngo, C.-W., Hauptmann, A. G. (2008). Multimodal News Story Clustering With Pairwise Visual Near-Duplicate Constraint. IEEE Transactions on Multimedia, 10 (2), 188–199. doi: https://doi.org/10.1109/tmm.2007.911778
  2. Chang, E. Y., Wang, J. Z., Li, C., Wiederhold, G. (1998). RIME: A replicated image detector for the World Wide Web. Proceedings of SPIE - The International Society for Optical Engineering, 3527, 58–67. doi: https://doi.org/10.1117/12.325852
  3. Liu, G.-H., Yang, J.-Y. (2013). Content-based image retrieval using color difference histogram. Pattern Recognition, 46 (1), 188–198. doi: https://doi.org/10.1016/j.patcog.2012.06.001
  4. Mikolajczyk, K., Schmid, C. (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27 (10), 1615–1630. doi: https://doi.org/10.1109/tpami.2005.188
  5. Ke, Y., Sukthankar, R. (2004). PCA-SIFT: A more distinctive representation for local image descriptors. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004. doi: https://doi.org/10.1109/cvpr.2004.1315206
  6. Zou, F., Feng, H., Ling, H., Liu, C., Yan, L., Li, P., Li, D. (2013). Nonnegative sparse coding induced hashing for image copy detection. Neurocomputing, 105, 81–89. doi: https://doi.org/10.1016/j.neucom.2012.06.042
  7. Gadeski, E., Le Borgne, H., Popescu, A. (2016). Fast and robust duplicate image detection on the web. Multimedia Tools and Applications, 76 (9), 11839–11858. doi: https://doi.org/10.1007/s11042-016-3619-4
  8. Li, Y. (2021). A Fast Algorithm for Near-Duplicate Image Detection. 2021 IEEE International Conference on Artificial Intelligence and Industrial Design (AIID). doi: https://doi.org/10.1109/aiid51893.2021.9456496
  9. Yi, L., Liu, B., Li, X. (2003). Eliminating noisy information in Web pages for data mining. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’03. doi: https://doi.org/10.1145/956750.956785
  10. Fetterly, D., Manasse, M., Najork, M. (2004). Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. Proceedings of the 7th International Workshop on the Web and Databases Colocated with ACM SIGMOD/PODS 2004 - WebDB ’04. doi: https://doi.org/10.1145/1017074.1017077
  11. Chang, H.-C., Wang, J.-H. (2007). Organizing News Archives by Near-Duplicate Copy Detection in Digital Libraries. Lecture Notes in Computer Science, 410–419. doi: https://doi.org/10.1007/978-3-540-77094-7_52
  12. Biloshchytskyi, A., Kuchansky, A., Biloshchytska, S., Dubnytska, A. (2017). Conceptual model of automatic system of near duplicates detection in electronic documents. 2017 14th International Conference The Experience of Designing and Application of CAD Systems in Microelectronics (CADSM). doi: https://doi.org/10.1109/cadsm.2017.7916155
  13. Lizunov, P., Biloshchytskyi, A., Kuchansky, A., Biloshchytska, S., Chala, L. (2016). Detection of near dublicates in tables based on the locality-sensitive hashing method and the nearest neighbor method. Eastern-European Journal of Enterprise Technologies, 6 (4 (84)), 4–10. doi: https://doi.org/10.15587/1729-4061.2016.86243
  14. Lizunov, P., Biloshchytskyi, A., Kuchansky, A., Andrashko, Y., Biloshchytska, S. (2019). Improvement of the method for scientific publications clustering based on n-gram analysis and fuzzy method for selecting research partners. Eastern-European Journal of Enterprise Technologies, 4 (4 (100)), 6–14. doi: https://doi.org/10.15587/1729-4061.2019.175139
  15. Lizunov, P., Biloshchytskyi, A., Kuchansky, A., Andrashko, Y., Biloshchytska, S. (2020). The use of probabilistic latent semantic analysis to identify scientific subject spaces and to evaluate the completeness of covering the results of dissertation studies. Eastern-European Journal of Enterprise Technologies, 4 (4 (106)), 21–28. doi: https://doi.org/10.15587/1729-4061.2020.209886
  16. Fellah, A. (2021). All-Three: Near-optimal and domain-independent algorithms for near-duplicate detection. Array, 11, 100070. doi: https://doi.org/10.1016/j.array.2021.100070
  17. Mathew, M., Das, S. N., Lakshmi Narayanan, T. R., Vijayaraghavan, P. K. (2011). A novel approach for near-duplicate detection of web pages using TDW matrix. International Journal of Computer Applications, 19 (7), 16–21. doi: https://doi.org/10.5120/2374-3128
  18. Arun, P., Sumesh, M. (2015). Near-duplicate web page detection by enhanced TDW and simHash technique. 2015 International Conference on Computing and Network Communications (CoCoNet), 765–770. doi: https://doi.org/10.1109/coconet.2015.7411276
  19. Mishra, A. R., Panchal, V. K., Kumar, P. (2020). Similarity Search based on Text Embedding Model for detection of Near Duplicates. International Journal of Grid and Distributed Computing, 13 (2), 1871–1881. Available at: http://sersc.org/journals/index.php/IJGDC/article/view/35004/19401
  20. National Library of Ukraine named after VI Vernadsky. Available at: http://nbuv.gov.ua/

Downloads

Published

2021-08-25

How to Cite

Lizunov, P., Biloshchytskyi, A., Kuchansky, A., Andrashko, Y., Biloshchytska, S., & Serbin, O. (2021). Development of the combined method of identification of near duplicates in electronic scientific works. Eastern-European Journal of Enterprise Technologies, 4(4(112), 57–63. https://doi.org/10.15587/1729-4061.2021.238318

Issue

Section

Mathematics and Cybernetics - applied aspects