Development of the combined method of identification of near duplicates in electronic scientific works
DOI:
https://doi.org/10.15587/1729-4061.2021.238318Keywords:
near-duplicate, electronic scientific paper, antiplagiarism system, locally sensitive hashingAbstract
The methods for identification of near-duplicates in electronic scientific papers, which include the content of the same type, for example, text data, mathematical formulas, numerical data, etc. were described. For text data, the method of locally sensitive hashing with the finding of Hamming distance between the elements of indices of electronic scientific papers was formalized. If Hamming distance exceeds a fixed numerical threshold, a scientific paper contains a near-duplicate. For numerical data, sub-sequences for each scientific work are formed and the proximity between the papers is determined as the Euclidian distance between the vectors consisting of the numbers of these sub-sequences. To compare mathematical formulas, the method for comparing the sample of formulas is used and the names of variables are compared. To identify near-duplicates in graphic information, two directions are separated: finding key points in the image and applying locally sensitive hashing for individual pixels of the image. Since scientific papers often include such objects as schemes and diagrams, subscriptions to them are examined separately using the methods for comparing text information. The combined method for identification of near-duplicates in electronic scientific papers, which combines the methods for identification of near-duplicates of various types of data, was proposed. To implement the combined method for the identification of near-duplicates in electronic scientific papers, an information-analytical system that processes scientific materials depending on the content type was devised. This makes it possible to qualitatively identify near-duplicates and as widely as possible identify possible abuses and plagiarism in electronic scientific papers: scientific articles, dissertations, monographs, conference materials, etc.
References
- Wu, X., Ngo, C.-W., Hauptmann, A. G. (2008). Multimodal News Story Clustering With Pairwise Visual Near-Duplicate Constraint. IEEE Transactions on Multimedia, 10 (2), 188–199. doi: https://doi.org/10.1109/tmm.2007.911778
- Chang, E. Y., Wang, J. Z., Li, C., Wiederhold, G. (1998). RIME: A replicated image detector for the World Wide Web. Proceedings of SPIE - The International Society for Optical Engineering, 3527, 58–67. doi: https://doi.org/10.1117/12.325852
- Liu, G.-H., Yang, J.-Y. (2013). Content-based image retrieval using color difference histogram. Pattern Recognition, 46 (1), 188–198. doi: https://doi.org/10.1016/j.patcog.2012.06.001
- Mikolajczyk, K., Schmid, C. (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27 (10), 1615–1630. doi: https://doi.org/10.1109/tpami.2005.188
- Ke, Y., Sukthankar, R. (2004). PCA-SIFT: A more distinctive representation for local image descriptors. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004. doi: https://doi.org/10.1109/cvpr.2004.1315206
- Zou, F., Feng, H., Ling, H., Liu, C., Yan, L., Li, P., Li, D. (2013). Nonnegative sparse coding induced hashing for image copy detection. Neurocomputing, 105, 81–89. doi: https://doi.org/10.1016/j.neucom.2012.06.042
- Gadeski, E., Le Borgne, H., Popescu, A. (2016). Fast and robust duplicate image detection on the web. Multimedia Tools and Applications, 76 (9), 11839–11858. doi: https://doi.org/10.1007/s11042-016-3619-4
- Li, Y. (2021). A Fast Algorithm for Near-Duplicate Image Detection. 2021 IEEE International Conference on Artificial Intelligence and Industrial Design (AIID). doi: https://doi.org/10.1109/aiid51893.2021.9456496
- Yi, L., Liu, B., Li, X. (2003). Eliminating noisy information in Web pages for data mining. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’03. doi: https://doi.org/10.1145/956750.956785
- Fetterly, D., Manasse, M., Najork, M. (2004). Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. Proceedings of the 7th International Workshop on the Web and Databases Colocated with ACM SIGMOD/PODS 2004 - WebDB ’04. doi: https://doi.org/10.1145/1017074.1017077
- Chang, H.-C., Wang, J.-H. (2007). Organizing News Archives by Near-Duplicate Copy Detection in Digital Libraries. Lecture Notes in Computer Science, 410–419. doi: https://doi.org/10.1007/978-3-540-77094-7_52
- Biloshchytskyi, A., Kuchansky, A., Biloshchytska, S., Dubnytska, A. (2017). Conceptual model of automatic system of near duplicates detection in electronic documents. 2017 14th International Conference The Experience of Designing and Application of CAD Systems in Microelectronics (CADSM). doi: https://doi.org/10.1109/cadsm.2017.7916155
- Lizunov, P., Biloshchytskyi, A., Kuchansky, A., Biloshchytska, S., Chala, L. (2016). Detection of near dublicates in tables based on the locality-sensitive hashing method and the nearest neighbor method. Eastern-European Journal of Enterprise Technologies, 6 (4 (84)), 4–10. doi: https://doi.org/10.15587/1729-4061.2016.86243
- Lizunov, P., Biloshchytskyi, A., Kuchansky, A., Andrashko, Y., Biloshchytska, S. (2019). Improvement of the method for scientific publications clustering based on n-gram analysis and fuzzy method for selecting research partners. Eastern-European Journal of Enterprise Technologies, 4 (4 (100)), 6–14. doi: https://doi.org/10.15587/1729-4061.2019.175139
- Lizunov, P., Biloshchytskyi, A., Kuchansky, A., Andrashko, Y., Biloshchytska, S. (2020). The use of probabilistic latent semantic analysis to identify scientific subject spaces and to evaluate the completeness of covering the results of dissertation studies. Eastern-European Journal of Enterprise Technologies, 4 (4 (106)), 21–28. doi: https://doi.org/10.15587/1729-4061.2020.209886
- Fellah, A. (2021). All-Three: Near-optimal and domain-independent algorithms for near-duplicate detection. Array, 11, 100070. doi: https://doi.org/10.1016/j.array.2021.100070
- Mathew, M., Das, S. N., Lakshmi Narayanan, T. R., Vijayaraghavan, P. K. (2011). A novel approach for near-duplicate detection of web pages using TDW matrix. International Journal of Computer Applications, 19 (7), 16–21. doi: https://doi.org/10.5120/2374-3128
- Arun, P., Sumesh, M. (2015). Near-duplicate web page detection by enhanced TDW and simHash technique. 2015 International Conference on Computing and Network Communications (CoCoNet), 765–770. doi: https://doi.org/10.1109/coconet.2015.7411276
- Mishra, A. R., Panchal, V. K., Kumar, P. (2020). Similarity Search based on Text Embedding Model for detection of Near Duplicates. International Journal of Grid and Distributed Computing, 13 (2), 1871–1881. Available at: http://sersc.org/journals/index.php/IJGDC/article/view/35004/19401
- National Library of Ukraine named after VI Vernadsky. Available at: http://nbuv.gov.ua/
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2021 Petro Lizunov, Andrii Biloshchytskyi, Alexander Kuchansky, Yurii Andrashko, Svitlana Biloshchytska, Oleg Serbin
This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.
A license agreement is a document in which the author warrants that he/she owns all copyright for the work (manuscript, article, etc.).
The authors, signing the License Agreement with TECHNOLOGY CENTER PC, have all rights to the further use of their work, provided that they link to our edition in which the work was published.
According to the terms of the License Agreement, the Publisher TECHNOLOGY CENTER PC does not take away your copyrights and receives permission from the authors to use and dissemination of the publication through the world's scientific resources (own electronic resources, scientometric databases, repositories, libraries, etc.).
In the absence of a signed License Agreement or in the absence of this agreement of identifiers allowing to identify the identity of the author, the editors have no right to work with the manuscript.
It is important to remember that there is another type of agreement between authors and publishers – when copyright is transferred from the authors to the publisher. In this case, the authors lose ownership of their work and may not use it in any way.