Detection of near duplicates in tables based on the locality-sensitive hashing method and the nearest neighbor method

Petro Lizunov; Andrii Biloshchytskyi; Alexander Kuchansky; Svitlana Biloshchytska; Larysa Chala

doi:10.15587/1729-4061.2016.86243

Authors

Petro Lizunov Kyiv National University of Construction and Architecture Povitroflotskyi ave., 31, Kyiv, Ukraine, 03037, Ukraine https://orcid.org/0000-0003-2924-3025
Andrii Biloshchytskyi Taras Shevchenko National University of Kyiv Volodymyrska str., 60, Kyiv, Ukraine, 01033, Ukraine https://orcid.org/0000-0001-9548-1959
Alexander Kuchansky Kyiv National University of Construction and Architecture Povitroflotskyi ave., 31, Kyiv, Ukraine, 03037, Ukraine https://orcid.org/0000-0003-1277-8031
Svitlana Biloshchytska Kyiv National University of Construction and Architecture Povitroflotskyi ave., 31, Kyiv, Ukraine, 03037, Ukraine https://orcid.org/0000-0002-0856-5474
Larysa Chala Kharkiv National University of Radio Electronics Nauky ave., 14, Kharkiv, Ukraine, 61166, Ukraine https://orcid.org/0000-0002-9890-4790

DOI:

https://doi.org/10.15587/1729-4061.2016.86243

Keywords:

near duplicate, similarity, locality-sensitive hashing method, nearest neighbor method

Abstract

A hybrid method for the detection of near duplicates in tables is proposed.

This method allows the identification of similarities between text and numeric data of tables separately, and then it generalized the results obtained. For the text data, sequences of words are formed in the canonized form, from which, based on the method of locality-sensitive hashing, the bit sequences are constructed. A similarity between data in this case is determined by the Hamming distance at the assigned threshold value. The identification of similarities between numeric data in tables is implemented based on the method of the nearest neighbours with assigned metric distances. The method makes it possible to identify near duplicates, present in data in the input table, relative to a set of tables, which are selected from the scientific publications and dissertations and theses papers. It should be noted that the method is designed for finding near duplicates in tables that contain only text and numeric data. In the case of availability in the content of examined tables of pictures and formulas, these objects are examined separately by using specific methods.

The method proposed might be implemented in the systems that are intended for running intelligent analysis of information represented by text and tables to identify similarities and detect near-duplicates, in particular, antiplagiarism-systems.

Author Biographies

Petro Lizunov, Kyiv National University of Construction and Architecture Povitroflotskyi ave., 31, Kyiv, Ukraine, 03037

Doctor of Technical Sciences, Professor

Andrii Biloshchytskyi, Taras Shevchenko National University of Kyiv Volodymyrska str., 60, Kyiv, Ukraine, 01033

Doctor of Technical Sciences, Professor

Alexander Kuchansky, Kyiv National University of Construction and Architecture Povitroflotskyi ave., 31, Kyiv, Ukraine, 03037

PhD, Associate Professor

Department of Information Technologies

Svitlana Biloshchytska, Kyiv National University of Construction and Architecture Povitroflotskyi ave., 31, Kyiv, Ukraine, 03037

PhD, Associate Professor

Department of Information Technology Designing and Applied Mathematics

Larysa Chala, Kharkiv National University of Radio Electronics Nauky ave., 14, Kharkiv, Ukraine, 61166

PhD, Associate Professor

Department of Artificial Intelligence

References

Fink, A. (2005). How to Conduct Surveys. Thousand Oaks: Sage Publications, 224.
Ehrenberg, A. S. C. (1982). A Primer in Data Reduction. Wiley, Chrichester, UK, 324.
Bertin, J. (1981). Graphics and Graphic Information Processing. Walter de Gruyter Berlin, New York, 279. doi: 10.1515/9783110854688
Card, S. K., MacKinlay, J. D., Shneiderman, B. (Eds.) (1999). Reading in Information Visualization: Using Vision to Think. Morgan Kaufmann, San Francisco, 712.
Su, Z., Ahn, B.-R., Eom, K.-Y., Kang, M.-K., Kim, J.-P., Kim, M.-K. (2008). Plagiarism detection using the Levenshtein distance and Smith-Waterman algorithm. 2008 3rd International Conference on Innovative Computing Information and Control. doi: 10.1109/icicic.2008.422
Wu, S., Manber, U. (1994). A fast algorithm for multi-pattern searching. Technical Report TR-94-17. Department of Computer Science, University of Arizona, 11.
Burkhard, W. A., Keller, R. M. (1973). Some approaches to best-match file searching. Communications of the ACM, 16 (4), 230–236. doi: 10.1145/362003.362025
Baeza-Yates, R., Cunto, W., Manber, U., Wu, S. (1994). Proximity matching using fixed-queries trees. Lecture Notes in Computer Science, 198–212. doi: 10.1007/3-540-58094-8_18
Shenoy, M. (2012). Automatic Plagiarism Detection Using Similarity Analysis. Advanced Computing: An International Journal, 3 (3), 59–62. doi: 10.5121/acij.2012.3306
Biloshchytskyi, A., Dikhtiarenko, O. (2014). Optimization of Matching algorithms by using local-sensitive hash sets of text data. Management of complex systems, 19, 113–117.
Biloshchytskyi, A., Kristof, S., Biloshchytska, S., Dikhtiarenko, O. (2015). The method of elimination of erroneous coincidences text in electronic documents. Management of Development of Complex Systems, 22 (1), 144–150.
Biloshchytskyi, A., Dikhtiarenko, O. (2013). The effectiveness of methods for finding matches in texts. Management of complex systems, 14, 144–147.
Kuchansky, A., Nikolenko, V. (2015). Pattern matching method for time-series forecasting. Management of Development of Complex Systems, 22, 101–106.
Kuchansky, A., Biloshchytskyi, A. (2015). Selective pattern matching method for time-series forecasting. Eastern-European Journal of Enterprise Technologies, 6 (4 (78)), 13–18. doi: 10.15587/1729-4061.2015.54812
Mojsiloviс, R., J. Kovaсeviс, J. Hu, R. J. Safranek, S. K. (2000). Ganapathy Matching and retrieval based on the vocabulary and grammar of color patterns. IEEE Transactions on Image Processing, 9 (1), 38–54. doi: 10.1109/83.817597
Tamura, H., Mori, S., Yamawaki, T. (1978). Textural Features Corresponding to Visual Perception. IEEE Transactions on Systems, Man, and Cybernetics, 8 (6), 460–473. doi: 10.1109/tsmc.1978.4309999
Zhang, D., Lu, G. (2001). Content-Based Shape Retrieval Using Different Shape Descriptors: A Comparative Study. IEEE International Conference on Multimedia and Expo, 2001. ICME 2001. doi: 10.1109/icme.2001.1237928
Quack, T., Monich, U., Thiele, L., Manjunath, B. (2004). A System for Largescale, Content based Web Image Retrieval. MM’04, 120–123.
Liebowitz, S., Margolis, S. E. (2001). Network Effects and the Microsoft Case. Dynamic Competition and Public Policy, 160–192. doi: 10.1017/cbo9781139164610.007

Detection of near duplicates in tables based on the locality-sensitive hashing method and the nearest neighbor method

Authors

DOI:

Keywords:

Abstract

Author Biographies

Petro Lizunov, Kyiv National University of Construction and Architecture Povitroflotskyi ave., 31, Kyiv, Ukraine, 03037

Andrii Biloshchytskyi, Taras Shevchenko National University of Kyiv Volodymyrska str., 60, Kyiv, Ukraine, 01033

Alexander Kuchansky, Kyiv National University of Construction and Architecture Povitroflotskyi ave., 31, Kyiv, Ukraine, 03037

Svitlana Biloshchytska, Kyiv National University of Construction and Architecture Povitroflotskyi ave., 31, Kyiv, Ukraine, 03037

Larysa Chala, Kharkiv National University of Radio Electronics Nauky ave., 14, Kharkiv, Ukraine, 61166

References

Downloads

Published

How to Cite

Issue

Section

License

Language

Information

Make a Submission

Developed By

Current Issue