Detection of near duplicates in tables based on the locality-sensitive hashing method and the nearest neighbor method
DOI:
https://doi.org/10.15587/1729-4061.2016.86243Keywords:
near duplicate, similarity, locality-sensitive hashing method, nearest neighbor methodAbstract
A hybrid method for the detection of near duplicates in tables is proposed.
This method allows the identification of similarities between text and numeric data of tables separately, and then it generalized the results obtained. For the text data, sequences of words are formed in the canonized form, from which, based on the method of locality-sensitive hashing, the bit sequences are constructed. A similarity between data in this case is determined by the Hamming distance at the assigned threshold value. The identification of similarities between numeric data in tables is implemented based on the method of the nearest neighbours with assigned metric distances. The method makes it possible to identify near duplicates, present in data in the input table, relative to a set of tables, which are selected from the scientific publications and dissertations and theses papers. It should be noted that the method is designed for finding near duplicates in tables that contain only text and numeric data. In the case of availability in the content of examined tables of pictures and formulas, these objects are examined separately by using specific methods.
The method proposed might be implemented in the systems that are intended for running intelligent analysis of information represented by text and tables to identify similarities and detect near-duplicates, in particular, antiplagiarism-systems.
References
- Fink, A. (2005). How to Conduct Surveys. Thousand Oaks: Sage Publications, 224.
- Ehrenberg, A. S. C. (1982). A Primer in Data Reduction. Wiley, Chrichester, UK, 324.
- Bertin, J. (1981). Graphics and Graphic Information Processing. Walter de Gruyter Berlin, New York, 279. doi: 10.1515/9783110854688
- Card, S. K., MacKinlay, J. D., Shneiderman, B. (Eds.) (1999). Reading in Information Visualization: Using Vision to Think. Morgan Kaufmann, San Francisco, 712.
- Su, Z., Ahn, B.-R., Eom, K.-Y., Kang, M.-K., Kim, J.-P., Kim, M.-K. (2008). Plagiarism detection using the Levenshtein distance and Smith-Waterman algorithm. 2008 3rd International Conference on Innovative Computing Information and Control. doi: 10.1109/icicic.2008.422
- Wu, S., Manber, U. (1994). A fast algorithm for multi-pattern searching. Technical Report TR-94-17. Department of Computer Science, University of Arizona, 11.
- Burkhard, W. A., Keller, R. M. (1973). Some approaches to best-match file searching. Communications of the ACM, 16 (4), 230–236. doi: 10.1145/362003.362025
- Baeza-Yates, R., Cunto, W., Manber, U., Wu, S. (1994). Proximity matching using fixed-queries trees. Lecture Notes in Computer Science, 198–212. doi: 10.1007/3-540-58094-8_18
- Shenoy, M. (2012). Automatic Plagiarism Detection Using Similarity Analysis. Advanced Computing: An International Journal, 3 (3), 59–62. doi: 10.5121/acij.2012.3306
- Biloshchytskyi, A., Dikhtiarenko, O. (2014). Optimization of Matching algorithms by using local-sensitive hash sets of text data. Management of complex systems, 19, 113–117.
- Biloshchytskyi, A., Kristof, S., Biloshchytska, S., Dikhtiarenko, O. (2015). The method of elimination of erroneous coincidences text in electronic documents. Management of Development of Complex Systems, 22 (1), 144–150.
- Biloshchytskyi, A., Dikhtiarenko, O. (2013). The effectiveness of methods for finding matches in texts. Management of complex systems, 14, 144–147.
- Kuchansky, A., Nikolenko, V. (2015). Pattern matching method for time-series forecasting. Management of Development of Complex Systems, 22, 101–106.
- Kuchansky, A., Biloshchytskyi, A. (2015). Selective pattern matching method for time-series forecasting. Eastern-European Journal of Enterprise Technologies, 6 (4 (78)), 13–18. doi: 10.15587/1729-4061.2015.54812
- Mojsiloviс, R., J. Kovaсeviс, J. Hu, R. J. Safranek, S. K. (2000). Ganapathy Matching and retrieval based on the vocabulary and grammar of color patterns. IEEE Transactions on Image Processing, 9 (1), 38–54. doi: 10.1109/83.817597
- Tamura, H., Mori, S., Yamawaki, T. (1978). Textural Features Corresponding to Visual Perception. IEEE Transactions on Systems, Man, and Cybernetics, 8 (6), 460–473. doi: 10.1109/tsmc.1978.4309999
- Zhang, D., Lu, G. (2001). Content-Based Shape Retrieval Using Different Shape Descriptors: A Comparative Study. IEEE International Conference on Multimedia and Expo, 2001. ICME 2001. doi: 10.1109/icme.2001.1237928
- Quack, T., Monich, U., Thiele, L., Manjunath, B. (2004). A System for Largescale, Content based Web Image Retrieval. MM’04, 120–123.
- Liebowitz, S., Margolis, S. E. (2001). Network Effects and the Microsoft Case. Dynamic Competition and Public Policy, 160–192. doi: 10.1017/cbo9781139164610.007
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2016 Petro Lizunov, Andrii Biloshchytskyi, Alexander Kuchansky, Svitlana Biloshchytska, Larysa Chala
This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.
A license agreement is a document in which the author warrants that he/she owns all copyright for the work (manuscript, article, etc.).
The authors, signing the License Agreement with TECHNOLOGY CENTER PC, have all rights to the further use of their work, provided that they link to our edition in which the work was published.
According to the terms of the License Agreement, the Publisher TECHNOLOGY CENTER PC does not take away your copyrights and receives permission from the authors to use and dissemination of the publication through the world's scientific resources (own electronic resources, scientometric databases, repositories, libraries, etc.).
In the absence of a signed License Agreement or in the absence of this agreement of identifiers allowing to identify the identity of the author, the editors have no right to work with the manuscript.
It is important to remember that there is another type of agreement between authors and publishers – when copyright is transferred from the authors to the publisher. In this case, the authors lose ownership of their work and may not use it in any way.