Development of fuzzy search method for creating an efficient information search system in text data
DOI:
https://doi.org/10.15587/2706-5448.2024.298425Keywords:
fuzzy search, Damerau-Levenstein distance, editing distance, character similarity table, text data processingAbstract
The object of research is the processes of effective search for information in a set of textual data. The subject of the research is the fuzzy search method, which will allow to effectively solve the problem of searching for information in a set of textual data. The paper considers the process of developing a fuzzy search method, which consists of 9 consecutive steps and is required for a quick search for matches in a large set of text data. Based on this method, it is proposed to create a fuzzy search system that will solve the problem of finding the most relevant documents from a set of such documents.
The proposed fuzzy search method combines the advantages of algorithms based on deterministic finite automata and algorithms based on dynamic programming for calculating the Damerau-Levenshtein distance. Such a combination allows to implement the symbol similarity table in an optimal way. As part of the work, an approach for creating a symbol similarity table was proposed and an example of such a table was created for symbols from the English alphabet, which allows to find the degree of similarity between two symbols with constant asymptotics and to convert the current symbol into its basic counterpart. For document filtering, a metric was developed to evaluate the correspondence of text data to a search phrase, which simultaneously takes into account the number of found and not found characters and the number of found and not found words.
The Damerau-Levenstein algorithm allows to find the edit distance between two words, taking into account the following types of errors: substitution, addition, deletion, and transposition of characters. The work proposed a modification of this algorithm by using a similarity table to more accurately estimate the editing distance between two words.
The developed method makes it possible to create a fuzzy search system that will help find the desired results faster and increase the relevance of the obtained results by sorting them according to the values of the proposed test data similarity metric.
References
- Boytsov, L. (2011). Indexing methods for approximate dictionary searching. ACM Journal of Experimental Algorithmics, 16. doi: https://doi.org/10.1145/1963190.1963191
- Carvalho, J. P., Coheur, L. (2013). Introducing UWS – A fuzzy based word similarity function with good discrimination capability: Preliminary results. 2013 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE). Hyderabad. doi: https://doi.org/10.1109/fuzz-ieee.2013.6622494
- Yu, M., Li, G., Deng, D., Feng, J. (2015). String similarity search and join: a survey. Frontiers of Computer Science, 10 (3), 399–417. doi: https://doi.org/10.1007/s11704-015-5900-5
- Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33 (1), 31–88. doi: https://doi.org/10.1145/375360.375365
- Fancy Letters. Available at: https://symbl.cc/en/collections/fancy-letters/
- Snášel, V., Keprt, A., Abraham, A., Hassanien, A. E. (2009). Approximate String Matching by Fuzzy Automata. Advances in Soft Computing. Berlin Heidelberg: Springer, 281–290. doi: https://doi.org/10.1007/978-3-642-00563-3_29
- Kleshch, K., Shablii, V. (2023). Comparison of fuzzy search algorithms based on Damerau-Levenshtein automata on large data. Technology Audit and Production Reserves, 4 (2 (72)), 27–32. doi: https://doi.org/10.15587/2706-5448.2023.286382
- Kleshch, K. O., Tsarov, M. O. (2023). Modification of the fuzzy search algorithms to use a symbols similarity table. Таuridа Scientific Herald. Series: Technical Sciences, 3, 21–28. doi: https://doi.org/10.32782/tnv-tech.2023.3.3
- Mihov, S., Schulz, K. U. (2004). Fast Approximate Search in Large Dictionaries. Computational Linguistics, 30 (4), 451–477. doi: https://doi.org/10.1162/0891201042544938
- Wang, J., Li, G., Fe, J. (2011). Fast-join: An efficient method for fuzzy token matching based string similarity join. 2011 IEEE 27th International Conference on Data Engineering. Hannover, 458–469. doi: https://doi.org/10.1109/icde.2011.5767865
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Kyrylo Kleshch
This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.