Comparison of algorithms for estimating the distance between words for finding similar sentences

Authors

  • T.D. Goncharenko State Higher Education Institution "Priazovskyi state technical university", Dnipro, Ukraine
  • O.I. Pronina State Higher Education Institution "Priazovskyi state technical university", Dnipro, Ukraine https://orcid.org/0000-0001-7085-8027

DOI:

https://doi.org/10.31498/2225-6733.47.2023.299974

Keywords:

text analysis, search system, similar sentences, distance between words, comparison of text fragments, cosine similarity, Levenshtein distance, diploma project

Abstract

The article in question delves into the intricacies of a newly developed system designed for the identification of similar sentences. This system operates on the principle of measuring the distance between words, utilizing algorithms that are adept at recognizing inexact matches. Such algorithms are vital as they enable search engines to process queries with a deeper understanding of context. They take into account potential discrepancies or variations in word spellings, a feature that becomes critically important when users employ diverse expressions to convey identical ideas. These algorithms form the backbone of intelligent search engines, equipping them with the ability to grasp the core intent of a query. This ensures the delivery of pertinent results, even when the input is marred by spelling or typographical errors. The software crafted as a result of this development finds its application across a broad spectrum of fields including information retrieval, natural language processing, plagiarism detection, and genomics, to name a few. The methodologies and algorithms highlighted in the article have significant implications in domains that demand a high degree of precision in text interpretation and comparison. In the realm of information search, these algorithms are instrumental in enhancing result quality, offering users more accurate responses to their queries, regardless of the presence of errors in spelling or grammar. In natural language processing, the algorithms play a pivotal role in analyzing and interpreting human language. This capability is fundamental for the development of advanced chatbots, machine translation systems, and intelligent digital assistants. Their application in plagiarism detection is equally noteworthy. Here, the algorithms demonstrate an exceptional ability to ascertain the degree of similarity between texts, a function that holds immense value in academic and research settings. In the field of genomics, these methods are employed for the intricate task of mapping genetic sequences, a vital process in bioinformatics research. In conclusion, the software developed as per the article presents a versatile tool, finding relevance in various scientific and technological arenas. Its ability to conduct a thorough analysis and comprehend textual data is unparalleled, marking a significant advancement in these fields

Author Biographies

T.D. Goncharenko, State Higher Education Institution "Priazovskyi state technical university", Dnipro

Master's student

O.I. Pronina, State Higher Education Institution "Priazovskyi state technical university", Dnipro

PhD (Engineering), associate professor

References

Cohen W.W., Ravikumar P. A comparison of string distance metrics for name-matching tasks. Proceedings of the 2003 International Conference on Information Integration on the Web, Acapulco, Mexico, 9-10 August 2003. Pp. 73-78.

Levenshtein V.I. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady. 1966. Vol 10. № 8. Pp. 707-710.

Navarro G. A guided tour to approximate string matching. ACM Computing Surveys. 2001. Vol. 33. Iss. 1. Pp. 31-88. DOI: https://doi.org/10.1145/375360.375365.

Ukkonen E. Algorithms for approximate string matching. Information and control. 1985. Vol. 64. Iss. 1-3. Pp. 100-118. DOI: https://doi.org/10.1016/S0019-9958(85)80046-2.

Jaro M.A. Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association. 1989. Vol. 84. No. 406. Pp. 414-420. DOI: https://doi.org/10.2307/2289924.

Landau G.M., Vishkin U. Fast string matching with k differences. Journal of Computer and Sys-tem Sciences. 1988. Vol. 37. Iss. 1. Pp. 63-78. DOI: https://doi.org/10.1016/0022-0000(88)90045-1.

Moffat A., Zobel J. Self-indexing inverted files for fast text retrieval. ACM Transactionson Information Systems. 1996. Vol. 14. № 4. Pp. 349-379. DOI: https://doi.org/10.1145/237496.237497.

Myers G. An O(ND) difference algorithm and its variations. Algorithmica. 1986. Vol. 1. Pp. 251-266. DOI: https://doi.org/10.1007/BF01840446.

Monge A.E., Elkan C. The field matching problem: Algorithms and applications. KDD'96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, USA, 2-4 August 1996. Pp. 267-270.

Wu S., Manber U. Fast text searching allowing errors. Communications of the ACM. 1992. Vol. 35. Iss. 10. Pp. 83-91. DOI: https://doi.org/10.1145/135239.135244.

Published

2023-12-28

How to Cite

Goncharenko, T. ., & Pronina, O. . (2023). Comparison of algorithms for estimating the distance between words for finding similar sentences. Reporter of the Priazovskyi State Technical University. Section: Technical Sciences, (47), 32–38. https://doi.org/10.31498/2225-6733.47.2023.299974