Латентно-семантический метод извлечения информации из интернет ресурсов

Александр Африканович Стенин; Юрий Афанасиевич Тимошин; Екатерина Юрьевна Мелкумян; В. В. Курбанов

doi:10.15587/1729-4061.2013.16387

Authors

Александр Африканович Стенин National Technical University of Ukraine “Kyiv Polytechnic Institute” Peremogy 37, Kyiv, Ukraine, 03056, Ukraine
Юрий Афанасиевич Тимошин National Technical University of Ukraine “Kyiv Polytechnic Institute” Peremogy 37, Kyiv, Ukraine, 03056, Ukraine https://orcid.org/0000-0001-9332-3228
Екатерина Юрьевна Мелкумян National Technical University of Ukraine “Kyiv Polytechnic Institute” Peremogy 37, Kyiv, Ukraine, 03056, Ukraine
В. В. Курбанов National Technical University of Ukraine “Kyiv Polytechnic Institute” Peremogy 37, Kyiv, Ukraine, 03056, Ukraine

DOI:

https://doi.org/10.15587/1729-4061.2013.16387

Keywords:

internet resources, information retrieval, intelligent agents, descriptors, Zipf’s law

Abstract

Unlike traditional information retrieval systems (IRS) Internet has the following features: as the information warehouse it lacked the search function,thus it was decentralized; the network is social, heterogeneous, combines both modern and previous systems versions; access time to various parts is unequal; the information volume exceeds the largest IRS volume. The main task of IRS in the internet is providing methods and ways of semantic analysis of the text in natural language, which entails the ability of information extraction from the specified HTML documents in the form of certain pieces of information.

The paper suggests the latent semantic method of weighed descriptors, allowingto extract the most meaningful documents the that are close to the subject area of the search, as well as the search algorithm. The method assumes that the conceptual descriptors, based on the Zipf’s law, in sentences have the downstream «latent» meaning obscured by the use of different words. Interpretation of the Zipf’s law is based on the correlation properties of additive Markov chains with a memory step function.

Also, the latent semantic analysis (LSA) is disclosed, which is the method of processing of information in natural language and analyzes the relationship between the documents collection and terms. The LSA can be compared to the simple version of a neural network consisting of three layers

Author Biographies

Александр Африканович Стенин, National Technical University of Ukraine “Kyiv Polytechnic Institute” Peremogy 37, Kyiv, Ukraine, 03056

Professor

Department of Technical Cybernetic

Юрий Афанасиевич Тимошин, National Technical University of Ukraine “Kyiv Polytechnic Institute” Peremogy 37, Kyiv, Ukraine, 03056

Docent

Department of Technical Cybernetic

Екатерина Юрьевна Мелкумян, National Technical University of Ukraine “Kyiv Polytechnic Institute” Peremogy 37, Kyiv, Ukraine, 03056

Ph.D.

Department of Technical Cybernetic

В. В. Курбанов, National Technical University of Ukraine “Kyiv Polytechnic Institute” Peremogy 37, Kyiv, Ukraine, 03056

Postgraduate Student

Department of Technical Cybernetic

References

Козлов, Д. Д. ИПС в Интернет: текущее состояние и пути развития [Текст] / Д. Д. Козлов. – М.:МГУ. – 2000. – 28 с.
Ландэ, Д. В. Поиск знаний в Internet [Текст] / Д. В. Ландэ. – М.: Диалектика. – 2005. – 28 с.
Мидоу, Ч. Ч. Анализ информационно-поисковых систем [Текст] / Ч. Ч. Мидоу. – М.:Мир. – 1970.
Lawrence, S. Accessibility of Information on the Web [Текст] / S. Lawrence, C. Giles // Nature. – 1999. – vol. 400 – С. 107-109
Hermans, B. Intelligent Software Agents on the Internet [Електронний ресурс] / B. Hermans. – 1996. – 89 с. – Режим доступу: www/ URL: http://www.hermans.org/agents
Bergman, K. The Deep Web: Surfacing Hidden Value, BrightPlanet.com LLC [Електронний ресурс] / K. Bergman. – Режим доступу: www/ URL: http://www.completeplanet.com/Tutorials/DeepWeb/index.asp
Inktomi Corp., Web Surpasses One Billion Documents, press release issued January 18, 2000 [Електронний ресурс]. – Режим доступу: www/ URL: http://www.inktomi.com/new/press/billion.html
Методы и средства извлечения слабоструктурированных схем из документов в HTML и конвертирования HTML документов в их XMLпредставление [Електронний ресурс]. – Режим доступу: www/ URL: http://synthesis.ipi.ac.ru/syntesis/projects/XMLBIS/html2xml.html
Некрестьянов, И. Обнаружение структурного подобия HTML-документов [Електронний ресурс]/ И. Некрестьянов, Е. Павлова. – СпбГУ, 2002. – C. 38-54. – Режим доступу: www/ URL: http://meta.math.spbu.ru
Gerdt, V. P. Computer Algebra and Constrained Dynamics [Текст] / V. P. Gerdt // Problem of Modern Physics. – 2000. – JINR D2-99-263. – C. 164-171
Kechedzhy, K. E. Rank distributions of words in additive many-step Markov chains and the Zipf Law [Текст] / K. E. Kechedzhy, O. V. Ustenko, V. A. Yampol’ski // Arxiv LANL. – 2004. – Phys.Rev.E. – 2005. – V 72. – pp. 1-6
Wentain, Li. Random Texts Exibition Zipf’s Law – Like Word Frequency Distribution. [Текст] / Li. Wentain // Santa Fe institute. NM 87501. – 1992. – V. 38-№6. – C. 1842-1845
Голуб, Дж. Матричные исчисления [Текст] / Дж. Голуб, И. Ван Лоун. М.: Мир. – 1999.
Kozlov, D. (2000). Internet ISS: current status and development. Moskow, MGU, 28.
Landje, D. (2005). Internet search of knowledge. M.: Dialektika, 28.
Midou, Ch. (1970). The analysis of ISS. M.: Mir.
Lawrence, S., Giles, C. (1999). Accessibility of Information on the Web. Nature vol. 400 pp., 107-109.
Hermans, B. (1996). Intelligent Software Agents on the Internet. Available: http://www.hermans.org/agents
Bergman, K. The Deep Web: Surfacing Hidden Value, BrightPlanet.com LLC. Available: http://www.completeplanet.com/Tutorials/DeepWeb/index.asp.
Inktomi Corp. (2000). Web Surpasses One Billion Documents, press release issued January 18. Available: http://www.inktomi.com/new/press/billion.html.
Metody i sredstva izvlechenija slabostrukturirovannyh shem iz
dokumentov v HTML i konvertirovanija HTML dokumentov v ih XML predstavlenie. Available: http://synthesis.ipi.ac.ru/syntesis/projects/XMLBIS/html2xml.html.
Nekrest'janov I., Pavlova E. (2002). Obnaruzhenie strukturnogo podobija HTML-dokumentov. SpbGU, 38-54. Available: http://meta.math.spbu.ru
Gerdt, V. (2000). Computer Algebra and Constrained Dynamics. «Problem of Modern Physics», JINR D2-99-263, 164-171.
Kechedzhy, K., Ustenko O., Yampol’ski V. (2004). Rank distributions of words in additive many-step Markov chains and the Zipf Law. Arxiv LANL. Phys.Rev.E., 1-6.
Wentain, Li. (1992). Random Texts Exibition Zipf’s Law – Like Word Frequency Distribution. Santa Fe institute. NM 87501. V. 38-№6, 1842-1845.
Golub, Dzh. Van Loun, I. (1999). Matrichnye ischislenija. M.: Mir.

Latent semantic method of extraction information from the internet resources

Authors

DOI:

Keywords:

Abstract

Author Biographies

Александр Африканович Стенин, National Technical University of Ukraine “Kyiv Polytechnic Institute” Peremogy 37, Kyiv, Ukraine, 03056

Юрий Афанасиевич Тимошин, National Technical University of Ukraine “Kyiv Polytechnic Institute” Peremogy 37, Kyiv, Ukraine, 03056

Екатерина Юрьевна Мелкумян, National Technical University of Ukraine “Kyiv Polytechnic Institute” Peremogy 37, Kyiv, Ukraine, 03056

В. В. Курбанов, National Technical University of Ukraine “Kyiv Polytechnic Institute” Peremogy 37, Kyiv, Ukraine, 03056

References

Downloads

Published

How to Cite

Issue

Section

License

Language

Information

Make a Submission

Developed By

Current Issue