Extracting information from the semistructured web pages

Authors

DOI:

https://doi.org/10.15587/1729-4061.2014.19496

Keywords:

web page, internet, information, semistructured, extraction

Abstract

In most cases, information that is given on the Internet open access has no strictly-defined structure. Web pages, i.e. information sources on the Internet, can have a non-uniform layout within a resource. When it concerns the processing from such sources, there appears a problem of extracting useful information from semistructured data.

The method for solving the problem involving the so-called “web scraping” approach is proposed in the paper. The essence lies in simulating a human operation on a web resource using a lowlevel Hypertext Transfer Protocol (HTTP). This approach makes possible working with any data structures which become known after a preliminary data source analysis. The examples of extracting information from web pages of scientometric of a subsequent result has been developed. Further studies include the possibilty of intelligent processing of extracted information to filter out irrelevant data

Author Biographies

Андрей Сергеевич Коляда, Odessa National Polytechnic University Shevchenko Ave 1 , Odessa, Ukraine, 65044

Graduate student

Department of Systems Management Life Safety

Виктор Дмитриевич Гогунский, Odessa National Polytechnic University Shevchenko Ave 1 , Odessa, Ukraine, 65044

Doctor of Technical Sciences, Professor

Department of Systems Management Life Safety

References

  1. Коляда, А. С. Автоматизация извлечения информации из наукометрических баз данных [Текст] / А. С. Коляда, В. Д. Гогунский // Управління розвитком складних систем. 2013. – № 16.
  2. Buneman, Peter Semistructured data, Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems [Text] / Peter Buneman. – Tucson, Arizona, United States. – May 11 15, 1997. – P.117 121.
  3. Бурков, В. Н. Параметры цитируемости научных публикаций в наукометрических базах данных [Текст] / В. Н. Бурков, А. А. Белощицкий, В. Д. Гогунский // Управління розвитком складних систем. – 2013. – № 15. – С. 134 – 139.
  4. Arens, Yigal. Retrieving and integrating data from multiple information sources [Text] / Yigal Arens, Chin Y. Chee, Chun-Nan Hsu, Craig A. Knoblock // International Journal of Intelligent and Cooperative Information Systems. – 1993. – Issue 02
  5. Yung-Jen Hsu, Jane. Template-based information mining from HTML documents [Text] / Jane Yung-Jen Hsu, Wen-tau Yih // Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence. – 1997. – P. 256 – 262.
  6. Smith, Dan. Information extraction for semi-structured documents [Text] / Dan Smith, Mauricio Lopez // In Proceedings of the Workshop on Management of Semistructured Data. – 1997.
  7. Li, Zhao. Web data extraction based on structural similarity [Text] / Zhao Li, Wee Keong Ng, Aixin Sun // Journal Knowledge and Information Systems archive. – November 2005. – Vol. 8, Issue 4. – P. 438 – 461.
  8. Коляда, А. С. Разработка проекта информационно-аналитической системы извлечения и обработки информации из наукометрических баз данных [Текст] : матеріали IX Міжнар. наук.-практ. конф / А. С. Коляда, А. А. Негри, Е. В. Колесникова // Управління проектами: стан та перспективи. — Миколаїв : НУК, 2013. — 348 с.
  9. Палагин, А. Формализация проблемы извлечения знаний из естественно языковых текстов [Текст] / А. Палагин, С. Кривый, Н. Петренко, Д. Бибиков. — Sofia : Information technologies & knowledge, 2012. — 100 с.
  10. Baumgartner, Robert The Personal Publication Reader: Illustrating Web Data Extraction, Personalization and Reasoning for the Semantic Web [Text] / Robert Baumgartner, Nicola Henze, Marcus Herzog // Lecture Notes in Computer Science 2005. – Vol. 3532. – P 515–530.
  11. Kolyada, A., Gogunsky, V. (2013). Automating the extraction of information from scientometric databases. Management of complex systems, 16.
  12. Buneman, P. (1997). Semistructured data. Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, 117 – 121.
  13. Burkov, V., Beloschitsky, A., Gogunsky, V. (2013). Options citation of scientific publications in scientometric databases. Management of complex systems, 15, 134 – 139.
  14. Yigal, A., Chin, Y. C., Chun-Nan, H., Craig, A. K. (1993). Retrieving and integrating data from multiple information sources. International Journal of Intelligent and Cooperative Information Systems, Vol. 2, Issue 2.
  15. Jane, Yung-Jen H., Wen-tau, Y. (1997). Template-based information mining from HTML documents. Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence, 256 – 262.
  16. Dan, S., Mauricio, L. (1997). Information extraction for semi-structured documents. In Proceedings of the Workshop on Management of Semistructured Data.
  17. Zhao, L., Wee, K. N., Aixin, S. (2005). Web data extraction based on structural similarity. Journal Knowledge and Information Systems archive, Vol. 8, Issue 4, 438 – 461.
  18. Kolyada, A., Negri, A., Kolesnikova, E. (2013). Development of the information and analytical system for extraction and processing of scientometric databases. Project management: state and prospects. International scientific conference, 9, 348.
  19. Palagin, A., Kriviy, S., Petrenko, N., Bibikov, D. (2012). Formalization of the problem of knowledge extraction from natural language texts. Information technologies & knowledge, 100.
  20. Baumgartner, R., Henze, N., Herzog, M. (2005). The Personal Publication Reader: Illustrating Web Data Extraction, Personalization and Reasoning for the Semantic Web. Lecture Notes in Computer Science, Vol. 3532, 515 – 530.

Published

2014-02-05

How to Cite

Коляда, А. С., & Гогунский, В. Д. (2014). Extracting information from the semistructured web pages. Eastern-European Journal of Enterprise Technologies, 1(9(67), 51–54. https://doi.org/10.15587/1729-4061.2014.19496

Issue

Section

Information and controlling system