Development and research of the efficiency of a website parsing information system using the Selenide framework
DOI:
https://doi.org/10.31498/2225-6733.49.1.2024.321179Keywords:
web parsing, web parser, web site, web application, information search system, data, Selenium, SelenideAbstract
The article is devoted to the study of methods for automating data collection from websites using parsing technologies. The paper describes the main advantages of parsing compared to manual data collection, provides a classification of existing parsers, their capabilities, limitations and application in real projects. A detailed analysis of popular commercial and free parsers, such as Import.io, Webhose.io, Dexi.io, Scraperhub, ParseHub, Visual Scraper, Spinn3r, 80legs, Scraper, OutWit Hub, is carried out in order to determine their advantages and disadvantages in various usage scenarios. Particular attention is paid to the comparison of the Selenide and Selenide frameworks, which are widely used to automate interaction with web browsers. The conclusion is made about the feasibility of using the Selenide framework due to its simplified syntax, capabilities for working with dynamic content and support for intelligent waiting. The article presents the development of a custom parser based on Selenide, focused on the needs of small and medium-sized enterprises with a limited budget. The system is built on a modern technology stack, including Java 11, Python, PostgreSQL, Angular 12, Docker, Gradle, Kafka, Node.js. The program architecture, interaction between modules, and the relational database model for storing the obtained data are described in detail. The proposed approach allows you to configure the parser to work with different types of sites, provides high speed of information collection and processing, as well as flexibility in configuring sampling parameters. The created tool provides the opportunity to use containerization technologies to simplify the deployment and support of the application. The results of the work can be used to implement effective information search systems and automate routine data collection processes, which is especially important for companies that seek to optimize their business processes and reduce costs
References
Ratra R., Gulia P. Big Data tools and techniques: a roadmap for predictive analytics. International Journal of Engineering and Advanced Technology (IJEAT). 2009. Vol. 9. Iss. 2. Pp. 4986-4992. DOI: https://doi.org/10.35940/ijeat.B2360.129219.
Tomar R.S. A Study on Web Scraping. International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering. 2020. Vol. 8. Iss. 6. Pp. 1820-1824. DOI: https://doi.org/10.15662/IJAREEIE.2019.0806020.
Ateeq W. M. B., Al-Khalifa H. S. Intelligent framework for detecting predatory publishing venues. IEEE Access. 2023. Vol. 11. Pp. 20582-20618. DOI: https://doi.org/10.1109/ACCESS.2023.3250256.
EasySpider: EasySpider: A No-Code Visual System for Crawling the Web / Wang N., Feng W., Yin J., Ng S.-K. WWW '23 Companion : Companion Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April - 4 May 2023. Pp. 192-195. DOI: https://doi.org/10.1145/3543873.3587345.
A classification framework for data marketplaces / Stahl F., Schomm F., Vossen G., Vomfell L. Vietnam Journal of Computer Science. 2016. Vol. 3. Pp. 137-143. DOI: https://doi.org/10.1007/s40595-016-0064-2.
Kirichenko L., Radivilova T., Carlsson A. Detecting cyber threats through social network analysis: short survey. SocioEconomic Challenges. 2017. Vol. 1. Iss. 1. Pp. 20-34. DOI: https://doi.org/10.21272/sec.2017.1-03.
Exploring Web Scraping with Python / Sasi A., Deep A., Kumar K., Birla V. Machine Intelligence and Smart Systems : Proceedings of the I International Conference, Gwalior, India, 24-25 September 2020. Pp. 287-296. DOI: https://doi.org/10.1007/978-981-33-4893-6_26.
Selenide. Concise UI tests in Java. URL: https://selenide.org/documentation/selenide-vs-selenium.html (дата звернення: 28.08.2024).
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
The journal «Reporter of the Priazovskyi State Technical University. Section: Technical sciences» is published under the CC BY license (Attribution License).
This license allows for the distribution, editing, modification, and use of the work as a basis for derivative works, even for commercial purposes, provided that proper attribution is given. It is the most flexible of all available licenses and is recommended for maximum dissemination and use of non-restricted materials.
Authors who publish in this journal agree to the following terms:
1. Authors retain the copyright of their work and grant the journal the right of first publication under the terms of the Creative Commons Attribution License (CC BY). This license allows others to freely distribute the published work, provided that proper attribution is given to the original authors and the first publication of the work in this journal is acknowledged.
2. Authors are allowed to enter into separate, additional agreements for non-exclusive distribution of the work in the same form as published in this journal (e.g., depositing it in an institutional repository or including it in a monograph), provided that a reference to the first publication in this journal is maintained.







