Development of a method for determining the keywords in the slavic language texts based on the technology of web mining
DOI:
https://doi.org/10.15587/1729-4061.2017.98750Keywords:
Web Mining, NLP, content, content monitoring, keywords, content analysis, Porter stemmer, linguistic analysisAbstract
The authors accomplished the task of development of algorithmic support of processes of the content monitoring for solving the problem of defining the keywords of a Slavic language text based on Web Mining technology. Substantiation of peculiarities of its use for defining keywords and subject heading of the text content was considered. Web Mining technology allows us to take advantage of the text content monitoring method based on the Porter’s stemmer to solve the problem on determining the keywords. Stemming modification is based on the well-known classification of morpheme and word formation structure of derivatives of the Ukrainian language, revealing patterns of affixes combination, modeling the structural organization of verbs and suffixed nouns. Algorithms of morphonological modifications in the process of verb word changing and adjective word changing and word formation in the Ukrainian language were used. Decomposition of the method of determining keywords of the text content was performed. Its features include adaptation of morphological and syntactic analysis of lexical units to peculiarities of Ukrainian words/text structures. Algorithm support of its main structural components was developed. Its features include convolution and analysis of a nominal/verb group and construction of appropriate trees of analysis for each sentence, taking into account the features of their structures as elements of the Slavic language texts. The formal approach to the implementation of stemming of a Ukrainian language text was proposed. It is aimed at automatic detection of notional keywords of a Ukrainian text due to the proposed formal approach to implementation of stemming for the Ukrainian language content. Theoretically, the ways of enhancing efficiency of the keywords search, in particular their density in the text, were found. They are based on an analysis of not the words themselves (nouns, a set of nouns, adjectives with nouns, other parts of speech are ignored), but rather of word stems in Slavic language texts. The rules of stem separations in texts consider not only the isolation of inflexions, but also suffixes, as well as registering the letter alternation during declension of nouns and adjectives. Based on the developed software, we received the results of experimental testing of the proposed content monitoring method for defining keywords in Slavic language scientific texts of technical area based on the Web Mining technology. It was found that for the selected experimental base of 100 works, the best results according to density criterion are achieved by the method of article analysis without compulsory initial information and a list of literature. This is attained through training the system and by checking the refined blocked words and refined thematic dictionary. It was also discovered that for technical scientific texts of the experimental base, the best results are reached by the method of article analysis without beginning (title, authors, UDC, abstracts in two languages, author’s keywords in two languages, work place of authors) and without a list of literature with the check of specified blocked words and refined thematic dictionary – for it the average value of keywords density in the text reaches 0.34, which is by 81 % higher than the correspondent value of density of the original text, which makes 0.19. By numerous data of statistical analysis, it was proved that setting parameters of the system increases the number of defined keywords almost by 2 times without decreasing the indicator of accuracy and reliability. Testing of the proposed method for determining keywords from other categories of texts, such as scientific humanitarian, fiction, journalistic, require further experimental research.
References
- Mobasher B. (2007). Data mining for web personalization. The adaptive web. Springer, Berlin, Heidelberg, 90–135.
- Dinuca, C. E., Ciobanu, D., Ciobanu, D. (2012). Web Content Mining. Annals of the University of Petrosani, Economics, 12 (1), 85–92.
- Xu, G., Zhang, Y., Li, L. (2010). Web Content Mining. Web Mining and Social Networking, 71–87. doi: 10.1007/978-1-4419-7735-9_4
- Bolshakova, Y., Klyshinskiy E., Lande D., Noskov A., Peskova O., Yagunova Y. (2011). Avtomaticheskaya obrabotka tekstov na yestestvennom yazyke i komp'yuternaya lingvistika, Мoscow: MIEM, 272.
- Lytvyn, V., Pukach, P., Bobyk, І., Vysotska, V. (2016). The method of formation of the status of personality understanding based on the content analysis. Eastern-European Journal of Enterprise Technologies, 5 (2 (83)), 4–12. doi: 10.15587/1729-4061.2016.77174
- Khomytska, I., Teslyuk, V. (2016). The Method of Statistical Analysis of the Scientific, Colloquial, Belles-Lettres and Newspaper Styles on the Phonological Level. Advances in Intelligent Systems and Computing, 149–163. doi: 10.1007/978-3-319-45991-2_10
- Khomytska, I., Teslyuk, V. (2016). Specifics of phonostatistical structure of the scientific style in English style system. 2016 XIth International Scientific and Technical Conference Computer Sciences and Information Technologies (CSIT). doi: 10.1109/stc-csit.2016.7589887
- Vysotska, V., Chyrun, L., Chyrun, L. (2016). Information technology of processing information resources in electronic content commerce systems. 2016 XIth International Scientific and Technical Conference Computer Sciences and Information Technologies (CSIT). doi: 10.1109/stc-csit.2016.7589909
- Vysotska, V., Chyrun, L., Chyrun, L. (2016). The commercial content digest formation and distributional process. 2016 XIth International Scientific and Technical Conference Computer Sciences and Information Technologies (CSIT). doi: 10.1109/stc-csit.2016.7589902
- Lytvyn, V., Vysotska, V., Veres, O., Rishnyak, I., Rishnyak, H. (2016). Classification Methods of Text Documents Using Ontology Based Approach. Advances in Intelligent Systems and Computing, 229–240. doi: 10.1007/978-3-319-45991-2_15
- Jivani, G. A. (2011). A Comparative Study of Stemming Algorithms. Int. J. Comp. Tech. Appl., 2 (6), 1930–1938.
- Mishler, A., Crabb, E. S., Paletz, S., Hefright, B., Golonka, E. (2015). Using Structural Topic Modeling to Detect Events and Cluster Twitter Users in the Ukrainian Crisis. HCI International 2015 - Posters’ Extended Abstracts, 639–644. doi: 10.1007/978-3-319-21380-4_108
- Vysotska, V. (2016). Linguistic analysis of textual commercial content for information resources processing. 2016 13th International Conference on Modern Problems of Radio Engineering, Telecommunications and Computer Science (TCSET). doi: 10.1109/tcset.2016.7452160
- Kowalska, K., Cai, D., Wade, S. (2012). Sentiment Analysis of Polish Texts. International Journal of Computer and Communication Engineering, 39–42. doi: 10.7763/ijcce.2012.v1.12
- Kotsyba, N. (2009). The current state of work on the Polish-Ukrainian Parallel Corpus (PolUKR). Organization and Development of Digital Lexical Resources, 55–60.
- Victana. Available at: http://victana.lviv.ua/index.php/kliuchovi-slova
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2017 Vasyl Lytvyn, Victoria Vysotska, Petro Pukach, Oksana Brodyak, Dmytro Ugryn

This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.
A license agreement is a document in which the author warrants that he/she owns all copyright for the work (manuscript, article, etc.).
The authors, signing the License Agreement with TECHNOLOGY CENTER PC, have all rights to the further use of their work, provided that they link to our edition in which the work was published.
According to the terms of the License Agreement, the Publisher TECHNOLOGY CENTER PC does not take away your copyrights and receives permission from the authors to use and dissemination of the publication through the world's scientific resources (own electronic resources, scientometric databases, repositories, libraries, etc.).
In the absence of a signed License Agreement or in the absence of this agreement of identifiers allowing to identify the identity of the author, the editors have no right to work with the manuscript.
It is important to remember that there is another type of agreement between authors and publishers – when copyright is transferred from the authors to the publisher. In this case, the authors lose ownership of their work and may not use it in any way.