DOI: https://doi.org/10.15587/2312-8372.2015.37274

Method of determining keywords for English texts based on DKPro Core

Олег Володимирович Бісікало, Олександр Вікторович Яхимович

Abstract


The approaches to search of keywords in text that are divided into two linguistic and statistical categories are considered. Linguistic methods are based on the meaning of words, especially using ontologies and semantic information of words. Unfortunately, these methods are resource-intensive in the early stages - development of ontologies, for example, is very time-consuming process.

It is proposed a new method for determining the keywords based on finding connections between word forms of the English text with the instrumental capabilities of package DKPro Core. The method, which illustrated with examples of analysis, aimed at solving problems of efficient processing of text documents - indexing, abstracting, clustering and classification.

As a result of theoretical and experimental studies it is found that the developed method found more keywords, specified by the author of the text, compared to analogues. In addition, the proposed method without additional filters at least 5 times reduces the number of stop words among the top ten important (key) words. The results can be used to improve the accuracy of the content analysis of the site and raise the site position in search results.

Unlike the existing methods the proposed method of determining the keywords based on the use of additional information about complex relationships between members of the English sentence. For the functional implementation of text analyzer it is selected the popular linguistic package DKPro Core. Experimental studies of theoretical substantiation of method are proved its quality advantages in comparison with known analogues.


Keywords


method; keywords; English; linguistic package; DKPro Core; syntactic analysis

References


Ershov, Yu. S. (2014). Vydelenie kliuchevyh slov v russkoiazychnyh tekstah. Molodezhnyi nauchno-tehnicheskii vestnik. M.: FGBOU VPO "MGTU im. N. E. Baumana". Available: http://sntbul.bmstu.ru/file/out/730754. Last accessed 21.01.2015.

Andreev, A. M., Berezkin, D. V., Siuzev, V. V., Shabanov, V. I. (2003). Modeli i metody avtomaticheskoi klassifikatsii tekstovyh dokumentov. Vestnik MGTU im. N. E. Baumana. Ser. Priborostroenie, № 4. Available: http://vestnikprib.bmstu.ru/articles/397/html/files/assets/basic-html/page1.html. Last accessed 21.01.2015.

Joachims, T. (1998). Text categorization with Support Vector Machines: Learning with many relevant features. Machine Learning: ECML-98 Lecture Notes in Computer Science, Vol. 1398, 137–142. doi:10.1007/bfb0026683

Jensen, R. (2000). A Rough Set-Aided System for Sorting WWW Bookmarks. The University of Edinburgh. Available: http://users.aber.ac.uk/rkj/research/mscthesis.pdf. Last accessed 21.01.2015.

Larkey, L. S., Croft, W. B. (1996). Combining classifiers in text categorization. Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’96. ACM Press, 289-297. doi:10.1145/243199.243276

Scott, S., Matwin, S. (1998). Text Classification Using WordNet Hypernyms. University of Ottawa. Available: http://www.aclweb.org/anthology/W98-0706. Last accessed 21.01.2015.

Darkulova, K. N., Ergeshova, G. (2014). Neobhodimost' vydeleniia kliuchevyh slov dlia sviortyvaniia teksta. VI Mezhdunarodnaia studencheskaia elektronnaia nauchnaia konferentsiia «Studencheskii nauchnyi forum» 15 fevralia – 31 marta 2014 goda. Lingvisticheskii analiz nauchnogo teksta. Yuzhno-Kazahstanskii gosudarstvennyi universitet im. Muhtara Auezova Shymkent. Available: http://www.scienceforum.ru/2014/476/70. Last accessed 21.01.2015.

Bisikalo, O. V. (2013). Kontseptualna model systemy obraznoho analizu i syntezu pryrodno-movnykh konstruktsii. Matematychni mashyny i systemy, № 2, 184–187. ISSN 1028-9763.

Bisikalo, O. V. (2013). Formalni metody obraznoho analizu ta syntezu pryrodno-movnykh konstruktsii. Vinnytsia: VNTU, 316. ISBN 978-966-641-528-1.

Natural Language Processing: Integration of Automatic and Manual Analysis. (2014). Technischen Universität Darmstadt. Available: http://tuprints.ulb.tu-darmstadt.de/4151/1/rec-thesis-final.pdf. Last accessed 21.01.2015.

Gurevych, I., Muhlhauser, M., Muller, Ch., Steimle, J., Weimer, M., Zesch, T. (2007, February 9). Darmstadt Knowledge Processing Repository Based on UIMA. Available: https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/publikationen/2007/gldv-uima-ukp.pdf. . Last accessed 21.01.2015.

Burgareli, L. A. (2009, Jul.-Dec.). Variability management in software product lines using adaptive object and reflection. Journal of Aerospace Technology and Management, V. 1, № 2. Available: http://www.jatm.com.br/papers/vol1_n2/JATMv1n2_thesis_abstracts.pdf. Last accessed 21.01.2015.

Address by President of the Russian Federation. Available: http://eng.kremlin.ru/transcripts/6402. Last accessed 21.01.2015.

Address by President of the Russian Federation. Available: http://eng.kremlin.ru/news/6889. Last accessed 21.01.2015.


GOST Style Citations


Ершов, Ю. С. Выделение ключевых слов в русскоязычных текстах [Текст] / Ю. С. Ершов // Молодежный научно-технический вестник. – М.: ФГБОУ ВПО "МГТУ им. Н. Э. Баумана", 2014. – Режим доступа: \www/URL: http://sntbul.bmstu.ru/file/out/730754. – 21.01.2015.

Андреев, А. М. Модели и методы автоматической классификации текстовых документов [Электронный ресурс] / А. М. Андреев, Д. В. Березкин, В. В. Сюзев, В. И. Шабанов // Вестник МГТУ им. Н. Э. Баумана. Сер. Приборостроение. – 2003. – №4. – Режим доступа: \www/URL: http://vestnikprib.bmstu.ru/articles/397/html/files/assets/basic-html/page1.html. – 21.01.2015.

Joachims, T. Text categorization with Support Vector Machines: Learning with many relevant features [Text] / T. Joachims // Machine Learning: ECML-98 Lecture Notes in Computer Science. – 1998. – Vol. 1398. – P. 137–142. doi:10.1007/bfb0026683

Jensen, R. A Rough Set-Aided System for Sorting WWW Bookmarks [Electronic resource] / R. Jensen. – The University of Edinburgh, 2000. – Available at: \www/URL: http://users.aber.ac.uk/rkj/research/mscthesis.pdf. – 21.01.2015.

Larkey, L. S. Combining classifiers in text categorization. [Text] / L. S. Larkey, W. B. Croft // Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’96. – ACM Press, 1996. – P. 289-297.doi:10.1145/243199.243276

Scott, S. Text Classification Using WordNet Hypernyms [Electronic resource] / S. Scott, S. Matwin. – University of Ottawa, 1998. – Available at: \www/URL: http://www.aclweb.org/anthology/W98-0706. – 21.01.2015.

Даркулова, К. Н. Необходимость выделения ключевых слов для свёртывания текста [Электронный ресурс] / К. Н. Даркулова, Г. Ергешова // VI Международная студенческая электронная научная конференция «Студенческий научный форум» 15 февраля – 31 марта 2014 года. Лингвистический анализ научного текста. – Южно-Казахстанский государственный университет им. Мухтара Ауэзова Шымкент, 2014. – Режим доступа: \www/URL: http://www.scienceforum.ru/2014/476/70. – 21.01.2015.

Бісікало, О. В. Концептуальна модель системи образного аналізу і синтезу природно-мовних конструкцій [Текст] / О. В. Бісікало // Математичні машини і системи. – 2013. – № 2. – С. 184–187. – ISSN 1028-9763.

Бісікало, О. В. Формальні методи образного аналізу та синтезу природно-мовних конструкцій [Текст]: монографія / О. В. Бісікало. – Вінниця: ВНТУ, 2013. – 316 с. – ISBN 978-966-641-528-1.

Natural Language Processing: Integration of Automatic and Manual Analysis [Electronic resource]. – Technischen Universität Darmstadt, 2014. – Available at: \www/URL: http://tuprints.ulb.tu-darmstadt.de/4151/1/rec-thesis-final.pdf. – 21.01.2015.

Gurevych, I. Darmstadt Knowledge Processing Repository Based on UIMA [Electronic resource] / I. Gurevych, M. Muhlhauser, Ch. Muller, J. Steimle, M. Weimer, T. Zesch. – February 9, 2007. – Available at: \www/URL: https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/publikationen/2007/gldv-uima-ukp.pdf. – 21.01.2015.

Burgareli, L. A. Variability management in software product lines using adaptive object and reflection [Electronic resource]: Thesis Abstracts / L. A. Burgareli // Journal of Aerospace Technology and Management. – Jul.-Dec. 2009. – V. 1, № 2. – Available at: \www/URL: http://www.jatm.com.br/papers/vol1_n2/JATMv1n2_thesis_abstracts.pdf

Address by President of the Russian Federation [Electronic resource]. – Available at: \www/URL: http://eng.kremlin.ru/transcripts/6402. – 21.01.2015.

Address by President of the Russian Federation [Electronic resource]. – Available at: \www/URL: http://eng.kremlin.ru/news/6889. – 21.01.2015.







Copyright (c) 2016 Олег Володимирович Бісікало, Олександр Вікторович Яхимович

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

ISSN (print) 2226-3780, ISSN (on-line) 2312-8372