Development of methods for pre-clustering and virtual merging of short documents for building domain dictionaries
DOI:
https://doi.org/10.15587/1729-4061.2020.215190Keywords:
domain dictionary, short document, clustering, document proximity coefficient, virtual unionAbstract
The aim of research is to improve the quality of domain dictionaries by expanding the corpus of the documents under study by using short documents. A document model is proposed that allows to define a short document and the need to combine it with other documents to highlight verbose terms. An algorithm for highlighting the substantive part of the document has been developed, since in a short document the heading and closing parts usually contain terms that are not related to the studied domain. A method for preliminary clustering of short documents to highlight verbose terms has been developed. The method is based on highlighting and counting occurrences of nouns (one-word terms) for all analyzed documents. The concept of document proximity is introduced, which is determined by the combination of two criteria: the relative number of matching terms and the relative frequency of occurrence of matching terms. The principle of grouping documents at the customer's site often does not correspond to the principles of grouping necessary for building a dictionary of the domain. In a short document, it is usually impossible to isolate a verbose term because the repetition of terms is very low. A method has been developed for virtual combining of short documents based on the principle of achieving the necessary repeatability of one-word terms. The merged document has the highest possible frequency of terms for the cluster it belongs to. At the same time, the original text of documents is preserved and the ability to associate the selected verbose term with those documents in which it is included. The experiment made it possible to find the best ratio for the elements of the document proximity coefficient and confirm the effectiveness of the proposed preliminary clustering methodReferences
- Bourgeois, D., Mortati, J., Wang, S., Smith, J. (2019). Information Systems for Business and Beyond (2019). Information systems, their use in business, and the larger impact they are having on our world. Available at: https://opentextbook.site/exports/ISBB-2019.pdf
- Kungurtsev, А. B., Potochniak, I. B. (2014). User interface for users communication with information systems in a natural language. Elektrotehnicheskie i komp'yuternye sistemy, 14 (90), 74–81. Available at: http://nbuv.gov.ua/UJRN/etks_2014_14_12
- Kim, S. N., Cavedon, L. (2011). Classifying Domain-Specific Terms Using a Dictionary. In Proceedings of Australasian Language Technology Association Workshop, 57−65. Available at: https://www.aclweb.org/anthology/U11-1009.pdf
- Kolle, P., Bhagat, S., Zade, S., Dand, B., Lifna, C. S. (2018). Ontology based Domain Dictionary. 2018 International Conference on Smart City and Emerging Technology (ICSCET). doi: https://doi.org/10.1109/icscet.2018.8537346
- Deng, Q., Hine, M. J., Ji, S., Sur, S. (2019). Inside the Black Box of Dictionary Building for Text Analytics: A Design Science Approach. Journal of International Technology and Information Management, 27 (3), 119–159. Available at: https://scholarworks.lib.csusb.edu/cgi/viewcontent.cgi?article=1376&context=jitim
- Maynard, D., Bontcheva, K., Augenstein, I. (2016). Natural Language Processing for the Semantic Web. Morgan & Claypool publishers. Available at: https://tianjun.me/static/essay_resources/RelationExtraction/Paper/NaturalLanguageProcessingfortheSemanticWeb.pdf
- Siddiqi, S., Sharan, A. (2015). Keyword and Keyphrase Extraction Techniques: A Literature Review. International Journal of Computer Applications, 109 (2), 18–23. doi: https://doi.org/10.5120/19161-0607
- Tamsin Maxwell, K. (2016). Term Selection in Information Retrieval. University of Edinburgh. Available at: https://era.ed.ac.uk/bitstream/handle/1842/20389/Maxwell2016.pdf?sequence=1&isAllowed=y
- Vivek, S. (2018). Automated Keyword Extraction from Articles using NLP. Available at: https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34
- Nokel, M., Loukachevitch, N. (2013). An Experimental Study of Term Extraction for Real Information-Retrieval Thesauri. Proceedings of 10th International Conference on Terminology and Artificial Intelligence, 69–76. Available at: https://istina.msu.ru/publications/article/4964490/
- Kungurtsev, O., Zinovatnaya, S., Potochniak, I., Kutasevych, M. (2018). Development of information technology of term extraction from documents in natural language. Eastern-European Journal of Enterprise Technologies, 6 (2 (96)), 44–51. doi: https://doi.org/10.15587/1729-4061.2018.147978
- Vavilenkova, A. I. (2017). Analiz i syntez lohiko-linhvistychnykh modelei rechen pryrodnoi movy. Kyiv, 152. Available at: https://er.nau.edu.ua/bitstream/NAU/42436/1/блок%20в%20печать.pdf
- Kozlov, P. Yu. (2017). Automated analysis method of short unstructured text documents. Programmnye produkty i sistemy, 30 (1), 100–105.
- Wahlin, L. (2020). Fundamentals of Engineering Technical Communications. A Resource & Writing Guide for the Fundamentals of Engineering Program. The Ohio State University. Available at: https://ohiostate.pressbooks.pub/feptechcomm/
- Liang, S., Yilmaz, E., Kanoulas, E. (2016). Dynamic Clustering of Streaming Short Documents. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. doi: https://doi.org/10.1145/2939672.2939748
- Punitha, S. C., Punithavalli, M. (2011). A Comparative Study To Find A Suitable Method For Text Document Clustering. International Journal of Computer Science and Information Technology, 3 (6), 49–59. doi: https://doi.org/10.5121/ijcsit.2011.3604
- Hartmann, J., Huppertz, J., Schamp, C., Heitmann, M. (2019). Comparing automated text classification methods. International Journal of Research in Marketing, 36 (1), 20–38. doi: https://doi.org/10.1016/j.ijresmar.2018.09.009
- Novokhatska, K., Kungurtsev, O. (2016). Application of Clustering Algorithm CLOPE to the Query Grouping Problem in the Field of Materialized View Maintenance. Journal of Computing and Information Technology, 24 (1), 79–89. doi: https://doi.org/10.20532/cit.2016.1002694
- Fernández, J., Antón-Vargas, J. A., Villuendas-Rey, Y., Cabrera-Venegas, J. F., Chávez, Y., Argüelles-Cruz, A. J. (2016). Clustering Techniques for Document Classification. Research in Computing Science, 118 (1), 115–125. doi: https://doi.org/10.13053/rcs-118-1-11
- Vtoraya mezhdunarodnaya konferentsiya «Upravlenie biznesom v tsifrovoy ekonomike»: sbornik tezisov vystupleniy (2019). Sankt-Peterburg. Available at: https://events.spbu.ru/eventsContent/events/2019/digital/tez_new.pdf
- Sil'no korrelirovannye dvumernye sistemy: ot teorii k praktike: tezisy dokladov Vserossiyskoy konferentsii s mezhdunarodnym uchastiem (2018). Yakutsk: Izdatel'skiy dom SVFU. Available at: https://www.s-vfu.ru/universitet/rukovodstvo-i-struktura/instituty/fti/kres/conference/Сборник%20тезисов%20конференции/2D%20systems%20abstracts.pdf
- Transport v integratsionnyh protsessah mirovoy ekonomiki (2020). Materialy Mezhdunarodnoy nauchno-prakticheskoy onlayn-konferentsii. Gomel'. Available at: https://www.bsut.by/images/MainMenuFiles/NauchnyeIssledovaniya/Konferencii/materialy/2020/transport_febt_2020.pdf
- Tsifrovaya transformatsiya obrazovaniya (2018). Nauchno-prakticheskaya konferentsiya. Minsk. Available at: http://dtconf.unibel.by/doc/Conference.pdf
- Obespechenie bezopasnosti zhiznedeyatel'nosti na sovremennom etape razvitiya obshchestva (2019). Materialy respublikanskoy studencheskoy nauchno-prakticheskoy konferentsii. Gorki, 69. Available at: https://baa.by/upload/science/conferencii/snk-bzd-19.pdf
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2020 Oleksii Kungurtsev, Svitlana Zinovatna, Iana Potochniak, Nataliia Novikova
This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.
A license agreement is a document in which the author warrants that he/she owns all copyright for the work (manuscript, article, etc.).
The authors, signing the License Agreement with TECHNOLOGY CENTER PC, have all rights to the further use of their work, provided that they link to our edition in which the work was published.
According to the terms of the License Agreement, the Publisher TECHNOLOGY CENTER PC does not take away your copyrights and receives permission from the authors to use and dissemination of the publication through the world's scientific resources (own electronic resources, scientometric databases, repositories, libraries, etc.).
In the absence of a signed License Agreement or in the absence of this agreement of identifiers allowing to identify the identity of the author, the editors have no right to work with the manuscript.
It is important to remember that there is another type of agreement between authors and publishers – when copyright is transferred from the authors to the publisher. In this case, the authors lose ownership of their work and may not use it in any way.