Development of methods for pre-clustering and virtual merging of short documents for building domain dictionaries

Authors

DOI:

https://doi.org/10.15587/1729-4061.2020.215190

Keywords:

domain dictionary, short document, clustering, document proximity coefficient, virtual union

Abstract

The aim of research is to improve the quality of domain dictionaries by expanding the corpus of the documents under study by using short documents. A document model is proposed that allows to define a short document and the need to combine it with other documents to highlight verbose terms. An algorithm for highlighting the substantive part of the document has been developed, since in a short document the heading and closing parts usually contain terms that are not related to the studied domain. A method for preliminary clustering of short documents to highlight verbose terms has been developed. The method is based on highlighting and counting occurrences of nouns (one-word terms) for all analyzed documents. The concept of document proximity is introduced, which is determined by the combination of two criteria: the relative number of matching terms and the relative frequency of occurrence of matching terms. The principle of grouping documents at the customer's site often does not correspond to the principles of grouping necessary for building a dictionary of the domain. In a short document, it is usually impossible to isolate a verbose term because the repetition of terms is very low. A method has been developed for virtual combining of short documents based on the principle of achieving the necessary repeatability of one-word terms. The merged document has the highest possible frequency of terms for the cluster it belongs to. At the same time, the original text of documents is preserved and the ability to associate the selected verbose term with those documents in which it is included. The experiment made it possible to find the best ratio for the elements of the document proximity coefficient and confirm the effectiveness of the proposed preliminary clustering method

Author Biographies

Oleksii Kungurtsev, Odessa National Polytechnic University Shevchenka ave., 1, Odessa, Ukraine, 65044

PhD, Professor

Department of System Software

Svitlana Zinovatna, Odessa National Polytechnic University Shevchenka ave., 1, Odessa, Ukraine, 65044

PhD, Associate Professor

Department of System Software

Iana Potochniak, Software Development Company "The Product Engine" Marazlievska str., 7, Odessa, Ukraine, 65078

PhD, Engineer

Nataliia Novikova, Odessa National Maritime University Mechnikova str., 34, Odessa, Ukraine, 65029

Senior Lecturer

Department of Technical cybernetics and information technology named after prof. R. V. Merkt

References

  1. Bourgeois, D., Mortati, J., Wang, S., Smith, J. (2019). Information Systems for Business and Beyond (2019). Information systems, their use in business, and the larger impact they are having on our world. Available at: https://opentextbook.site/exports/ISBB-2019.pdf
  2. Kungurtsev, А. B., Potochniak, I. B. (2014). User interface for users communication with information systems in a natural language. Elektrotehnicheskie i komp'yuternye sistemy, 14 (90), 74–81. Available at: http://nbuv.gov.ua/UJRN/etks_2014_14_12
  3. Kim, S. N., Cavedon, L. (2011). Classifying Domain-Specific Terms Using a Dictionary. In Proceedings of Australasian Language Technology Association Workshop, 57−65. Available at: https://www.aclweb.org/anthology/U11-1009.pdf
  4. Kolle, P., Bhagat, S., Zade, S., Dand, B., Lifna, C. S. (2018). Ontology based Domain Dictionary. 2018 International Conference on Smart City and Emerging Technology (ICSCET). doi: https://doi.org/10.1109/icscet.2018.8537346
  5. Deng, Q., Hine, M. J., Ji, S., Sur, S. (2019). Inside the Black Box of Dictionary Building for Text Analytics: A Design Science Approach. Journal of International Technology and Information Management, 27 (3), 119–159. Available at: https://scholarworks.lib.csusb.edu/cgi/viewcontent.cgi?article=1376&context=jitim
  6. Maynard, D., Bontcheva, K., Augenstein, I. (2016). Natural Language Processing for the Semantic Web. Morgan & Claypool publishers. Available at: https://tianjun.me/static/essay_resources/RelationExtraction/Paper/NaturalLanguageProcessingfortheSemanticWeb.pdf
  7. Siddiqi, S., Sharan, A. (2015). Keyword and Keyphrase Extraction Techniques: A Literature Review. International Journal of Computer Applications, 109 (2), 18–23. doi: https://doi.org/10.5120/19161-0607
  8. Tamsin Maxwell, K. (2016). Term Selection in Information Retrieval. University of Edinburgh. Available at: https://era.ed.ac.uk/bitstream/handle/1842/20389/Maxwell2016.pdf?sequence=1&isAllowed=y
  9. Vivek, S. (2018). Automated Keyword Extraction from Articles using NLP. Available at: https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34
  10. Nokel, M., Loukachevitch, N. (2013). An Experimental Study of Term Extraction for Real Information-Retrieval Thesauri. Proceedings of 10th International Conference on Terminology and Artificial Intelligence, 69–76. Available at: https://istina.msu.ru/publications/article/4964490/
  11. Kungurtsev, O., Zinovatnaya, S., Potochniak, I., Kutasevych, M. (2018). Development of information technology of term extraction from documents in natural language. Eastern-European Journal of Enterprise Technologies, 6 (2 (96)), 44–51. doi: https://doi.org/10.15587/1729-4061.2018.147978
  12. Vavilenkova, A. I. (2017). Analiz i syntez lohiko-linhvistychnykh modelei rechen pryrodnoi movy. Kyiv, 152. Available at: https://er.nau.edu.ua/bitstream/NAU/42436/1/блок%20в%20печать.pdf
  13. Kozlov, P. Yu. (2017). Automated analysis method of short unstructured text documents. Programmnye produkty i sistemy, 30 (1), 100–105.
  14. Wahlin, L. (2020). Fundamentals of Engineering Technical Communications. A Resource & Writing Guide for the Fundamentals of Engineering Program. The Ohio State University. Available at: https://ohiostate.pressbooks.pub/feptechcomm/
  15. Liang, S., Yilmaz, E., Kanoulas, E. (2016). Dynamic Clustering of Streaming Short Documents. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. doi: https://doi.org/10.1145/2939672.2939748
  16. Punitha, S. C., Punithavalli, M. (2011). A Comparative Study To Find A Suitable Method For Text Document Clustering. International Journal of Computer Science and Information Technology, 3 (6), 49–59. doi: https://doi.org/10.5121/ijcsit.2011.3604
  17. Hartmann, J., Huppertz, J., Schamp, C., Heitmann, M. (2019). Comparing automated text classification methods. International Journal of Research in Marketing, 36 (1), 20–38. doi: https://doi.org/10.1016/j.ijresmar.2018.09.009
  18. Novokhatska, K., Kungurtsev, O. (2016). Application of Clustering Algorithm CLOPE to the Query Grouping Problem in the Field of Materialized View Maintenance. Journal of Computing and Information Technology, 24 (1), 79–89. doi: https://doi.org/10.20532/cit.2016.1002694
  19. Fernández, J., Antón-Vargas, J. A., Villuendas-Rey, Y., Cabrera-Venegas, J. F., Chávez, Y., Argüelles-Cruz, A. J. (2016). Clustering Techniques for Document Classification. Research in Computing Science, 118 (1), 115–125. doi: https://doi.org/10.13053/rcs-118-1-11
  20. Vtoraya mezhdunarodnaya konferentsiya «Upravlenie biznesom v tsifrovoy ekonomike»: sbornik tezisov vystupleniy (2019). Sankt-Peterburg. Available at: https://events.spbu.ru/eventsContent/events/2019/digital/tez_new.pdf
  21. Sil'no korrelirovannye dvumernye sistemy: ot teorii k praktike: tezisy dokladov Vserossiyskoy konferentsii s mezhdunarodnym uchastiem (2018). Yakutsk: Izdatel'skiy dom SVFU. Available at: https://www.s-vfu.ru/universitet/rukovodstvo-i-struktura/instituty/fti/kres/conference/Сборник%20тезисов%20конференции/2D%20systems%20abstracts.pdf
  22. Transport v integratsionnyh protsessah mirovoy ekonomiki (2020). Materialy Mezhdunarodnoy nauchno-prakticheskoy onlayn-konferentsii. Gomel'. Available at: https://www.bsut.by/images/MainMenuFiles/NauchnyeIssledovaniya/Konferencii/materialy/2020/transport_febt_2020.pdf
  23. Tsifrovaya transformatsiya obrazovaniya (2018). Nauchno-prakticheskaya konferentsiya. Minsk. Available at: http://dtconf.unibel.by/doc/Conference.pdf
  24. Obespechenie bezopasnosti zhiznedeyatel'nosti na sovremennom etape razvitiya obshchestva (2019). Materialy respublikanskoy studencheskoy nauchno-prakticheskoy konferentsii. Gorki, 69. Available at: https://baa.by/upload/science/conferencii/snk-bzd-19.pdf

Downloads

Published

2020-10-31

How to Cite

Kungurtsev, O., Zinovatna, S., Potochniak, I., & Novikova, N. (2020). Development of methods for pre-clustering and virtual merging of short documents for building domain dictionaries. Eastern-European Journal of Enterprise Technologies, 5(2 (107), 39–47. https://doi.org/10.15587/1729-4061.2020.215190