The use of probabilistic latent semantic analysis to identify scientific subject spaces and to evaluate the completeness of covering the results of dissertation studies

Authors

DOI:

https://doi.org/10.15587/1729-4061.2020.209886

Keywords:

probabilistic latent semantic analysis, clustering, scientific subject space, thematic model

Abstract

The study considers the possibilities of using latent semantic analysis for the tasks of identifying scientific subject spaces and evaluating the completeness of covering the results of dissertation research by science degree seekers.

A probabilistic thematic model was built to make it possible to cluster the publications of scholars in scientific areas, taking into account the citation network, which was an important step for solving the problem of identifying scientific subject spaces. As a result of constructing the model, the problem of increasing instability of clustering the citation graph in connection with a decrease in the number of clusters was solved. This problem would arise when combining clusters built on the basis of citation graph clustering, taking into account the similarity of abstracts of scientific publications.

In the article, the presentation of text documents is described based on a probabilistic thematic model using n-grams. A probabilistic thematic model was built for the task of determining the completeness of covering the materials of an author’s dissertation research in scientific publications. The approximate values of the threshold coefficients were calculated to evaluate whether the articles of an author included the research provisions that were reflected in the text of the author’s abstract of the dissertation. The probabilistic thematic model for an author’s publications was practised on the basis of the BigARTM tool. Using the constructed model and with the help of a special regularizer, a matrix was found to evaluate the relevance of topics specified by the segments of an author’s dissertation abstracts to documents that are produced by the author’s publications.

Important aspects of the possibilities of using latent semantic analysis were studied to identify tasks of scientific subject spaces and to reveal the completeness of covering the results of dissertation research science degree seekers.

Author Biographies

Petro Lizunov, Kyiv National University of Construction and Architecture Povitroflotskyi ave., 31, Kyiv, Ukraine, 03680

Doctor of Technical Sciences, Professor

Department of Fundamentals of Informatics

Andrii Biloshchytskyi, Taras Shevchenko National University of Kyiv Volodymyrska str., 60, Kyiv, Ukraine, 01033 Astana IT University Turkestan str., Nur-Sultan, Kazakhstan, 020000

Doctor of Technical Sciences, Professor

Department of Information Systems and Technologies

Alexander Kuchansky, Taras Shevchenko National University of Kyiv Volodymyrska str., 60, Kyiv, Ukraine, 01033

PhD, Associate Professor

Department of Information Systems and Technologies

Yurii Andrashko, Uzhhorod National University Narodna sq., 3, Uzhhorod, Ukraine, 88000

PhD, Associate Professor

Department of System Analysis and Optimization Theory

Svitlana Biloshchytska, Taras Shevchenko National University of Kyiv Volodymyrska str., 60, Kyiv, Ukraine, 01033

PhD, Associate Professor

Department of Intellectual Technologies

References

  1. Dumais, S. T. (2005). Latent semantic analysis. Annual Review of Information Science and Technology, 38 (1), 188–230. doi: https://doi.org/10.1002/aris.1440380105
  2. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41 (6), 391–407. doi: https://doi.org/10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9
  3. Hofmann, T. (1999). Probabilistic latent semantic indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR ’99. doi: https://doi.org/10.1145/312624.312649
  4. Dai, A. M., Olah, C., Le, Q. V. (2015). Document embedding with paragraph vectors. arXiv. Available at: https://arxiv.org/pdf/1507.07998v1.pdf
  5. Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P. (2004). The Author-Topic Model for Authors and Documents. Conference: UAI '04, Proceedings of the 20th Conference in Uncertainty in Artificial Intelligence.
  6. Pagliardini, M., Gupta, P., Jaggi, M. (2018). Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 528–540. doi: https://doi.org/10.18653/v1/n18-1049
  7. Lifchitz, A., Jhean-Larose, S., Denhière, G. (2009). Effect of tuned parameters on an LSA multiple choice questions answering model. Behavior Research Methods, 41 (4), 1201–1209. doi: https://doi.org/10.3758/brm.41.4.1201
  8. Gálvez, R. H., Gravano, A. (2017). Assessing the usefulness of online message board mining in automatic stock prediction systems. Journal of Computational Science, 19, 43–56. doi: https://doi.org/10.1016/j.jocs.2017.01.001
  9. Scopus Preview. Eastern-European Journal of Enterprise Technologies. Available at: https://www.scopus.com/sourceid/21100450083
  10. Mendeley. Available at: https://www.mendeley.com/?interaction_required=true
  11. Biloshchytskyi, A., Kuchansky, A., Andrashko, Y., Biloshchytska, S., Kuzka, O., Shabala, Y., Lyashchenko, T. (2017). A method for the identification of scientists' research areas based on a cluster analysis of scientific publications. Eastern-European Journal of Enterprise Technologies, 5 (2 (89)), 4–11. doi: https://doi.org/10.15587/1729-4061.2017.112323
  12. Lizunov, P., Biloshchytskyi, A., Kuchansky, A., Andrashko, Y., Biloshchytska, S. (2019). Improvement of the method for scientific publications clustering based on n-gram analysis and fuzzy method for selecting research partners. Eastern-European Journal of Enterprise Technologies, 4 (4 (100)), 6–14. doi: https://doi.org/10.15587/1729-4061.2019.175139
  13. Bykov, V. Y., Kuchanskyi, O. Y., Biloshchytskyi, A. O., Andrashko, Y. V., Dikhtiarenko, O. V., Budnik, S. V. (2019). Development of information technology for complex evaluation of higher education institutions. Information Technologies and Learning Tools, 73 (5), 293–306. doi: https://doi.org/10.33407/itlt.v73i5.3397
  14. Kuchansky, A., Andrashko, Yu., Biloshchytskyi, A., Danchenko, O., Ilarionov, O., Vatskel, I., Honcharenko, T. (2018). The method for evaluation of educational environment subjects' performance based on the calculation of volumes of m­simplexes. Eastern-European Journal of Enterprise Technologies, 2 (4 (92)), 15–25. doi: https://doi.org/10.15587/1729-4061.2018.126287
  15. Kuchansky, A., Biloshchytskyi, A., Andrashko, Y., Biloshchytska, S., Shabala, Y., Myronov, O. (2018). Development of adaptive combined models for predicting time series based on similarity identification. Eastern-European Journal of Enterprise Technologies, 1 (4 (91)), 32–42. doi: https://doi.org/10.15587/1729-4061.2018.121620
  16. Biloshchytskyi, A., Biloshchytska, S., Kuchansky, A., Bielova, O., Andrashko, Y. (2018). Infocommunication system of scientific activity management on the basis of project-vector methodology. 2018 14th International Conference on Advanced Trends in Radioelecrtronics, Telecommunications and Computer Engineering (TCSET). doi: https://doi.org/10.1109/tcset.2018.8336186
  17. Biloshchytskyi, A., Kuchansky, A., Andrashko, Y., Biloshchytska, S., Danchenko, O. (2018). Development of Infocommunication System for Scientific Activity Administration of Educational Environment’s Subjects. 2018 International Scientific-Practical Conference Problems of Infocommunications. Science and Technology (PIC S&T). doi: https://doi.org/10.1109/infocommst.2018.8632036
  18. Biloshchytskyi, A., Kuchansky, A., Paliy, S., Biloshchytska, S., Bronin, S., Andrashko, Y. et. al. (2018). Development of technical component of the methodology for project­vector management of educational environments. Eastern-European Journal of Enterprise Technologies, 2 (2 (92)), 4–13. doi: https://doi.org/10.15587/1729-4061.2018.126301
  19. Mulesa, O., Snytyuk, V., Myronyuk, I. (2019). Optimal alternative selection models in a multi-stage decision-making process. EUREKA: Physics and Engineering, 6, 43–50. doi: https://doi.org/10.21303/2461-4262.2019.001005
  20. Ostakhov, V., Artykulna, N., Morozov, V. (2018). Models of IT Projects KPIs and Metrics. 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP). doi: https://doi.org/10.1109/dsmp.2018.8478464
  21. Ostakhov, V., Morozov, V. (2019). Models and Methods of IT and Infocommunications Portfolio Management Using the System of Metrics and KPIs. 2019 IEEE International Scientific-Practical Conference Problems of Infocommunications, Science and Technology (PIC S&T). doi: https://doi.org/10.1109/picst47496.2019.9061328
  22. Kolesnіkov, O., Gogunskii, V., Kolesnikova, K., Lukianov, D., Olekh, T. (2016). Development of the model of interaction among the project, team of project and project environment in project system. Eastern-European Journal of Enterprise Technologies, 5 (9 (83)), 20–26. doi: https://doi.org/10.15587/1729-4061.2016.80769
  23. Morozov, V., Kalnichenko, O., Liubyma, I. (2017). Managing projects configuration in development distributed information systems. 2017 2nd International Conference on Advanced Information and Communication Technologies (AICT). doi: https://doi.org/10.1109/aiact.2017.8020088
  24. Lizunov, P., Biloshchytskyi, A., Kuchansky, A., Biloshchytska, S., Chala, L. (2016). Detection of near dublicates in tables based on the locality-sensitive hashing method and the nearest neighbor method. Eastern-European Journal of Enterprise Technologies, 6 (4 (84)), 4–10. doi: https://doi.org/10.15587/1729-4061.2016.86243
  25. Rossi, R. J. (2018). Mathematical Statistics: An Introduction to Likelihood Based Inference. John Wiley & Sons. doi: https://doi.org/10.1002/9781118771075
  26. Tihonov, A. N., Arsenin, V. Ya. (1986). Metody resheniya nekorrektnyh zadach. Moscow: Nauka, 287.
  27. Blei, D. M., Ng, A. Y., Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
  28. Dietz, L., Bickel, S., Scheffer, T. (2007). Unsupervised prediction of citation influences. Proceedings of the 24th International Conference on Machine Learning - ICML ’07. doi: https://doi.org/10.1145/1273496.1273526
  29. Andrzejewski, D., Zhu, X. (2009). Latent Dirichlet Allocation with topic-in-set knowledge. Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing - SemiSupLearn ’09. doi: https://doi.org/10.3115/1621829.1621835
  30. BigARTM. Available at: https://bigartm.readthedocs.io/en/stable/intro.html
  31. Vorontsov, K. V. (2013). Veroyatnostnoe tematicheskoe modelirovanie. Available at: http://www.machinelearning.ru/wiki/images/2/22/Voron-2013-ptm.pdf

Downloads

Published

2020-08-31

How to Cite

Lizunov, P., Biloshchytskyi, A., Kuchansky, A., Andrashko, Y., & Biloshchytska, S. (2020). The use of probabilistic latent semantic analysis to identify scientific subject spaces and to evaluate the completeness of covering the results of dissertation studies. Eastern-European Journal of Enterprise Technologies, 4(4 (106), 21–28. https://doi.org/10.15587/1729-4061.2020.209886

Issue

Section

Mathematics and Cybernetics - applied aspects