The use of probabilistic latent semantic analysis to identify scientific subject spaces and to evaluate the completeness of covering the results of dissertation studies
DOI:
https://doi.org/10.15587/1729-4061.2020.209886Keywords:
probabilistic latent semantic analysis, clustering, scientific subject space, thematic modelAbstract
The study considers the possibilities of using latent semantic analysis for the tasks of identifying scientific subject spaces and evaluating the completeness of covering the results of dissertation research by science degree seekers.
A probabilistic thematic model was built to make it possible to cluster the publications of scholars in scientific areas, taking into account the citation network, which was an important step for solving the problem of identifying scientific subject spaces. As a result of constructing the model, the problem of increasing instability of clustering the citation graph in connection with a decrease in the number of clusters was solved. This problem would arise when combining clusters built on the basis of citation graph clustering, taking into account the similarity of abstracts of scientific publications.
In the article, the presentation of text documents is described based on a probabilistic thematic model using n-grams. A probabilistic thematic model was built for the task of determining the completeness of covering the materials of an author’s dissertation research in scientific publications. The approximate values of the threshold coefficients were calculated to evaluate whether the articles of an author included the research provisions that were reflected in the text of the author’s abstract of the dissertation. The probabilistic thematic model for an author’s publications was practised on the basis of the BigARTM tool. Using the constructed model and with the help of a special regularizer, a matrix was found to evaluate the relevance of topics specified by the segments of an author’s dissertation abstracts to documents that are produced by the author’s publications.
Important aspects of the possibilities of using latent semantic analysis were studied to identify tasks of scientific subject spaces and to reveal the completeness of covering the results of dissertation research science degree seekers.
References
- Dumais, S. T. (2005). Latent semantic analysis. Annual Review of Information Science and Technology, 38 (1), 188–230. doi: https://doi.org/10.1002/aris.1440380105
- Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41 (6), 391–407. doi: https://doi.org/10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9
- Hofmann, T. (1999). Probabilistic latent semantic indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR ’99. doi: https://doi.org/10.1145/312624.312649
- Dai, A. M., Olah, C., Le, Q. V. (2015). Document embedding with paragraph vectors. arXiv. Available at: https://arxiv.org/pdf/1507.07998v1.pdf
- Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P. (2004). The Author-Topic Model for Authors and Documents. Conference: UAI '04, Proceedings of the 20th Conference in Uncertainty in Artificial Intelligence.
- Pagliardini, M., Gupta, P., Jaggi, M. (2018). Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 528–540. doi: https://doi.org/10.18653/v1/n18-1049
- Lifchitz, A., Jhean-Larose, S., Denhière, G. (2009). Effect of tuned parameters on an LSA multiple choice questions answering model. Behavior Research Methods, 41 (4), 1201–1209. doi: https://doi.org/10.3758/brm.41.4.1201
- Gálvez, R. H., Gravano, A. (2017). Assessing the usefulness of online message board mining in automatic stock prediction systems. Journal of Computational Science, 19, 43–56. doi: https://doi.org/10.1016/j.jocs.2017.01.001
- Scopus Preview. Eastern-European Journal of Enterprise Technologies. Available at: https://www.scopus.com/sourceid/21100450083
- Mendeley. Available at: https://www.mendeley.com/?interaction_required=true
- Biloshchytskyi, A., Kuchansky, A., Andrashko, Y., Biloshchytska, S., Kuzka, O., Shabala, Y., Lyashchenko, T. (2017). A method for the identification of scientists' research areas based on a cluster analysis of scientific publications. Eastern-European Journal of Enterprise Technologies, 5 (2 (89)), 4–11. doi: https://doi.org/10.15587/1729-4061.2017.112323
- Lizunov, P., Biloshchytskyi, A., Kuchansky, A., Andrashko, Y., Biloshchytska, S. (2019). Improvement of the method for scientific publications clustering based on n-gram analysis and fuzzy method for selecting research partners. Eastern-European Journal of Enterprise Technologies, 4 (4 (100)), 6–14. doi: https://doi.org/10.15587/1729-4061.2019.175139
- Bykov, V. Y., Kuchanskyi, O. Y., Biloshchytskyi, A. O., Andrashko, Y. V., Dikhtiarenko, O. V., Budnik, S. V. (2019). Development of information technology for complex evaluation of higher education institutions. Information Technologies and Learning Tools, 73 (5), 293–306. doi: https://doi.org/10.33407/itlt.v73i5.3397
- Kuchansky, A., Andrashko, Yu., Biloshchytskyi, A., Danchenko, O., Ilarionov, O., Vatskel, I., Honcharenko, T. (2018). The method for evaluation of educational environment subjects' performance based on the calculation of volumes of msimplexes. Eastern-European Journal of Enterprise Technologies, 2 (4 (92)), 15–25. doi: https://doi.org/10.15587/1729-4061.2018.126287
- Kuchansky, A., Biloshchytskyi, A., Andrashko, Y., Biloshchytska, S., Shabala, Y., Myronov, O. (2018). Development of adaptive combined models for predicting time series based on similarity identification. Eastern-European Journal of Enterprise Technologies, 1 (4 (91)), 32–42. doi: https://doi.org/10.15587/1729-4061.2018.121620
- Biloshchytskyi, A., Biloshchytska, S., Kuchansky, A., Bielova, O., Andrashko, Y. (2018). Infocommunication system of scientific activity management on the basis of project-vector methodology. 2018 14th International Conference on Advanced Trends in Radioelecrtronics, Telecommunications and Computer Engineering (TCSET). doi: https://doi.org/10.1109/tcset.2018.8336186
- Biloshchytskyi, A., Kuchansky, A., Andrashko, Y., Biloshchytska, S., Danchenko, O. (2018). Development of Infocommunication System for Scientific Activity Administration of Educational Environment’s Subjects. 2018 International Scientific-Practical Conference Problems of Infocommunications. Science and Technology (PIC S&T). doi: https://doi.org/10.1109/infocommst.2018.8632036
- Biloshchytskyi, A., Kuchansky, A., Paliy, S., Biloshchytska, S., Bronin, S., Andrashko, Y. et. al. (2018). Development of technical component of the methodology for projectvector management of educational environments. Eastern-European Journal of Enterprise Technologies, 2 (2 (92)), 4–13. doi: https://doi.org/10.15587/1729-4061.2018.126301
- Mulesa, O., Snytyuk, V., Myronyuk, I. (2019). Optimal alternative selection models in a multi-stage decision-making process. EUREKA: Physics and Engineering, 6, 43–50. doi: https://doi.org/10.21303/2461-4262.2019.001005
- Ostakhov, V., Artykulna, N., Morozov, V. (2018). Models of IT Projects KPIs and Metrics. 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP). doi: https://doi.org/10.1109/dsmp.2018.8478464
- Ostakhov, V., Morozov, V. (2019). Models and Methods of IT and Infocommunications Portfolio Management Using the System of Metrics and KPIs. 2019 IEEE International Scientific-Practical Conference Problems of Infocommunications, Science and Technology (PIC S&T). doi: https://doi.org/10.1109/picst47496.2019.9061328
- Kolesnіkov, O., Gogunskii, V., Kolesnikova, K., Lukianov, D., Olekh, T. (2016). Development of the model of interaction among the project, team of project and project environment in project system. Eastern-European Journal of Enterprise Technologies, 5 (9 (83)), 20–26. doi: https://doi.org/10.15587/1729-4061.2016.80769
- Morozov, V., Kalnichenko, O., Liubyma, I. (2017). Managing projects configuration in development distributed information systems. 2017 2nd International Conference on Advanced Information and Communication Technologies (AICT). doi: https://doi.org/10.1109/aiact.2017.8020088
- Lizunov, P., Biloshchytskyi, A., Kuchansky, A., Biloshchytska, S., Chala, L. (2016). Detection of near dublicates in tables based on the locality-sensitive hashing method and the nearest neighbor method. Eastern-European Journal of Enterprise Technologies, 6 (4 (84)), 4–10. doi: https://doi.org/10.15587/1729-4061.2016.86243
- Rossi, R. J. (2018). Mathematical Statistics: An Introduction to Likelihood Based Inference. John Wiley & Sons. doi: https://doi.org/10.1002/9781118771075
- Tihonov, A. N., Arsenin, V. Ya. (1986). Metody resheniya nekorrektnyh zadach. Moscow: Nauka, 287.
- Blei, D. M., Ng, A. Y., Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
- Dietz, L., Bickel, S., Scheffer, T. (2007). Unsupervised prediction of citation influences. Proceedings of the 24th International Conference on Machine Learning - ICML ’07. doi: https://doi.org/10.1145/1273496.1273526
- Andrzejewski, D., Zhu, X. (2009). Latent Dirichlet Allocation with topic-in-set knowledge. Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing - SemiSupLearn ’09. doi: https://doi.org/10.3115/1621829.1621835
- BigARTM. Available at: https://bigartm.readthedocs.io/en/stable/intro.html
- Vorontsov, K. V. (2013). Veroyatnostnoe tematicheskoe modelirovanie. Available at: http://www.machinelearning.ru/wiki/images/2/22/Voron-2013-ptm.pdf
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2020 Petro Lizunov, Andrii Biloshchytskyi, Alexander Kuchansky, Yurii Andrashko, Svitlana Biloshchytska
This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.
A license agreement is a document in which the author warrants that he/she owns all copyright for the work (manuscript, article, etc.).
The authors, signing the License Agreement with TECHNOLOGY CENTER PC, have all rights to the further use of their work, provided that they link to our edition in which the work was published.
According to the terms of the License Agreement, the Publisher TECHNOLOGY CENTER PC does not take away your copyrights and receives permission from the authors to use and dissemination of the publication through the world's scientific resources (own electronic resources, scientometric databases, repositories, libraries, etc.).
In the absence of a signed License Agreement or in the absence of this agreement of identifiers allowing to identify the identity of the author, the editors have no right to work with the manuscript.
It is important to remember that there is another type of agreement between authors and publishers – when copyright is transferred from the authors to the publisher. In this case, the authors lose ownership of their work and may not use it in any way.