Development of a computer system for generating semantic template of a group of documents by using latent semantic analysis
DOI:
https://doi.org/10.15587/1729-4061.2016.73551Keywords:
computer system, latent semantic analysis, resolution, frequency matrix, semantic distanceAbstract
The CS was developed by means of the Python programming language to generate a semantic template of a group of documents by the LSA method. The system contains eight software modules, each performs one stage of the LSA. The control module of the frequency word-document matrix and the measuring module of semantic distance between the template documents are unique. Adjustment of CS to the contents and structure of the document templates is performed by changing a set of modules.
According to the research, the frequency matrix normalization enhances the resolution of the semantic template generated by using the LSA. It is proved that the removal of individual words improves the resolution of the generated semantic template and does not affect the semantic content. Application of semantic proximity of documents, the cosine of the difference of angles between the vector of a group of basic words and vectors of documents for evaluation allows increasing the resolution of the generated semantic template. To ensure the continuity of the LSA, the module of the frequency matrix analysis for compliance of excess (or equality) of the number of words over the number of documents was introduced in the CS. In the event of a mismatch, the module starts over the LSA process with a new set of words and documents after removal of the inappropriate document and related words.
References
- Landauer, T. K., Dumais, S. T. (1997). A solution to Plato’s problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104 (2), 211–240. doi: 10.1037//0033-295x.104.2.211
- Froud, H., Lachkar, A., Ouatik, S. A. (2013). Arabic Text Summarization Based on Latent Semantic Analysis to Enhance Arabic Documents Clustering. International Journal of Data Mining & Knowledge Management Process, 3 (1), 79–95. doi: 10.5121/ijdkp.2013.3107
- Kesorn, K., Poslad, S. (2009). Semantic Restructuring of Natural Language Image Captions to Enhance Image Retrieval. Journal of Multimedia, 4 (5), 284–297. doi: 10.4304/jmm.4.5.284-297
- Amudaria, S., Sasirekha, S. (2011). Design of Content-Oriented Information Retrieval by Semantic Analysis. International Journal of Computer Science and Information Security, 9 (1), 92–97.
- Wang, Z., Zhang, H., Sarkar, A. (2015). A Python-based Interface for Wide Coverage Lexicalized Tree-adjoining Grammars. The Prague Bulletin of Mathematical Linguistics, 103 (1), 139–159. doi: 10.1515/pralin-2015-0008
- Latent semantic analysis. Available at: https://habrahabr.ru/post/110078/ (Last accessed: 30.04.2016).
- Sheetal, A., Sushma, S. (2010). Measuring Semantic Similarity between Words Using Web Documents. International Journal of Advanced Computer Science and Applications, 1 (4), 132–154. doi: 10.14569/ijacsa.2010.010414
- Latent semantic analysis and search on Python. Available at: https://habrahabr.ru/post/197238/ (Last accessed: 30.04.2016).
- Reena, K., Preeti, M., Chavan, V., Jadhav, K. (2013). Semantically Detecting Plagiarism for Research Papers. International Journal of Engineering Research and Applications, 3 (3), 77–80.
- Kolyada, A. C., Godunsky, V. D. (2014). Authenticity of authorship of scientific publications using latent semantic analysis. Eastern-European Journal of Enterprise Technologies, 3 (2 (69)), 36–40.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2016 Yuriy Taranenko, Maryna Kabanova
This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.
A license agreement is a document in which the author warrants that he/she owns all copyright for the work (manuscript, article, etc.).
The authors, signing the License Agreement with TECHNOLOGY CENTER PC, have all rights to the further use of their work, provided that they link to our edition in which the work was published.
According to the terms of the License Agreement, the Publisher TECHNOLOGY CENTER PC does not take away your copyrights and receives permission from the authors to use and dissemination of the publication through the world's scientific resources (own electronic resources, scientometric databases, repositories, libraries, etc.).
In the absence of a signed License Agreement or in the absence of this agreement of identifiers allowing to identify the identity of the author, the editors have no right to work with the manuscript.
It is important to remember that there is another type of agreement between authors and publishers – when copyright is transferred from the authors to the publisher. In this case, the authors lose ownership of their work and may not use it in any way.