Development of the algorithm of keyword search in the Kazakh language text corpus

Akerke Akanova; Nazira Ospanova; Yevgeniya Kukharenko; Gulmira Abildinova

doi:10.15587/1729-4061.2019.179036

Authors

Akerke Akanova Saken Seifullin Kazakh Agro Technical University Zhenis ave., 62, Nur-Sultan, Kazakhstan, 010000, Kazakhstan https://orcid.org/0000-0002-7178-2121
Nazira Ospanova S. Toraighyrov Pavlodar State University Lomova str., 62, Pavlodar, Kazakhstan, 140008, Kazakhstan https://orcid.org/0000-0003-0100-1008
Yevgeniya Kukharenko M. Kozybayev North Kazakhstan State University Pushkin str., 86, Petropavlovsk, Kazakhstan, 150000, Kazakhstan https://orcid.org/0000-0002-9107-2119
Gulmira Abildinova L. N. Gumilyov Eurasian National University Satpaev str., 2, Nur-Sultan, Kazakhstan, 010008, Kazakhstan https://orcid.org/0000-0002-4262-6197

DOI:

https://doi.org/10.15587/1729-4061.2019.179036

Keywords:

keyword, Porter algorithm, semantic analysis, neural network

Abstract

The issue of semantic text analysis occupies a special place in computational linguistics. Researchers in this field have an increased interest in developing an algorithm that will improve the quality of text corpus processing and probabilistic determination of text content. The results of the study on the application of methods, approaches, algorithms for semantic text analysis in computational linguistics in international and Kazakhstan science led to the development of an algorithm of keyword search in a Kazakh text. The first step of the algorithm was to compile a reference dictionary of keywords for the Kazakh language text corpus. The solution to this problem was to apply the Porter (stemmer) algorithm for the Kazakh language text corpus. The implementation of the stemmer allowed highlighting unique word stems and getting a reference dictionary, which was subsequently indexed. The next step is to collect learning data from the text corpus. To calculate the degree of semantic proximity between words, each word is assigned a vector of the corresponding word forms of the reference dictionary, which results in a pair of a keyword and a vector. And the last step of the algorithm is neural network learning. During learning, the error backpropagation method is used, which allows a semantic analysis of the text corpus and obtaining a probabilistic number of words close to the expected number of keywords. This process automates the processing of text material by creating digital learning models of keywords. The algorithm is used to develop a neurocomputer system that will automatically check the text works of online learners. The uniqueness of the keyword search algorithm is the use of neural network learning for texts in the Kazakh language. In Kazakhstan, scientists in the field of computational linguistics conducted a number of studies based on morphological analysis, lemmatization and other approaches and implemented linguistic tools (mainly translation dictionaries). The scope of neural network learning for parsing of the Kazakh language remains an open issue in the Kazakhstan science.

The developed algorithm involves solving one of the problems of effective semantic analysis of the text in the Kazakh language

Author Biographies

Akerke Akanova, Saken Seifullin Kazakh Agro Technical University Zhenis ave., 62, Nur-Sultan, Kazakhstan, 010000

Master of Informatics

Department of Computer Engineering and Software

Nazira Ospanova, S. Toraighyrov Pavlodar State University Lomova str., 62, Pavlodar, Kazakhstan, 140008

PhD, Associate Professor, Head of Department

Department of Information Technology

Yevgeniya Kukharenko, M. Kozybayev North Kazakhstan State University Pushkin str., 86, Petropavlovsk, Kazakhstan, 150000

PhD, Associate Professor, Head of Department

Department of Information communication Technology

Gulmira Abildinova, L. N. Gumilyov Eurasian National University Satpaev str., 2, Nur-Sultan, Kazakhstan, 010008

PhD

Department of Information Technology

References

Bassiou, N. K., Kotropoulos, C. L. (2014). Online PLSA: Batch Updating Techniques Including Out-of-Vocabulary Words. IEEE Transactions on Neural Networks and Learning Systems, 25 (11), 1953–1966. doi: https://doi.org/10.1109/tnnls.2014.2299806
Borschev, V. B., Partee, B. H. (2014). Ontology and Integration of Formal and Lexical Semantics. Proceedings of the international scientific conference on computational linguistics "Dialogue". Available at: http://www.dialog-21.ru/digests/dialog2014/materials/pdf/BorschevVBParteeBH.pdf
Turdakov, D. Y., Astrakhantsev, N. A., Nedumov, Y. R., Sysoev, A. A., Andrianov, I. A., Mayorov, V. D. et. al. (2014). Texterra: A framework for text analysis. Programming and Computer Software, 40 (5), 288–295. doi: https://doi.org/10.1134/s0361768814050090
Attali, Y., Burstein, J. (2006). Automated Essay Scoring With E-rater® V.2. Journal of Technology, Learning, and Assessment, 4 (3). Available at: https://ejournals.bc.edu/index.php/jtla/article/view/1650/1492
Dikli, S. (2006). Automated Essay Scoring. Turkish Online Journal of Distance Education, 7 (1), 49–62. Available at: https://www.researchgate.net/publication/26415982_Automated_Essay_Scoring
Rai, A., Kannan, R. J. (2018). Differed Restructuring of Neural Connectome Using Evolutionary Neurodynamic Algorithm for Improved M2M Online Learning. Procedia Computer Science, 133, 298–305. doi: https://doi.org/10.1016/j.procs.2018.07.037
Chen, Z., Huang, Y., Liang, Y., Wang, Y., Fu, X., Fu, K. (2017). RGloVe: An Improved Approach of Global Vectors for Distributional Entity Relation Representation. Algorithms, 10 (2), 42. doi: https://doi.org/10.3390/a10020042
Sukumar A., R., Sukumar A., S., Shah A., F., Anto P., B. (2010). Key-Word Based Query Recognition in a Speech Corpus by Using Artificial Neural Networks. 2010 2nd International Conference on Computational Intelligence, Communication Systems and Networks. doi: https://doi.org/10.1109/cicsyn.2010.56
Lytvyn, V., Moroz, O. (2013). Contextual search method based on the thesaurus of knowledge domain. Eastern-European Journal of Enterprise Technologies, 6 (2 (66)), 22–27. Available at: http://journals.uran.ua/eejet/article/view/18700/17065
Ranjan, N. M., Prasad, R. S. (2018). LFNN: Lion fuzzy neural network-based evolutionary model for text classification using context and sense based features. Applied Soft Computing, 71, 994–1008. doi: https://doi.org/10.1016/j.asoc.2018.07.016
Zhang, H., Jun, Y. (2009). An Algorithm of Text Automatic Proofreading Based on Chinese Word Segmentation. 2009 International Conference on Computational Intelligence and Software Engineering. doi: https://doi.org/10.1109/cise.2009.5364024
Kalinichenko, L. A. (2012). Effective support of databases with ontological dependencies: Relational languages instead of description logics. Programming and Computer Software, 38 (6), 315–326. doi: https://doi.org/10.1134/s0361768812060059
Garanina, N. O., Sidorova, E. A. (2015). Ontology population as algebraic information system processing based on multi-agent natural language text analysis algorithms. Programming and Computer Software, 41 (3), 140–148. doi: https://doi.org/10.1134/s0361768815030044
Bessmertny, I. A. (2010). Knowledge visualization based on semantic networks. Programming and Computer Software, 36 (4), 197–204. doi: https://doi.org/10.1134/s036176881004002x
Jorge-Botana, G., León, J. A., Olmos, R., Escudero, I. (2010). Latent Semantic Analysis Parameters for Essay Evaluation using Small-Scale Corpora*. Journal of Quantitative Linguistics, 17 (1), 1–29. doi: https://doi.org/10.1080/09296170903395890
Mashechkin, I. V., Petrovskiy, M. I., Popov, D. S., Tsarev, D. V. (2011). Automatic text summarization using latent semantic analysis. Programming and Computer Software, 37 (6), 299–305. doi: https://doi.org/10.1134/s0361768811060041
Grigoryeva, E., Klyachin, V., Pomelnikov, Y., Popov, V. (2017). Algorithm of Key Words Search Based on Graph Model of Linguistic Corpus. Vestnik Volgogradskogo Gosudarstvennogo Universiteta. Serija 2. Jazykoznanije, 16 (2), 58–67. doi: https://doi.org/10.15688/jvolsu2.2017.2.6
Hu, J., Li, S., Yao, Y., Yu, L., Yang, G., Hu, J. (2018). Patent Keyword Extraction Algorithm Based on Distributed Representation for Patent Classification. Entropy, 20 (2), 104. doi: https://doi.org/10.3390/e20020104
Kanagarajan, K., Arumugam, S. (2018). Intelligent sentence retrieval using semantic word based answer generation algorithm with cuckoo search optimization. Cluster Computing. doi: https://doi.org/10.1007/s10586-018-2054-x
Turney, P. D. (2000). Learning Algorithms for Keyphrase Extraction. Information Retrieval, 2 (4), 303–304. doi: https://doi.org/10.1023/A:1009976227802
Kulhare, S. (2017). Deep Learning for Semantic Video Understanding. A Thesis for the Degree of Master of Science in Computer Engineering. Rochester. Available at: https://pdfs.semanticscholar.org/d195/9ba4637739dcc6cc6995e10fd41fd6604713.pdf
Ibrahim, A. S. (2017). End-To-End Text Detection Using Deep Learning. Blacksburg. Available at: https://vtechworks.lib.vt.edu/handle/10919/81277
Lin, X. V., Wang, C., Zettlemoyer, L., Ernst, M. D. (2018). NL2Bash : A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System. International Conference on Language Resources and Evaluation. Available at: https://homes.cs.washington.edu/~mernst/pubs/nl2bash-corpus-lrec2018.pdf
Dictionary Based Annotation at Scale with Spark, SolrTextTagger and OpenNLP. Available at: https://databricks.com/session/dictionary-based-annotation-at-scale-with-spark-solrtexttagger-and-opennlp
Bingel, J., Bjerva, J. (2018). Cross-lingual complex word identification with multitask learning. Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications. doi: https://doi.org/10.18653/v1/w18-0518
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et. al. (2014). Caffe. Proceedings of the ACM International Conference on Multimedia - MM ’14. doi: https://doi.org/10.1145/2647868.2654889
Hinton, G. E., Osindero, S., Teh, Y.-W. (2006). A Fast Learning Algorithm for Deep Belief Nets. Neural Computation, 18 (7), 1527–1554. doi: https://doi.org/10.1162/neco.2006.18.7.1527
Snowball. Available at: https://snowballstem.org/
He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi: https://doi.org/10.1109/cvpr.2016.90
Swingler, K. Applying Neural Networks. A practical Guide. Available at: http://matlab.exponenta.ru/neuralnetwork/book4/3_2.php
Sharipbaev, A. A., Bekmanova, G. T., Ergesh, B. J., Buribaeva, A. K., Karabalaeva, M. H. (2012). The intellectual morphological analyzer based on semantic networks. Open Semantic Technologies for Intelligent Systems.
Koybagarov, K. Ch., Musabaev, R. R., Kalimoldaev, M. N. (2014). Razrabotka lingvisticheskogo protsessora tekstov na kazahskom yazyke. Problemy informatiki, 3, 64–72.
Akanova, A., Ospanova, N., Abildinova, G., Ulman, M. (2016). Assessment tools for evaluating knowledge of online students. Proceedings of the 13th International Conference Efficiency and Responsibility in Education 2016, 9–18. Available at: https://erie.v2.czu.cz/en/r-13629-proceedings-2016

Development of the algorithm of keyword search in the Kazakh language text corpus

Authors

DOI:

Keywords:

Abstract

Author Biographies

Akerke Akanova, Saken Seifullin Kazakh Agro Technical University Zhenis ave., 62, Nur-Sultan, Kazakhstan, 010000

Nazira Ospanova, S. Toraighyrov Pavlodar State University Lomova str., 62, Pavlodar, Kazakhstan, 140008

Yevgeniya Kukharenko, M. Kozybayev North Kazakhstan State University Pushkin str., 86, Petropavlovsk, Kazakhstan, 150000

Gulmira Abildinova, L. N. Gumilyov Eurasian National University Satpaev str., 2, Nur-Sultan, Kazakhstan, 010008

References

Downloads

Published

How to Cite

Issue

Section

License

Language

Information

Make a Submission

Developed By

Current Issue