STT, text, processing, summary, audiofile, model, neural, networks


The subject matter of the article is the module for converting the speaker’s speech into text in the proposed model of automatic annotation of the speaker’s speech, which has become more and more popular in Ukraine in the last two years, due to the active transition to an online form of communication and education as well as conducting workshops, interviews and discussing urgent issues. Furthermore, the users of personal educational platforms are not always able to join online meetings on time due to various reasons (one example can be a blackout), which explains the need to save the speakers’ presentations in the form of audio files. The goal of the work is to elimination of false or corrupt data in the process of converting the audio sequence into the relevant text for further semantic analysis. To achieve the goal, the following tasks were solved: a generalized model of incoming audio data summarization was proposed; the existing STT models (for turning audio data into text) were analyzed; the ability of the STT module to operate in Ukrainian was studied; STT module efficiency and timing for English and Ukrainian-based STT module operation were evaluated. The proposed model of the speaker’s speech automatic annotation has two major functional modules: speech-to-text (STT) і summarization module (SUM). For the STT module, the following models of linguistic text analysis have been researched and improved: for English it is wav2vec2-xls-r-1bz, and for Ukrainian it is Ukrainian STT model (wav2vec2-xls-r-1b-uk-with-lm.Artificial neural networks were used as a mathematical apparatus in the models under consideration. The following results were obtained: demonstrates the reduction of the word error level descriptor by almost 1.5 times, which influences the quality of word recognition from the audio and may potentially lead to obtaining higher-quality output text data. In order to estimate the timing for STT module operation, three English and Ukrainian audio recordings of various length (5s, ~60s and ~240s) were analyzed. The results demonstrated an obvious trend for accelerated obtaining of the output file through the application of the computational power of NVIDIA Tesla T4 graphic accelerator for the longest recording. Conclusions: the use of a deep neural network at the stage of noise reduction in the input file is justified, as it provides an increase in the WER metric by almost 25%, and an increase in the computing power of the graphics processor and the number of stream processors provide acceleration only for large input audio files. The following research of the author is focused on the study of the methods of the obtained text summarization module efficiency.

Author Biography

Olesia Barkovska, Kharkiv National University of Radio Electronics

Ph.D (Engineering Sciences), Docent


Liu, J., Wang, H. (2021), "An Analysis of the Educational Function of Network Platform from the Perspective of Home-School Interaction in Universities in the New Era", 2021 IEEE International Conference on Educational Technology (ICET), 2021, P. 112–116. DOI:

Ponomarova, H., Kharkivska, A., Petrichenko, L., Shaparenko, K., Aleksandrova, O., Beskorsa, V. (2021), "Distance Education In Ukraine In The Context Of Modern Challenges: An Overview Of Platforms", International Journal of Computer Science & Network Security, 21 (5), Р. 39–42. DOI:

Berrío-Quispe, M. L., Chávez-Bellido, D. E., González-Díaz, R. R. (2021), "Use of educational platforms and student academic stress during COVID-19," 2021 16th Conference on Information Systems and Technologies (CISTI), Р. 1–5.

Malieieva, J., Kosenko, V., Malyeyeva, O., & Svetlichnyj, D. (2019), "Creation of collaborative development environment in the system of distance learning", Innovative Technologies and Scientific Solutions for Industries, 2 (8), Р. 62–71. DOI:

Dong, Q., Ye, R., Wang, M., Zhou, H., Xu, S., Xu, B., & Li, L. (2021), "Listen, understand and translate: Triple supervision decouples end-to-end speech-to-text translation", Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 14, Р. 12749–12759.

Gao, Jianqing, Wan, Genshun, Wu, Kui and Fu, Zhonghua (2022), "Review of the application of intelligent speech technology in education", Journal of China Computer-Assisted Language Learning, Vol. 2, No. 1, P. 165–178. DOI:

Liu, J., Xiang, X. (2017) "Review of the anti-noise method in the speech recognition technology," 12th IEEE Conference on Industrial Electronics and Applications (ICIEA), P. 1391–1394. DOI:

Juang, Biing-Hwang, and Lawrence, R. Rabiner (2005), Automatic speech recognition–a brief history of the technology development, Georgia Institute of Technology, Atlanta Rutgers University and the University of California, Santa Barbara 1, 67 p.

Potamianos, G. (2009), "Audio-visual automatic speech recognition and related bimodal speech technologies: A review of the state-of-the-art and open problems," 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, P. 22–22. DOI:

Georgescu, A. L., Pappalardo, A., Cucu, H., & Blott, M. (2021), "Performance vs. hardware requirements in state-of-the-art automatic speech recognition", EURASIP Journal on Audio, Speech, and Music Processing, No.1, P. 1–30.

Mohammed, A., Sunar, M. S., & hj Salam, M. S., (2021), "Speech recognition toolkits: a review", The 2ndNational Conference for Ummah Network 2021 (INTER-UMMAH 2021), No. 2, P. 250–255.

Kumar, T., Mahrishi, M., & Meena, G. (2022), "A comprehensive review of recent automatic speech summarization and keyword identification techniques", Artificial Intelligence in Industrial Applications, P. 111–126.

Kim, C. et al. (2019), "End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System," 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), P. 562–569, DOI:

Ping, L. (2022), "English Speech Recognition Method Based on HMM Technology," 2021 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS), P. 646–649. DOI:

Mykhailichenko, I., Ivashchenko, H., Barkovska, O., & Liashenko, O. (2022), "Application of Deep Neural Network for Real-Time Voice Command Recognition", In 2022 IEEE 3rd KhPI Week on Advanced Technology (KhPIWeek), P. 1–4. DOI:

Barkovska, О., Lytvynenko, V., (2022), "Study of the performance of neural network models in semantic analysis", Modern trends in the development of information and communication technologies and management tools, Vol.1, P. 136.

Barkovska, O., Kholiev,V., Lytvynenko, V. (2022), "Study of noise reduction methods in the sound sequence when solving the speech-to-text problem", Advanced Information Systems, No. 6.1, P. 48–54. DOI:

Xu, Y., Du, J., Dai, L. -R., and Lee, C. -H. (2015), "A Regression Approach to Speech Enhancement Based on Deep Neural Networks," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, No. 1, P. 7–19. DOI:

Davydov, V., & Hrebeniuk, D. (2020), "Development of the methods for resource reallocation in cloud computing systems", Innovative Technologies and Scientific Solutions for Industries, 3 (13), P. 25–33. DOI: