Analysis of the influence of selected audio pre-processing stages on accuracy of speaker language recognition




mel-cepstral characteristic coefficients, spectrogram, time mask, frequency mask, normalization, neural network, voice, audio series, speech


The subject matter of the study is the analysis of the influence of pre-processing stages of the audio on the accuracy of speaker language regognition. The importance of audio pre-processing has grown significantly in recent years due to its key role in a variety of applications such as data reduction, filtering, and denoising. Taking into account the growing demand for accuracy and efficiency of audio information classification methods, evaluation and comparison of different audio pre-processing methods becomes important part of determining optimal solutions. The goal of this study is to select the best sequence of stages of pre-processing audio data for use in further training of a neural network for various ways of converting signals into features, namely, spectrograms and mel-cepstral characteristic coefficients. In order to achieve the goal, the following tasks were solved: analysis of ways of transforming the signal into certain characteristics and analysis of mathematical models for performing an analysis of the audio series by selected characteristics were carried out. After that, a generalized model of real-time translation of the speaker's speech was developed and the experiment was planned depending on the selected stages of pre-processing of the audio. To conclude, the neural network was trained and tested for the planned experiments. The following methods were used: mel-cepstral characteristic coefficients, spectrogram, time mask, frequency mask, normalization. The following results were obtained: depending on the selected stages of pre-processing of voice information and various ways of converting the signal into certain features, it is possible to achieve speech recognition accuracy up to 93%. The practical significance of this work is to increase the accuracy of further transcription of audio information and translation of the formed text into the chosen language, including artificial laguages. Conclusions: In the course of the work, the best sequence of stages of pre-processing audio data was selected for use in further training of the neural network for different ways to convert signals into features. Mel-cepstral characteristic coefficients are better suited for solving our problem. Since the neural network strongly depends on its structure, the results may change with the increase in the volume of input data and the number of languages. But at this stage, it was decided to use only mel-cepstral characteristic coefficients with normalization.

Author Biographies

Olesia Barkovska, Kharkiv National University of Radio Electronics

Phd (Engineering Sciences), Associate Professor, Associate Professor at the Department of Electronic Computers

Anton Havrashenko, Kharkiv National University of Radio Electronics

PhD student at the Department of Electronic Computers



Zhang, Z.(2016), “Mechanics of human voice production and control”. The journal of the acoustical society of america Vol.140.4. Р. 2614-2635. DOI:

Garellek, M.(2022), “Theoretical achievements of phonetics in the 21st century: Phonetics of voice quality”. Journal of Phonetics Vol.94(24). DOI:

Abdul, Z. K., Al-Talabani A. K.(2022), “Mel Frequency Cepstral Coefficient and its Applications: A Review”, IEEE Access, Vol. 10, P. 122136-122158. DOI:

Ayvaz, U.(2022), “Automatic speaker recognition using mel-frequency cepstral coefficients through machine learning.” CMC-Computers Materials & Continua. Vol.71(3), Р. 5511-5521. DOI:

Shalbbya, A. (2020), “Mel frequency cepstral coefficient: a review.” ICIDSSD, Р.1-10. DOI:

Ramakrishnan, S. (2012), “Recognition of emotion from speech: A review.” Speech Enhancement, Modeling and Recognition Algorithms and Applications. Rijeka, Croatia: InTech, Р. 121-136. DOI:

Wang, L.(2022), “A Machine Learning Assessment System for Spoken English Based on Linear Predictive Coding.” Mobile Information Systems, Vol. 2022 (5). Р. 1-12. DOI:

Darling, D., Hinduja, J.(2022), “Feature Extraction in Speech Recognition using Linear Predictive Coding: An Overview." i-Manager's Journal on Digital Signal Processing Vol. 10.2. 16 р. DOI:

Lonce, W. (2017), “Audio spectrogram representations for processing with convolutional neural networks.” arXiv preprint arXiv. P. 37-41. DOI:

Gong, Y., Chung, Y., Glass, J. (2021), “Audio spectrogram transformer.” arXiv preprint arXiv:Version 3. available at:

Qiuqiang, K. (2020), “Large-scale pretrained audio neural networks for audio pattern recognition.” IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 28 (2020). Р. 2880-2894. DOI:

Sandhya, P.(2020), “An Analysis of the Impact of Spectral Contrast Feature in Speech Emotion Recognition”. Third International Conference on Advances in Electronics, Computers and Communications (ICAECC). IEEE, Vol. 9 No. 2. Р. 87-95. DOI:

Charbuty, B., Abdulazeez, A. (2021), “Classification based on decision tree algorithm for machine learning.” Journal of Applied Science and Technology Trends, Vol. 2.01 (2021). Р. 20-28. DOI:

Breiman, L. (2001), “Random forests.” Machine learning Vol. 45 (2001). Р. 5-32. DOI:

Deshmukh, A.(2020), “Comparison of hidden markov model and recurrent neural network in automatic speech recognition”, European Journal of Engineering and Technology Research, Vol. 5.8 (2020). Р. 958-965. DOI:

Ardila, R., Branson, M., Davis, K., Kohler, M., Meyer, J., Henretty, M., Morais, R., Saunders, L., Tyers, F., Weber, G. (2020), “Common Voice: A Massively-Multilingual Speech Corpus.” Proceedings of the Twelfth Language Resources and Evaluation Conference, Р. 4218–4222. DOI:

Goldsborough, P. (2016), “A tour of tensorflow.” arXiv preprint arXiv:1610.01178 (2016). available at:

Havrashenko, A., Barkovska, А. (2023), “Analysis of word search algorithms in the dictionaries of machine translation systems for artificial languages.” Computer systems and information technologies. No 2. P. 17-24. DOI:

Havrashenko, A., Barkovska, O.(2023), “Analysis of text augmentation algorithms in artificial language machine translation systems.” Advanced Information Systems. No.7(1). Р. 47-53. DOI:

Barkovska, O, Havrashenko, A., Kholiev, V., Sevostianova, O.(2021), “Automatic text translation system for artificial languages”, Computer systems and information technologies. No. 3. P. 21-30. DOI:




How to Cite

Barkovska, O., & Havrashenko, A. (2023). Analysis of the influence of selected audio pre-processing stages on accuracy of speaker language recognition. INNOVATIVE TECHNOLOGIES AND SCIENTIFIC SOLUTIONS FOR INDUSTRIES, (4(26), 16–23.