Analysis of the influence of selected audio pre-processing stages on accuracy of speaker language recognition

Olesia Barkovska; Anton Havrashenko

doi:10.30837/ITSSI.2023.26.016

Authors

Olesia Barkovska Kharkiv National University of Radio Electronics, Ukraine https://orcid.org/0000-0001-7496-4353
Anton Havrashenko Kharkiv National University of Radio Electronics, Ukraine https://orcid.org/0000-0002-8802-0529

DOI:

https://doi.org/10.30837/ITSSI.2023.26.016

Keywords:

mel-cepstral characteristic coefficients, spectrogram, time mask, frequency mask, normalization, neural network, voice, audio series, speech

Abstract

The subject matter of the study is the analysis of the influence of pre-processing stages of the audio on the accuracy of speaker language regognition. The importance of audio pre-processing has grown significantly in recent years due to its key role in a variety of applications such as data reduction, filtering, and denoising. Taking into account the growing demand for accuracy and efficiency of audio information classification methods, evaluation and comparison of different audio pre-processing methods becomes important part of determining optimal solutions. The goal of this study is to select the best sequence of stages of pre-processing audio data for use in further training of a neural network for various ways of converting signals into features, namely, spectrograms and mel-cepstral characteristic coefficients. In order to achieve the goal, the following tasks were solved: analysis of ways of transforming the signal into certain characteristics and analysis of mathematical models for performing an analysis of the audio series by selected characteristics were carried out. After that, a generalized model of real-time translation of the speaker's speech was developed and the experiment was planned depending on the selected stages of pre-processing of the audio. To conclude, the neural network was trained and tested for the planned experiments. The following methods were used: mel-cepstral characteristic coefficients, spectrogram, time mask, frequency mask, normalization. The following results were obtained: depending on the selected stages of pre-processing of voice information and various ways of converting the signal into certain features, it is possible to achieve speech recognition accuracy up to 93%. The practical significance of this work is to increase the accuracy of further transcription of audio information and translation of the formed text into the chosen language, including artificial laguages. Conclusions: In the course of the work, the best sequence of stages of pre-processing audio data was selected for use in further training of the neural network for different ways to convert signals into features. Mel-cepstral characteristic coefficients are better suited for solving our problem. Since the neural network strongly depends on its structure, the results may change with the increase in the volume of input data and the number of languages. But at this stage, it was decided to use only mel-cepstral characteristic coefficients with normalization.

Author Biographies

Olesia Barkovska, Kharkiv National University of Radio Electronics

Phd (Engineering Sciences), Associate Professor, Associate Professor at the Department of Electronic Computers

Anton Havrashenko, Kharkiv National University of Radio Electronics

PhD student at the Department of Electronic Computers

References

Zhang, Z.(2016), “Mechanics of human voice production and control”. The journal of the acoustical society of america Vol.140.4. Р. 2614-2635. DOI: https://doi.org/10.1121/1.4964509

Garellek, M.(2022), “Theoretical achievements of phonetics in the 21st century: Phonetics of voice quality”. Journal of Phonetics Vol.94(24). DOI: https://doi.org/10.1016/j.wocn.2022.101155

Abdul, Z. K., Al-Talabani A. K.(2022), “Mel Frequency Cepstral Coefficient and its Applications: A Review”, IEEE Access, Vol. 10, P. 122136-122158. DOI: https://doi.org/10.1109/ACCESS.2022.3223444.

Ayvaz, U.(2022), “Automatic speaker recognition using mel-frequency cepstral coefficients through machine learning.” CMC-Computers Materials & Continua. Vol.71(3), Р. 5511-5521. DOI: https://doi.org/10.32604/cmc.2022.023278

Shalbbya, A. (2020), “Mel frequency cepstral coefficient: a review.” ICIDSSD, Р.1-10. DOI: https://doi.org/10.4108/eai.27-2-2020.2303173

Ramakrishnan, S. (2012), “Recognition of emotion from speech: A review.” Speech Enhancement, Modeling and Recognition Algorithms and Applications. Rijeka, Croatia: InTech, Р. 121-136. DOI: https://doi.org/10.5772/39246

Wang, L.(2022), “A Machine Learning Assessment System for Spoken English Based on Linear Predictive Coding.” Mobile Information Systems, Vol. 2022 (5). Р. 1-12. DOI: https://doi.org/10.1155/2022/6131572

Darling, D., Hinduja, J.(2022), “Feature Extraction in Speech Recognition using Linear Predictive Coding: An Overview." i-Manager's Journal on Digital Signal Processing Vol. 10.2. 16 р. DOI: https://doi.org/10.26634/jdp.10.2.19289

Lonce, W. (2017), “Audio spectrogram representations for processing with convolutional neural networks.” arXiv preprint arXiv. P. 37-41. DOI: https://doi.org/10.48550/arXiv.1706.09559

Gong, Y., Chung, Y., Glass, J. (2021), “Audio spectrogram transformer.” arXiv preprint arXiv:Version 3. available at: https://arxiv.org/abs/2104.01778

Qiuqiang, K. (2020), “Large-scale pretrained audio neural networks for audio pattern recognition.” IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 28 (2020). Р. 2880-2894. DOI: https://doi.org/10.48550/arXiv.1912.10211

Sandhya, P.(2020), “An Analysis of the Impact of Spectral Contrast Feature in Speech Emotion Recognition”. Third International Conference on Advances in Electronics, Computers and Communications (ICAECC). IEEE, Vol. 9 No. 2. Р. 87-95. DOI: https://doi.org/10.3991/ijes.v9i2.22983

Charbuty, B., Abdulazeez, A. (2021), “Classification based on decision tree algorithm for machine learning.” Journal of Applied Science and Technology Trends, Vol. 2.01 (2021). Р. 20-28. DOI: https://doi.org/10.38094/jastt20165

Breiman, L. (2001), “Random forests.” Machine learning Vol. 45 (2001). Р. 5-32. DOI: http://dx.doi.org/10.1023/A:1010933404324

Deshmukh, A.(2020), “Comparison of hidden markov model and recurrent neural network in automatic speech recognition”, European Journal of Engineering and Technology Research, Vol. 5.8 (2020). Р. 958-965. DOI: https://doi.org/10.1051/itmconf/20235401016

Ardila, R., Branson, M., Davis, K., Kohler, M., Meyer, J., Henretty, M., Morais, R., Saunders, L., Tyers, F., Weber, G. (2020), “Common Voice: A Massively-Multilingual Speech Corpus.” Proceedings of the Twelfth Language Resources and Evaluation Conference, Р. 4218–4222. DOI: https://doi.org/10.48550/arXiv.1912.06670

Goldsborough, P. (2016), “A tour of tensorflow.” arXiv preprint arXiv:1610.01178 (2016). available at: https://arxiv.org/abs/1610.01178

Havrashenko, A., Barkovska, А. (2023), “Analysis of word search algorithms in the dictionaries of machine translation systems for artificial languages.” Computer systems and information technologies. No 2. P. 17-24. DOI: https://doi.org/10.31891/csit-2023-2-2

Havrashenko, A., Barkovska, O.(2023), “Analysis of text augmentation algorithms in artificial language machine translation systems.” Advanced Information Systems. No.7(1). Р. 47-53. DOI: https://doi.org/10.20998/2522-9052.2023.1.08

Barkovska, O, Havrashenko, A., Kholiev, V., Sevostianova, O.(2021), “Automatic text translation system for artificial languages”, Computer systems and information technologies. No. 3. P. 21-30. DOI: https://doi.org/10.31891/CSIT-2021-5-3

Analysis of the influence of selected audio pre-processing stages on accuracy of speaker language recognition

Authors

DOI:

Keywords:

Abstract

Author Biographies

Olesia Barkovska, Kharkiv National University of Radio Electronics

Anton Havrashenko, Kharkiv National University of Radio Electronics

References

Downloads

Published

How to Cite

Issue

Section

License

Language

Make a Submission