Analysis of the influence of selected audio pre-processing stages on accuracy of speaker language recognition
DOI:
https://doi.org/10.30837/ITSSI.2023.26.016Keywords:
mel-cepstral characteristic coefficients, spectrogram, time mask, frequency mask, normalization, neural network, voice, audio series, speechAbstract
The subject matter of the study is the analysis of the influence of pre-processing stages of the audio on the accuracy of speaker language regognition. The importance of audio pre-processing has grown significantly in recent years due to its key role in a variety of applications such as data reduction, filtering, and denoising. Taking into account the growing demand for accuracy and efficiency of audio information classification methods, evaluation and comparison of different audio pre-processing methods becomes important part of determining optimal solutions. The goal of this study is to select the best sequence of stages of pre-processing audio data for use in further training of a neural network for various ways of converting signals into features, namely, spectrograms and mel-cepstral characteristic coefficients. In order to achieve the goal, the following tasks were solved: analysis of ways of transforming the signal into certain characteristics and analysis of mathematical models for performing an analysis of the audio series by selected characteristics were carried out. After that, a generalized model of real-time translation of the speaker's speech was developed and the experiment was planned depending on the selected stages of pre-processing of the audio. To conclude, the neural network was trained and tested for the planned experiments. The following methods were used: mel-cepstral characteristic coefficients, spectrogram, time mask, frequency mask, normalization. The following results were obtained: depending on the selected stages of pre-processing of voice information and various ways of converting the signal into certain features, it is possible to achieve speech recognition accuracy up to 93%. The practical significance of this work is to increase the accuracy of further transcription of audio information and translation of the formed text into the chosen language, including artificial laguages. Conclusions: In the course of the work, the best sequence of stages of pre-processing audio data was selected for use in further training of the neural network for different ways to convert signals into features. Mel-cepstral characteristic coefficients are better suited for solving our problem. Since the neural network strongly depends on its structure, the results may change with the increase in the volume of input data and the number of languages. But at this stage, it was decided to use only mel-cepstral characteristic coefficients with normalization.
References
References
Zhang, Z.(2016), “Mechanics of human voice production and control”. The journal of the acoustical society of america Vol.140.4. Р. 2614-2635. DOI: https://doi.org/10.1121/1.4964509
Garellek, M.(2022), “Theoretical achievements of phonetics in the 21st century: Phonetics of voice quality”. Journal of Phonetics Vol.94(24). DOI: https://doi.org/10.1016/j.wocn.2022.101155
Abdul, Z. K., Al-Talabani A. K.(2022), “Mel Frequency Cepstral Coefficient and its Applications: A Review”, IEEE Access, Vol. 10, P. 122136-122158. DOI: https://doi.org/10.1109/ACCESS.2022.3223444.
Ayvaz, U.(2022), “Automatic speaker recognition using mel-frequency cepstral coefficients through machine learning.” CMC-Computers Materials & Continua. Vol.71(3), Р. 5511-5521. DOI: https://doi.org/10.32604/cmc.2022.023278
Shalbbya, A. (2020), “Mel frequency cepstral coefficient: a review.” ICIDSSD, Р.1-10. DOI: https://doi.org/10.4108/eai.27-2-2020.2303173
Ramakrishnan, S. (2012), “Recognition of emotion from speech: A review.” Speech Enhancement, Modeling and Recognition Algorithms and Applications. Rijeka, Croatia: InTech, Р. 121-136. DOI: https://doi.org/10.5772/39246
Wang, L.(2022), “A Machine Learning Assessment System for Spoken English Based on Linear Predictive Coding.” Mobile Information Systems, Vol. 2022 (5). Р. 1-12. DOI: https://doi.org/10.1155/2022/6131572
Darling, D., Hinduja, J.(2022), “Feature Extraction in Speech Recognition using Linear Predictive Coding: An Overview." i-Manager's Journal on Digital Signal Processing Vol. 10.2. 16 р. DOI: https://doi.org/10.26634/jdp.10.2.19289
Lonce, W. (2017), “Audio spectrogram representations for processing with convolutional neural networks.” arXiv preprint arXiv. P. 37-41. DOI: https://doi.org/10.48550/arXiv.1706.09559
Gong, Y., Chung, Y., Glass, J. (2021), “Audio spectrogram transformer.” arXiv preprint arXiv:Version 3. available at: https://arxiv.org/abs/2104.01778
Qiuqiang, K. (2020), “Large-scale pretrained audio neural networks for audio pattern recognition.” IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 28 (2020). Р. 2880-2894. DOI: https://doi.org/10.48550/arXiv.1912.10211
Sandhya, P.(2020), “An Analysis of the Impact of Spectral Contrast Feature in Speech Emotion Recognition”. Third International Conference on Advances in Electronics, Computers and Communications (ICAECC). IEEE, Vol. 9 No. 2. Р. 87-95. DOI: https://doi.org/10.3991/ijes.v9i2.22983
Charbuty, B., Abdulazeez, A. (2021), “Classification based on decision tree algorithm for machine learning.” Journal of Applied Science and Technology Trends, Vol. 2.01 (2021). Р. 20-28. DOI: https://doi.org/10.38094/jastt20165
Breiman, L. (2001), “Random forests.” Machine learning Vol. 45 (2001). Р. 5-32. DOI: http://dx.doi.org/10.1023/A:1010933404324
Deshmukh, A.(2020), “Comparison of hidden markov model and recurrent neural network in automatic speech recognition”, European Journal of Engineering and Technology Research, Vol. 5.8 (2020). Р. 958-965. DOI: https://doi.org/10.1051/itmconf/20235401016
Ardila, R., Branson, M., Davis, K., Kohler, M., Meyer, J., Henretty, M., Morais, R., Saunders, L., Tyers, F., Weber, G. (2020), “Common Voice: A Massively-Multilingual Speech Corpus.” Proceedings of the Twelfth Language Resources and Evaluation Conference, Р. 4218–4222. DOI: https://doi.org/10.48550/arXiv.1912.06670
Goldsborough, P. (2016), “A tour of tensorflow.” arXiv preprint arXiv:1610.01178 (2016). available at: https://arxiv.org/abs/1610.01178
Havrashenko, A., Barkovska, А. (2023), “Analysis of word search algorithms in the dictionaries of machine translation systems for artificial languages.” Computer systems and information technologies. No 2. P. 17-24. DOI: https://doi.org/10.31891/csit-2023-2-2
Havrashenko, A., Barkovska, O.(2023), “Analysis of text augmentation algorithms in artificial language machine translation systems.” Advanced Information Systems. No.7(1). Р. 47-53. DOI: https://doi.org/10.20998/2522-9052.2023.1.08
Barkovska, O, Havrashenko, A., Kholiev, V., Sevostianova, O.(2021), “Automatic text translation system for artificial languages”, Computer systems and information technologies. No. 3. P. 21-30. DOI: https://doi.org/10.31891/CSIT-2021-5-3
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Our journal abides by the Creative Commons copyright rights and permissions for open access journals.
Authors who publish with this journal agree to the following terms:
Authors hold the copyright without restrictions and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-commercial and non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their published work online (e.g., in institutional repositories or on their website) as it can lead to productive exchanges, as well as earlier and greater citation of published work.