Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level
DOI:
https://doi.org/10.15587/1729-4061.2022.252801Keywords:
ASR, transfer learning, end-to-end, low-resource language, connectionist temporal classification, attentionAbstract
Ensuring the best quality and performance of modern speech technologies, today, is possible based on the widespread use of machine learning methods. The idea of this project is to study and implement an end-to-end system of automatic speech recognition using machine learning methods, as well as to develop new mathematical models and algorithms for solving the problem of automatic speech recognition for agglutinative (Turkic) languages.
Many research papers have shown that deep learning methods make it easier to train automatic speech recognition systems that use an end-to-end approach. This method can also train an automatic speech recognition system directly, that is, without manual work with raw signals. Despite the good recognition quality, this model has some drawbacks. These disadvantages are based on the need for a large amount of data for training. This is a serious problem for low-data languages, especially Turkic languages such as Kazakh and Azerbaijani. To solve this problem, various methods are needed to apply. Some methods are used for end-to-end speech recognition of languages belonging to the group of languages of the same family (agglutinative languages). Method for low-resource languages is transfer learning, and for large resources – multi-task learning. To increase efficiency and quickly solve the problem associated with a limited resource, transfer learning was used for the end-to-end model. The transfer learning method helped to fit a model trained on the Kazakh dataset to the Azerbaijani dataset. Thereby, two language corpora were trained simultaneously. Conducted experiments with two corpora show that transfer learning can reduce the symbol error rate, phoneme error rate (PER), by 14.23 % compared to baseline models (DNN+HMM, WaveNet, and CNC+LM). Therefore, the realized model with the transfer method can be used to recognize other low-resource languages.
Supporting Agency
- This research has been funded by the Science Committee of the Ministry of Education and Science of the Republic Kazakhstan (Grant No. AP09259309).
References
- Perera, F. P., Tang, D., Rauh, V., Lester, K., Tsai, W. Y., Tu, Y. H. et. al. (2005). Relationships among Polycyclic Aromatic Hydrocarbon–DNA Adducts, Proximity to the World Trade Center, and Effects on Fetal Growth. Environmental Health Perspectives, 113 (8), 1062–1067. doi: https://doi.org/10.1289/ehp.10144
- Jaitly, N., Hinton, G. (2011). Learning a better representation of speech soundwaves using restricted boltzmann machines. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). doi: https://doi.org/10.1109/icassp.2011.5947700
- Rustamov, S., Gasimov, E., Hasanov, R., Jahangirli, S., Mustafayev, E., Usikov, D. (2018). Speech recognition in flight simulator. IOP Conference Series: Materials Science and Engineering, 459, 012005. doi: https://doi.org/10.1088/1757-899x/459/1/012005
- Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Laurent, C., Bengio, Y., Courville, A. (2016). Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks. Interspeech 2016. doi: https://doi.org/10.21437/interspeech.2016-1446
- Rao, K., Peng, F., Sak, H., Beaufays, F. (2015). Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). doi: https://doi.org/10.1109/icassp.2015.7178767
- Alsayadi, H. A., Abdelhamid, A. A., Hegazy, I., Fayed, Z. T. (2021). Arabic speech recognition using end‐to‐end deep learning. IET Signal Processing, 15 (8), 521–534. doi: https://doi.org/10.1049/sil2.12057
- Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y. (2015). Attention-based models for speech recognition. Advances in Neural Information Processing Systems, 577–585.
- Putri, F. Y., Puji Lestari, D., Widyantoro, D. H. (2018). Long Short-Term Memory Based Language Model for Indonesian Spontaneous Speech Recognition. 2018 International Conference on Computer, Control, Informatics and Its Applications (IC3INA). doi: https://doi.org/10.1109/ic3ina.2018.8629500
- Zou, W., Jiang, D., Zhao, S., Yang, G., Li, X. (2018). Comparable Study Of Modeling Units For End-To-End Mandarin Speech Recognition. 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP). doi: https://doi.org/10.1109/iscslp.2018.8706661
- Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y. et. al. (2018). ESPnet: End-to-End Speech Processing Toolkit. Interspeech 2018. doi: https://doi.org/10.21437/interspeech.2018-1456
- Mamyrbayev, O., Alimhan, K., Zhumazhanov, B., Turdalykyzy, T., Gusmanova, F. (2020). End-to-End Speech Recognition in Agglutinative Languages. Lecture Notes in Computer Science, 391–401. doi: https://doi.org/10.1007/978-3-030-42058-1_33
- Asefisaray, B., Haznedaroglu, A., Erden, M., Arslan, L. M. (2018). Transfer learning for automatic speech recognition systems. 2018 26th Signal Processing and Communications Applications Conference (SIU). doi: https://doi.org/10.1109/siu.2018.8404628
- Heigold, G., Vanhoucke, V., Senior, A., Nguyen, P., Ranzato, M., Devin, M., Dean, J. (2013). Multilingual acoustic models using distributed deep neural networks. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. doi: https://doi.org/10.1109/icassp.2013.6639348
- Yi, J., Tao, J., Wen, Z., Bai, Y. (2019). Language-Adversarial Transfer Learning for Low-Resource Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27 (3), 621–630. doi: https://doi.org/10.1109/taslp.2018.2889606
- Palaz, D., Magimai-Doss, M., Collobert, R. (2019). End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition. Speech Communication, 108, 15–32. doi: https://doi.org/10.1016/j.specom.2019.01.004
- Dokuz, Y., Tufekci, Z. (2021). Mini-batch sample selection strategies for deep learning based speech recognition. Applied Acoustics, 171, 107573. doi: https://doi.org/10.1016/j.apacoust.2020.107573
- Khassanov, Y., Mussakhojayeva, S., Mirzakhmetov, A., Adiyev, A., Nurpeiissov, M., Varol, H. A. (2021). A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. doi: https://doi.org/10.18653/v1/2021.eacl-main.58
- Beibut, A. (2020). Development of Automatic Speech Recognition for Kazakh Language using Transfer Learning. International Journal of Advanced Trends in Computer Science and Engineering, 9 (4), 5880–5886. doi: https://doi.org/10.30534/ijatcse/2020/249942020
- Markovnikov, N., Kipyatkova, I. (2019). Investigating Joint CTC-Attention Models for End-to-End Russian Speech Recognition. Lecture Notes in Computer Science, 337–347. doi: https://doi.org/10.1007/978-3-030-26061-3_35
- Fujita, Y., Watanabe, S., Omachi, M., Chang, X. (2020). Insertion-Based Modeling for End-to-End Automatic Speech Recognition. Interspeech 2020. doi: https://doi.org/10.21437/interspeech.2020-1619
- Zeng, Z., Pham, V. T., Xu, H., Khassanov, Y., Chng, E. S., Ni, C., Ma, B. (2021). Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning. 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP). doi: https://doi.org/10.1109/iscslp49672.2021.9362086
- Qin, C.-X., Zhang, W.-L., Qu, D. (2019). A new joint CTC-attention-based speech recognition model with multi-level multi-head attention. EURASIP Journal on Audio, Speech, and Music Processing, 2019 (1). doi: https://doi.org/10.1186/s13636-019-0161-0
- O’Brien, M. G., Derwing, T. M., Cucchiarini, C., Hardison, D. M., Mixdorff, H., Thomson, R. I.et. al. (2018). Directions for the future of technology in pronunciation research and teaching. Journal of Second Language Pronunciation, 4 (2), 182–207. doi: https://doi.org/10.1075/jslp.17001.obr
- Tejedor-García, C., Cardeñoso-Payo, V., Escudero-Mancebo, D. (2021). Automatic Speech Recognition (ASR) Systems Applied to Pronunciation Assessment of L2 Spanish for Japanese Speakers. Applied Sciences, 11 (15), 6695. doi: https://doi.org/10.3390/app11156695
- Ding, C. H. Q., Tao Li, Jordan, M. I. (2010). Convex and Semi-Nonnegative Matrix Factorizations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32 (1), 45–55. doi: https://doi.org/10.1109/tpami.2008.277
- Schuller, B., Weninger, F., Wollmer, M., Sun, Y., Rigoll, G. (2010). Non-negative matrix factorization as noise-robust feature extractor for speech recognition. 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. doi: https://doi.org/10.1109/icassp.2010.5495567
- Mamyrbayev, O., Turdalyuly, M., Mekebayev, N., Alimhan, K., Kydyrbekova, A., Turdalykyzy, T. (2019). Automatic Recognition of Kazakh Speech Using Deep Neural Networks. Lecture Notes in Computer Science, 465–474. doi: https://doi.org/10.1007/978-3-030-14802-7_40
- Mamyrbayev, O., Oralbekova, D., Kydyrbekova, A., Turdalykyzy, T., Bekarystankyzy, A. (2021). End-to-End Model Based on RNN-T for Kazakh Speech Recognition. 2021 3rd International Conference on Computer Communication and the Internet (ICCCI). doi: https://doi.org/10.1109/iccci51764.2021.9486811
- Mamyrbayev, O., Kydyrbekova, A., Alimhan, K., Oralbekova, D., Zhumazhanov, B., Nuranbayeva, B. (2021). Development of security systems using DNN and i & x-vector classifiers. Eastern-European Journal of Enterprise Technologies, 4 (9(112)), 32–45. doi: https://doi.org/10.15587/1729-4061.2021.239186
- Van Den Oord, A., Dieleman, S., Zen, H. et. al. (2016). Wavenet: A generative model for raw audio. arXiv. Available at: https://arxiv.org/pdf/1609.03499.pdf
- Zeghidour, N., Usunier, N., Kokkinos, I., Schaiz, T., Synnaeve, G., Dupoux, E. (2018). Learning Filterbanks from Raw Speech for Phone Recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). doi: https://doi.org/10.1109/icassp.2018.8462015
- Kermanshahi, M. A., Akbari, A., Nasersharif, B. (2021). Transfer Learning for End-to-End ASR to Deal with Low-Resource Problem in Persian Language. 2021 26th International Computer Conference, Computer Society of Iran (CSICC). doi: https://doi.org/10.1109/csicc52343.2021.9420540
- Qin, C.-X., Qu, D., Zhang, L.-H. (2018). Towards end-to-end speech recognition with transfer learning. EURASIP Journal on Audio, Speech, and Music Processing, 2018 (1). doi: https://doi.org/10.1186/s13636-018-0141-9
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2022 Orken Mamyrbayev, Keylan Alimhan, Dina Oralbekova, Akbayan Bekarystankyzy, Bagashar Zhumazhanov
This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.
A license agreement is a document in which the author warrants that he/she owns all copyright for the work (manuscript, article, etc.).
The authors, signing the License Agreement with TECHNOLOGY CENTER PC, have all rights to the further use of their work, provided that they link to our edition in which the work was published.
According to the terms of the License Agreement, the Publisher TECHNOLOGY CENTER PC does not take away your copyrights and receives permission from the authors to use and dissemination of the publication through the world's scientific resources (own electronic resources, scientometric databases, repositories, libraries, etc.).
In the absence of a signed License Agreement or in the absence of this agreement of identifiers allowing to identify the identity of the author, the editors have no right to work with the manuscript.
It is important to remember that there is another type of agreement between authors and publishers – when copyright is transferred from the authors to the publisher. In this case, the authors lose ownership of their work and may not use it in any way.