Development of audio-visual speech recognition system

Authors

DOI:

https://doi.org/10.15587/2313-8416.2017.118212

Keywords:

audiovisual system, hidden Markov models, viseme, coupled hidden Markov models

Abstract

A model of the audiovisual system based on the hidden Markov models is proposed, which allows recognizing the language in real time. The model provides a language recognition tool that can be used in conditions where other means may not be possible, for example, in the absence of an audio component. The model is researched and tested on the example of digital recognition, expected results are obtained

Author Biographies

Alexandr Gornostal, National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute» Peremohy ave., 37, Kyiv, Ukraine, 03056

Department of Automation and Control in Technical Systems

Yaroslaw Dorogyy, National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute» Peremohy ave., 37, Kyiv, Ukraine, 03056

PhD, Associate Professor

Department of Automation and Control in Technical Systems

References

Liang, L., Liu, X., Zhao, Y., Pi, X., Nefian, A. V. (2002). Speaker independent audio-visual continuous speech recognition. International Conference on Acoustics, Speech and Signal Processing. Lausanne. doi: 10.1109/icme.2002.1035365

Nefian, A. V., Liang, L., Pi, X., Liu, X., Mao, C. (2002). An coupled hidden Markov model for audio-visual speech recognition. International Conference on Acoustics, Speech and Signal Processing. Lausanne. doi: 10.1109/icassp.2002.1006167

Liang, L., Liu, X., Zhao, Y., Pi, X., Nefian, A. V. (2002). Audio-Visual continuous speech recognition using a coupled hidden Markov models. International Conference on Acoustics, Speech and Signal Processing. Lausanne. doi: 10.1109/icassp.2002.1006166

Gurban, M., Thiran, J. P. (2005). Audio-visual speech recognition with a hybrid SVM-HMM system. 13th European Signal Processing Conference. Available at: https://infoscience.epfl.ch/record/87309/files/Gurban2005_1391.pdf

Raskinis, G., Raskinien˙e, D. (2003). Building Medium-Vocabulary Isolated-Word Lithuanian HMM Speech Recognition System. Informatica, 14 (1), 75–84.

Kass, M., Witkin, A., Terzopoulos, D. (1988). Snakes: Active contour models. International Journal of Computer Vision, 1 (4), 321–331. doi: 10.1007/bf00133570

Rao, R. R., Mesereau, R. M. (1994). Lip modeling for visual speech recognition. 28th Annual Asilomar Conference on Signals, Systems, and Computers, 1, 587–590. doi: 10.1109/acssc.1994.471520

Sanchez, M. U. R., Matas, J., Kittler, J. (1997). Statistical chromaticity-based lip tracking with B-splines. IEEE International Conference on Acoustics, Speech and Signal Processing. Munich. doi: 10.1109/icassp.1997.595416

Malcangi, M., Ouazzane, K., Patel, P. (2013). Audio-visual fuzzy fusion for robust speech recognition. The 2013 International Joint Conference on Neural Networks (IJCNN). Dallas. doi: 10.1109/ijcnn.2013.6706789

Malcangi, M., Quan, H. (2016). Bio-inspired Audio-Visual Speech Recognition Towards the Zero Instruction Set Computing. International Conference on Engineering Applications of Neural Networks EANN 2016: Engineering Applications of Neural Networks, 326–334. doi: 10.1007/978-3-319-44188-7_25

Hernando, J. (1997). Maximum likelihood weighting of dynamic speech features for CDHMM speech recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing. Munich. doi: 10.1109/icassp.1997.596176

Gravier, G., Axelrod, S., Potamianos, G. (2002). Maximum entropy and MCE based HMM stream weight estimation for audio-visual ASR. IEEE International Conference on Acoustics Speech and Signal Processing. Orlando. doi: 10.1109/icassp.2002.5743873

Peng, L., Zuoying, W. (2005). Stream weight training based on MCE for audio-visual LVCSR. Tsinghua Science and Technology, 10 (2), 141–144. doi: 10.1016/s1007-0214(05)70045-6

Estellers, V., Gurban, M., Thiran, J.-P. (2012). On dynamic stream weighting for audio-visual speech recognition. IEEE Trans. Audio, Speech, and Language Processing, 20 (4), 1145–1157. doi: 10.1109/tasl.2011.2172427

Garg, A., Potamianos, G., Neti, C., Huang, T. S. (2003). Frame-dependent multi-stream reliability indicators for audio-visual speech recognition. International Conference on Multimedia and Expo. Baltimore. doi: 10.1109/icme.2003.1221384

Lienhart, R., Maydt, J. (2002). An extended set of Haar-like features for rapid objection detection. Proceedings. International Conference on Image Processing. Rochester, 900–903. doi: 10.1109/icip.2002.1038171

Cordea, M. D., Petriu, E. M., Georganos, N. D., Petriu, D. C., Whalen, T. E. (2001). Real-time 2(1/2)-D head pose recovery for model-based video-coding. IEEE Transactions on Instrumentation and Measurement, 50 (4), 1007–1013. doi: 10.1109/19.948316

Neti, C., Potamianos, G., Luettin, J. et. al. (2000). Audio-visual speech recognition: Final Workshop 2000 Report, Center for Language and Speech Processing. Baltimore: The Johns Hopkins University.

Jensen, F. V. (1998). An Introduction to Bayesian Networks. London: UCL Press Limited, 178.

Young, S. et. al. (1995). The HTK Book. Cambridge: Entropic Cambridge Research Laboratory.

Published

2017-12-30

Issue

Section

Technical Sciences