Enhancing skeleton-based action recognition with hybrid real and gan-generated datasets
DOI:
https://doi.org/10.15587/1729-4061.2024.317092Keywords:
action recognition, convolutional neural network, generative adversarial networks, LSTMAbstract
This research addresses the critical challenge of recognizing mutual actions involving multiple individuals, an important task for applications such as video surveillance, human-computer interaction, autonomous systems, and behavioral analysis. Identifying these actions from 3D skeleton motion sequences poses significant challenges due to the necessity of accurately capturing intricate spatial and temporal patterns in diverse, dynamic, and often unpredictable environments. To tackle this, a robust neural network framework was developed that combines Convolutional Neural Networks (CNNs) for efficient spatial feature extraction with Long Short-Term Memory (LSTM) networks to model temporal dependencies over extended sequences. A distinguishing feature of this study is the creation of a hybrid dataset that which combines real-world skeleton motion data with synthetically generated samples, produced using Generative Adversarial Networks (GANs). This dataset enriches variability, enhances generalization, and mitigates data scarcity challenges. Experimental findings across three different network architectures demonstrate that our method significantly enhances recognition accuracy, mainly due to the integration of CNNs and LSTMs alongside the broadened dataset. Our approach successfully identifies complex interactions and ensures consistent performance across different perspectives and environmental conditions. The improved reliability in recognition indicates that this framework can be effectively utilized in practical applications such as security systems, crowd monitoring, and other areas where precise detection of mutual actions is critical, particularly in real-time and dynamic environments
References
- Pareek, P., Thakkar, A. (2020). A survey on video-based Human Action Recognition: recent updates, datasets, challenges, and applications. Artificial Intelligence Review, 54 (3), 2259–2322. https://doi.org/10.1007/s10462-020-09904-8
- Cermeño, E., Pérez, A., Sigüenza, J. A. (2018). Intelligent video surveillance beyond robust background modeling. Expert Systems with Applications, 91, 138–149. https://doi.org/10.1016/j.eswa.2017.08.052
- Fang, M., Chen, Z., Przystupa, K., Li, T., Majka, M., Kochan, O. (2021). Examination of Abnormal Behavior Detection Based on Improved YOLOv3. Electronics, 10 (2), 197. https://doi.org/10.3390/electronics10020197
- Hejazi, S. M., Abhayaratne, C. (2022). Handcrafted localized phase features for human action recognition. Image and Vision Computing, 123, 104465. https://doi.org/10.1016/j.imavis.2022.104465
- Yan, S., Xiong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C., Schmid, C. (2022). Multiview Transformers for Video Recognition. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3323–3333. https://doi.org/10.1109/cvpr52688.2022.00333
- Tong, Z., Song, Y., Wang, J., Wang, L. (2022). VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. arXiv. https://arxiv.org/abs/2203.12602
- Hochreiter, S., Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9 (8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
- Soltau, H., Liao, H., Sak, H. (2017). Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition. Interspeech 2017. https://doi.org/10.21437/interspeech.2017-1566
- Kozhirbayev, Z., Yessenbayev, Z., Karabalayeva, M. (2017). Kazakh and Russian Languages Identification Using Long Short-Term Memory Recurrent Neural Networks. 2017 IEEE 11th International Conference on Application of Information and Communication Technologies (AICT), 1–5. https://doi.org/10.1109/icaict.2017.8687095
- Lu, Y., Lu, C., Tang, C.-K. (2017). Online Video Object Detection Using Association LSTM. 2017 IEEE International Conference on Computer Vision (ICCV). https://doi.org/10.1109/iccv.2017.257
- Huang, R., Zhang, W., Kundu, A., Pantofaru, C., Ross, D. A., Funkhouser, T., Fathi, A. (2020). An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds. Computer Vision – ECCV 2020, 266–282. https://doi.org/10.1007/978-3-030-58523-5_16
- Yuan, Y., Liang, X., Wang, X., Yeung, D.-Y., Gupta, A. (2017). Temporal Dynamic Graph LSTM for Action-Driven Video Object Detection. 2017 IEEE International Conference on Computer Vision (ICCV). https://doi.org/10.1109/iccv.2017.200
- Zhang, B., Yu, J., Fifty, C., Han, W., Dai, A. M., Pang, R., Sha, F. (2021). Co-training Transformer with Videos and Images Improves Action Recognition. arXiv. https://doi.org/10.48550/arxiv.2112.07175
- Wang, Y., Li, K., Li, Y., He, Y., Huang, B., Zhao, Z. et al. (2022). InternVideo: General Video Foundation Models via Generative and Discriminative Learning. arXiv. https://doi.org/10.48550/arXiv.2212.03191
- Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S. et al. (2017). The Kinetics Human Action Video Dataset. arXiv. https://doi.org/10.48550/arXiv.1705.06950
- Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A. (2018). A short note about kinetics-600. arXiv. https://doi.org/10.48550/arXiv.1808.01340
- Carreira, J., Noland, E., Hillier, C., Zisserman, A. (2019). A short note on the kinetics-700 human action dataset. arXiv. https://doi.org/10.48550/arXiv.1907.06987
- Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H. et al. (2017). The “Something Something” Video Database for Learning and Evaluating Visual Common Sense. 2017 IEEE International Conference on Computer Vision (ICCV). https://doi.org/10.1109/iccv.2017.622
- Heilbron, F. C., Escorcia, V., Ghanem, B., Niebles, J. C. (2015). ActivityNet: A large-scale video benchmark for human activity understanding. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr.2015.7298698
- Zhao, H., Torralba, A., Torresani, L., Yan, Z. (2019). HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 8667–8677. https://doi.org/10.1109/iccv.2019.00876
- Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T. (2011). HMDB: A large video database for human motion recognition. 2011 International Conference on Computer Vision, 2556–2563. https://doi.org/10.1109/iccv.2011.6126543
- Shahroudy, A., Liu, J., Ng, T.-T., Wang, G. (2016). NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr.2016.115
- Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y., Kot, A. C. (2020). NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42 (10), 2684–2701. https://doi.org/10.1109/tpami.2019.2916873
- Degardin, B., Neves, J., Lopes, V., Brito, J., Yaghoubi, E., Proenca, H. (2022). Generative Adversarial Graph Convolutional Networks for Human Action Synthesis. 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2753–2762. https://doi.org/10.1109/wacv51458.2022.00281
- Caetano, C., Sena, J., Bremond, F., Dos Santos, J. A., Schwartz, W. R. (2019). SkeleMotion: A New Representation of Skeleton Joint Sequences based on Motion Information for 3D Action Recognition. 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). https://doi.org/10.1109/avss.2019.8909840
- Groleau, G. A., Tso, E. L., Olshaker, J. S., Barish, R. A., Lyston, D. J. (1993). Baseball bat assault injuries. The Journal of Trauma: Injury, Infection, and Critical Care, 34 (3), 366–372. https://doi.org/10.1097/00005373-199303000-00010
- DegardinBruno/Kinetic-Gan. Available at: https://github.com/DegardinBruno/Kinetic-GAN
- Li, C., Zhong, Q., Xie, D., Pu, S. (2017). Skeleton-based action recognition with convolutional neural networks. 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 597–600. https://doi.org/10.1109/icmew.2017.8026285
- Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A. C., Bengio, Y. (2013). Maxout networks. arXiv. https://doi.org/10.48550/arXiv.1302.4389
- Zheng, W., Li, L., Zhang, Z., Huang, Y., Wang, L. (2019). Relational Network for Skeleton-Based Action Recognition. 2019 IEEE International Conference on Multimedia and Expo (ICME), 826–831. https://doi.org/10.1109/icme.2019.00147
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Talgat Islamgozhayev, Beibut Amirgaliyev, Zhanibek Kozhirbayev
This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.
A license agreement is a document in which the author warrants that he/she owns all copyright for the work (manuscript, article, etc.).
The authors, signing the License Agreement with TECHNOLOGY CENTER PC, have all rights to the further use of their work, provided that they link to our edition in which the work was published.
According to the terms of the License Agreement, the Publisher TECHNOLOGY CENTER PC does not take away your copyrights and receives permission from the authors to use and dissemination of the publication through the world's scientific resources (own electronic resources, scientometric databases, repositories, libraries, etc.).
In the absence of a signed License Agreement or in the absence of this agreement of identifiers allowing to identify the identity of the author, the editors have no right to work with the manuscript.
It is important to remember that there is another type of agreement between authors and publishers – when copyright is transferred from the authors to the publisher. In this case, the authors lose ownership of their work and may not use it in any way.