Comparative analysis of modality alignment algorithms in multimodal transformers for sound synthesis

Authors

DOI:

https://doi.org/10.30837/2522-9818.2025.2.049

Keywords:

multimodal transformers; modality alignment; feature projection; contrastive; cross-attention.

Abstract

Subject matter: this research focuses on the use of multimodal transformers for high-quality sound synthesis. By integrating heterogeneous data sources such as audio, text, images, and video, it aims to address the inherent challenges of accurate modality alignment. Goal: the primary goal is to conduct a comprehensive analysis of various modality alignment algorithms in order to assess their effectiveness, computational efficiency, and practical applicability in sound synthesis tasks. Tasks: the core tasks include investigating feature projection, contrastive learning, cross-attention mechanisms, and dynamic time warping for modality alignment; evaluating alignment accuracy, computational overhead, and robustness under diverse operational conditions; and benchmarking performance using standardized datasets and metrics such as Cross-Modal Retrieval Accuracy (CMRA), Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG). Methods: the study adopts both quantitative and qualitative approaches. Quantitative methods entail empirical evaluations of alignment precision and computational cost, whereas qualitative analysis focuses on the perceptual quality of synthesized audio. Standardized data preprocessing and evaluation protocols ensure reliability and reproducibility of the findings. Results: the analysis reveals that contrastive learning and cross-attention mechanisms achieve high alignment precision but demand considerable computational resources. Feature projection and dynamic time warping offer greater efficiency at the expense of some fine-grained detail. Hybrid approaches, combining the strengths of these methods, show potential for balanced performance across varied use cases. Conclusions: this research deepens understanding of how multimodal transformers can advance robust and efficient sound synthesis. By clarifying the benefits and limitations of each alignment strategy, it provides a foundation for developing adaptive systems that tailor alignment methods to specific data characteristics. Future work could extend these insights by exploring real-time applications and broadening the range of input modalities.

Author Biographies

Vadym Mukhin, National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"

Doctor of Sciences (Engineering), Professor, Chair of System Design Department

Yaroslav Khablo, National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"

PhD Student at the Department of System Design

References

References

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need". NeurIPS, 15 p. DOI: https://doi.org/10.48550/arXiv.1706.03762

Choromanski, K., Likhosherstov, V., Dohan, D., et al. (2021). "Rethinking Attention with Performers". ICLR, 38 p. DOI: https://doi.org/10.48550/arXiv.2009.14794

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL, DOI: https://doi.org/10.48550/arXiv.1810.04805

Radford, A., Kim, J. W., Hallacy, C., et al. (2021). "Learning Transferable Visual Models from Natural Language Supervision". ICML, DOI: https://doi.org/10.48550/arXiv.2103.00020

Guzhov, A., Raileanu, A., Golubev, V., et al. (2022). "AudioCLIP: Extending CLIP to Image, Text, and Audio". ICLR, DOI: https://doi.org/10.48550/arXiv.2106.13043

Mahmud, T., Mo, S., Tian, Y., & Marculescu, D. (2024). "MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers". CVPR Workshops, Р. 7996–8005, DOI: https://doi.org/10.48550/arXiv.2406.04930

Gao, P., Zhao, H., Lu, J., et al. (2021). "ResT: An Efficient Transformer for Visual Recognition". CVPR, DOI: https://doi.org/10.48550/arXiv.2105.13677

Baltruūtis, T., Ahuja, C., & Morency, L.-P. (2018). "Multimodal Machine Learning: A Survey and Taxonomy". IEEE Transactions on Pattern Analysis and Machine Intelligence, Р. 423 – 443. DOI: https://doi.org/10.1109/TPAMI.2018.2798607

Sachidananda, V., Tseng, S.-Y., Marchi, E., Kajarekar, S., & Georgiou, P. (2022). "CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations", DOI: https://doi.org/10.48550/arXiv.2202.03587

Akbari, H., Yuan, L., Qian, R., et al. (2021). "VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio, and Text" NeurIPS, DOI: https://doi.org/10.48550/arXiv.2104.11178

Ye, H., Huang, D.-A., Lu, Y., Yu, Z., Ping, W., Tao, A., Kautz, J., Han, S., Xu, D., Molchanov, P., & Yin, H. (2024). "X-VILA: Cross-Modality Alignment for Large Language Model", DOI: https://doi.org/10.48550/arXiv.2405.19335

Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations". NeurIPS, DOI: https://doi.org/10.48550/arXiv.2006.11477

Alayrac, J.-B., Donahue, J., Luc, P., et al. (2022). "Flamingo: A Visual Language Model for Few-Shot Learning", DOI: https://doi.org/10.48550/arXiv.2204.14198

Child, R., Gray, S., Radford, A., & Sutskever, I. (2022). "Generating Long Sequences with Sparse Transformers", DOI: https://doi.org/10.48550/arXiv.1904.10509

Zaheer, M., Guruganesh, G., Dubey, K. A., et al. (2020). "Big Bird: Transformers for Longer Sequences". NeurIPS, DOI: https://doi.org/10.48550/arXiv.2007.14062

Wang, S., Li, B., Khabsa, M., et al. (2020). "Linformer: Self-Attention with Linear Complexity", DOI: https://doi.org/10.48550/arXiv.2009.14794

Downloads

Published

2025-07-08

How to Cite

Mukhin, V., & Khablo, Y. (2025). Comparative analysis of modality alignment algorithms in multimodal transformers for sound synthesis. INNOVATIVE TECHNOLOGIES AND SCIENTIFIC SOLUTIONS FOR INDUSTRIES, (2(32), 49–57. https://doi.org/10.30837/2522-9818.2025.2.049