Extending the ImageNET dataset for multimodal text and image learning

Authors

DOI:

https://doi.org/10.30837/2522-9818.2025.1.020

Keywords:

multimodal machine learning; image classification; natural language processing; datasets; text metadata.

Abstract

Subject matter: image processing methods for classification and other computer vision tasks using multimodal data, including text descriptions of classes and images. Goal: development of a multimodal dataset for image classification using textual meta-information analysis. The resulting dataset should consist of image data, image classes, namely 1000 classes of objects depicted in photos from the ImageNet set, textual descriptions of individual images, and textual descriptions of image classes as a whole. Tasks: 1) based on the images of the ImageNet dataset, compile a dataset for training classifier models with text descriptions of image classes and individual images; 2) based on the obtained dataset, conduct an experiment on training a language neural network to confirm the effectiveness of using this approach to solve the classification problem. Methods: compilation of datasets manually, training of speech neural networks based on the RoBERTa architecture. The neural network training was carried out using the fine-tuning method, namely, adding a neural network layer to an existing model to obtain a new machine learning model capable of performing the selected task. Results: the result of the work is the creation of a dataset that combines image data with text data. The resulting dataset is useful for establishing a connection between the information that a machine learning model is able to extract from photos and the information that the model can extract from text data. The multimodal approach can be used to solve a wide range of problems, as demonstrated by the example of training a language neural network. The trained language model processes the descriptions of images contained in the dataset and predicts the class of the image to which this description is associated. The model is designed to filter out irrelevant text metadata, improving the quality of the dataset. Conclusions: data sets that combine multiple types of data can provide a broader context for solving problems that are typically associated with only one type of data, allowing for more effective application of machine learning methods.

Author Biographies

Dmytro Dashenkov, Kharkiv National University of Radio Electronics

PhD Student at the Software Engineering department

Kirill Smelyakov, Kharkiv National University of Radio Electronics

Doctor of Sciences (Engineering), Professor at the Software Engineering department

References

Список літератури

Mensink T., Verbeek J., Perronnin F., Csurka G. Distance-Based image classification: generalizing to new classes at near-zero cost. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2013. № 35 (11). Р. 2624–2637. DOI: https://doi.org/10.1109/tpami.2013.83

Xu Z., Sun K., Mao J. Research on ResNet101 network chemical reagent label image classification based on transfer learning. IEEE Xplore. 2020. DOI: https://doi.org/10.1109/ICCASIT50869.2020.9368658

Tang X., Zhou C., Chen L., Wen Y. Enhancing medical image classification via augmentation-based pre-training. 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2021. DOI: https://doi.org/10.1109/bibm52615.2021.9669817

Dao H.N., Nguyen T., Cherubin Mugisha, Paik I. A multimodal transfer learning approach using pubmedclip for medical image classification. IEEE Access. 2024. №12. Р. 75496–75507. DOI: https://doi.org/10.1109/access.2024.3401777

Ma M., Ma W., Jiao L., Liu X., Liu F., Li L., Yang S. MBSI-Net: multimodal balanced self-learning interaction network for image classification. IEEE Transactions on Circuits and Systems for Video Technology. 2024. №34(5). Р. 3819–3833. DOI: https://doi.org/10.1109/tcsvt.2023.3322470

Chen Q., Shi Z., Zuo Z., Fu J., Sun Y. Two-Stream hybrid attention network for multimodal classification. IEEE International Conference on Image Processing (ICIP). 2021. DOI: https://doi.org/10.1109/icip42928.2021.9506177

"ImageNet". URL: www.image-net.org (дата звернення: 10.10.2024).

Liu Y., Ott M., Goyal N., Du J., Joshi M.S., Chen D., Levy O., Lewis M., Zettlemoyer L., Stoyanov V. RoBERTa: A robustly optimized bert pretraining approach. arXiv (Cornell University). 2019. DOI: https://doi.org/10.48550/arxiv.1907.11692

Satheesh Kumar NJ, CH A. DRCNN-WS: a novel approach for high-resolution video using recurrent neural networks and walrus search. 2022 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO). 2024. Р. 1–6. DOI: https://doi.org/10.1109/icrito61523.2024.10522118

Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S., Uszkoreit J., Houlsby N. An image is worth 16x16 words: transformers for image recognition at scale. arXiv.org. 2021. DOI: https://doi.org/10.48550/arXiv.2010.11929

Arpit Bansal Mathematics, Kumar K., Singla S. Multimodal deep learning: integrating text and image embeddings with attention mechanism. IEEE Xplore. 2024. DOI: https://doi.org/10.1109/aiiot58432.2024.10574665

Radford A., Kim J.W., Hallacy C., Ramesh A., Goh G., Agarwal S., Sastry G., Askell A., Mishkin P., Clark J., Krueger G., Sutskever I. Learning transferable visual models from natural language supervision. arXiv.org. 2021. DOI: https://doi.org/10.48550/arXiv.2103.00020

Peng L., Jian S., Li D., Shen S. MRML: Multimodal rumor detection by deep metric learning. IEEE Xplore. 2023. DOI: https://doi.org/10.1109/ICASSP49357.2023.10096188

Guo W., Wang J., Wang S. Deep multimodal representation learning: a survey. IEEE Access. 2019. №7. Р. 63373–63394. DOI: https://doi.org/10.1109/access.2019.2916887

Karpathy A., Fei-Fei L. Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017. №39(4). Р. 664–676. DOI: https://doi.org/10.1109/tpami.2016.2598339

Shin S., Jang J., Jung M., Kim J., Jung Y., Jung H. Construction of a machine learning dataset for multiple AI tasks using korean commercial multimodal video clips. ICTC. 2020. Р.1264–1266. DOI: https://doi.org/10.1109/ictc49870.2020.9289319

Chen B., Liu J., Li Z., Yang M. Seeking the sufficiency and necessity causal features in multimodal representation learning. arXiv (Cornell University). 2024. DOI: https://doi.org/10.48550/arxiv.2408.16577

Srinivasan K., Raman K., Chen J., Bendersky M., Najork M. WIT: wikipedia-based image text dataset for multimodal multilingual machine learning. SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021. P. 2443–2449. DOI: https://doi.org/10.1145/3404835.3463257

Young P., Lai A., Hodosh M., Hockenmaier, J. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics. 2014. №2. Р. 67–78. DOI: https://doi.org/10.1162/tacl_a_00166

"Papers with Code – ImageNet Benchmark (Image Classification)". URL: https://paperswithcode.com/sota/image-classification-on-imagenet (дата звернення: 01.11.2024).

Fang A., Ilharco G., Wortsman M., Wan Y., Shankar V., Dave A., Schmidt L. Data determines distributional robustness in contrastive language image pre-training (CLIP). arXiv (Cornell University). 2022. DOI: https://doi.org/10.48550/arxiv.2205.01397

Chen J., Hu H., Wu H., Jiang Y., Wang C. Learning the best pooling strategy for visual semantic embedding. arXiv (Cornell University). 2020. DOI: https://doi.org/10.48550/arxiv.2011.04305

References

Mensink, T., Verbeek, J., Perronnin, F., Csurka, G. (2013), "Distance-Based image classification: generalizing to new classes at near-zero cost". IEEE Transactions on Pattern Analysis and Machine Intelligence, № 35(11), Р. 2624–2637. DOI: https://doi.org/10.1109/tpami.2013.83

Xu, Z., Sun, K., Mao, J. (2020), "Research on ResNet101 Network chemical reagent label image classification based on transfer learning". IEEE Xplore. DOI: https://doi.org/10.1109/ICCASIT50869.2020.9368658

Tang, X., Zhou, C., Chen, L., Wen, Y. (2021), "Enhancing medical image classification via augmentation-based pre-training". 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). DOI: https://doi.org/10.1109/bibm52615.2021.9669817

Dao, H.N., Nguyen, T., Cherubin Mugisha, Paik, I. (2024), "A multimodal transfer learning approach using pubmedclip for medical image classification". IEEE Access, 12, Р. 75496–75507. DOI: https://doi.org/10.1109/access.2024.3401777

Ma, M., Ma, W., Jiao, L., Liu, X., Liu, F., Li, L., Yang, S. (2024), "MBSI-Net: multimodal balanced self-learning interaction network for image classification". IEEE Transactions on Circuits and Systems for Video Technology, № 34(5), Р. 3819–3833. DOI: https://doi.org/10.1109/tcsvt.2023.3322470

Chen, Q., Shi, Z., Zuo, Z., Fu, J., Sun, Y. (2021), "Two-stream hybrid attention network for multimodal classification". 2022 IEEE International Conference on Image Processing (ICIP). DOI: https://doi.org/10.1109/icip42928.2021.9506177

"ImageNet". available at: https://www.image-net.org/

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M.S., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V. (2019), "RoBERTa: a robustly optimized bert pretraining approach". arXiv (Cornell University), № 1. DOI: https://doi.org/10.48550/arxiv.1907.11692

Satheesh Kumar NJ, CH, A. (2024), "DRCNN-WS: A novel approach for high-resolution video using recurrent neural networks and walrus search". 2022 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Р. 1–6. DOI: https://doi.org/10.1109/icrito61523.2024.10522118

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N. (2021), "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". arXiv.org. DOI: https://doi.org/10.48550/arXiv.2010.11929

Arpit Bansal Mathematics, Kumar, K., Singla, S. (2024), "Multimodal deep learning: integrating text and image embeddings with attention mechanism". IEEE Xplore. DOI: https://doi.org/10.1109/aiiot58432.2024.10574665

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I. (2021), "Learning transferable visual models from natural language supervision". arXiv.org. DOI: https://doi.org/10.48550/arXiv.2103.00020

Peng, L., Jian, S., Li, D. and Shen, S. (2023), "MRML: multimodal rumor detection by deep metric learning". IEEE Xplore. DOI: https://doi.org/10.1109/ICASSP49357.2023.10096188

Guo, W., Wang, J., Wang, S. (2019), "Deep multimodal representation learning: a survey". IEEE Access, № 7, Р. 63373–63394. DOI: https://doi.org/10.1109/access.2019.2916887

Karpathy, A., Fei-Fei, L. (2017), "Deep visual-semantic alignments for generating image descriptions". IEEE Transactions on Pattern Analysis and Machine Intelligence, № 39 (4), Р. 664–676. DOI: https://doi.org/10.1109/tpami.2016.2598339

Shin, S., Jang, J., Jung, M., Kim, J., Jung, Y., Jung, H. (2020), "Construction of a machine learning dataset for multiple AI tasks using korean commercial multimodal video clips". ICTC. Р. 1264–1266. DOI: https://doi.org/10.1109/ictc49870.2020.9289319

Chen, B., Liu, J., Li, Z., Yang, M. (2024), "Seeking the Sufficiency and Necessity Causal Features in Multimodal Representation Learning". arXiv (Cornell University). DOI: https://doi.org/10.48550/arxiv.2408.16577

Srinivasan, K., Raman, K., Chen, J., Bendersky, M., Najork, M. (2021), "WIT: wikipedia-based image text dataset for multimodal multilingual machine learning". SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021. P. 2443–2449.DOI: https://doi.org/10.1145/3404835.3463257

Young, P., Lai, A., Hodosh, M., Hockenmaier, J. (2014), "From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions". Transactions of the association for computational linguistics, № 2, Р. 67–78. DOI: https://doi.org/10.1162/tacl_a_00166

"Papers with Code - ImageNet Benchmark (Image Classification)". available at: https://paperswithcode.com/sota/image-classification-on-imagenet

Fang, A., Ilharco, G., Wortsman, M., Wan, Y., Shankar, V., Dave, A., Schmidt, L. (2022), "Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)". arXiv (Cornell University). DOI: https://doi.org/10.48550/arxiv.2205.01397

Chen, J., Hu, H., Wu, H., Jiang, Y., Wang, C. (2020), "Learning the Best Pooling Strategy for Visual Semantic Embedding". arXiv (Cornell University). DOI: https://doi.org/10.48550/arxiv.2011.04305

Published

2025-03-31

How to Cite

Dashenkov, D., & Smelyakov, K. (2025). Extending the ImageNET dataset for multimodal text and image learning. INNOVATIVE TECHNOLOGIES AND SCIENTIFIC SOLUTIONS FOR INDUSTRIES, (1(31), 20–31. https://doi.org/10.30837/2522-9818.2025.1.020