Extending the ImageNET dataset for multimodal text and image learning
DOI:
https://doi.org/10.30837/2522-9818.2025.1.020Keywords:
multimodal machine learning; image classification; natural language processing; datasets; text metadata.Abstract
Subject matter: image processing methods for classification and other computer vision tasks using multimodal data, including text descriptions of classes and images. Goal: development of a multimodal dataset for image classification using textual meta-information analysis. The resulting dataset should consist of image data, image classes, namely 1000 classes of objects depicted in photos from the ImageNet set, textual descriptions of individual images, and textual descriptions of image classes as a whole. Tasks: 1) based on the images of the ImageNet dataset, compile a dataset for training classifier models with text descriptions of image classes and individual images; 2) based on the obtained dataset, conduct an experiment on training a language neural network to confirm the effectiveness of using this approach to solve the classification problem. Methods: compilation of datasets manually, training of speech neural networks based on the RoBERTa architecture. The neural network training was carried out using the fine-tuning method, namely, adding a neural network layer to an existing model to obtain a new machine learning model capable of performing the selected task. Results: the result of the work is the creation of a dataset that combines image data with text data. The resulting dataset is useful for establishing a connection between the information that a machine learning model is able to extract from photos and the information that the model can extract from text data. The multimodal approach can be used to solve a wide range of problems, as demonstrated by the example of training a language neural network. The trained language model processes the descriptions of images contained in the dataset and predicts the class of the image to which this description is associated. The model is designed to filter out irrelevant text metadata, improving the quality of the dataset. Conclusions: data sets that combine multiple types of data can provide a broader context for solving problems that are typically associated with only one type of data, allowing for more effective application of machine learning methods.
References
Список літератури
Mensink T., Verbeek J., Perronnin F., Csurka G. Distance-Based image classification: generalizing to new classes at near-zero cost. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2013. № 35 (11). Р. 2624–2637. DOI: https://doi.org/10.1109/tpami.2013.83
Xu Z., Sun K., Mao J. Research on ResNet101 network chemical reagent label image classification based on transfer learning. IEEE Xplore. 2020. DOI: https://doi.org/10.1109/ICCASIT50869.2020.9368658
Tang X., Zhou C., Chen L., Wen Y. Enhancing medical image classification via augmentation-based pre-training. 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2021. DOI: https://doi.org/10.1109/bibm52615.2021.9669817
Dao H.N., Nguyen T., Cherubin Mugisha, Paik I. A multimodal transfer learning approach using pubmedclip for medical image classification. IEEE Access. 2024. №12. Р. 75496–75507. DOI: https://doi.org/10.1109/access.2024.3401777
Ma M., Ma W., Jiao L., Liu X., Liu F., Li L., Yang S. MBSI-Net: multimodal balanced self-learning interaction network for image classification. IEEE Transactions on Circuits and Systems for Video Technology. 2024. №34(5). Р. 3819–3833. DOI: https://doi.org/10.1109/tcsvt.2023.3322470
Chen Q., Shi Z., Zuo Z., Fu J., Sun Y. Two-Stream hybrid attention network for multimodal classification. IEEE International Conference on Image Processing (ICIP). 2021. DOI: https://doi.org/10.1109/icip42928.2021.9506177
"ImageNet". URL: www.image-net.org (дата звернення: 10.10.2024).
Liu Y., Ott M., Goyal N., Du J., Joshi M.S., Chen D., Levy O., Lewis M., Zettlemoyer L., Stoyanov V. RoBERTa: A robustly optimized bert pretraining approach. arXiv (Cornell University). 2019. DOI: https://doi.org/10.48550/arxiv.1907.11692
Satheesh Kumar NJ, CH A. DRCNN-WS: a novel approach for high-resolution video using recurrent neural networks and walrus search. 2022 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO). 2024. Р. 1–6. DOI: https://doi.org/10.1109/icrito61523.2024.10522118
Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S., Uszkoreit J., Houlsby N. An image is worth 16x16 words: transformers for image recognition at scale. arXiv.org. 2021. DOI: https://doi.org/10.48550/arXiv.2010.11929
Arpit Bansal Mathematics, Kumar K., Singla S. Multimodal deep learning: integrating text and image embeddings with attention mechanism. IEEE Xplore. 2024. DOI: https://doi.org/10.1109/aiiot58432.2024.10574665
Radford A., Kim J.W., Hallacy C., Ramesh A., Goh G., Agarwal S., Sastry G., Askell A., Mishkin P., Clark J., Krueger G., Sutskever I. Learning transferable visual models from natural language supervision. arXiv.org. 2021. DOI: https://doi.org/10.48550/arXiv.2103.00020
Peng L., Jian S., Li D., Shen S. MRML: Multimodal rumor detection by deep metric learning. IEEE Xplore. 2023. DOI: https://doi.org/10.1109/ICASSP49357.2023.10096188
Guo W., Wang J., Wang S. Deep multimodal representation learning: a survey. IEEE Access. 2019. №7. Р. 63373–63394. DOI: https://doi.org/10.1109/access.2019.2916887
Karpathy A., Fei-Fei L. Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017. №39(4). Р. 664–676. DOI: https://doi.org/10.1109/tpami.2016.2598339
Shin S., Jang J., Jung M., Kim J., Jung Y., Jung H. Construction of a machine learning dataset for multiple AI tasks using korean commercial multimodal video clips. ICTC. 2020. Р.1264–1266. DOI: https://doi.org/10.1109/ictc49870.2020.9289319
Chen B., Liu J., Li Z., Yang M. Seeking the sufficiency and necessity causal features in multimodal representation learning. arXiv (Cornell University). 2024. DOI: https://doi.org/10.48550/arxiv.2408.16577
Srinivasan K., Raman K., Chen J., Bendersky M., Najork M. WIT: wikipedia-based image text dataset for multimodal multilingual machine learning. SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021. P. 2443–2449. DOI: https://doi.org/10.1145/3404835.3463257
Young P., Lai A., Hodosh M., Hockenmaier, J. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics. 2014. №2. Р. 67–78. DOI: https://doi.org/10.1162/tacl_a_00166
"Papers with Code – ImageNet Benchmark (Image Classification)". URL: https://paperswithcode.com/sota/image-classification-on-imagenet (дата звернення: 01.11.2024).
Fang A., Ilharco G., Wortsman M., Wan Y., Shankar V., Dave A., Schmidt L. Data determines distributional robustness in contrastive language image pre-training (CLIP). arXiv (Cornell University). 2022. DOI: https://doi.org/10.48550/arxiv.2205.01397
Chen J., Hu H., Wu H., Jiang Y., Wang C. Learning the best pooling strategy for visual semantic embedding. arXiv (Cornell University). 2020. DOI: https://doi.org/10.48550/arxiv.2011.04305
References
Mensink, T., Verbeek, J., Perronnin, F., Csurka, G. (2013), "Distance-Based image classification: generalizing to new classes at near-zero cost". IEEE Transactions on Pattern Analysis and Machine Intelligence, № 35(11), Р. 2624–2637. DOI: https://doi.org/10.1109/tpami.2013.83
Xu, Z., Sun, K., Mao, J. (2020), "Research on ResNet101 Network chemical reagent label image classification based on transfer learning". IEEE Xplore. DOI: https://doi.org/10.1109/ICCASIT50869.2020.9368658
Tang, X., Zhou, C., Chen, L., Wen, Y. (2021), "Enhancing medical image classification via augmentation-based pre-training". 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). DOI: https://doi.org/10.1109/bibm52615.2021.9669817
Dao, H.N., Nguyen, T., Cherubin Mugisha, Paik, I. (2024), "A multimodal transfer learning approach using pubmedclip for medical image classification". IEEE Access, 12, Р. 75496–75507. DOI: https://doi.org/10.1109/access.2024.3401777
Ma, M., Ma, W., Jiao, L., Liu, X., Liu, F., Li, L., Yang, S. (2024), "MBSI-Net: multimodal balanced self-learning interaction network for image classification". IEEE Transactions on Circuits and Systems for Video Technology, № 34(5), Р. 3819–3833. DOI: https://doi.org/10.1109/tcsvt.2023.3322470
Chen, Q., Shi, Z., Zuo, Z., Fu, J., Sun, Y. (2021), "Two-stream hybrid attention network for multimodal classification". 2022 IEEE International Conference on Image Processing (ICIP). DOI: https://doi.org/10.1109/icip42928.2021.9506177
"ImageNet". available at: https://www.image-net.org/
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M.S., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V. (2019), "RoBERTa: a robustly optimized bert pretraining approach". arXiv (Cornell University), № 1. DOI: https://doi.org/10.48550/arxiv.1907.11692
Satheesh Kumar NJ, CH, A. (2024), "DRCNN-WS: A novel approach for high-resolution video using recurrent neural networks and walrus search". 2022 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Р. 1–6. DOI: https://doi.org/10.1109/icrito61523.2024.10522118
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N. (2021), "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". arXiv.org. DOI: https://doi.org/10.48550/arXiv.2010.11929
Arpit Bansal Mathematics, Kumar, K., Singla, S. (2024), "Multimodal deep learning: integrating text and image embeddings with attention mechanism". IEEE Xplore. DOI: https://doi.org/10.1109/aiiot58432.2024.10574665
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I. (2021), "Learning transferable visual models from natural language supervision". arXiv.org. DOI: https://doi.org/10.48550/arXiv.2103.00020
Peng, L., Jian, S., Li, D. and Shen, S. (2023), "MRML: multimodal rumor detection by deep metric learning". IEEE Xplore. DOI: https://doi.org/10.1109/ICASSP49357.2023.10096188
Guo, W., Wang, J., Wang, S. (2019), "Deep multimodal representation learning: a survey". IEEE Access, № 7, Р. 63373–63394. DOI: https://doi.org/10.1109/access.2019.2916887
Karpathy, A., Fei-Fei, L. (2017), "Deep visual-semantic alignments for generating image descriptions". IEEE Transactions on Pattern Analysis and Machine Intelligence, № 39 (4), Р. 664–676. DOI: https://doi.org/10.1109/tpami.2016.2598339
Shin, S., Jang, J., Jung, M., Kim, J., Jung, Y., Jung, H. (2020), "Construction of a machine learning dataset for multiple AI tasks using korean commercial multimodal video clips". ICTC. Р. 1264–1266. DOI: https://doi.org/10.1109/ictc49870.2020.9289319
Chen, B., Liu, J., Li, Z., Yang, M. (2024), "Seeking the Sufficiency and Necessity Causal Features in Multimodal Representation Learning". arXiv (Cornell University). DOI: https://doi.org/10.48550/arxiv.2408.16577
Srinivasan, K., Raman, K., Chen, J., Bendersky, M., Najork, M. (2021), "WIT: wikipedia-based image text dataset for multimodal multilingual machine learning". SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021. P. 2443–2449.DOI: https://doi.org/10.1145/3404835.3463257
Young, P., Lai, A., Hodosh, M., Hockenmaier, J. (2014), "From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions". Transactions of the association for computational linguistics, № 2, Р. 67–78. DOI: https://doi.org/10.1162/tacl_a_00166
"Papers with Code - ImageNet Benchmark (Image Classification)". available at: https://paperswithcode.com/sota/image-classification-on-imagenet
Fang, A., Ilharco, G., Wortsman, M., Wan, Y., Shankar, V., Dave, A., Schmidt, L. (2022), "Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)". arXiv (Cornell University). DOI: https://doi.org/10.48550/arxiv.2205.01397
Chen, J., Hu, H., Wu, H., Jiang, Y., Wang, C. (2020), "Learning the Best Pooling Strategy for Visual Semantic Embedding". arXiv (Cornell University). DOI: https://doi.org/10.48550/arxiv.2011.04305
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Our journal abides by the Creative Commons copyright rights and permissions for open access journals.
Authors who publish with this journal agree to the following terms:
Authors hold the copyright without restrictions and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-commercial and non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their published work online (e.g., in institutional repositories or on their website) as it can lead to productive exchanges, as well as earlier and greater citation of published work.












