What Makes Multimodal In-Context Learning Work? - Machine Learning and Information Access
Communication Dans Un Congrès Année : 2024

What Makes Multimodal In-Context Learning Work?

Résumé

Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at gitlab.com/folbaeni/multimodal-icl
Fichier sous embargo
Fichier sous embargo
0 10 21
Année Mois Jours
Avant la publication
mercredi 19 novembre 2025
Fichier sous embargo
mercredi 19 novembre 2025
Connectez-vous pour demander l'accès au fichier

Dates et versions

hal-04791285 , version 1 (19-11-2024)

Licence

Copyright (Tous droits réservés)

Identifiants

Citer

Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, Benjamin Piwowarski. What Makes Multimodal In-Context Learning Work?. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun 2024, Seattle, United States. pp.1539-1550, ⟨10.1109/CVPRW63382.2024.00161⟩. ⟨hal-04791285⟩
2 Consultations
2 Téléchargements

Altmetric

Partager

More