MMIU

Multimodal Multi-image Understanding for Evaluating Multimodal Large Language Models

Fanqing Meng*,2,1 , Jin Wang*,3,1 , Chuanhao Li*,1, Quanfeng Lu1,2, Hao Tian4, Jiaqi Liao1, Xizhou Zhu5,1,4, Jifeng Dai5,1, Yu Qiao1, Ping Luo3,1 , Kaipeng Zhang†,1 , Wenqi Shao†,1

1OpenGVLab, Shanghai AI Laboratory, 2Shanghai Jiao Tong University,
3The University of Hong Kong, 4SenseTime Research, 5Tsinghua University

*Equal contribution
†Corresponding Author: shaowenqi@pjlab.org.cn, zhangkaipeng@pjlab.org.cn
MMIU

Visualization of MMIU. Our MMIU contains 77,659 images, 7 types of image relationships, and 5 image modalities, along with 11,698 multiple-choice questions, providing a comprehensive evaluation for 52 multi-image understanding tasks. Each example comes from a task chosen from each multi-image relationship. We construct MMIU by adopting a top-down hierarchy where image relationships of interest are enumerated and multiple tasks are associated with each relationship. The number of tasks for each relationship is demoted.

🔔News

🔥[2024-08-07] We released the technical report.

Introduction

The capability to process multiple images is crucial for Multimodal Large Language Models, as a single image captures information from a specific angle and moment, limiting the model's ability to understand and reason about the entire scene. Recent multi-image Multimodal Large Language Models (MLLMs) have begun to address this need. However, their evaluation has not kept pace with their development. To fill this gap, we introduce the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess MLLMs across a wide range of multi-image tasks. MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions, making it the most extensive benchmark of its kind. Our evaluation of 24 popular MLLMs, including both open-source and proprietary models, reveals significant challenges in multi-image comprehension, particularly in tasks involving spatial understanding. Even the most advanced models, such as GPT-4o, achieve only 55.7% accuracy on MMIU. Through multi-faceted analytical experiments, we identify key performance gaps and limitations, providing valuable insights for future model and data improvements. We aim for MMIU to advance the frontier of LVLM research and development, moving us toward achieving sophisticated multimodal multi-image user interactions.

MMIU

Overview

pipeline

An illustration of our data collection process. First, we refine multi-image tasks and collect task data based on cognitive psychology. Then, we standardize these datasets into a uniform format—metadata. Next, we generate multiple-choice samples with answerable and unanswerable questions from the metadata using either manually designed rules or GPT4o. Our benchmarks include capability evaluations across various image types.

Comparisons with Existing Benchmarks

The comparison between MMIU and existing multi-image evaluation benchmarks including Video-MME, MIRB, MUIRBENCH, and MileBench. We summarize the image relationships in previous benchmarks according to seven categories defined in MMIU. `Y\&N' indicates that our MMIU comprises both answerable and unanswerable questions. I, T, V, D and P represent image, text, video, depth map and point cloud, respectively. Compared with prior datasets, MMIU involves massive test samples spanning 52 multimodal tasks and 5 modalities, and comprehensive multi-image analyses by image relationships, task map and supervised fine-tuning (SFT).

comparison

Experiment Results

Leaderboard

Quantitative results for 24 LVLMs across 52 tasks are summarized. Accuracy is the metric, and the Overall score is computed across all tasks. The maximum value of each task is bolded. Notice that although InternVL1.5-chat supports multiple image inputs, its training phase did not incorporate multi-image data. The full term of task abbreviation can be found in the paper.

Baseline Adequate Multi-Image SFT LVLMs Multi-Image input LVLMs Single-Image input LVLMs Closed-source LVLMs
Model Overall CR ER FD FC SC VCor VQA VGR FR HR I2IR MIC PR S2IR STD STS T2IR VR AQA GAR MVU MEV NIP TL TO VidCap
GuAR GNAP TC VClz VCo VO EVQA HE IQASC ICSC ISTE ITRSC MAR MR JPS 3DE 3DOD 3DOT 3DPE 3DSR 3DQA PT RPM SOT 3DCR 3DIR
Frequency 31.5 32.0 27.7 27.3 30.0 30.2 29.6 49.0 76.5 29.0 28.0 27.5 29.0 30.0 37.0 51.5 50.0 26.5 31.0 32.0 30.0 29.0 30.0 28.5 30.1 29.0 27.5
31.5 28.0 28.5 27.5 30.5 31.0 27.5 27.5 41.5 27.5 30.0 18.0 27.6 55.6 29.0 26.5 29.0 28.0 26.5 28.5 29.5 30.5 18.0 28.0 26.0 27.0
Random 27.4 19.0 23.0 22.3 26.4 24.7 29.1 45.0 50.0 23.0 26.0 24.0 20.0 24.5 37.5 51.0 55.0 27.5 28.0 28.0 26.5 24.0 27.5 23.0 26.9 24.5 23.0
21.0 12.5 24.0 27.5 20.5 27.0 32.0 31.5 38.5 27.0 26.0 14.0 24.6 50.4 23.5 25.5 24.5 22.5 31.0 23.5 24.5 25.5 10.5 22.5 27.0 27.0
GPT-4o 55.7 67.8 46.5 88.8 42.6 41.5 72.6 79.2 61.3 76.0 42.0 59.5 93.5 61.5 67.0 11.0 84.0 70.5 68.0 33.5 91.5 71.5 35.0 26.5 50.8 28.0 92.5
78.0 46.5 62.5 43.5 97.5 21.5 57.5 29.5 88.0 58.5 35.0 17.5 81.9 46.6 23.5 24.0 40.5 94.5 85.0 22.0 39.0 55.0 12.5 56.0 69.0 49.0
Gemini1.5 53.4 71.0 31.8 73.5 24.3 34.9 47.3 78.8 61.0 88.0 80.0 74.0 89.0 70.5 81.5 74.0 80.0 60.5 68.0 35.5 88.0 75.0 25.0 21.0 45.6 26.5 84.0
93.0 39.5 59.0 30.0 60.0 43.5 53.5 22.5 91.0 64.5 24.0 13.0 68.8 51.1 34.5 20.0 32.0 48.5 37.5 28.5 35.5 66.5 13.0 61.0 55.0 43.0
Claude3.5 53.4 70.2 38.5 76.6 31.3 34.9 57.0 77.8 54.5 92.0 79.0 62.0 85.5 77.5 68.0 80.0 57.5 65.5 79.0 26.0 80.5 75.0 33.5 10.5 43.5 23.0 91.0
88.5 55.0 56.0 26.5 67.5 38.5 53.5 23.0 78.5 52.0 32.0 4.0 64.8 42.1 31.5 23.5 41.0 32.0 99.5 21.5 28.5 78.5 10.5 67.5 53.5 36.5
Gemini1.0 40.2 63.2 26.5 36.6 27.5 28.3 30.3 60.8 71.0 25.0 24.5 28.0 84.0 21.0 44.0 71.0 48.0 27.0 31.5 34.5 89.0 73.5 29.0 21.5 37.3 23.5 90.0
87.0 35.5 62.5 24.5 42.0 23.0 45.5 17.0 53.0 55.0 22.5 16.0 71.9 43.6 28.0 22.0 28.0 36.0 7.0 24.5 39.0 17.0 12.0 47.0 53.0 33.5
Mantis 45.6 61.5 31.8 57.0 24.3 28.1 30.9 59.8 65.2 66.5 54.0 63.5 71.0 57.5 64.5 96.0 65.5 46.5 70.5 17.5 81.0 58.5 28.5 26.0 23.8 27.0 85.0
73.5 34.0 51.5 31.0 14.0 20.0 54.5 23.0 66.0 48.0 23.5 13.0 71.4 47.4 27.5 23.5 24.0 26.0 22.5 25.0 50.5 76.0 13.5 50.0 59.0 40.5
Llava-interleave 32.4 29.5 24.8 26.3 23.2 26.4 25.1 48.8 49.8 23.5 25.0 28.0 57.0 21.5 33.0 63.5 54.5 25.0 26.0 24.0 27.0 49.5 29.0 23.0 25.4 27.5 32.5
43.0 34.0 49.0 29.5 32.0 26.0 30.0 21.5 42.0 47.5 22.5 14.0 23.6 32.3 17.5 28.5 23.0 17.5 3.0 31.0 36.0 79.0 15.0 60.5 34.5 42.5
InternVL2 50.3 77.8 41.5 62.8 24.6 25.3 35.3 82.5 59.8 93.5 47.0 85.5 92.5 82.0 73.0 19.0 77.0 54.5 83.5 22.0 86.5 68.5 33.0 20.5 26.9 25.0 88.0
91.5 40.5 52.0 25.5 78.0 35.0 63.0 28.5 77.5 41.5 26.0 20.0 78.4 55.6 27.5 25.5 28.0 20.0 26.0 41.0 43.0 48.5 13.5 59.5 51.5 31.0
InternVL1.5-chat 37.4 63.7 31.0 22.6 20.3 16.3 28.3 63.2 38.5 21.0 28.0 26.5 82.5 20.5 31.5 6.0 45.5 26.5 29.5 29.5 85.0 65.0 32.0 23.5 29.0 18.5 89.0
90.5 35.5 56.5 23.5 31.0 24.5 53.0 26.0 40.0 49.0 25.5 15.5 59.3 43.6 19.5 22.5 23.5 15.0 33.5 28.0 39.0 71.0 9.5 46.5 50.5 39.5
idefics2-8b 27.8 28.0 25.8 26.4 26.7 24.6 28.6 58.5 30.8 3.5 9.5 4.0 82.0 5.0 27.5 98.5 70.5 12.5 7.0 16.0 24.5 12.0 19.0 23.5 22.3 18.0 19.5
23.5 22.5 21.0 26.5 21.5 22.5 14.5 21.5 31.0 50.5 25.5 13.5 15.1 55.6 27.5 26.0 21.5 9.0 21.5 23.0 11.5 61.0 18.0 52.5 44.5 40.5
deepseek-vl-7b 24.6 2.2 22.2 29.1 23.3 28.2 29.0 49.0 65.5 20.5 25.0 25.5 72.5 21.0 30.5 65.0 54.5 25.5 31.0 0.0 6.0 0.0 0.0 27.5 31.1 15.5 2.0
10.0 14.0 5.5 17.0 30.5 21.5 0.0 23.0 45.5 42.0 24.5 0.0 2.0 44.4 20.5 24.5 24.5 0.0 7.5 0.5 1.5 78.0 0.5 62.5 40.5 38.5
XComposer2-1.8b 23.5 24.5 23.0 19.1 16.4 18.4 10.0 27.8 27.5 13.0 12.0 26.0 55.5 19.5 33.5 17.0 54.0 10.5 1.5 25.0 59.5 37.0 25.5 0.0 24.4 13.0 68.5
59.0 28.0 34.0 25.0 28.5 17.0 17.5 0.5 29.5 48.0 6.0 7.5 33.2 41.4 7.0 0.0 15.5 17.0 28.0 2.0 29.0 33.5 9.0 27.5 11.5 3.0
deepseek-vl-1.3b 23.2 1.2 27.5 21.4 23.1 26.7 30.0 45.2 54.8 20.5 25.0 25.5 46.0 21.0 30.5 89.0 0.0 23.0 31.0 0.0 1.0 2.5 0.0 23.0 26.4 20.0 1.0
6.5 13.0 3.5 11.5 33.0 20.0 0.5 25.0 44.5 38.0 24.0 1.0 0.0 55.6 31.0 26.0 31.0 0.0 19.5 0.0 1.5 66.5 3.0 61.5 45.5 29.0
flamingov2 22.3 25.5 25.8 24.6 21.6 25.0 28.2 34.5 49.0 14.5 19.0 13.5 22.5 17.5 26.0 39.0 49.0 20.0 27.5 10.0 13.5 16.5 30.0 20.0 18.7 24.5 22.5
25.0 21.5 25.5 25.0 14.5 13.5 15.5 27.5 4.0 25.5 23.0 7.0 22.1 3.0 1.5 26.5 22.0 35.0 17.0 28.5 20.5 23.5 11.5 31.0 25.0 23.5
XComposer2 21.9 24.0 21.0 10.8 5.8 0.0 0.0 34.2 24.0 14.5 2.5 23.0 63.5 19.0 26.0 14.5 31.0 9.5 28.5 31.5 59.5 44.0 30.0 4.5 15.5 12.0 66.0
55.0 35.0 42.5 22.5 2.5 19.0 20.0 8.0 15.5 45.0 0.0 0.0 20.6 0.0 16.5 0.0 7.0 0.0 4.5 0.0 33.5 63.0 1.5 38.5 42.0 33.0
qwen-chat 15.9 20.5 2.5 13.3 2.5 9.9 5.9 31.2 23.8 10.5 19.5 12.5 41.0 5.5 13.5 29.5 45.0 3.0 12.0 10.0 52.5 18.5 16.5 2.5 3.6 5.5 47.0
29.0 23.0 18.0 6.0 6.0 6.0 32.0 9.0 13.5 17.0 15.5 3.5 40.2 15.8 16.5 16.5 22.5 17.5 13.0 14.5 14.0 8.0 3.0 8.5 1.5 0.5
idefics-9b-instruct 12.8 10.8 0.2 0.2 0.8 0.0 9.4 23.0 13.0 2.5 22.0 14.0 70.0 3.0 14.5 40.5 34.5 3.5 2.0 4.0 1.5 20.0 3.0 15.5 0.5 3.0 10.0
37.0 27.5 48.5 23.0 0.0 5.5 5.0 3.0 9.0 16.0 0.0 0.0 6.5 12.8 1.0 15.5 10.5 0.5 36.5 5.5 2.5 44.5 1.5 35.0 0.0 0.0
qwen-base 5.2 9.2 0.5 5.7 5.8 0.5 1.0 5.0 4.5 0.0 1.0 0.0 20.5 0.0 2.5 1.0 43.0 1.0 0.0 0.0 4.5 8.5 0.5 0.0 0.0 0.0 7.5
24.5 8.0 29.5 5.0 5.5 6.5 2.0 2.0 8.5 11.5 0.0 0.0 0.5 5.3 0.0 0.5 7.0 0.0 21.5 0.0 5.5 2.5 0.0 0.5 0.0 0.0
glm-4v-9b 27.0 32.8 16.0 31.8 8.7 9.0 4.7 59.0 55.8 31.0 7.5 19.5 82.0 23.5 24.5 81.0 67.0 25.0 30.0 7.0 59.5 53.5 10.5 5.0 25.9 10.0 76.0
55.5 19.0 34.0 5.0 11.5 14.5 26.0 11.5 35.5 41.5 16.0 6.5 25.1 29.3 9.0 14.0 14.5 7.0 0.5 5.5 27.0 35.0 7.5 26.0 48.5 23.5
llava-next-vicuna_7b 22.2 22.2 9.2 11.0 9.1 7.7 10.5 37.0 23.2 7.0 16.5 8.0 66.0 5.0 23.5 88.0 42.5 13.0 14.5 5.5 51.0 42.5 9.5 10.0 17.1 6.5 66.0
50.5 14.5 38.0 9.0 9.5 8.5 31.0 5.0 28.5 27.0 8.5 5.0 22.6 29.3 6.5 4.0 4.0 6.0 8.0 9.5 32.5 72.0 1.0 38.0 42.0 25.0
MiniCPM-Llama3-V-2-5 21.6 41.1 11.8 13.2 8.7 5.0 11.3 47.8 38.5 7.0 3.0 6.5 77.0 7.5 18.5 41.5 41.5 10.0 5.0 0.5 70.5 51.0 13.5 4.5 17.6 5.0 83.5
46.0 24.5 26.0 4.5 20.5 12.0 43.0 0.0 25.0 44.5 0.0 1.5 34.2 38.3 6.0 8.5 5.5 9.5 20.0 4.5 24.5 14.5 0.5 22.0 32.5 15.0
LLaVA-v1.5-7B 19.2 14.1 4.2 13.7 5.8 1.9 6.9 27.3 35.0 6.5 12.5 12.5 53.0 10.0 25.5 66.5 43.0 19.0 3.5 2.5 23.5 36.5 12.0 16.5 6.7 7.0 28.0
24.5 17.5 40.0 15.0 21.5 4.0 26.0 7.5 26.5 17.5 5.0 4.5 25.6 27.1 8.5 8.0 4.0 6.0 6.0 14.5 29.5 66.0 2.0 35.0 34.5 28.5
ShareGPT4V-7B 18.5 16.4 5.0 10.8 6.2 9.0 2.7 34.2 28.5 4.5 10.5 3.5 57.0 4.0 12.5 55.5 44.5 13.5 5.0 5.0 26.0 38.0 14.0 15.5 10.9 6.0 25.0
26.5 19.0 42.0 7.5 14.0 7.5 31.5 7.0 29.0 18.0 5.0 1.5 28.1 23.3 9.5 3.0 7.0 6.0 2.0 8.0 27.5 65.5 0.0 44.0 36.5 31.0
SharedCaptioner 16.1 20.7 22.2 27.2 10.2 9.1 21.0 39.5 37.0 7.0 5.0 6.0 47.0 5.0 17.0 25.0 35.5 12.5 13.0 5.5 14.5 4.5 3.0 6.0 18.1 5.5 21.5
17.0 22.5 18.5 12.0 14.5 11.0 23.5 7.0 25.5 22.0 5.5 2.0 16.1 43.6 9.0 2.5 1.5 1.5 5.5 8.0 26.5 47.0 2.0 28.0 16.5 9.0
Monkey-Chat 13.7 8.4 8.0 5.9 9.2 6.7 8.1 23.5 25.3 4.5 6.0 1.5 34.5 2.0 9.0 40.5 40.5 12.0 2.5 6.5 16.5 14.5 10.0 12.5 18.1 6.5 19.5
10.0 8.5 17.0 8.0 13.0 7.5 15.5 7.0 27.5 17.0 5.5 3.0 10.6 22.6 9.0 5.5 8.0 6.0 5.5 7.5 34.5 51.0 1.5 17.0 36.0 8.5

Performance across Image Relationships

We find that models exhibit varying capabilities across different image relationships. More detailed visualizations can be found in the paper. In general, LVLMs excel at understanding semantic content in multi-image scenarios, perform moderately in temporal tasks, and obtain the worst performance in comprehending spatial relationships in multi-image contexts.

1) Semantic Relationships, models generally perform well on multi-image semantic tasks involving low-level relationships. However, they struggle with high-level tasks, for subjective tasks such as Causality Reasoning and Emotion Recognition, which require the identification and reasoning of implicit visual information, highlighting a gap between model performance and human visual cognition. As for objective tasks such retrieval tasks, most models fail to tackle them.

2) In temporal relationships, models can handle discrete and continuous temporal relationships relatively well but show mediocre performance on reasoning-intensive multi-image tasks. For instance, in sorting tasks, GPT4o achieves only 28% and 21.5% accuracy in temporal ordering and visual ordering tasks, respectively.

3) In spatial relationships, we find that models struggle with understanding both 2D and 3D positional relations. This is consistent with the observation in the previous single-image evaluation benchmark where they find that LVLMs fall short in localization and detection tasks requiring spatial reasoning. The tasks involving spatial relationships in MMIU become more challenging because models need to gather spatial information in multiple images and to reason.

Performance across Image Relationships

(a): The average performance comparison of 24 LVLMs on three main image relationships. (b): The average performance comparison of representative models such as GPT4o on seven specific image relationships.

Taskonomy Analysis

Task map is an effective tool for multi-task analysis. Thanks to extensive coverage of multi-image tasks in MMIU, we build a task map to analyze the relationships between different tasks, allowing us to identify in- and out-of-domain tasks for current LVLMs. Following MMT-Bench, we use QwenVL-chat to construct a task map where the distance between two tasks is given. Detailed construction process of the task map can be found in the paper.

Tasks involving recognition or captioning are in-domain tasks which can be handled by most current multimodal large models. For multi-image tasks, models generally struggle to achieve satisfactory results, obtaining good performance on a limited number of tasks. Specifically, for tasks in clusters 7, 8, and some tasks in cluster 2, which involve recognition or captioning (e.g., video captioning, action recognition), models perform relatively well. This is because these multi-image tasks focus on overall image perception, requiring less comparison and reasoning between images.

Tasks involving temporal ordering and 3D spatial reasoning are out-of-domain Tasks where most models perform poorly. Specifically, models struggle with tasks in clusters 4, 5, and 6. Clusters 4 and 6 involve modelling semantic relationships or sequential order among multiple images, requiring memorizing detailed long-context content and strong reasoning skills. Most LVLMs underperform on these tasks such as temporal ordering tasks. Tasks in cluster 5 pertain to 3D visual tasks such as 3D detection and tracking. This may be due to the lack of 3D vision-language data in training LVLMs.

error distribution

(a): Visualization of task maps and hierarchical clustering along with the task map. (b): Visualization of model performance across various tasks. Different colors represent the respective categories formed through clustering, arranged sequentially from left to right, starting from the first category to the eighth. Notice that although InternVL1.5-chat supports multiple image inputs, its training phase did not incorporate multi-image data.

Task Learning Difficulty

We analyze task learning difficulty by SFT with all evaluation samples in MMIU being instruction tuning data. In this way, we can identify tasks which cannot be improved by simple SFT. To this end, we fine-tune QwenVL-chat on each task for 20 epochs and obtain the accuracy of QwenVL-chat on each task, denoted as AccSFT. The lower accuracy reflects the larger fitting difficulty of the task. Meanwhile, we also obtain the average accuracy of all tested models on each task, denoted as AccModel. This accuracy reflects the difficulty current models face in handling these tasks.

We find that the Spearman correlation coefficient between AccSFT and AccModel is 0.66, indicating a high correlation. This suggests that both measures can reflect task difficulty to some extent. More importantly, we need to focus on tasks where both AccSFT and AccModel are low. A low AccSFT indicates that the task is difficult to overfit even with SFT, suggesting that additional pre-training data or training techniques might be necessary. These tasks include 1) Ordering and retrieval tasks, which require strong memory and reasoning abilities—capabilities that are generally weak in large multimodal models; 2) Tasks involving a large number of images, such as EVQA, MEV, and GNAP, require models to support longer context lengths and possess strong memory capabilities. This indicates that future multimodal model designs should consider the ability to handle long contexts and emphasize the inclusion of multi-image data during the pre-training phase.

error distribution

The performance of AccModel and AccSFT across different tasks, sorted by AccModel in descending order, with AccSFT scaled to the same magnitude as AccModel for easy comparison.

Error Examples

BibTeX


      @misc{meng2024mmiumultimodalmultiimageunderstanding,
        title={MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models}, 
        author={Fanqing Meng and Jin Wang and Chuanhao Li and Quanfeng Lu and Hao Tian and Jiaqi Liao and Xizhou Zhu and Jifeng Dai and Yu Qiao and Ping Luo and Kaipeng Zhang and Wenqi Shao},
        year={2024},
        eprint={2408.02718},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2408.02718}, 
  }