Skip to content

Commit

Permalink
update radar graph with gemini
Browse files Browse the repository at this point in the history
  • Loading branch information
lupantech committed Dec 8, 2023
1 parent 5fd8a25 commit 83ccb4c
Show file tree
Hide file tree
Showing 2 changed files with 30 additions and 20 deletions.
50 changes: 30 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,14 @@ With **MathVista**, we have conducted **a comprehensive, quantitative evaluation
<img src="assets/score_leaderboard_gpt4v.png" width="70%"> <br>
Accuracy scores the testmini set (1,000 examples) of <b>MathVista</b>.
</p>

We further explore the new ability of **self-verification**, the use of **self-consistency**, and the **goal-directed multi-turn human-AI dialogues**, highlighting the promising potential of GPT-4V for future research.

<p align="center">
<img src="assets/tease_scores_version4_gemini.png" width="80%"> <br>
Accuracy scores of one leading LLM (i.e., PoT GPT-4), four primary LMMs, random chance, and human performance on <b>MathVista</b>.
</p>

<p align="center">
<img src="assets/tease_scores_gpt4v.png" width="80%"> <br>
Accuracy scores of one leading LLM (i.e., PoT GPT-4), four primary LMMs, random chance, and human performance on <b>MathVista</b>.
Expand All @@ -70,30 +76,34 @@ Accuracy scores on the **testmini** subset (1,000 examples):

| **#** | **Model** | **Method** | **Source** | **Date** | **ALL** | **FQA** | **GPS** | **MWP** | **TQA** | **VQA** | **ALG** | **ARI** | **GEO** | **LOG** | **NUM** | **SCI** | **STA** |
| ----- | ------------------------------ | ---------- | ------------------------------------------------------------ | ---------- | -------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- |
| - | **Human** | - | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **60.3** | 59.7 | 48.4 | 73.0 | 63.2 | 55.9 | 50.9 | 59.2 | 51.4 | 40.7 | 53.8 | 64.9 | 63.9 |
| 1 | **Gemini Ultra🥇** | LMM 🖼️ | [Link](https://blog.google/technology/ai/google-gemini-ai/#performance) | 2023-12-06 | **53.0** | - | - | - | - | - | - | - | - | - | - | - | - |
| - | **Human Performance\*** | - | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **60.3** | 59.7 | 48.4 | 73.0 | 63.2 | 55.9 | 50.9 | 59.2 | 51.4 | 40.7 | 53.8 | 64.9 | 63.9 |
| 1 | **Gemini Ultra🥇** | LMM 🖼️ | [Link](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) | 2023-12-06 | **53.0** | 49.1 | 56.3 | 53.8 | 69.0 | 40.2 | 58.4 | 45.9 | 55.7 | 21.6 | 38.9 | 62.3 | 59.5 |
| 2 | **GPT-4V (Playground)🥈** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-15 | **49.9** | 43.1 | 50.5 | 57.5 | 65.2 | 38.0 | 53.0 | 49.0 | 51.0 | 21.6 | 20.1 | 63.1 | 55.8 |
| 3 | **SPHINX (V2) 🥉** | LMM 🖼️ | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2023-11-17 | **36.7** | 54.7 | 16.4 | 23.1 | 41.8 | 43.0 | 20.6 | 33.4 | 17.6 | 24.3 | 21.5 | 43.4 | 51.5 |
| 4 | **Multimodal Bard** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **34.8** | 26.0 | 47.1 | 29.6 | 48.7 | 26.8 | 46.5 | 28.6 | 47.8 | 13.5 | 14.9 | 47.5 | 33.0 |
| 5 | **PoT GPT-4 (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.9** | 30.1 | 39.4 | 30.6 | 39.9 | 31.3 | 37.4 | 31.7 | 41.0 | 18.9 | 20.1 | 44.3 | 37.9 |
| 6 | **CoT GPT-4 (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.2** | 27.9 | 31.7 | 31.2 | 51.9 | 28.5 | 33.5 | 30.9 | 32.2 | 13.5 | 12.5 | 58.2 | 37.9 |
| 7 | **CoT ChatGPT (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.2** | 27.5 | 29.3 | 36.0 | 49.4 | 29.1 | 31.0 | 32.9 | 31.0 | 16.2 | 17.4 | 50.8 | 37.2 |
| 8 | **CoT Claude-2 (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.2** | 26.0 | 31.7 | 35.5 | 48.1 | 30.2 | 32.4 | 32.3 | 33.0 | 16.2 | 17.4 | 54.9 | 36.2 |
| 9 | **SPHINX (V1)** | LMM 🖼️ | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/SPHINX/SPHINX_paper.pdf) | 2023-11-09 | **27.5** | 23.4 | 23.1 | 21.5 | 39.9 | 34.1 | 25.6 | 28.1 | 23.4 | 16.2 | 17.4 | 40.2 | 23.6 |
| 10 | **PoT ChatGPT (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **26.8** | 24.5 | 26.4 | 23.7 | 33.5 | 27.9 | 27.8 | 26.1 | 28.0 | 18.9 | 13.2 | 33.6 | 29.9 |
| 11 | **LLaVA (LLaMA-2-13B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **26.1** | 26.8 | 29.3 | 16.1 | 32.3 | 26.3 | 27.3 | 20.1 | 28.8 | 24.3 | 18.3 | 37.3 | 25.1 |
| 12 | **InstructBLIP (Vicuna-7B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **25.3** | 23.1 | 20.7 | 18.3 | 32.3 | 35.2 | 21.8 | 27.1 | 20.7 | 18.9 | 20.4 | 33.0 | 23.1 |
| 13 | **LLaVAR** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **25.2** | 21.9 | 25.0 | 16.7 | 34.8 | 30.7 | 24.2 | 22.1 | 23.0 | 13.5 | 15.3 | 42.6 | 21.9 |
| 14 | **LLaMA-Adapter-V2 (7B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **23.9** | 21.2 | 25.5 | 11.3 | 32.3 | 31.8 | 26.3 | 20.4 | 24.3 | 24.3 | 13.9 | 29.5 | 18.3 |
| 15 | **miniGPT4 (LLaMA-2-7B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **23.1** | 18.6 | 26.0 | 13.4 | 30.4 | 30.2 | 28.1 | 21.0 | 24.7 | 16.2 | 16.7 | 25.4 | 17.9 |
| 16 | **mPLUG-Owl (LLaMA-7B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **22.2** | 22.7 | 23.6 | 10.2 | 27.2 | 27.9 | 23.6 | 19.2 | 23.9 | 13.5 | 12.7 | 26.3 | 21.4 |
| 17 | **IDEFICS (9B-Instruct)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **19.8** | 21.6 | 21.1 | 6.5 | 25.9 | 24.0 | 22.1 | 15.0 | 19.8 | 18.9 | 9.9 | 24.6 | 18.1 |
| 18 | **Random Chance** | - | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **17.9** | 18.2 | 21.6 | 3.8 | 19.6 | 26.3 | 21.7 | 14.7 | 20.1 | 13.5 | 8.3 | 17.2 | 16.3 |

| 3 | **Gemini Pro 🥉** | LMM 🖼️ | [Link](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) | 2023-12-06 | **45.2** | 47.6 | 40.4 | 39.3 | 61.4 | 39.1 | 45.2 | 38.8 | 41.0 | 10.8 | 32.6 | 54.9 | 56.8 |
| 4 | **SPHINX (V2)** | LMM 🖼️ | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2023-11-17 | **36.7** | 54.7 | 16.4 | 23.1 | 41.8 | 43.0 | 20.6 | 33.4 | 17.6 | 24.3 | 21.5 | 43.4 | 51.5 |
| 5 | **Multimodal Bard** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **34.8** | 26.0 | 47.1 | 29.6 | 48.7 | 26.8 | 46.5 | 28.6 | 47.8 | 13.5 | 14.9 | 47.5 | 33.0 |
| 6 | **PoT GPT-4 (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.9** | 30.1 | 39.4 | 30.6 | 39.9 | 31.3 | 37.4 | 31.7 | 41.0 | 18.9 | 20.1 | 44.3 | 37.9 |
| 7 | **CoT GPT-4 (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.2** | 27.9 | 31.7 | 31.2 | 51.9 | 28.5 | 33.5 | 30.9 | 32.2 | 13.5 | 12.5 | 58.2 | 37.9 |
| 8 | **CoT ChatGPT (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.2** | 27.5 | 29.3 | 36.0 | 49.4 | 29.1 | 31.0 | 32.9 | 31.0 | 16.2 | 17.4 | 50.8 | 37.2 |
| 9 | **CoT Claude-2 (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.2** | 26.0 | 31.7 | 35.5 | 48.1 | 30.2 | 32.4 | 32.3 | 33.0 | 16.2 | 17.4 | 54.9 | 36.2 |
| 10 | **Gemini Nano 2** | LMM 🖼️ | [Link](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) | 2023-12-06 | **30.6** | 28.6 | 23.6 | 30.7 | 41.8 | 31.8 | 27.1 | 29.8 | 26.8 | 10.8 | 20.8 | 40.2 | 33.6 |
| 11 | **SPHINX (V1)** | LMM 🖼️ | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/SPHINX/SPHINX_paper.pdf) | 2023-11-09 | **27.5** | 23.4 | 23.1 | 21.5 | 39.9 | 34.1 | 25.6 | 28.1 | 23.4 | 16.2 | 17.4 | 40.2 | 23.6 |
| 12 | **Gemini Nano 1** | LMM 🖼️ | [Link](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) | 2023-12-06 | **27.3** | 30.9 | 21.6 | 23.7 | 29.1 | 30.7 | 23.8 | 25.5 | 21.3 | 13.5 | 20.8 | 27.9 | 30.9 |
| 13 | **PoT ChatGPT (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **26.8** | 24.5 | 26.4 | 23.7 | 33.5 | 27.9 | 27.8 | 26.1 | 28.0 | 18.9 | 13.2 | 33.6 | 29.9 |
| 14 | **LLaVA (LLaMA-2-13B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **26.1** | 26.8 | 29.3 | 16.1 | 32.3 | 26.3 | 27.3 | 20.1 | 28.8 | 24.3 | 18.3 | 37.3 | 25.1 |
| 15 | **InstructBLIP (Vicuna-7B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **25.3** | 23.1 | 20.7 | 18.3 | 32.3 | 35.2 | 21.8 | 27.1 | 20.7 | 18.9 | 20.4 | 33.0 | 23.1 |
| 16 | **LLaVAR** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **25.2** | 21.9 | 25.0 | 16.7 | 34.8 | 30.7 | 24.2 | 22.1 | 23.0 | 13.5 | 15.3 | 42.6 | 21.9 |
| 17 | **LLaMA-Adapter-V2 (7B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **23.9** | 21.2 | 25.5 | 11.3 | 32.3 | 31.8 | 26.3 | 20.4 | 24.3 | 24.3 | 13.9 | 29.5 | 18.3 |
| 18 | **miniGPT4 (LLaMA-2-7B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **23.1** | 18.6 | 26.0 | 13.4 | 30.4 | 30.2 | 28.1 | 21.0 | 24.7 | 16.2 | 16.7 | 25.4 | 17.9 |
| 19 | **mPLUG-Owl (LLaMA-7B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **22.2** | 22.7 | 23.6 | 10.2 | 27.2 | 27.9 | 23.6 | 19.2 | 23.9 | 13.5 | 12.7 | 26.3 | 21.4 |
| 20 | **IDEFICS (9B-Instruct)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **19.8** | 21.6 | 21.1 | 6.5 | 25.9 | 24.0 | 22.1 | 15.0 | 19.8 | 18.9 | 9.9 | 24.6 | 18.1 |
| 21 | **Random Chance** | - | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **17.9** | 18.2 | 21.6 | 3.8 | 19.6 | 26.3 | 21.7 | 14.7 | 20.1 | 13.5 | 8.3 | 17.2 | 16.3 |

Some notations in the table:

- **Gemini**: the result is reported in the [technical report](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) authored by th Gemini Team.
- **Human Performance\*:** Average human performance from AMT annotators who have high school diplomas or above.

- **Gemini**: the fine-grained scores are from **the Gemini Team, Google**.

- **GPT-4V (Playgroud)**: the launched playground at https://chat.openai.com/?model=gpt-4; experimental dates range from Oct 7, 2023, to Oct 15, 2023

Expand Down
Binary file added assets/tease_scores_version4_gemini.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

4 comments on commit 83ccb4c

@EwoutH
Copy link

@EwoutH EwoutH commented on 83ccb4c Aug 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to update the radar graph again, based on the latest models?

@lupantech
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing that out. Most of the latest models typically provide only overall scores rather than the fine-grained scores needed for radar graphs. Given the fast-paced evolution of these models, it's challenging for us to keep the radar graph continuously updated. However, we are actively maintaining the leaderboard, which you can find here: Leaderboard.

@EwoutH
Copy link

@EwoutH EwoutH commented on 83ccb4c Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting back! I indeed notices many models only providing overall score. Is there a specific reason for this? It would be nice to have more complete score per category.

@lupantech
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible reason may be that these models are evaluated on a large number of datasets. Providing the overall scores makes it easier to compare different models.

Please sign in to comment.