update radar graph with gemini

lupantech · Dec 8, 2023 · 83ccb4c · 83ccb4c · EwoutH · Aug 18, 2024
1 parent 5fd8a25
commit 83ccb4c
Show file tree

Hide file tree

Showing 2 changed files with 30 additions and 20 deletions.
diff --git a/README.md b/README.md
@@ -53,8 +53,14 @@ With **MathVista**, we have conducted **a comprehensive, quantitative evaluation
     <img src="assets/score_leaderboard_gpt4v.png" width="70%"> <br>
   Accuracy scores the testmini set (1,000 examples) of <b>MathVista</b>.
 </p>
+
 We further explore the new ability of **self-verification**, the use of **self-consistency**, and the **goal-directed multi-turn human-AI dialogues**, highlighting the promising potential of GPT-4V for future research.
 
+<p align="center">
+    <img src="assets/tease_scores_version4_gemini.png" width="80%"> <br>
+  Accuracy scores of one leading LLM (i.e., PoT GPT-4), four primary LMMs, random chance, and human performance on <b>MathVista</b>.
+</p>
+
 <p align="center">
     <img src="assets/tease_scores_gpt4v.png" width="80%"> <br>
   Accuracy scores of one leading LLM (i.e., PoT GPT-4), four primary LMMs, random chance, and human performance on <b>MathVista</b>.
@@ -70,30 +76,34 @@ Accuracy scores on the **testmini** subset (1,000 examples):
 
 | **#** | **Model**                      | **Method** | **Source**                                                   | **Date**   | **ALL**  | **FQA** | **GPS** | **MWP** | **TQA** | **VQA** | **ALG** | **ARI** | **GEO** | **LOG** | **NUM** | **SCI** | **STA** |
 | ----- | ------------------------------ | ---------- | ------------------------------------------------------------ | ---------- | -------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- |
-| -     | **Human**                      | -          | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **60.3** | 59.7    | 48.4    | 73.0    | 63.2    | 55.9    | 50.9    | 59.2    | 51.4    | 40.7    | 53.8    | 64.9    | 63.9    |
-| 1     | **Gemini Ultra🥇**              | LMM 🖼️      | [Link](https://blog.google/technology/ai/google-gemini-ai/#performance) | 2023-12-06 | **53.0** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |
+| -     | **Human Performance\***        | -          | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **60.3** | 59.7    | 48.4    | 73.0    | 63.2    | 55.9    | 50.9    | 59.2    | 51.4    | 40.7    | 53.8    | 64.9    | 63.9    |
+| 1     | **Gemini Ultra🥇**              | LMM 🖼️      | [Link](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) | 2023-12-06 | **53.0** | 49.1    | 56.3    | 53.8    | 69.0    | 40.2    | 58.4    | 45.9    | 55.7    | 21.6    | 38.9    | 62.3    | 59.5    |
 | 2     | **GPT-4V (Playground)🥈**       | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-15 | **49.9** | 43.1    | 50.5    | 57.5    | 65.2    | 38.0    | 53.0    | 49.0    | 51.0    | 21.6    | 20.1    | 63.1    | 55.8    |
-| 3     | **SPHINX (V2) 🥉**              | LMM 🖼️      | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2023-11-17 | **36.7** | 54.7    | 16.4    | 23.1    | 41.8    | 43.0    | 20.6    | 33.4    | 17.6    | 24.3    | 21.5    | 43.4    | 51.5    |
-| 4     | **Multimodal Bard**            | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **34.8** | 26.0    | 47.1    | 29.6    | 48.7    | 26.8    | 46.5    | 28.6    | 47.8    | 13.5    | 14.9    | 47.5    | 33.0    |
-| 5     | **PoT GPT-4 (Caption+OCR)**    | Tool 🛠️     | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **33.9** | 30.1    | 39.4    | 30.6    | 39.9    | 31.3    | 37.4    | 31.7    | 41.0    | 18.9    | 20.1    | 44.3    | 37.9    |
-| 6     | **CoT GPT-4 (Caption+OCR)**    | Tool 🛠️     | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **33.2** | 27.9    | 31.7    | 31.2    | 51.9    | 28.5    | 33.5    | 30.9    | 32.2    | 13.5    | 12.5    | 58.2    | 37.9    |
-| 7     | **CoT ChatGPT (Caption+OCR)**  | Tool 🛠️     | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **33.2** | 27.5    | 29.3    | 36.0    | 49.4    | 29.1    | 31.0    | 32.9    | 31.0    | 16.2    | 17.4    | 50.8    | 37.2    |
-| 8     | **CoT Claude-2 (Caption+OCR)** | Tool 🛠️     | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **33.2** | 26.0    | 31.7    | 35.5    | 48.1    | 30.2    | 32.4    | 32.3    | 33.0    | 16.2    | 17.4    | 54.9    | 36.2    |
-| 9     | **SPHINX (V1)**                | LMM 🖼️      | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/SPHINX/SPHINX_paper.pdf) | 2023-11-09 | **27.5** | 23.4    | 23.1    | 21.5    | 39.9    | 34.1    | 25.6    | 28.1    | 23.4    | 16.2    | 17.4    | 40.2    | 23.6    |
-| 10    | **PoT ChatGPT (Caption+OCR)**  | Tool 🛠️     | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **26.8** | 24.5    | 26.4    | 23.7    | 33.5    | 27.9    | 27.8    | 26.1    | 28.0    | 18.9    | 13.2    | 33.6    | 29.9    |
-| 11    | **LLaVA (LLaMA-2-13B)**        | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **26.1** | 26.8    | 29.3    | 16.1    | 32.3    | 26.3    | 27.3    | 20.1    | 28.8    | 24.3    | 18.3    | 37.3    | 25.1    |
-| 12    | **InstructBLIP (Vicuna-7B)**   | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **25.3** | 23.1    | 20.7    | 18.3    | 32.3    | 35.2    | 21.8    | 27.1    | 20.7    | 18.9    | 20.4    | 33.0    | 23.1    |
-| 13    | **LLaVAR**                     | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **25.2** | 21.9    | 25.0    | 16.7    | 34.8    | 30.7    | 24.2    | 22.1    | 23.0    | 13.5    | 15.3    | 42.6    | 21.9    |
-| 14    | **LLaMA-Adapter-V2 (7B)**      | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **23.9** | 21.2    | 25.5    | 11.3    | 32.3    | 31.8    | 26.3    | 20.4    | 24.3    | 24.3    | 13.9    | 29.5    | 18.3    |
-| 15    | **miniGPT4 (LLaMA-2-7B)**      | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **23.1** | 18.6    | 26.0    | 13.4    | 30.4    | 30.2    | 28.1    | 21.0    | 24.7    | 16.2    | 16.7    | 25.4    | 17.9    |
-| 16    | **mPLUG-Owl (LLaMA-7B)**       | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **22.2** | 22.7    | 23.6    | 10.2    | 27.2    | 27.9    | 23.6    | 19.2    | 23.9    | 13.5    | 12.7    | 26.3    | 21.4    |
-| 17    | **IDEFICS (9B-Instruct)**      | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **19.8** | 21.6    | 21.1    | 6.5     | 25.9    | 24.0    | 22.1    | 15.0    | 19.8    | 18.9    | 9.9     | 24.6    | 18.1    |
-| 18    | **Random Chance**              | -          | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **17.9** | 18.2    | 21.6    | 3.8     | 19.6    | 26.3    | 21.7    | 14.7    | 20.1    | 13.5    | 8.3     | 17.2    | 16.3    |
-
+| 3     | **Gemini Pro 🥉**               | LMM 🖼️      | [Link](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) | 2023-12-06 | **45.2** | 47.6    | 40.4    | 39.3    | 61.4    | 39.1    | 45.2    | 38.8    | 41.0    | 10.8    | 32.6    | 54.9    | 56.8    |
+| 4     | **SPHINX (V2)**                | LMM 🖼️      | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2023-11-17 | **36.7** | 54.7    | 16.4    | 23.1    | 41.8    | 43.0    | 20.6    | 33.4    | 17.6    | 24.3    | 21.5    | 43.4    | 51.5    |
+| 5     | **Multimodal Bard**            | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **34.8** | 26.0    | 47.1    | 29.6    | 48.7    | 26.8    | 46.5    | 28.6    | 47.8    | 13.5    | 14.9    | 47.5    | 33.0    |
+| 6     | **PoT GPT-4 (Caption+OCR)**    | Tool 🛠️     | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **33.9** | 30.1    | 39.4    | 30.6    | 39.9    | 31.3    | 37.4    | 31.7    | 41.0    | 18.9    | 20.1    | 44.3    | 37.9    |
+| 7     | **CoT GPT-4 (Caption+OCR)**    | Tool 🛠️     | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **33.2** | 27.9    | 31.7    | 31.2    | 51.9    | 28.5    | 33.5    | 30.9    | 32.2    | 13.5    | 12.5    | 58.2    | 37.9    |
+| 8     | **CoT ChatGPT (Caption+OCR)**  | Tool 🛠️     | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **33.2** | 27.5    | 29.3    | 36.0    | 49.4    | 29.1    | 31.0    | 32.9    | 31.0    | 16.2    | 17.4    | 50.8    | 37.2    |
+| 9     | **CoT Claude-2 (Caption+OCR)** | Tool 🛠️     | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **33.2** | 26.0    | 31.7    | 35.5    | 48.1    | 30.2    | 32.4    | 32.3    | 33.0    | 16.2    | 17.4    | 54.9    | 36.2    |
+| 10    | **Gemini Nano 2**              | LMM 🖼️      | [Link](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) | 2023-12-06 | **30.6** | 28.6    | 23.6    | 30.7    | 41.8    | 31.8    | 27.1    | 29.8    | 26.8    | 10.8    | 20.8    | 40.2    | 33.6    |
+| 11    | **SPHINX (V1)**                | LMM 🖼️      | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/SPHINX/SPHINX_paper.pdf) | 2023-11-09 | **27.5** | 23.4    | 23.1    | 21.5    | 39.9    | 34.1    | 25.6    | 28.1    | 23.4    | 16.2    | 17.4    | 40.2    | 23.6    |
+| 12    | **Gemini Nano 1**              | LMM 🖼️      | [Link](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) | 2023-12-06 | **27.3** | 30.9    | 21.6    | 23.7    | 29.1    | 30.7    | 23.8    | 25.5    | 21.3    | 13.5    | 20.8    | 27.9    | 30.9    |
+| 13    | **PoT ChatGPT (Caption+OCR)**  | Tool 🛠️     | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **26.8** | 24.5    | 26.4    | 23.7    | 33.5    | 27.9    | 27.8    | 26.1    | 28.0    | 18.9    | 13.2    | 33.6    | 29.9    |
+| 14    | **LLaVA (LLaMA-2-13B)**        | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **26.1** | 26.8    | 29.3    | 16.1    | 32.3    | 26.3    | 27.3    | 20.1    | 28.8    | 24.3    | 18.3    | 37.3    | 25.1    |
+| 15    | **InstructBLIP (Vicuna-7B)**   | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **25.3** | 23.1    | 20.7    | 18.3    | 32.3    | 35.2    | 21.8    | 27.1    | 20.7    | 18.9    | 20.4    | 33.0    | 23.1    |
+| 16    | **LLaVAR**                     | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **25.2** | 21.9    | 25.0    | 16.7    | 34.8    | 30.7    | 24.2    | 22.1    | 23.0    | 13.5    | 15.3    | 42.6    | 21.9    |
+| 17    | **LLaMA-Adapter-V2 (7B)**      | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **23.9** | 21.2    | 25.5    | 11.3    | 32.3    | 31.8    | 26.3    | 20.4    | 24.3    | 24.3    | 13.9    | 29.5    | 18.3    |
+| 18    | **miniGPT4 (LLaMA-2-7B)**      | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **23.1** | 18.6    | 26.0    | 13.4    | 30.4    | 30.2    | 28.1    | 21.0    | 24.7    | 16.2    | 16.7    | 25.4    | 17.9    |
+| 19    | **mPLUG-Owl (LLaMA-7B)**       | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **22.2** | 22.7    | 23.6    | 10.2    | 27.2    | 27.9    | 23.6    | 19.2    | 23.9    | 13.5    | 12.7    | 26.3    | 21.4    |
+| 20    | **IDEFICS (9B-Instruct)**      | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **19.8** | 21.6    | 21.1    | 6.5     | 25.9    | 24.0    | 22.1    | 15.0    | 19.8    | 18.9    | 9.9     | 24.6    | 18.1    |
+| 21    | **Random Chance**              | -          | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **17.9** | 18.2    | 21.6    | 3.8     | 19.6    | 26.3    | 21.7    | 14.7    | 20.1    | 13.5    | 8.3     | 17.2    | 16.3    |
 
 Some notations in the table:
 
-- **Gemini**: the result is reported in the [technical report](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) authored by th Gemini Team.
+- **Human Performance\*:** Average human performance from AMT annotators who have high school diplomas or above.
+
+- **Gemini**: the fine-grained scores are from **the Gemini Team, Google**.
 
 - **GPT-4V (Playgroud)**: the launched playground at https://chat.openai.com/?model=gpt-4; experimental dates range from Oct 7, 2023, to Oct 15, 2023
 

diff --git a/assets/tease_scores_version4_gemini.png b/assets/tease_scores_version4_gemini.png