diff --git a/README.md b/README.md index 70e43b1..376dba8 100644 --- a/README.md +++ b/README.md @@ -53,8 +53,14 @@ With **MathVista**, we have conducted **a comprehensive, quantitative evaluation
Accuracy scores the testmini set (1,000 examples) of MathVista.

+ We further explore the new ability of **self-verification**, the use of **self-consistency**, and the **goal-directed multi-turn human-AI dialogues**, highlighting the promising potential of GPT-4V for future research. +

+
+ Accuracy scores of one leading LLM (i.e., PoT GPT-4), four primary LMMs, random chance, and human performance on MathVista. +

+


Accuracy scores of one leading LLM (i.e., PoT GPT-4), four primary LMMs, random chance, and human performance on MathVista. @@ -70,30 +76,34 @@ Accuracy scores on the **testmini** subset (1,000 examples): | **#** | **Model** | **Method** | **Source** | **Date** | **ALL** | **FQA** | **GPS** | **MWP** | **TQA** | **VQA** | **ALG** | **ARI** | **GEO** | **LOG** | **NUM** | **SCI** | **STA** | | ----- | ------------------------------ | ---------- | ------------------------------------------------------------ | ---------- | -------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | -| - | **Human** | - | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **60.3** | 59.7 | 48.4 | 73.0 | 63.2 | 55.9 | 50.9 | 59.2 | 51.4 | 40.7 | 53.8 | 64.9 | 63.9 | -| 1 | **Gemini Ultra๐Ÿฅ‡** | LMM ๐Ÿ–ผ๏ธ | [Link](https://blog.google/technology/ai/google-gemini-ai/#performance) | 2023-12-06 | **53.0** | - | - | - | - | - | - | - | - | - | - | - | - | +| - | **Human Performance\*** | - | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **60.3** | 59.7 | 48.4 | 73.0 | 63.2 | 55.9 | 50.9 | 59.2 | 51.4 | 40.7 | 53.8 | 64.9 | 63.9 | +| 1 | **Gemini Ultra๐Ÿฅ‡** | LMM ๐Ÿ–ผ๏ธ | [Link](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) | 2023-12-06 | **53.0** | 49.1 | 56.3 | 53.8 | 69.0 | 40.2 | 58.4 | 45.9 | 55.7 | 21.6 | 38.9 | 62.3 | 59.5 | | 2 | **GPT-4V (Playground)๐Ÿฅˆ** | LMM ๐Ÿ–ผ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-15 | **49.9** | 43.1 | 50.5 | 57.5 | 65.2 | 38.0 | 53.0 | 49.0 | 51.0 | 21.6 | 20.1 | 63.1 | 55.8 | -| 3 | **SPHINX (V2) ๐Ÿฅ‰** | LMM ๐Ÿ–ผ๏ธ | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2023-11-17 | **36.7** | 54.7 | 16.4 | 23.1 | 41.8 | 43.0 | 20.6 | 33.4 | 17.6 | 24.3 | 21.5 | 43.4 | 51.5 | -| 4 | **Multimodal Bard** | LMM ๐Ÿ–ผ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **34.8** | 26.0 | 47.1 | 29.6 | 48.7 | 26.8 | 46.5 | 28.6 | 47.8 | 13.5 | 14.9 | 47.5 | 33.0 | -| 5 | **PoT GPT-4 (Caption+OCR)** | Tool ๐Ÿ› ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.9** | 30.1 | 39.4 | 30.6 | 39.9 | 31.3 | 37.4 | 31.7 | 41.0 | 18.9 | 20.1 | 44.3 | 37.9 | -| 6 | **CoT GPT-4 (Caption+OCR)** | Tool ๐Ÿ› ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.2** | 27.9 | 31.7 | 31.2 | 51.9 | 28.5 | 33.5 | 30.9 | 32.2 | 13.5 | 12.5 | 58.2 | 37.9 | -| 7 | **CoT ChatGPT (Caption+OCR)** | Tool ๐Ÿ› ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.2** | 27.5 | 29.3 | 36.0 | 49.4 | 29.1 | 31.0 | 32.9 | 31.0 | 16.2 | 17.4 | 50.8 | 37.2 | -| 8 | **CoT Claude-2 (Caption+OCR)** | Tool ๐Ÿ› ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.2** | 26.0 | 31.7 | 35.5 | 48.1 | 30.2 | 32.4 | 32.3 | 33.0 | 16.2 | 17.4 | 54.9 | 36.2 | -| 9 | **SPHINX (V1)** | LMM ๐Ÿ–ผ๏ธ | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/SPHINX/SPHINX_paper.pdf) | 2023-11-09 | **27.5** | 23.4 | 23.1 | 21.5 | 39.9 | 34.1 | 25.6 | 28.1 | 23.4 | 16.2 | 17.4 | 40.2 | 23.6 | -| 10 | **PoT ChatGPT (Caption+OCR)** | Tool ๐Ÿ› ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **26.8** | 24.5 | 26.4 | 23.7 | 33.5 | 27.9 | 27.8 | 26.1 | 28.0 | 18.9 | 13.2 | 33.6 | 29.9 | -| 11 | **LLaVA (LLaMA-2-13B)** | LMM ๐Ÿ–ผ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **26.1** | 26.8 | 29.3 | 16.1 | 32.3 | 26.3 | 27.3 | 20.1 | 28.8 | 24.3 | 18.3 | 37.3 | 25.1 | -| 12 | **InstructBLIP (Vicuna-7B)** | LMM ๐Ÿ–ผ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **25.3** | 23.1 | 20.7 | 18.3 | 32.3 | 35.2 | 21.8 | 27.1 | 20.7 | 18.9 | 20.4 | 33.0 | 23.1 | -| 13 | **LLaVAR** | LMM ๐Ÿ–ผ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **25.2** | 21.9 | 25.0 | 16.7 | 34.8 | 30.7 | 24.2 | 22.1 | 23.0 | 13.5 | 15.3 | 42.6 | 21.9 | -| 14 | **LLaMA-Adapter-V2 (7B)** | LMM ๐Ÿ–ผ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **23.9** | 21.2 | 25.5 | 11.3 | 32.3 | 31.8 | 26.3 | 20.4 | 24.3 | 24.3 | 13.9 | 29.5 | 18.3 | -| 15 | **miniGPT4 (LLaMA-2-7B)** | LMM ๐Ÿ–ผ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **23.1** | 18.6 | 26.0 | 13.4 | 30.4 | 30.2 | 28.1 | 21.0 | 24.7 | 16.2 | 16.7 | 25.4 | 17.9 | -| 16 | **mPLUG-Owl (LLaMA-7B)** | LMM ๐Ÿ–ผ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **22.2** | 22.7 | 23.6 | 10.2 | 27.2 | 27.9 | 23.6 | 19.2 | 23.9 | 13.5 | 12.7 | 26.3 | 21.4 | -| 17 | **IDEFICS (9B-Instruct)** | LMM ๐Ÿ–ผ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **19.8** | 21.6 | 21.1 | 6.5 | 25.9 | 24.0 | 22.1 | 15.0 | 19.8 | 18.9 | 9.9 | 24.6 | 18.1 | -| 18 | **Random Chance** | - | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **17.9** | 18.2 | 21.6 | 3.8 | 19.6 | 26.3 | 21.7 | 14.7 | 20.1 | 13.5 | 8.3 | 17.2 | 16.3 | - +| 3 | **Gemini Pro ๐Ÿฅ‰** | LMM ๐Ÿ–ผ๏ธ | [Link](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) | 2023-12-06 | **45.2** | 47.6 | 40.4 | 39.3 | 61.4 | 39.1 | 45.2 | 38.8 | 41.0 | 10.8 | 32.6 | 54.9 | 56.8 | +| 4 | **SPHINX (V2)** | LMM ๐Ÿ–ผ๏ธ | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2023-11-17 | **36.7** | 54.7 | 16.4 | 23.1 | 41.8 | 43.0 | 20.6 | 33.4 | 17.6 | 24.3 | 21.5 | 43.4 | 51.5 | +| 5 | **Multimodal Bard** | LMM ๐Ÿ–ผ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **34.8** | 26.0 | 47.1 | 29.6 | 48.7 | 26.8 | 46.5 | 28.6 | 47.8 | 13.5 | 14.9 | 47.5 | 33.0 | +| 6 | **PoT GPT-4 (Caption+OCR)** | Tool ๐Ÿ› ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.9** | 30.1 | 39.4 | 30.6 | 39.9 | 31.3 | 37.4 | 31.7 | 41.0 | 18.9 | 20.1 | 44.3 | 37.9 | +| 7 | **CoT GPT-4 (Caption+OCR)** | Tool ๐Ÿ› ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.2** | 27.9 | 31.7 | 31.2 | 51.9 | 28.5 | 33.5 | 30.9 | 32.2 | 13.5 | 12.5 | 58.2 | 37.9 | +| 8 | **CoT ChatGPT (Caption+OCR)** | Tool ๐Ÿ› ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.2** | 27.5 | 29.3 | 36.0 | 49.4 | 29.1 | 31.0 | 32.9 | 31.0 | 16.2 | 17.4 | 50.8 | 37.2 | +| 9 | **CoT Claude-2 (Caption+OCR)** | Tool ๐Ÿ› ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.2** | 26.0 | 31.7 | 35.5 | 48.1 | 30.2 | 32.4 | 32.3 | 33.0 | 16.2 | 17.4 | 54.9 | 36.2 | +| 10 | **Gemini Nano 2** | LMM ๐Ÿ–ผ๏ธ | [Link](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) | 2023-12-06 | **30.6** | 28.6 | 23.6 | 30.7 | 41.8 | 31.8 | 27.1 | 29.8 | 26.8 | 10.8 | 20.8 | 40.2 | 33.6 | +| 11 | **SPHINX (V1)** | LMM ๐Ÿ–ผ๏ธ | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/SPHINX/SPHINX_paper.pdf) | 2023-11-09 | **27.5** | 23.4 | 23.1 | 21.5 | 39.9 | 34.1 | 25.6 | 28.1 | 23.4 | 16.2 | 17.4 | 40.2 | 23.6 | +| 12 | **Gemini Nano 1** | LMM ๐Ÿ–ผ๏ธ | [Link](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) | 2023-12-06 | **27.3** | 30.9 | 21.6 | 23.7 | 29.1 | 30.7 | 23.8 | 25.5 | 21.3 | 13.5 | 20.8 | 27.9 | 30.9 | +| 13 | **PoT ChatGPT (Caption+OCR)** | Tool ๐Ÿ› ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **26.8** | 24.5 | 26.4 | 23.7 | 33.5 | 27.9 | 27.8 | 26.1 | 28.0 | 18.9 | 13.2 | 33.6 | 29.9 | +| 14 | **LLaVA (LLaMA-2-13B)** | LMM ๐Ÿ–ผ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **26.1** | 26.8 | 29.3 | 16.1 | 32.3 | 26.3 | 27.3 | 20.1 | 28.8 | 24.3 | 18.3 | 37.3 | 25.1 | +| 15 | **InstructBLIP (Vicuna-7B)** | LMM ๐Ÿ–ผ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **25.3** | 23.1 | 20.7 | 18.3 | 32.3 | 35.2 | 21.8 | 27.1 | 20.7 | 18.9 | 20.4 | 33.0 | 23.1 | +| 16 | **LLaVAR** | LMM ๐Ÿ–ผ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **25.2** | 21.9 | 25.0 | 16.7 | 34.8 | 30.7 | 24.2 | 22.1 | 23.0 | 13.5 | 15.3 | 42.6 | 21.9 | +| 17 | **LLaMA-Adapter-V2 (7B)** | LMM ๐Ÿ–ผ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **23.9** | 21.2 | 25.5 | 11.3 | 32.3 | 31.8 | 26.3 | 20.4 | 24.3 | 24.3 | 13.9 | 29.5 | 18.3 | +| 18 | **miniGPT4 (LLaMA-2-7B)** | LMM ๐Ÿ–ผ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **23.1** | 18.6 | 26.0 | 13.4 | 30.4 | 30.2 | 28.1 | 21.0 | 24.7 | 16.2 | 16.7 | 25.4 | 17.9 | +| 19 | **mPLUG-Owl (LLaMA-7B)** | LMM ๐Ÿ–ผ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **22.2** | 22.7 | 23.6 | 10.2 | 27.2 | 27.9 | 23.6 | 19.2 | 23.9 | 13.5 | 12.7 | 26.3 | 21.4 | +| 20 | **IDEFICS (9B-Instruct)** | LMM ๐Ÿ–ผ๏ธ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **19.8** | 21.6 | 21.1 | 6.5 | 25.9 | 24.0 | 22.1 | 15.0 | 19.8 | 18.9 | 9.9 | 24.6 | 18.1 | +| 21 | **Random Chance** | - | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **17.9** | 18.2 | 21.6 | 3.8 | 19.6 | 26.3 | 21.7 | 14.7 | 20.1 | 13.5 | 8.3 | 17.2 | 16.3 | Some notations in the table: -- **Gemini**: the result is reported in the [technical report](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) authored by th Gemini Team. +- **Human Performance\*:** Average human performance from AMT annotators who have high school diplomas or above. + +- **Gemini**: the fine-grained scores are from **the Gemini Team, Google**. - **GPT-4V (Playgroud)**: the launched playground at https://chat.openai.com/?model=gpt-4; experimental dates range from Oct 7, 2023, to Oct 15, 2023 diff --git a/assets/tease_scores_version4_gemini.png b/assets/tease_scores_version4_gemini.png new file mode 100644 index 0000000..0fd9534 Binary files /dev/null and b/assets/tease_scores_version4_gemini.png differ