diff --git a/README.md b/README.md index 229759d..c8471a7 100644 --- a/README.md +++ b/README.md @@ -78,47 +78,71 @@ For more details, you can find our project page [here](https://mathvista.github. ## 🏆 Leaderboard 🏆 -🚨🚨 The leaderboard is continuously being updated. To submit your results to the leaderboard, please send to [this email](mailto:lupantech@gmail.com) with your result json file, referring to the template files available [here](https://github.com/lupantech/MathVista/tree/main/results/leaderboad_submission_template). +### Contributing the Leaderboard + +🚨🚨 The leaderboard is continuously being updated. The evaluation instructions are available [here](https://github.com/lupantech/MathVista?tab=readme-ov-file#-evaluations-on-mathvista). + +To submit your results to the leaderboard on the **testmini** subset, please send to [this email](mailto:lupantech@gmail.com) with your result json file and score json file, referring to the template files below: + +- [output_testmini_template_for_leaderboard_submission.json](https://github.com/lupantech/MathVista/blob/main/results/leaderboad_submission_template/output_testmini_template_for_leaderboard_submission.json) +- [scores_testmini_template_for_leaderboard_submission.json](https://github.com/lupantech/MathVista/blob/main/results/leaderboad_submission_template/scores_testmini_template_for_leaderboard_submission.json) + +To submit your results to the leaderboard on the **test** subset, please send to [this email](mailto:lupantech@gmail.com) with your result file (**we will generate the score file for you**), referring to the template file below: + +- [output_test_template_for_leaderboard_submission.json](https://github.com/lupantech/MathVista/blob/main/results/leaderboad_submission_template/output_test_template_for_leaderboard_submission.json) + +### Leaderboard on the testmini subset Accuracy scores on the **testmini** subset (1,000 examples): -| **#** | **Model** | **Method** | **Source** | **Date** | **ALL** | **FQA** | **GPS** | **MWP** | **TQA** | **VQA** | **ALG** | **ARI** | **GEO** | **LOG** | **NUM** | **SCI** | **STA** | -| ----- | ----------------------------- | ---------- | ------------------------------------------------------------ | ---------- | -------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | -| - | **Human Performance\*** | - | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **60.3** | 59.7 | 48.4 | 73.0 | 63.2 | 55.9 | 50.9 | 59.2 | 51.4 | 40.7 | 53.8 | 64.9 | 63.9 | -| 1 | **Gemini Ultra 🥇** | LMM 🖼️ | [Link](https://arxiv.org/abs/2312.11805) | 2023-12-06 | **53.0** | 49.1 | 56.2 | 53.8 | 69.0 | 40.2 | 58.4 | 45.9 | 55.6 | 21.6 | 38.9 | 62.3 | 59.5 | -| 2 | **GPT-4V (Playground) 🥈** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-15 | **49.9** | 43.1 | 50.5 | 57.5 | 65.2 | 38.0 | 53.0 | 49.0 | 51.0 | 21.6 | 20.1 | 63.1 | 55.8 | -| 3 | **Gemini Pro 🥉** | LMM 🖼️ | [Link](https://arxiv.org/abs/2312.11805) | 2023-12-06 | **45.2** | 47.6 | 40.4 | 39.2 | 61.4 | 39.1 | 45.2 | 38.8 | 41.0 | 10.8 | 32.6 | 54.9 | 56.8 | -| 4 | **Qwen-VL-Plus** | LMM 🖼️ | [Link](https://github.com/QwenLM/Qwen-VL) | 2023-12-21 | **43.3** | 54.6 | 38.5 | 31.2 | 55.1 | 34.1 | 39.1 | 32.0 | 39.3 | 18.9 | 26.4 | 59.0 | 56.1 | -| 5 | **SPHINX-MoE** | MoE 🤖 | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2024-01-12 | **42.3** | 49.8 | 31.2 | 42.5 | 46.8 | 39.7 | 31.7 | 41.6 | 30.5 | 16.2 | 27.1 | 50.8 | 50.8 | -| 6 | **SPHINX (V2)** | LMM 🖼️ | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2023-11-17 | **36.7** | 54.6 | 16.4 | 23.1 | 41.8 | 43.0 | 20.6 | 33.4 | 17.6 | 24.3 | 21.5 | 43.4 | 51.5 | -| 7 | **Multimodal Bard** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **34.8** | 26.0 | 47.1 | 29.6 | 48.7 | 26.8 | 46.5 | 28.6 | 47.8 | 13.5 | 14.9 | 47.5 | 33.0 | -| 8 | **PoT GPT-4 (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.9** | 30.1 | 39.4 | 30.6 | 39.9 | 31.3 | 37.4 | 31.7 | 41.0 | 18.9 | 20.1 | 44.3 | 37.9 | -| 9 | **CoT Claude (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.2** | 27.5 | 29.3 | 36.0 | 49.4 | 29.1 | 31.0 | 32.9 | 31.0 | 16.2 | 17.4 | 50.8 | 37.2 | -| 10 | **CoT GPT4 (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.2** | 27.9 | 31.7 | 31.2 | 51.9 | 28.5 | 33.5 | 30.9 | 32.2 | 13.5 | 12.5 | 58.2 | 37.9 | -| 11 | **CoT ChatGPT (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.2** | 26.0 | 31.7 | 35.5 | 48.1 | 30.2 | 32.4 | 32.3 | 33.0 | 16.2 | 17.4 | 54.9 | 36.2 | -| 12 | **Gemini Nano 2** | LMM 🖼️ | [Link](https://arxiv.org/abs/2312.11805) | 2023-12-06 | **30.6** | 28.6 | 23.6 | 30.6 | 41.8 | 31.8 | 27.1 | 29.8 | 26.8 | 10.8 | 20.8 | 40.2 | 33.5 | -| 13 | **SPHINX (V1)** | LMM 🖼️ | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2023-11-09 | **27.5** | 23.4 | 23.1 | 21.5 | 39.9 | 34.1 | 25.6 | 28.1 | 23.4 | 16.2 | 17.4 | 40.2 | 23.6 | -| 14 | **Gemini Nano 1** | LMM 🖼️ | [Link](https://arxiv.org/abs/2312.11805) | 2023-12-06 | **27.3** | 30.9 | 21.6 | 23.7 | 29.1 | 30.7 | 23.8 | 25.5 | 21.3 | 13.5 | 20.8 | 27.9 | 30.9 | -| 15 | **PoT ChatGPT (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **26.8** | 24.5 | 26.4 | 23.7 | 33.5 | 27.9 | 27.8 | 26.1 | 28.0 | 18.9 | 13.2 | 33.6 | 29.9 | -| 16 | **LLaVA (LLaMA-2-13B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **26.1** | 26.8 | 29.3 | 16.1 | 32.3 | 26.3 | 27.3 | 20.1 | 28.8 | 24.3 | 18.3 | 37.3 | 25.1 | -| 17 | **InstructBLIP (Vicuna-7B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **25.3** | 23.1 | 20.7 | 18.3 | 32.3 | 35.2 | 21.8 | 27.1 | 20.7 | 18.9 | 20.4 | 33.0 | 23.1 | -| 18 | **LLaVAR** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **25.2** | 21.9 | 25.0 | 16.7 | 34.8 | 30.7 | 24.2 | 22.1 | 23.0 | 13.5 | 15.3 | 42.6 | 21.9 | -| 19 | **LLaMA-Adapter-V2 (7B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **23.9** | 21.2 | 25.5 | 11.3 | 32.3 | 31.8 | 26.3 | 20.4 | 24.3 | 24.3 | 13.9 | 29.5 | 18.3 | -| 20 | **miniGPT4 (LLaMA-2-7B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **23.1** | 18.6 | 26.0 | 13.4 | 30.4 | 30.2 | 28.1 | 21.0 | 24.7 | 16.2 | 16.7 | 25.4 | 17.9 | -| 21 | **mPLUG-Owl (LLaMA-7B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **22.2** | 22.7 | 23.6 | 10.2 | 27.2 | 27.9 | 23.6 | 19.2 | 23.9 | 13.5 | 12.7 | 26.3 | 21.4 | -| 22 | **IDEFICS (9B-Instruct)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **19.8** | 21.6 | 21.1 | 6.5 | 25.9 | 24.0 | 22.1 | 15.0 | 19.8 | 18.9 | 9.9 | 24.6 | 18.1 | -| 23 | **Random Chance** | - | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **17.9** | 15.5 | 24.1 | 4.5 | 23.4 | 24.3 | 25.8 | 13.8 | 22.7 | 13.4 | 8.8 | 15.8 | 14.3 | +| **#** | **Model** | **Method** | **Source** | **Date** | **ALL** | **FQA** | **GPS** | **MWP** | **TQA** | **VQA** | **ALG** | **ARI** | **GEO** | **LOG** | **NUM** | **SCI** | **STA** | +| ----- | ------------------------------- | ---------- | ------------------------------------------------------------ | ---------- | -------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | +| - | **Human Performance\*** | - | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **60.3** | 59.7 | 48.4 | 73.0 | 63.2 | 55.9 | 50.9 | 59.2 | 51.4 | 40.7 | 53.8 | 64.9 | 63.9 | +| 1 | **InternVL-Chat-V1.2-Plus 🥇** | LMM 🖼️ | [Link](https://arxiv.org/abs/2312.14238) | 2024-02-22 | **59.9** | 51.7 | 61.1 | 79.6 | 52.5 | 57.0 | 54.5 | 63.2 | 61.1 | 16.2 | 48.6 | 55.7 | 60.8 | +| 2 | **InternLM-XComposer2-VL-7B 🥈** | LMM 🖼️ | [Link](https://github.com/InternLM/InternLM-XComposer) | 2024-01-22 | **57.6** | 55.0 | 63.0 | 73.7 | 56.3 | 39.7 | 56.6 | 52.4 | 62.3 | 8.1 | 42.4 | 59.0 | 64.1 | +| 3 | **Gemini 1.0 Ultra 🥉** | LMM 🖼️ | [Link](https://arxiv.org/abs/2312.11805) | 2023-12-06 | **53.0** | 49.1 | 56.2 | 53.8 | 69.0 | 40.2 | 58.4 | 45.9 | 55.6 | 21.6 | 38.9 | 62.3 | 59.5 | +| 4 | **Gemini 1.5 Pro** | LMM 🖼️ | [Link](https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf) | 2024-02-15 | **52.1** | - | - | - | - | - | - | - | - | - | - | - | - | +| 5 | **GPT-4V (Playground)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-15 | **49.9** | 43.1 | 50.5 | 57.5 | 65.2 | 38.0 | 53.0 | 49.0 | 51.0 | 21.6 | 20.1 | 63.1 | 55.8 | +| 6 | **InternVL-Chat-V1.2** | LMM 🖼️ | [Link](https://arxiv.org/abs/2312.14238) | 2024-02-22 | **47.7** | 50.9 | 61.1 | 30.6 | 48.1 | 44.7 | 52.3 | 36.5 | 58.2 | 18.9 | 30.6 | 54.9 | 51.8 | +| 7 | **LLaVA-1.6-34B** | LMM 🖼️ | [Link](https://llava-vl.github.io/blog/2024-01-30-llava-1-6/) | 2024-01-30 | **46.5** | - | - | - | - | - | - | - | - | - | - | - | - | +| 8 | **Gemini 1.0 Pro** | LMM 🖼️ | [Link](https://arxiv.org/abs/2312.11805) | 2023-12-06 | **45.2** | 47.6 | 40.4 | 39.2 | 61.4 | 39.1 | 45.2 | 38.8 | 41.0 | 10.8 | 32.6 | 54.9 | 56.8 | +| 9 | **Qwen-VL-Plus** | LMM 🖼️ | [Link](https://github.com/QwenLM/Qwen-VL) | 2023-12-21 | **43.3** | 54.6 | 38.5 | 31.2 | 55.1 | 34.1 | 39.1 | 32.0 | 39.3 | 18.9 | 26.4 | 59.0 | 56.1 | +| 10 | **SPHINX-MoE** | MoE 🤖 | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2024-01-12 | **42.3** | 49.8 | 31.2 | 42.5 | 46.8 | 39.7 | 31.7 | 41.6 | 30.5 | 16.2 | 27.1 | 50.8 | 50.8 | +| 11 | **SPHINX (V2)** | LMM 🖼️ | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2023-11-17 | **36.7** | 54.6 | 16.4 | 23.1 | 41.8 | 43.0 | 20.6 | 33.4 | 17.6 | 24.3 | 21.5 | 43.4 | 51.5 | +| 12 | **OmniLMM-12B** | LMM 🖼️ | [Link](https://github.com/OpenBMB/OmniLMM) | 2024-02-01 | **34.9** | 45.0 | 17.8 | 26.9 | 44.9 | 39.1 | 23.1 | 32.3 | 20.9 | 18.9 | 27.8 | 45.9 | 44.2 | +| 13 | **Multimodal Bard** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **34.8** | 26.0 | 47.1 | 29.6 | 48.7 | 26.8 | 46.5 | 28.6 | 47.8 | 13.5 | 14.9 | 47.5 | 33.0 | +| 14 | **PoT GPT-4 (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.9** | 30.1 | 39.4 | 30.6 | 39.9 | 31.3 | 37.4 | 31.7 | 41.0 | 18.9 | 20.1 | 44.3 | 37.9 | +| 15 | **CoT Claude (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.2** | 27.5 | 29.3 | 36.0 | 49.4 | 29.1 | 31.0 | 32.9 | 31.0 | 16.2 | 17.4 | 50.8 | 37.2 | +| 16 | **CoT GPT4 (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.2** | 27.9 | 31.7 | 31.2 | 51.9 | 28.5 | 33.5 | 30.9 | 32.2 | 13.5 | 12.5 | 58.2 | 37.9 | +| 17 | **CoT ChatGPT (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **33.2** | 26.0 | 31.7 | 35.5 | 48.1 | 30.2 | 32.4 | 32.3 | 33.0 | 16.2 | 17.4 | 54.9 | 36.2 | +| 18 | **Gemini 1.0 Nano 2** | LMM 🖼️ | [Link](https://arxiv.org/abs/2312.11805) | 2023-12-06 | **30.6** | 28.6 | 23.6 | 30.6 | 41.8 | 31.8 | 27.1 | 29.8 | 26.8 | 10.8 | 20.8 | 40.2 | 33.5 | +| 19 | **LLaVA-1.5-13B** | LMM 🖼️ | [Link](https://llava-vl.github.io/blog/2024-01-30-llava-1-6/) | 2024-01-30 | **27.6** | - | - | - | - | - | - | - | - | - | - | - | - | +| 20 | **SPHINX (V1)** | LMM 🖼️ | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2023-11-09 | **27.5** | 23.4 | 23.1 | 21.5 | 39.9 | 34.1 | 25.6 | 28.1 | 23.4 | 16.2 | 17.4 | 40.2 | 23.6 | +| 21 | **Gemini 1.0 Nano 1** | LMM 🖼️ | [Link](https://arxiv.org/abs/2312.11805) | 2023-12-06 | **27.3** | 30.9 | 21.6 | 23.7 | 29.1 | 30.7 | 23.8 | 25.5 | 21.3 | 13.5 | 20.8 | 27.9 | 30.9 | +| 22 | **PoT ChatGPT (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **26.8** | 24.5 | 26.4 | 23.7 | 33.5 | 27.9 | 27.8 | 26.1 | 28.0 | 18.9 | 13.2 | 33.6 | 29.9 | +| 23 | **LLaVA (LLaMA-2-13B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **26.1** | 26.8 | 29.3 | 16.1 | 32.3 | 26.3 | 27.3 | 20.1 | 28.8 | 24.3 | 18.3 | 37.3 | 25.1 | +| 24 | **InstructBLIP (Vicuna-7B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **25.3** | 23.1 | 20.7 | 18.3 | 32.3 | 35.2 | 21.8 | 27.1 | 20.7 | 18.9 | 20.4 | 33.0 | 23.1 | +| 25 | **LLaVAR** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **25.2** | 21.9 | 25.0 | 16.7 | 34.8 | 30.7 | 24.2 | 22.1 | 23.0 | 13.5 | 15.3 | 42.6 | 21.9 | +| 26 | **LLaMA-Adapter-V2 (7B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **23.9** | 21.2 | 25.5 | 11.3 | 32.3 | 31.8 | 26.3 | 20.4 | 24.3 | 24.3 | 13.9 | 29.5 | 18.3 | +| 27 | **miniGPT4 (LLaMA-2-7B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **23.1** | 18.6 | 26.0 | 13.4 | 30.4 | 30.2 | 28.1 | 21.0 | 24.7 | 16.2 | 16.7 | 25.4 | 17.9 | +| 28 | **mPLUG-Owl (LLaMA-7B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **22.2** | 22.7 | 23.6 | 10.2 | 27.2 | 27.9 | 23.6 | 19.2 | 23.9 | 13.5 | 12.7 | 26.3 | 21.4 | +| 29 | **IDEFICS (9B-Instruct)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **19.8** | 21.6 | 21.1 | 6.5 | 25.9 | 24.0 | 22.1 | 15.0 | 19.8 | 18.9 | 9.9 | 24.6 | 18.1 | +| 30 | **Random Chance** | - | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **17.9** | 15.5 | 24.1 | 4.5 | 23.4 | 24.3 | 25.8 | 13.8 | 22.7 | 13.4 | 8.8 | 15.8 | 14.3 | + +### Leaderboard on the test subset Accuracy scores on the **test** subset (5,141 examples): -| **#** | **Model** | **Method** | **Source** | **Date** | **ALL** | **FQA** | **GPS** | **MWP** | **TQA** | **VQA** | **ALG** | **ARI** | **GEO** | **LOG** | **NUM** | **SCI** | **STA** | -| ----- | ----------------------------- | ---------- | ------------------------------------------------------------ | ---------- | --------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | -| 1 | **Qwen-VL-Plus 🥇** | LMM 🖼️ | [Link](https://github.com/QwenLM/Qwen-VL) | 2023-12-26 | **44.33** | 55.9 | 34.7 | 29.7 | 58.8 | 42.4 | 40.7 | 35.4 | 36.6 | 21.6 | 30.4 | 55.9 | 56.3 | -| 2 | **SPHINX-MoE 🥈** | MoE 🤖 | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2024-01-13 | **42.68** | 50.3 | 29.7 | 40.9 | 49.3 | 43.3 | 33.9 | 43.0 | 29.1 | 14.4 | 26.3 | 46.9 | 51.2 | -| 3 | **PoT GPT-4 (Caption+OCR) 🥉** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **31.74** | 27.6 | 37.4 | 23.9 | 43.0 | 30.3 | 37.1 | 27.9 | 37.5 | 22.7 | 15.8 | 44.5 | 31.9 | -| 4 | **CoT GPT4 (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **30.50** | 27.2 | 35.9 | 21.3 | 43.1 | 28.2 | 35.7 | 25.2 | 35.8 | 24.7 | 15.4 | 47.3 | 31.3 | -| 5 | **LLaVA (LLaMA-2-13B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **25.40** | 22.9 | 24.6 | 18.1 | 35.8 | 29.7 | 26.9 | 22.5 | 24.4 | 19.1 | 19.1 | 34.7 | 21.6 | -| * | **Random Chance** | - | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **17.86** | 15.5 | 24.1 | 4.5 | 23.4 | 24.3 | 25.8 | 13.8 | 22.7 | 13.4 | 8.8 | 15.8 | 14.3 | +| **#** | **Model** | **Method** | **Source** | **Date** | **ALL** | **FQA** | **GPS** | **MWP** | **TQA** | **VQA** | **ALG** | **ARI** | **GEO** | **LOG** | **NUM** | **SCI** | **STA** | +| ----- | ------------------------------- | ---------- | ------------------------------------------------------------ | ---------- | --------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | +| 1 | **InternVL-Chat-V1.2-Plus 🥇** | LMM 🖼️ | [Link](https://arxiv.org/abs/2312.14238) | 2024-02-22 | **60.18** | 52.2 | 56.2 | 78.3 | 61.6 | 55.5 | 56.0 | 64.4 | 57.6 | 21.6 | 46.1 | 60.0 | 60.1 | +| 2 | **InternLM-XComposer2-VL-7B 🥈** | LMM 🖼️ | [Link](https://github.com/InternLM/InternLM-XComposer) | 2024-01-22 | **57.93** | 53.9 | 56.4 | 77.1 | 58.4 | 43.2 | 54.8 | 57.6 | 58.0 | 16.5 | 47.6 | 59.1 | 62.5 | +| 3 | **Qwen-VL-Plus 🥉** | LMM 🖼️ | [Link](https://github.com/QwenLM/Qwen-VL) | 2023-12-26 | **44.33** | 55.9 | 34.7 | 29.7 | 58.8 | 42.4 | 40.7 | 35.4 | 36.6 | 21.6 | 30.4 | 55.9 | 56.3 | +| 4 | **SPHINX-MoE** | MoE 🤖 | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2024-01-13 | **42.68** | 50.3 | 29.7 | 40.9 | 49.3 | 43.3 | 33.9 | 43.0 | 29.1 | 14.4 | 26.3 | 46.9 | 51.2 | +| 5 | **PoT GPT-4 (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **31.74** | 27.6 | 37.4 | 23.9 | 43.0 | 30.3 | 37.1 | 27.9 | 37.5 | 22.7 | 15.8 | 44.5 | 31.9 | +| 6 | **CoT GPT4 (Caption+OCR)** | Tool 🛠️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **30.50** | 27.2 | 35.9 | 21.3 | 43.1 | 28.2 | 35.7 | 25.2 | 35.8 | 24.7 | 15.4 | 47.3 | 31.3 | +| 7 | **LLaVA (LLaMA-2-13B)** | LMM 🖼️ | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **25.40** | 22.9 | 24.6 | 18.1 | 35.8 | 29.7 | 26.9 | 22.5 | 24.4 | 19.1 | 19.1 | 34.7 | 21.6 | +| * | **Random Chance** | - | [Link](https://arxiv.org/abs/2310.02255) | 2023-10-03 | **17.86** | 15.5 | 24.1 | 4.5 | 23.4 | 24.3 | 25.8 | 13.8 | 22.7 | 13.4 | 8.8 | 15.8 | 14.3 | Some notations in the table: @@ -393,7 +417,7 @@ Additionally, ensure that the API keys for ChatGPT, GPT-4, Claude-2, and Bard ar If you have setted Multimodal Bard, you can run the following commands: -Generate the response: +Generate the response on the **testmini** subset: ```sh cd evaluation @@ -404,7 +428,7 @@ python generate_response.py \ --output_file output_bard.json ``` -Extract the short answer text for score calculation: +Extract the short answer text for score calculation on the **testmini** subset: ```sh python extract_answer.py \ @@ -412,7 +436,7 @@ python extract_answer.py \ --output_file output_bard.json ``` -Calculate the final score: +Calculate the final score on the **testmini** subset: ```sh python calculate_score.py \ @@ -421,9 +445,27 @@ python calculate_score.py \ --score_file scores_bard.json ``` +Generate the response of the **test** subset: + +```sh +python generate_response.py \ +--model bard \ +--input_file test.json \ +--output_dir ../results/bard \ +--output_file output_bard_test.json +``` + +Extract the short answer text for score calculation on the **test** subset: + +```sh +python extract_answer.py \ +--output_dir ../results/bard \ +--output_file output_bard_test.json +``` + ### Evaluating Chain-of-Thought GPT-4 -Generate the response: +Generate the response on the **testmini** subset: ```sh cd evaluation @@ -440,7 +482,7 @@ python generate_response.py \ --ocr_file ../data/texts/ocrs_easyocr.json ``` -Extract the short answer text for score calculation: +Extract the short answer text for score calculation on the **testmini** subset: ```sh python extract_answer.py \ @@ -448,7 +490,7 @@ python extract_answer.py \ --output_file output_gpt4_2shot_solution_use_caption_ocr.json ``` -Calculate the final score: +Calculate the final score on the **testmini** subset: ```sh python calculate_score.py \ @@ -457,9 +499,33 @@ python calculate_score.py \ --score_file scores_gpt4_2shot_solution_use_caption_ocr.json ``` +Generate the response of the **test** subset: + +```sh +python generate_response.py \ +--model gpt-4-0613 \ +-input_file test.json \ +--output_dir ../results/gpt4 \ +--output_file output_test_gpt4_2shot_code_use_caption_ocr.json \ +--shot_num 2 \ +--shot_type solution \ +--use_caption \ +--use_ocr \ +--caption_file ../data/texts/captions_bard.json \ +--ocr_file ../data/texts/ocrs_easyocr.json +``` + +Extract the short answer text for score calculation on the **test** subset: + +```sh +python extract_answer.py \ +--output_dir ../results/bard \ +--output_file output_test_gpt4_2shot_code_use_caption_ocr.json +``` + ### Evaluating Program-of-Thought GPT-4 -Generate the response: +Generate the response on the **testmini** subset: ```sh cd evaluation @@ -476,7 +542,7 @@ python generate_response.py \ --ocr_file ../data/texts/ocrs_easyocr.json ``` -Extract the short answer text for score calculation: +Extract the short answer text for score calculation on the **testmini** subset: ```sh python extract_answer.py \ @@ -485,7 +551,7 @@ python extract_answer.py \ --response_label execution ``` -Calculate the final score: +Calculate the final score on the **testmini** subset: ```sh python calculate_score.py \ @@ -494,6 +560,31 @@ python calculate_score.py \ --score_file scores_gpt4_2shot_code_use_caption_ocr.json ``` +Generate the response of the **test** subset: + +```sh +python generate_response.py \ +--model gpt-4-0613 \ +--input_file test.json \ +--output_dir ../results/gpt4 \ +--output_file output_test_gpt4_2shot_code_use_caption_ocr.json \ +--shot_num 2 \ +--shot_type code \ +--use_caption \ +--use_ocr \ +--caption_file ../data/texts/captions_bard.json \ +--ocr_file ../data/texts/ocrs_easyocr.json +``` + +Extract the short answer text for score calculation on the **test** subset: + +```sh +python extract_answer.py \ +--output_dir ../results/gpt4 \ +--output_file output_test_gpt4_2shot_code_use_caption_ocr.json \ +--response_label execution +``` + ### Evaluating More Settings For additional settings for large language models and other baselines, please refer to the running scripts available in the [`scripts`](https://github.com/lupantech/MathVista/tree/main/scripts) directory. diff --git a/results/leaderboad_submission_template/output_test_template_for_leaderboard_submission_response_only.json b/results/leaderboad_submission_template/output_test_template_for_leaderboard_submission_response_only.json deleted file mode 100644 index bd54378..0000000 --- a/results/leaderboad_submission_template/output_test_template_for_leaderboard_submission_response_only.json +++ /dev/null @@ -1,5143 +0,0 @@ -{ - "1001": "2", - "1002": "No", - "1003": "2", - "1004": "2", - "1005": "80\u00b0", - "1006": "2", - "1007": "2", - "1008": "2", - "1009": "2", - "1010": "quarter", - "1011": "No", - "1012": "No", - "1013": "2", - "1014": "No", - "1015": "25", - "1016": "12\u221a{3}\u6d77\u91cc", - "1017": "2", - "1018": "2", - "1019": "Nine", - "1020": "neither; white and pink are equally likely", - "1021": "No", - "1022": "No", - "1023": "no", - "1024": "No", - "1025": "2", - "1026": "2", - "1027": "2", - "1028": "2", - "1029": "2", - "1030": "2", - "1031": "No", - "1032": "no", - "1033": "2", - "1034": "2", - "1035": "40", - "1036": "no", - "1037": "B", - "1038": "2", - "1039": "72", - "1040": "12.5", - "1041": "2", - "1042": "2", - "1043": "2", - "1044": "2", - "1045": "no", - "1046": "can't predict", - "1047": "No", - "1048": "No", - "1049": "2", - "1050": "0.21", - "1051": "C", - "1052": "2", - "1053": "6\u221a{3}cm", - "1054": "2", - "1055": "2", - "1056": "No", - "1057": "2", - "1058": "65\u00b0", - "1059": "65\u00b0", - "1060": "2", - "1061": "0.21", - "1062": "2", - "1063": "No", - "1064": "11", - "1065": "C", - "1066": "2", - "1067": "2", - "1068": "2", - "1069": "1.2", - "1070": "Does Heather's snowboard slide down a hill in less time when it has a thin layer of wax or a thick layer of wax?", - "1071": "2", - "1072": "equal to", - "1073": "no", - "1074": "paperclips", - "1075": "18", - "1076": "40\u00b0", - "1077": "1.2", - "1078": "2", - "1079": "2", - "1080": "no", - "1081": "2", - "1082": "No", - "1083": "2", - "1084": "2", - "1085": "2", - "1086": "2", - "1087": "2", - "1088": "150", - "1089": "2", - "1090": "no", - "1091": "195", - "1092": "\\frac{4}{5}", - "1093": "2", - "1094": "30\u00b0", - "1095": "2", - "1096": "2", - "1097": "III only", - "1098": "no", - "1099": "2", - "1100": "2", - "1101": "4 \\sqrt { 3 }", - "1102": "2", - "1103": "2", - "1104": "2", - "1105": "2", - "1106": "2", - "1107": "2", - "1108": "No", - "1109": "no", - "1110": "2", - "1111": "2", - "1112": "60\u00b0", - "1113": "no", - "1114": "2", - "1115": "2", - "1116": "50\u00b0", - "1117": "blue and orange", - "1118": "2", - "1119": "red", - "1120": "2", - "1121": "0.21", - "1122": "June", - "1123": "No", - "1124": "2", - "1125": "No", - "1126": "12:30 P.M.", - "1127": "2", - "1128": "2", - "1129": "1.2", - "1130": "No", - "1131": "2", - "1132": "2", - "1133": "2cm", - "1134": "No", - "1135": "2", - "1136": "No", - "1137": "7/9", - "1138": "2", - "1139": "0.21", - "1140": "2", - "1141": "No", - "1142": "0.21", - "1143": "Seattle-Bremerton", - "1144": "2", - "1145": "2", - "1146": "2", - "1147": "45", - "1148": "100\u00b0", - "1149": "1:2", - "1150": "No", - "1151": "2", - "1152": "0.21", - "1153": "plant", - "1154": "2", - "1155": "2", - "1156": "1.2", - "1157": "20\u00b0", - "1158": "65\u00b0", - "1159": "No", - "1160": "4cm", - "1161": "segments AB and BC", - "1162": "2", - "1163": "No", - "1164": "December, January, February, and March", - "1165": "2", - "1166": "2", - "1167": "6:8", - "1168": "Mice", - "1169": "60", - "1170": "2", - "1171": "no", - "1172": "2", - "1173": "2", - "1174": "21 \\pi", - "1175": "Death of lizards", - "1176": "40\u00b0", - "1177": "6", - "1178": "6.10\u7c73", - "1179": "2", - "1180": "no", - "1181": "Population will decrease", - "1182": "a lower elevation than", - "1183": "No", - "1184": "120\u00b0", - "1185": "2", - "1186": "2", - "1187": "9:00", - "1188": "2", - "1189": "No", - "1190": "2", - "1191": "2", - "1192": "2", - "1193": "2", - "1194": "2", - "1195": "no", - "1196": "No", - "1197": "2", - "1198": "0.21", - "1199": "30", - "1200": "2", - "1201": "2", - "1202": "2", - "1203": "w-o Processor", - "1204": "56\u00b0", - "1205": "2", - "1206": "20", - "1207": "62B-c", - "1208": "2", - "1209": "No", - "1210": "quarter", - "1211": "2", - "1212": "2", - "1213": "C", - "1214": "2", - "1215": "Their population remains constant.", - "1216": "1.2", - "1217": "2", - "1218": "No", - "1219": "5", - "1220": "0.21", - "1221": "wheat will increase", - "1222": "2", - "1223": "65\u00b0", - "1224": "2", - "1225": "2", - "1226": "No", - "1227": "quarter", - "1228": "2", - "1229": "2", - "1230": "2", - "1231": "70\u00b0", - "1232": "no change", - "1233": "Jupiter", - "1234": "No", - "1235": "No", - "1236": "2", - "1237": "4\u221a{2}", - "1238": "2", - "1239": "2", - "1240": "No", - "1241": "2", - "1242": "2", - "1243": "2", - "1244": "2", - "1245": "12", - "1246": "2", - "1247": "\\frac{4}{5}", - "1248": "no", - "1249": "2", - "1250": "2", - "1251": "can't predict", - "1252": "35\u00b0", - "1253": "2", - "1254": "no", - "1255": "45\u00b0", - "1256": "[0, 2, 0, 2, 1, 7, 1, 2, 0, 3, 0, 6]", - "1257": "2", - "1258": "2", - "1259": "2", - "1260": "No", - "1261": "2", - "1262": "3", - "1263": "Riemann sum", - "1264": "2", - "1265": "No", - "1266": "0.21", - "1267": "(1, 1)", - "1268": "2", - "1269": "2", - "1270": "h(x)", - "1271": "45\u00b0", - "1272": "no", - "1273": "2", - "1274": "nothing", - "1275": "Earth", - "1276": "No", - "1277": "Computer Data Company", - "1278": "Otter", - "1279": "2", - "1280": "2", - "1281": "No", - "1282": "2", - "1283": "8cm", - "1284": "2", - "1285": "1.2", - "1286": "2", - "1287": "50\u00b0", - "1288": "48\u00b0", - "1289": "45", - "1290": "No", - "1291": "59\u00b0", - "1292": "No", - "1293": "no", - "1294": "18", - "1295": "17", - "1296": "180", - "1297": "2", - "1298": "2", - "1299": "55\u00b0", - "1300": "2", - "1301": "2", - "1302": ".40\u00b0", - "1303": "2", - "1304": "7", - "1305": "2", - "1306": "10.24", - "1307": "B", - "1308": "no", - "1309": "no", - "1310": "55\u00b0", - "1311": "552", - "1312": "No", - "1313": "2", - "1314": "2", - "1315": "4.5", - "1316": "7.9", - "1317": "an exponential function", - "1318": "deep deposit feeders", - "1319": "2", - "1320": "389.5", - "1321": "2", - "1322": "2", - "1323": "2", - "1324": "60\u00b0", - "1325": "no", - "1326": "2", - "1327": "50\u00b0", - "1328": "2", - "1329": "2", - "1330": "2", - "1331": "blue, orange, and grey", - "1332": "No", - "1333": "20\u00b0", - "1334": "1.2", - "1335": "no", - "1336": "equal to", - "1337": "\\frac{2}{3}", - "1338": "2", - "1339": "No", - "1340": "2", - "1341": "No", - "1342": "paperclips", - "1343": "1", - "1344": "5:1", - "1345": "2", - "1346": "2", - "1347": "No", - "1348": "2", - "1349": "2", - "1350": "No", - "1351": "100\u00b0", - "1352": "2", - "1353": "4", - "1354": "2", - "1355": "2", - "1356": "2", - "1357": "No", - "1358": "2", - "1359": "2", - "1360": "20", - "1361": "2", - "1362": "2", - "1363": "4", - "1364": "2", - "1365": "2", - "1366": "ants", - "1367": "0.21", - "1368": "108", - "1369": "surplus", - "1370": "2", - "1371": "2", - "1372": "2", - "1373": "Make-an-Audio 2", - "1374": "2", - "1375": "2", - "1376": "2", - "1377": "All the plankton will get destroyed", - "1378": "2", - "1379": "No", - "1380": "25\u00b0", - "1381": "2", - "1382": "54", - "1383": "2", - "1384": "No", - "1385": "22.8", - "1386": "2", - "1387": "2", - "1388": "60", - "1389": "2", - "1390": "2", - "1391": "9*\\pi", - "1392": "4.5", - "1393": "both", - "1394": "RFT k=12", - "1395": "LLaVA", - "1396": "no", - "1397": "2", - "1398": "2", - "1399": "blue", - "1400": "The small fish population would not be affected.", - "1401": "2", - "1402": "2", - "1403": "2", - "1404": "decrease in algae", - "1405": "no", - "1406": "quarter", - "1407": "2", - "1408": "2", - "1409": "2", - "1410": "2", - "1411": "2", - "1412": "20", - "1413": "50", - "1414": "orange", - "1415": "0.21", - "1416": "2", - "1417": "no", - "1418": "no", - "1419": "B", - "1420": "60", - "1421": "0.21", - "1422": "2", - "1423": "2", - "1424": "2", - "1425": "No", - "1426": "2", - "1427": "2", - "1428": "2", - "1429": "2", - "1430": "29.8", - "1431": "2", - "1432": "22", - "1433": "2", - "1434": "4", - "1435": "2", - "1436": "2", - "1437": "1\u7c73", - "1438": "2", - "1439": "scissors", - "1440": "2", - "1441": "B", - "1442": "\\frac{3}{5}", - "1443": "150\u00b0", - "1444": "1.8", - "1445": "2", - "1446": "15", - "1447": "2", - "1448": "120\u00b0", - "1449": "none", - "1450": "2", - "1451": "2", - "1452": "102\u00b0", - "1453": "2", - "1454": "120\u00b0", - "1455": "45\u00b0", - "1456": "2", - "1457": "3444.7", - "1458": "0.88", - "1459": "2", - "1460": "54\u00b0", - "1461": "No", - "1462": "6m", - "1463": "32", - "1464": "12", - "1465": "2", - "1466": "1.2", - "1467": "7.4", - "1468": "2", - "1469": "2", - "1470": "2", - "1471": "no", - "1472": "2", - "1473": "2", - "1474": "2", - "1475": "3", - "1476": "C", - "1477": "deer", - "1478": "No", - "1479": "12 \\sqrt { 3 }", - "1480": "No", - "1481": "1.2", - "1482": "2", - "1483": "Sea horse population will increase", - "1484": "No", - "1485": "2", - "1486": "2", - "1487": "14.25%", - "1488": "2", - "1489": "0.21", - "1490": "no", - "1491": "8", - "1492": "65\u00b0", - "1493": "2", - "1494": "2", - "1495": "Cod", - "1496": "darters", - "1497": "56.3", - "1498": "Jupiter", - "1499": "0.21", - "1500": "No", - "1501": "no", - "1502": "40%", - "1503": "continuous", - "1504": "no", - "1505": "no", - "1506": "2", - "1507": "2", - "1508": "D", - "1509": "0.21", - "1510": "12", - "1511": "Grizzly bear", - "1512": "Third", - "1513": "1.2", - "1514": "2", - "1515": "It will remain the same", - "1516": "No", - "1517": "48cm2", - "1518": "0.21", - "1519": "no", - "1520": "No", - "1521": "15", - "1522": "2", - "1523": "2", - "1524": "0.21", - "1525": "quarter", - "1526": "2", - "1527": "2", - "1528": "75", - "1529": "no", - "1530": "2", - "1531": "7", - "1532": "2", - "1533": "2", - "1534": "45 degrees", - "1535": "40\u00b0", - "1536": "2", - "1537": "14", - "1538": "2", - "1539": "No", - "1540": "No", - "1541": "2", - "1542": "2", - "1543": "2", - "1544": "No", - "1545": "2", - "1546": "2", - "1547": "21", - "1548": "rat", - "1549": "No", - "1550": "2", - "1551": "55*\\degree", - "1552": "2", - "1553": "2", - "1554": "20.6", - "1555": "No", - "1556": "1.2", - "1557": "90\u00b0", - "1558": "2", - "1559": "2", - "1560": "even", - "1561": "2", - "1562": "2", - "1563": "2", - "1564": "2", - "1565": "2", - "1566": "less than", - "1567": "2", - "1568": "purple", - "1569": "60", - "1570": "2", - "1571": "2", - "1572": "2", - "1573": "115*\\degree", - "1574": "2", - "1575": "D", - "1576": "2", - "1577": "False", - "1578": "45\u00b0", - "1579": "Egg", - "1580": "41\u00b0", - "1581": "50\u00b0", - "1582": "2", - "1583": "2", - "1584": "5.65", - "1585": "2", - "1586": "No", - "1587": "Timothy seed", - "1588": "10*\\pi", - "1589": "2", - "1590": "2", - "1591": "x = v", - "1592": "2", - "1593": "52", - "1594": "2", - "1595": "0.21", - "1596": "2", - "1597": "2", - "1598": "2", - "1599": "8\u00b0", - "1600": "[-1, 1]", - "1601": "2", - "1602": "2", - "1603": "No", - "1604": "no", - "1605": "2", - "1606": "2", - "1607": "2", - "1608": "35\u00b0", - "1609": "no", - "1610": "more otters", - "1611": "No", - "1612": "can't tell", - "1613": "No", - "1614": "C", - "1615": "2", - "1616": "2", - "1617": "no", - "1618": "36 \\sqrt { 11 }", - "1619": "100\u00b0", - "1620": "2", - "1621": "2", - "1622": "10", - "1623": "N(t) decays exponentially", - "1624": "2", - "1625": "No", - "1626": "30", - "1627": "116", - "1628": "10.8", - "1629": "2", - "1630": "9*\\pi", - "1631": "No", - "1632": "2", - "1633": "2", - "1634": "2", - "1635": "2", - "1636": "2", - "1637": "no", - "1638": "No", - "1639": "GPU SynJax Eisner", - "1640": "50", - "1641": "no", - "1642": "50\u00b0", - "1643": "2", - "1644": "2", - "1645": "2", - "1646": "2", - "1647": "2", - "1648": "it would stay the same", - "1649": "2", - "1650": "2", - "1651": "no", - "1652": "2", - "1653": "2", - "1654": "2", - "1655": "6.75", - "1656": "2", - "1657": "No", - "1658": "2", - "1659": "no", - "1660": "1-6,3-4,5-2", - "1661": "9.5%", - "1662": "2", - "1663": "2", - "1664": "65\u00b0", - "1665": "2", - "1666": "2", - "1667": "2", - "1668": "no", - "1669": "2", - "1670": "90\u00b0", - "1671": "35\u00b0", - "1672": "No", - "1673": "12m", - "1674": "2", - "1675": "No", - "1676": "4", - "1677": "2", - "1678": "The magnitude of the magnetic force is the same in both pairs.", - "1679": "2", - "1680": "12", - "1681": "no", - "1682": "2", - "1683": "1.2", - "1684": "2", - "1685": "24", - "1686": "24\u00b0", - "1687": "130\u00b0", - "1688": "60.4", - "1689": "no", - "1690": "2", - "1691": "2", - "1692": "1 & 3", - "1693": "2", - "1694": "0.21", - "1695": "2", - "1696": "no", - "1697": "60", - "1698": "2", - "1699": "4", - "1700": "2", - "1701": "35\u00b0", - "1702": "66.4", - "1703": "Increase", - "1704": "2", - "1705": "no", - "1706": "quarter", - "1707": "2", - "1708": "2", - "1709": "No", - "1710": "2", - "1711": "2", - "1712": "2", - "1713": "2", - "1714": "2", - "1715": "0.21", - "1716": "no", - "1717": "2", - "1718": "72", - "1719": "2", - "1720": "No", - "1721": "0.21", - "1722": "2", - "1723": "N", - "1724": "42", - "1725": "2", - "1726": "quarter", - "1727": "no", - "1728": "2", - "1729": "No", - "1730": "No", - "1731": "2", - "1732": "2", - "1733": "no", - "1734": "2", - "1735": "2", - "1736": "2", - "1737": "2", - "1738": "2", - "1739": "5%-7%", - "1740": "55\u00b0", - "1741": "quarter", - "1742": "2", - "1743": "2", - "1744": "8cm", - "1745": "No", - "1746": "2", - "1747": "2", - "1748": "2", - "1749": "8", - "1750": "129.9", - "1751": "2", - "1752": "6.8", - "1753": "2", - "1754": "2", - "1755": "55\u00b0", - "1756": "2", - "1757": "No", - "1758": "45\u00b0", - "1759": "2", - "1760": "2", - "1761": "25", - "1762": "krill and birds", - "1763": "2", - "1764": "No", - "1765": "C", - "1766": "2.5", - "1767": "2", - "1768": "2", - "1769": "2", - "1770": "No", - "1771": "2", - "1772": "2", - "1773": "All of them", - "1774": "20\u00b0", - "1775": "2", - "1776": "2", - "1777": "Banana", - "1778": "GCC", - "1779": "CIFAR10", - "1780": "no", - "1781": "2", - "1782": "70", - "1783": "No", - "1784": "27\u00b0", - "1785": "No", - "1786": "No", - "1787": "2", - "1788": "2", - "1789": "No", - "1790": "2", - "1791": "no", - "1792": "12", - "1793": "11:00 A.M.", - "1794": "2", - "1795": "10:30 A.M.", - "1796": "2", - "1797": "No", - "1798": "no", - "1799": "180", - "1800": "No", - "1801": "No", - "1802": "No", - "1803": "80\u00b0", - "1804": "Does Wanda's snowboard slide down a hill in less time when it has a layer of wax or when it does not have a layer of wax?", - "1805": "It would remain unchanged", - "1806": "70cm", - "1807": "2", - "1808": "No", - "1809": "2", - "1810": "2", - "1811": "2", - "1812": "22\u00b0", - "1813": "2", - "1814": "No", - "1815": "60\u00b0", - "1816": "[-1, 1]", - "1817": "2", - "1818": "2", - "1819": "6", - "1820": "2", - "1821": "no", - "1822": "Loggerhead Turtle", - "1823": "2", - "1824": "2", - "1825": "12cm", - "1826": "grasshopper", - "1827": "No", - "1828": "2", - "1829": "deer", - "1830": "no", - "1831": "2", - "1832": "2:30", - "1833": "increase", - "1834": "2", - "1835": "2", - "1836": "3 \\sqrt { 3 }", - "1837": "no", - "1838": "50\u00b0", - "1839": "0.21", - "1840": "No", - "1841": "2", - "1842": "\\frac{3}{2}", - "1843": "2", - "1844": "2", - "1845": "2", - "1846": "Stay the same", - "1847": "No", - "1848": "No", - "1849": "2", - "1850": "1.2", - "1851": "6", - "1852": "Caterpillar", - "1853": "no", - "1854": "2", - "1855": "2", - "1856": "2", - "1857": "2", - "1858": "2\u03c0", - "1859": "2", - "1860": "598", - "1861": "2", - "1862": "no", - "1863": "surplus", - "1864": "2", - "1865": "No", - "1866": "10 + 5/8*\\pi", - "1867": "10", - "1868": "2", - "1869": "No", - "1870": "2", - "1871": "2", - "1872": "[0, 2, 0, 2, 1, 7, 1, 2, 0, 3, 0, 6]", - "1873": "5, pi/2)", - "1874": "No", - "1875": "no", - "1876": "$k_1 = 8$", - "1877": "45\u00b0", - "1878": "No", - "1879": "No", - "1880": "45", - "1881": "90", - "1882": "no", - "1883": "15", - "1884": "6", - "1885": "12cm", - "1886": "2", - "1887": "2", - "1888": "2", - "1889": "0.21", - "1890": "Does water freeze more quickly than apple juice?", - "1891": "No", - "1892": "0.21", - "1893": "No", - "1894": "2", - "1895": "No", - "1896": "2 hours", - "1897": "2", - "1898": "20", - "1899": "2", - "1900": "2", - "1901": "2", - "1902": "no", - "1903": "2", - "1904": "November", - "1905": "4cm", - "1906": "No", - "1907": "2", - "1908": "60\u00b0", - "1909": "2", - "1910": "equal to", - "1911": "2", - "1912": "No", - "1913": "no", - "1914": "130\u00b0", - "1915": "2*r", - "1916": "2", - "1917": "12", - "1918": "no", - "1919": "No", - "1920": "2", - "1921": "sheep decrease", - "1922": "60\u00b0", - "1923": "35\u00b0", - "1924": "2", - "1925": "2", - "1926": "2", - "1927": "2", - "1928": "C", - "1929": "No", - "1930": "130\u00b0", - "1931": "1.2", - "1932": "No", - "1933": "No", - "1934": "40\u00b0", - "1935": "2", - "1936": "105\u00b0", - "1937": "2", - "1938": "2", - "1939": "2", - "1940": "detritus, only", - "1941": "2", - "1942": "90", - "1943": "No", - "1944": "2", - "1945": "2", - "1946": "2", - "1947": "2", - "1948": "No", - "1949": "2", - "1950": "quarter", - "1951": "nonlinear", - "1952": "4", - "1953": "2", - "1954": "60\u00b0", - "1955": "No", - "1956": "8", - "1957": "2", - "1958": "No", - "1959": "7", - "1960": "2", - "1961": "5", - "1962": "80\u00b0", - "1963": "2", - "1964": "2", - "1965": "2", - "1966": "0.21", - "1967": "1.2", - "1968": "no", - "1969": "no", - "1970": "No", - "1971": "2", - "1972": "2", - "1973": "2:41", - "1974": "less perch", - "1975": "G", - "1976": "2", - "1977": "4", - "1978": "0.21", - "1979": "2", - "1980": "no", - "1981": "wolf will increase", - "1982": "2", - "1983": "2", - "1984": "no", - "1985": "2.4", - "1986": "2", - "1987": "No", - "1988": "2", - "1989": "2", - "1990": "2", - "1991": "2", - "1992": "vs. 04", - "1993": "8", - "1994": "48.2", - "1995": "2", - "1996": "0.21", - "1997": "1.2", - "1998": "5", - "1999": "Plants", - "2000": "0.21", - "2001": "No", - "2002": "no", - "2003": "2", - "2004": "9.2", - "2005": "0.21", - "2006": "No", - "2007": "80", - "2008": "5", - "2009": "2", - "2010": "2", - "2011": "8 \\sqrt { 3 }", - "2012": "0.21", - "2013": "6cm", - "2014": "false", - "2015": "no", - "2016": "no", - "2017": "70\u00b0", - "2018": "2", - "2019": "8", - "2020": "7", - "2021": "surplus", - "2022": "0.21", - "2023": "2", - "2024": "2", - "2025": "49", - "2026": "0.21", - "2027": "2", - "2028": "69", - "2029": "No", - "2030": "123", - "2031": "About the same amount of precipitation falls each month between May and October.", - "2032": "no", - "2033": "No", - "2034": "blue", - "2035": "no", - "2036": "No", - "2037": "frog", - "2038": "2", - "2039": "6", - "2040": "2", - "2041": "49", - "2042": "45 minutes", - "2043": "2", - "2044": "2", - "2045": "10.2", - "2046": "2", - "2047": "\\frac{\u221a2}{2}", - "2048": "2", - "2049": "No", - "2050": "B", - "2051": "2", - "2052": "2", - "2053": "2", - "2054": "6cm", - "2055": "no", - "2056": "No", - "2057": "2", - "2058": "No", - "2059": "110\u00b0", - "2060": "No", - "2061": "about 30", - "2062": "No", - "2063": "No", - "2064": "2", - "2065": "2", - "2066": "increase in turtles", - "2067": "No", - "2068": "2.5", - "2069": "no", - "2070": "1.2", - "2071": "0.21", - "2072": "16cm", - "2073": "2", - "2074": "No", - "2075": "D", - "2076": "2", - "2077": "2", - "2078": "The amount of pine available would double.", - "2079": "3\u221a{3}\u7c73", - "2080": "semionotus", - "2081": "no", - "2082": "No", - "2083": "2", - "2084": "2", - "2085": "\\frac{3}{2}\u221a{10}", - "2086": "2", - "2087": "No", - "2088": "2", - "2089": "2", - "2090": "2", - "2091": "2", - "2092": "2", - "2093": "2", - "2094": "2", - "2095": "20\u00b0", - "2096": "ross sea.", - "2097": "2", - "2098": "2", - "2099": "85*\\degree", - "2100": "2", - "2101": "2", - "2102": "decrease", - "2103": "2", - "2104": "19", - "2105": "2", - "2106": "2", - "2107": "2", - "2108": "2", - "2109": "2", - "2110": "2", - "2111": "2", - "2112": "No", - "2113": "2", - "2114": "no", - "2115": "2", - "2116": "remains the same", - "2117": "No", - "2118": "2", - "2119": "1440", - "2120": "even", - "2121": "8", - "2122": "Wild cat", - "2123": "probably will decrease", - "2124": "70", - "2125": "14", - "2126": "D", - "2127": "11cm", - "2128": "2", - "2129": "2", - "2130": "60\u00b0", - "2131": "0.21", - "2132": "2", - "2133": "2", - "2134": "0.21", - "2135": "2", - "2136": "30cm", - "2137": "60\u00b0", - "2138": "60\u00b0", - "2139": "2", - "2140": "16", - "2141": "seabirds", - "2142": "2", - "2143": "nonlinear", - "2144": "D", - "2145": "2", - "2146": "2", - "2147": "quarter", - "2148": "no", - "2149": "2", - "2150": "2", - "2151": "no", - "2152": "1.2", - "2153": "2", - "2154": "2", - "2155": "3.8%", - "2156": "2", - "2157": "no", - "2158": "2", - "2159": "0.21", - "2160": "3", - "2161": "2", - "2162": "3*\\sqrt{2}", - "2163": "2", - "2164": "6", - "2165": "B", - "2166": "60\u00b0", - "2167": "34\u00b0", - "2168": "2", - "2169": "2.5", - "2170": "an exponential function", - "2171": "\\frac{2\u221a{13}}{13}", - "2172": "2", - "2173": "0.21", - "2174": "53\u00b0", - "2175": "2", - "2176": "16 \\pi", - "2177": "2", - "2178": "2", - "2179": "No", - "2180": "0.21", - "2181": "no", - "2182": "surplus", - "2183": "No", - "2184": "2", - "2185": "about 40", - "2186": "quarter", - "2187": "2", - "2188": "2", - "2189": "2", - "2190": "No", - "2191": "2", - "2192": "2", - "2193": "1.2", - "2194": "eagle", - "2195": "2", - "2196": "122\u00b0", - "2197": "7m", - "2198": "2", - "2199": "2", - "2200": "336", - "2201": "2", - "2202": "2", - "2203": "14", - "2204": "2", - "2205": "119\u00b0", - "2206": "5", - "2207": "16\u03c0", - "2208": "60\u00b0", - "2209": "2", - "2210": "19.5", - "2211": "50\u00b0", - "2212": "2", - "2213": "30\u00b0", - "2214": "61.8%-63.1%", - "2215": "No", - "2216": "2", - "2217": "No", - "2218": "2", - "2219": "No", - "2220": "OF-9B", - "2221": "2", - "2222": "More precipitation falls in April than in August.", - "2223": "The tiny algae population would increase.", - "2224": "45\u00b0", - "2225": "2", - "2226": "8", - "2227": "2", - "2228": "D", - "2229": "2", - "2230": "1.2", - "2231": "108", - "2232": "2", - "2233": "65", - "2234": "2", - "2235": "2", - "2236": "3", - "2237": "no", - "2238": "2", - "2239": "122", - "2240": "7:30", - "2241": "22", - "2242": "1.2", - "2243": "8 \\sqrt { 5 }", - "2244": "2", - "2245": "2", - "2246": "2", - "2247": "2", - "2248": "11*\\sqrt{2}", - "2249": "6", - "2250": "2", - "2251": "increase", - "2252": "C", - "2253": "Crabs", - "2254": "2", - "2255": "No", - "2256": "62", - "2257": "2", - "2258": "2", - "2259": "10", - "2260": "No", - "2261": "2", - "2262": "no", - "2263": "DTD", - "2264": "163", - "2265": "60", - "2266": "2", - "2267": "40\u00b0", - "2268": "No", - "2269": "100\u00b0", - "2270": "0.21", - "2271": "No", - "2272": "2", - "2273": "2", - "2274": "2", - "2275": "2", - "2276": "2", - "2277": "2", - "2278": "2", - "2279": "2", - "2280": "2", - "2281": "It would increase", - "2282": "increase", - "2283": "2", - "2284": "2", - "2285": "2", - "2286": "Argentina", - "2287": "zero", - "2288": "5", - "2289": "Lanceolate", - "2290": "2", - "2291": "[0, 2, 0, 2, 1, 7, 1, 2, 0, 3, 0, 6]", - "2292": "2", - "2293": "b", - "2294": "2", - "2295": "68\u00b0", - "2296": "purple", - "2297": "no", - "2298": "2", - "2299": "2", - "2300": "2", - "2301": "no", - "2302": "3", - "2303": "no", - "2304": "2", - "2305": "10", - "2306": "3", - "2307": "45\u00b0", - "2308": "2", - "2309": "No", - "2310": "2", - "2311": "0.21", - "2312": "No", - "2313": "2", - "2314": "5", - "2315": "12", - "2316": "27.5\u00b0", - "2317": "zero", - "2318": "2", - "2319": "2", - "2320": "2", - "2321": "21", - "2322": "No", - "2323": "2", - "2324": "2", - "2325": "2", - "2326": "2", - "2327": "2", - "2328": "decaying", - "2329": "50\u00b0", - "2330": "4cm", - "2331": "grass", - "2332": "2", - "2333": "No", - "2334": "2", - "2335": "1.2", - "2336": "2", - "2337": "12", - "2338": "Elliptical", - "2339": "5*\\sqrt{2}", - "2340": "2", - "2341": "2", - "2342": "2", - "2343": "5", - "2344": "no", - "2345": "No", - "2346": "2", - "2347": "No", - "2348": "72\u00b0", - "2349": "No", - "2350": "No", - "2351": "115\u00b0", - "2352": "12*\\sqrt{3}", - "2353": "2", - "2354": "quarter", - "2355": "2", - "2356": "1:55 P.M.", - "2357": "2", - "2358": "0.6", - "2359": "1.2", - "2360": "D", - "2361": "2", - "2362": "Mountain lions would decrease", - "2363": "9", - "2364": "quarter", - "2365": "78\u00b0", - "2366": "60", - "2367": "0.21", - "2368": "2", - "2369": "No", - "2370": "2", - "2371": "2", - "2372": "2", - "2373": "2", - "2374": "36", - "2375": "2", - "2376": "no", - "2377": "3", - "2378": "27\u00b0", - "2379": "2", - "2380": "20\u00b0", - "2381": "1:4", - "2382": "2", - "2383": "\\frac{3}{2}", - "2384": "2", - "2385": "2", - "2386": "no", - "2387": "2", - "2388": "No", - "2389": "c", - "2390": "2", - "2391": "2", - "2392": "45\u00b0", - "2393": "No", - "2394": "2", - "2395": "0.21", - "2396": "2", - "2397": "no", - "2398": "60\u00b0", - "2399": "2", - "2400": "quarter", - "2401": "1200\u03c0cm^{2}", - "2402": "2", - "2403": "2", - "2404": "12cm", - "2405": "60", - "2406": "9", - "2407": "2", - "2408": "24", - "2409": "No", - "2410": "2", - "2411": "No", - "2412": "2", - "2413": "2", - "2414": "2", - "2415": "2", - "2416": "No", - "2417": "2", - "2418": "24", - "2419": "2", - "2420": "No", - "2421": "9", - "2422": "no", - "2423": "No", - "2424": "54.8", - "2425": "0.21", - "2426": "2", - "2427": "65\u00b0", - "2428": "2", - "2429": "No", - "2430": "6", - "2431": "97\u00b0", - "2432": "2", - "2433": "2", - "2434": "no", - "2435": "110", - "2436": "decrease", - "2437": "increase in deer", - "2438": "60", - "2439": "Stay the same", - "2440": "1.2", - "2441": "2", - "2442": "1.2", - "2443": "0.21", - "2444": "5", - "2445": "77", - "2446": "quarter", - "2447": "B", - "2448": "2", - "2449": "2", - "2450": "2", - "2451": "no", - "2452": "2", - "2453": "2", - "2454": "145\u00b0", - "2455": "28", - "2456": "60\u00b0", - "2457": "decrease", - "2458": "55\u00b0", - "2459": "GPT-4 (3-shot) ", - "2460": "4", - "2461": "The ladybug population will increase.", - "2462": "1.2", - "2463": "2", - "2464": "2", - "2465": "26\u00b0", - "2466": "2", - "2467": "2", - "2468": "65\u00b0", - "2469": "D", - "2470": "2", - "2471": "a decrease in racoons", - "2472": "No", - "2473": "No", - "2474": "2", - "2475": "No", - "2476": "2", - "2477": "2", - "2478": "No", - "2479": "2", - "2480": "quarter", - "2481": "6", - "2482": "2", - "2483": "255,600-1,600,800", - "2484": "3", - "2485": "2", - "2486": "3\u03c0", - "2487": "105\u00b0", - "2488": "85\u00b0", - "2489": "2", - "2490": "2", - "2491": "Phytoplankton", - "2492": "no", - "2493": "0.21", - "2494": "2", - "2495": "Population of squids and fishes would increase.", - "2496": "2", - "2497": "No", - "2498": "2", - "2499": "30", - "2500": "quarter", - "2501": "no", - "2502": "2", - "2503": "no", - "2504": "77\u00b0", - "2505": "4", - "2506": "2", - "2507": "49", - "2508": "2", - "2509": "Chipmunks population would decrease.", - "2510": "No", - "2511": "3", - "2512": "No", - "2513": "2", - "2514": "no", - "2515": "22.5", - "2516": "9", - "2517": "MusicLDM w/. BAM ", - "2518": "88", - "2519": "2", - "2520": "nonlinear", - "2521": "\\frac{1}{3}", - "2522": "No", - "2523": "No", - "2524": "No", - "2525": "2", - "2526": "0.21", - "2527": "2", - "2528": "2", - "2529": "2", - "2530": "The rosebush population would increase.", - "2531": "2", - "2532": "surplus", - "2533": "No", - "2534": "2", - "2535": "2", - "2536": "2", - "2537": "2", - "2538": "odd", - "2539": "B", - "2540": "50\u00b0", - "2541": "2", - "2542": "No", - "2543": "2", - "2544": "2", - "2545": "No", - "2546": "No", - "2547": "50\u00b0", - "2548": "2", - "2549": "2", - "2550": "Lizard", - "2551": "B", - "2552": "2", - "2553": "The snake population will increase. (D)The mouse population will decrease. (A) The plant populations will decrease (B) The bird population will decrease", - "2554": "increase", - "2555": "2", - "2556": "2", - "2557": "2", - "2558": "no", - "2559": "2", - "2560": "No", - "2561": "1.2", - "2562": "0.21", - "2563": "10", - "2564": "8", - "2565": "9", - "2566": "11", - "2567": "Increase in number of cottontails", - "2568": "2", - "2569": "14cm", - "2570": "6", - "2571": "2", - "2572": "2", - "2573": "No", - "2574": "no", - "2575": "no", - "2576": "0.21", - "2577": "C", - "2578": "2", - "2579": "0.21", - "2580": "No", - "2581": "Phytoplankton", - "2582": "no", - "2583": "no", - "2584": "w/o $L_{cm}$", - "2585": "no", - "2586": "2", - "2587": "3\u221a{14}", - "2588": "There will be more grass", - "2589": "no", - "2590": "2", - "2591": "11:50 A.M.", - "2592": "35\u00b0", - "2593": "No", - "2594": "2", - "2595": "5 \\sqrt { 3 }", - "2596": "sinuate", - "2597": "The beetle and grasshopper populations would multiply exponentially.", - "2598": "2", - "2599": "y = 2", - "2600": "0.48", - "2601": "3", - "2602": "2", - "2603": "No", - "2604": "no", - "2605": "Tiny shrimps", - "2606": "40\u00b0", - "2607": "2", - "2608": "2", - "2609": "2", - "2610": "No", - "2611": "40\u00b0", - "2612": "aphid will increase", - "2613": "53", - "2614": "2", - "2615": "18", - "2616": "2", - "2617": "14 \\sqrt { 3 }", - "2618": "2", - "2619": "B", - "2620": "2", - "2621": "No", - "2622": "2", - "2623": "2", - "2624": "no", - "2625": "2", - "2626": "2", - "2627": "20", - "2628": "2.5", - "2629": "35\u00b0", - "2630": "60\u00b0", - "2631": "105*\\degree", - "2632": "no", - "2633": "decaying", - "2634": "2", - "2635": "2", - "2636": "1.2", - "2637": "2", - "2638": "2", - "2639": "5", - "2640": "9", - "2641": "75", - "2642": "No", - "2643": "2", - "2644": "21", - "2645": "No", - "2646": "square", - "2647": "65\u00b0", - "2648": "2", - "2649": "70", - "2650": "0.21", - "2651": "No", - "2652": "2", - "2653": "2", - "2654": "4.5%", - "2655": "2", - "2656": "2", - "2657": "2", - "2658": "equal to", - "2659": "2", - "2660": "\\frac{\u221a6}{3}", - "2661": "no", - "2662": "0.21", - "2663": "2", - "2664": "2", - "2665": "no", - "2666": "2", - "2667": "120", - "2668": "2", - "2669": "110\u00b0", - "2670": "No", - "2671": "it would increase", - "2672": "2", - "2673": "2", - "2674": "2", - "2675": "2", - "2676": "2", - "2677": "2", - "2678": "The magnitude of the magnetic force is the same in both pairs.", - "2679": "36", - "2680": "Precipitation does not change much from month to month.", - "2681": "2", - "2682": "May through September", - "2683": "No", - "2684": "1