Skip to content

Commit

Permalink
first commit
Browse files Browse the repository at this point in the history
  • Loading branch information
lupantech committed Oct 4, 2023
1 parent 80c8128 commit f036866
Show file tree
Hide file tree
Showing 121 changed files with 2,049,478 additions and 1 deletion.
12 changes: 12 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
env/
# data/scienceqa/images/

checkpoints/*.pt

*.zip
*.idea
*.vscode
*.DS_Store
*.ipynb_checkpoints
*.pyc
*__pycache__
437 changes: 437 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

207 changes: 206 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,206 @@
# MathVista
# <img src="https://mathvista.github.io/static/images/mathvista.png" alt="Logo" style="zoom:10%;" /> MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts



Code for the Paper "[MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts](https://arxiv.org/abs/2304.09842)".

:bell: If you have any questions or suggestions, please don't hesitate to let us know. You can directly email [Pan Lu](https://lupantech.github.io/) using the email address [email protected], comment on the [Twitter](https://twitter.com/lupantech), or post an issue on this repository.

[[Project Page](https://mathvista.github.io/)] [[Paper](https://arxiv.org/abs/2304.09842)]

<p align="center">
<img src="assets/mathvista.png" width="15%"> <br>
Tentative logo for <b>MathVista</b>.
</p>




## 💥 News 💥

- **[2023.10.03]** Our paper is now accessible at https://arxiv.org/abs/2304.09842.



## About MathVista

Although Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive skills in various domains, their ability for mathematical reasoning within visual contexts has not been formally examined. Equipping LLMs and LMMs with this capability is vital for general-purpose AI assistants and showcases promising potential in education, data analysis, and scientific discovery.

To bridge this gap, we present **MathVista**, a benchmark designed to amalgamate challenges from **diverse mathematical and visual tasks**. We first taxonomize the key task types, reasoning skills, and visual contexts from the literature to guide our selection from **28 existing math-focused and visual question answering datasets**. Then, **we construct three new datasets, IQTest, FunctionQA, and PaperQA**, to accommodate for missing types of visual contexts. The problems featured often require deep visual understanding beyond OCR or image captioning, and compositional reasoning with rich domain-specific tools, thus posing a notable challenge to existing models.

We conduct **a comprehensive evaluation of 11 prominent open-source and proprietary foundation models** (LLMs, LLMs augmented with tools, and LMMs), and **early experiments with GPT-4V**. The best-performing model, Multimodal Bard, achieves only **58%** of human performance (34.8% vs 60.3%), indicating ample room for further improvement. Given this significant gap, **MathVista** fuels future research in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. Preliminary tests show that **MathVista** also presents challenges to GPT-4V, underscoring the benchmark's importance.

For more details, you can find our project page [here](https://mathvista.github.io/) and our paper [here](https://arxiv.org/pdf/2304.09842.pdf).



## Download the MathVista Dataset

You can download the MathVista dataset from [Google Drive](https://drive.google.com/file/d/1jX_nKaoDALEttiN1IR0dr89qLVt8yBkO/view) and store the data files in the `data` folder as following:

```sh
├── data
│   ├── annot_testmini.json
│   ├── images
│   ├── pids_UniGeo.json
│   ├── query.json
│   ├── README.md
│   ├── source.json
│   ├── test.json
│   ├── testmini.json
│   └── texts
```

The MathVista dataset will be available at [HuggingFace Datasets](https://huggingface.co/datasets/lupantech/MathVista) shortly! Stay tuned~



## 🐙 Requirements

- [OpenAI API key](https://platform.openai.com/account/api-keys)
- [Claude API Key](https://docs.anthropic.com/claude/reference/getting-started-with-the-api)
- [Bard API Key](https://bard.google.com/)

Install the python dependencies if you would like to reproduce our results:

```
pip install openai
pip install anthropic
pip install bardapi
```



## Run Experiments on MathVista

### Multimodal Bard

If you have setted Multimodal Bard, you can run the following commands:

Generate the response:

```sh
cd evaluation

python generate_response.py \
--model bard \
--output_dir ../results/bard \
--output_file output_bard.json
```

Extract the short answer text for score calculation:

```sh
python extract_answer.py \
--output_dir ../results/bard \
--output_file output_bard.json
```

Calculate the final score:

```sh
python calculate_score.py \
--output_dir ../results/bard \
--output_file output_bard.json \
--score_file scores_bard.json
```

### Chain-of-Thought GPT-4

Generate the response:

```sh
cd evaluation

python generate_response.py \
--model gpt-4-0613 \
--output_dir ../results/gpt4 \
--output_file output_gpt4_2shot_solution_use_caption_ocr.json \
--shot_num 2 \
--shot_type solution \
--use_caption \
--use_ocr \
--caption_file ../data/texts/captions_bard.json \
--ocr_file ../data/texts/ocrs_easyocr.json
```

Extract the short answer text for score calculation:

```sh
python extract_answer.py \
--output_dir ../results/gpt4 \
--output_file output_gpt4_2shot_solution_use_caption_ocr.json
```

Calculate the final score:

```sh
python calculate_score.py \
--output_dir ../results/gpt4 \
--output_file output_gpt4_2shot_solution_use_caption_ocr.json \
--score_file scores_gpt4_2shot_solution_use_caption_ocr.json
```

### Program-of-Thought GPT-4

Generate the response:

```sh
cd evaluation

python generate_response.py \
--model gpt-4-0613 \
--output_dir ../results/gpt4 \
--output_file output_gpt4_2shot_code_use_caption_ocr.json \
--shot_num 2 \
--shot_type code \
--use_caption \
--use_ocr \
--caption_file ../data/texts/captions_bard.json \
--ocr_file ../data/texts/ocrs_easyocr.json
```

Extract the short answer text for score calculation:

```sh
python extract_answer.py \
--output_dir ../results/gpt4 \
--output_file output_gpt4_2shot_code_use_caption_ocr.json \
--response_label execution
```

Calculate the final score:

```sh
python calculate_score.py \
--output_dir ../results/gpt4 \
--output_file output_gpt4_2shot_code_use_caption_ocr.json \
--score_file scores_gpt4_2shot_code_use_caption_ocr.json
```

### More Models

To run more models, please check out the running scripts at [`scripts`](https://github.com/lupantech/MathVista/tree/main/scripts).



## :coffee: Stay Connected!

Fantastic! I'm always open to engaging discussions, collaborations, or even just sharing a virtual coffee. To get in touch, visit [Pan Lu](https://lupantech.github.io/)'s homepage for contact information.




## :white_check_mark: Cite

If you find **MathVista** useful for your your research and applications, please kindly cite using this BibTeX:

```latex
@article{lu2023chameleon,
title={MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts},
author={Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng},
journal={arXiv preprint arXiv:2304.09842},
year={2023}
}
```

Binary file added assets/mathvista.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added data/assets/mathvista.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit f036866

Please sign in to comment.