Autonomous Evaluation and Refinement of Digital Agents

Paper | Data and Models

Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, Alane Suhr

UC Berkeley, University of Michigan

COLM 2024 / MAR Workshop CVPR 2024 Best Paper

Overview

In this study, we design and use evaluation models to both evaluate and autonomously refine the performance of digital agents that browse the web or control mobile devices.

The evaluator and evaluation code is provided in ./agent_eval/ folder. You can use these models, either open weight or GPT-4V-based, to evaluate the performance of digital agents. Please refer to the Evaluation section for more details.

The refinement and the ios/android emulator code is provided in ./exps/ folder. It provides examples to execute/improve a variety of agents on WebArena/Android/iOS. Notably,

Reflexion + GPT-4 agent which achieves 20.2% on WebArena and is current state-of-the-art.
Refined CogAgent model which achieves 75% relative improvement in success rate on iOS.
A Python binding for iOS and Android emulator to facilitate refinement and end-to-end evaluation of digital agents.

Please refer to the Refinement section for more details.

We release all models, agent trajectories and dataset on Huggingface Hub.

News

July. 10. 2024: The paper is accepted at COLM 2024. We also won the best paper award at MAR Workshop of CVPR 2024
June. 14. 2024: We release DigiRL. Our 2B VLM, when post-trained with an autonomous evaluator (reward model), improves its success rate on Android device-control tasks from 17% to 67%.

Evaluation

Setup

First install the agent_eval package

cd agent_eval
pip install -e .

If you want to do inference with the captioner model, you need to additionally revert the transformers package to an old version

pip install transformers==4.32.0

Evaluate Agent Trajectories

You can evaluate agent trajectories by . You can download all agent trajectories used in the paper from this link.

Please visit the following files and change the configuration, setup the OpenAI API Key (for GPT-4) / Anyscale API key (for Mixtral), and run the following command to evaluate the agent trajectories.

cd ./agent_eval/agent_eval/scripts
# Select the right command according to the domain
python run_eval_web.py # for evaluating webarena agents
python run_eval_android.py # for evaluating android agents
python annotate_ios_dense.py # for providing dense annotations to iOS agents, later used as rewards in filtered-bc

Inspect/Annotate Agent Trajectories

We define a shared UnifiedTrajectory format to store agent trajectories, it's defined in ./agent_eval/agent_eval/domains/unified.py. To transform raw agent trajectories to UnifiedTrajectory, you can use the corresponding notebooks under ./agent_eval/agent_eval/domains/ folder.

You can inspect or provide human annotations to the agent trajectories by running the following command:

python -m agent_eval.eval.annotate_app --dataset <path-to-dataset> --log_name <log-name>

Captioner

The captioner VLM is used in the modular evaluator to provide dense descriptions of the screenshots, which is then feed into a LM to reason about the agent's behavior. We provide a demo, its weight, and training data on Huggingface Hub.

You can start the captioner server by running the following command:

python -m agent_eval.captioner.captioner_server --port <PORT_NUMBER>

./agent_eval/agent_eval/captioner also include

annotate_screenshots.py, code to annotate the screenshots with GPT-4V
gen_captions.sh, script to annotate a large number of screenshots with captions

Refinement

You can download all agent trajectories used in the experiment from this link.

Reflexion Agent on WebArena

Please refer to exps/webarena_exp/README.md for more details on how to reproduce the results.

Filtered-BC Refinement on iOS

The tasks we used are listed in exps/ios_exp/train_tasks.txt and exps/ios_exp/eval_tasks.txt
Please refer to exps/ios_exp/README.md for more details on how to reproduce the results.

Running Agents on Android

The tasks we used are listed in exps/android_exp/assets/instructions.txt
Please refer to exps/android_exp/README.md for more details on how to reproduce the results.

Filtered-BC Refinement on Android

We share the codebase with DigiRL for this part of experiment.

Citation

Please consider citing our paper if you find this project helpful for your research:

@misc{pan2024autonomous,
      title={Autonomous Evaluation and Refinement of Digital Agents}, 
      author={Jiayi Pan and Yichi Zhang and Nicholas Tomlin and Yifei Zhou and Sergey Levine and Alane Suhr},
      year={2024},
      eprint={2404.06474},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Autonomous Evaluation and Refinement of Digital Agents

Paper | Data and Models

Overview

News

Evaluation

Refinement

Reflexion Agent on WebArena

Filtered-BC Refinement on iOS

Running Agents on Android

Filtered-BC Refinement on Android

Citation

About

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
agent_eval		agent_eval
assets		assets
exps		exps
LICENSE		LICENSE
README.md		README.md

License

Berkeley-NLP/Agent-Eval-Refine

Folders and files

Latest commit

History

Repository files navigation

Autonomous Evaluation and Refinement of Digital Agents

Paper | Data and Models

Overview

News

Evaluation

Refinement

Reflexion Agent on WebArena

Filtered-BC Refinement on iOS

Running Agents on Android

Filtered-BC Refinement on Android

Citation

About

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages