DOM and DOMer-2

A First-in-Class Benchmark for Evaluating the Ability of Web Agents to Identify and Interact with Web Components

Overview

DOM and DOMer-2 focuses on testing a model's ability to interact with web elements (clicking buttons, typing text, etc.) without requiring complex planning or reasoning. The benchmark provides:

Simple, single-action tasks
Real websites with diverse DOM structures
Ground truth screenshots for validation
GPT-4V based evaluation
Support for both serial and parallel execution

Installation

Clone the repository:

git clone https://github.com/yourusername/DOMe-and-DOMer-2.git
cd DOMe-and-DOMer-2

Install dependencies using pip:

pip install -e .

Required dependencies:

selenium
webdriver-manager
Pillow
numpy
requests
beautifulsoup4
openai
anthropic
google-generativeai
python-dotenv

Set up your API keys in a .env file:

OPENAI_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here
GOOGLE_API_KEY=your_google_key_here

Supported Models

The benchmark currently supports the following models:

GPT-4 Turbo (OpenAI)
- Default model for both task execution and evaluation
- High accuracy but subject to rate limits (3500 RPM)
Claude 3 Haiku (Anthropic)
- Fast and efficient for task execution
- Subject to stricter rate limits (5 RPM)
- Use --serial flag for best results
Gemini 1.5 Pro (Google)
- Latest version of Google's Gemini model
- Good balance of speed and accuracy

Usage

The benchmark can be run in either serial or parallel mode:

Parallel Mode (Default)

# Run with GPT-4
python -m benchmark --model gpt4 --tasks data/test_tasks.jsonl --output-dir results

# Run with Claude
python -m benchmark --model claude --tasks data/test_tasks.jsonl --output-dir results --serial

# Run with Gemini
python -m benchmark --model gemini --tasks data/test_tasks.jsonl --output-dir results

Serial Mode

python -m benchmark --model [gpt4|claude|gemini] --tasks data/test_tasks.jsonl --output-dir results --serial

Evaluation

Results are automatically evaluated using GPT-4V for visual comparison and GPT-4 for HTML structure matching:

python -m evaluate --tasks data/test_tasks.jsonl --results-dir results --output results/evaluation.json

Task Format

Tasks are defined in JSONL format with the following structure:

{
    "web_name": "Website Name",
    "id": "unique_task_id",
    "task": "Description of the interaction task",
    "web": "https://website.url",
    "element_type": "button|input|link",
    "interaction": "click|type|hover",
    "target_element": {
        "type": "id|class|xpath",
        "value": "selector_value"
    },
    "input_text": "Text to type (for type interactions)",
    "target_html": "HTML of target element",
    "ground_truth": {
        "screenshot": "path/to/screenshot.png",
        "description": "Expected result description"
    }
}

Rate Limits

Different models have different rate limits:

GPT-4: 3500 requests per minute
Claude: 5 requests per minute
Gemini: 60 requests per minute

Use the --serial flag for models with strict rate limits (e.g., Claude) to avoid hitting limits.

Test Tasks

The repository includes two task sets:

data/test_tasks.jsonl: Full test set with 100+ tasks
data/test_tasks_10.jsonl: Smaller set of 10 tasks for quick testing

Detailed Setup Instructions

Environment Configuration: Copy .env.example to .env and fill in your API keys.
Dependencies: Install dependencies using pip install -r requirements.txt.
Virtual Environment: (Optional) Set up a virtual environment using venv.

Running Benchmarks

Main Script: Use run.py to execute benchmarks. Example:

python run.py --tasks data/test_tasks.jsonl --output results --model gpt4

Parallel and Serial Execution: Use parallel_runner.py or serial_runner.py for specific execution modes.

Adding New Models

Model Class: Create a new class in models/ inheriting from BaseModel.
Integration: Implement required methods and integrate with run.py.
Testing: Validate the new model with existing task sets.

Interpreting Results

Results Directory: Check the results/ directory for output files and logs.
Evaluation: Use evaluate.py to assess model performance.
Logs: Review logs for insights into model behavior and errors.

Baseline Results

Reference Scores: Baseline results are available in results/baseline_results/.
Comparison: Use these scores to evaluate new models or configurations.

Additional Resources

Scripts: Explore the scripts/ directory for additional utilities.
Examples: Check the examples/ directory for example usage and configurations.
Utilities: Use utils.py and other scripts in utils/ for common tasks.

Documentation

Using the Benchmark

Setup: Ensure all dependencies are installed and API keys are configured in the .env file.
Running Tests: Use the benchmark module to run tests on different models. Specify the model and task set.
Serial vs Parallel: Use --serial for models with strict rate limits.

Adding New Agents

Model Integration: Implement a new model class inheriting from BaseModel.
Configuration: Configure API keys and model parameters in the new class.
Testing: Add the new model to the benchmark script and test with existing task sets.

Interpreting Results

Output Files: Check the results directory for detailed logs and evaluation scores.
Error Handling: Review logs for any errors or skipped tasks.
Baseline Comparison: Compare results against baseline scores provided in the baseline_results directory.

Baseline Results

Baseline results for each model are available for comparison.
Use these results to gauge the performance of new models or configurations.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name	Name	Last commit message	Last commit date
Latest commit dhruvahuja19 Update README.md Dec 26, 2024 a1a6aca · Dec 26, 2024 History 72 Commits
data	data	more garbage cleanup	Dec 20, 2024
evaluation	evaluation	Final Commit	Dec 20, 2024
examples	examples	Model adapters	Dec 17, 2024
models	models	Done	Dec 20, 2024
results	results	Done	Dec 20, 2024
scripts	scripts	Model adapters	Dec 17, 2024
utils	utils	just need a small model fix	Dec 17, 2024
.env.example	.env.example	made gemini better	Dec 17, 2024
.env.template	.env.template	Initial Setup	Dec 16, 2024
.gitignore	.gitignore	added folders	Dec 16, 2024
LICENSE	LICENSE	Initial Setup	Dec 16, 2024
Makefile	Makefile	Initial Setup	Dec 16, 2024
README.md	README.md	Update README.md	Dec 26, 2024
analyze_insights.py	analyze_insights.py	Final Commit	Dec 20, 2024
analyze_patterns.py	analyze_patterns.py	Done with bare min - Gemini working, pushing results	Dec 19, 2024
analyze_results.py	analyze_results.py	Final Commit	Dec 20, 2024
evaluate.py	evaluate.py	Done with bare min - Gemini working, pushing results	Dec 19, 2024
parallel_runner.py	parallel_runner.py	Final Commit	Dec 20, 2024
prompts.py	prompts.py	Fixed json schema to allow for html elements	Dec 16, 2024
pyproject.toml	pyproject.toml	Model adapters	Dec 17, 2024
requirements.txt	requirements.txt	Works	Dec 16, 2024
run.py	run.py	Done with bare min - Gemini working, pushing results	Dec 19, 2024
serial_runner.py	serial_runner.py	Final Commit	Dec 20, 2024
setup.cfg	setup.cfg	Initial Setup	Dec 16, 2024
utils.py	utils.py	so many changes	Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DOM and DOMer-2

Overview

Installation

Supported Models

Usage

Parallel Mode (Default)

Serial Mode

Evaluation

Task Format

Rate Limits

Test Tasks

Detailed Setup Instructions

Running Benchmarks

Adding New Models

Interpreting Results

Baseline Results

Additional Resources

Documentation

Using the Benchmark

Adding New Agents

Interpreting Results

Baseline Results

Contributing

License

About

Releases

Packages

Contributors 4

Languages

License

dhruvahuja19/DOMe-and-DOMer-2

Folders and files

Latest commit

History

Repository files navigation

DOM and DOMer-2

Overview

Installation

Supported Models

Usage

Parallel Mode (Default)

Serial Mode

Evaluation

Task Format

Rate Limits

Test Tasks

Detailed Setup Instructions

Running Benchmarks

Adding New Models

Interpreting Results

Baseline Results

Additional Resources

Documentation

Using the Benchmark

Adding New Agents

Interpreting Results

Baseline Results

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages