Skip to content

A First-in-Class Benchmark for Evaluating the Ability of Web Agents to Identify and Interact with Web Components

License

Notifications You must be signed in to change notification settings

dhruvahuja19/DOMe-and-DOMer-2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

a1a6aca · Dec 26, 2024

History

72 Commits
Dec 20, 2024
Dec 20, 2024
Dec 17, 2024
Dec 20, 2024
Dec 20, 2024
Dec 17, 2024
Dec 17, 2024
Dec 17, 2024
Dec 16, 2024
Dec 16, 2024
Dec 16, 2024
Dec 16, 2024
Dec 26, 2024
Dec 20, 2024
Dec 19, 2024
Dec 20, 2024
Dec 19, 2024
Dec 20, 2024
Dec 16, 2024
Dec 17, 2024
Dec 16, 2024
Dec 19, 2024
Dec 20, 2024
Dec 16, 2024
Dec 17, 2024

Repository files navigation

DOM and DOMer-2

A First-in-Class Benchmark for Evaluating the Ability of Web Agents to Identify and Interact with Web Components

Overview

DOM and DOMer-2 focuses on testing a model's ability to interact with web elements (clicking buttons, typing text, etc.) without requiring complex planning or reasoning. The benchmark provides:

  1. Simple, single-action tasks
  2. Real websites with diverse DOM structures
  3. Ground truth screenshots for validation
  4. GPT-4V based evaluation
  5. Support for both serial and parallel execution

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/DOMe-and-DOMer-2.git
cd DOMe-and-DOMer-2
  1. Install dependencies using pip:
pip install -e .

Required dependencies:

  • selenium
  • webdriver-manager
  • Pillow
  • numpy
  • requests
  • beautifulsoup4
  • openai
  • anthropic
  • google-generativeai
  • python-dotenv
  1. Set up your API keys in a .env file:
OPENAI_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here
GOOGLE_API_KEY=your_google_key_here

Supported Models

The benchmark currently supports the following models:

  1. GPT-4 Turbo (OpenAI)

    • Default model for both task execution and evaluation
    • High accuracy but subject to rate limits (3500 RPM)
  2. Claude 3 Haiku (Anthropic)

    • Fast and efficient for task execution
    • Subject to stricter rate limits (5 RPM)
    • Use --serial flag for best results
  3. Gemini 1.5 Pro (Google)

    • Latest version of Google's Gemini model
    • Good balance of speed and accuracy

Usage

The benchmark can be run in either serial or parallel mode:

Parallel Mode (Default)

# Run with GPT-4
python -m benchmark --model gpt4 --tasks data/test_tasks.jsonl --output-dir results

# Run with Claude
python -m benchmark --model claude --tasks data/test_tasks.jsonl --output-dir results --serial

# Run with Gemini
python -m benchmark --model gemini --tasks data/test_tasks.jsonl --output-dir results

Serial Mode

python -m benchmark --model [gpt4|claude|gemini] --tasks data/test_tasks.jsonl --output-dir results --serial

Evaluation

Results are automatically evaluated using GPT-4V for visual comparison and GPT-4 for HTML structure matching:

python -m evaluate --tasks data/test_tasks.jsonl --results-dir results --output results/evaluation.json

Task Format

Tasks are defined in JSONL format with the following structure:

{
    "web_name": "Website Name",
    "id": "unique_task_id",
    "task": "Description of the interaction task",
    "web": "https://website.url",
    "element_type": "button|input|link",
    "interaction": "click|type|hover",
    "target_element": {
        "type": "id|class|xpath",
        "value": "selector_value"
    },
    "input_text": "Text to type (for type interactions)",
    "target_html": "HTML of target element",
    "ground_truth": {
        "screenshot": "path/to/screenshot.png",
        "description": "Expected result description"
    }
}

Rate Limits

Different models have different rate limits:

  • GPT-4: 3500 requests per minute
  • Claude: 5 requests per minute
  • Gemini: 60 requests per minute

Use the --serial flag for models with strict rate limits (e.g., Claude) to avoid hitting limits.

Test Tasks

The repository includes two task sets:

  • data/test_tasks.jsonl: Full test set with 100+ tasks
  • data/test_tasks_10.jsonl: Smaller set of 10 tasks for quick testing

Detailed Setup Instructions

  • Environment Configuration: Copy .env.example to .env and fill in your API keys.
  • Dependencies: Install dependencies using pip install -r requirements.txt.
  • Virtual Environment: (Optional) Set up a virtual environment using venv.

Running Benchmarks

  • Main Script: Use run.py to execute benchmarks. Example:
    python run.py --tasks data/test_tasks.jsonl --output results --model gpt4
  • Parallel and Serial Execution: Use parallel_runner.py or serial_runner.py for specific execution modes.

Adding New Models

  • Model Class: Create a new class in models/ inheriting from BaseModel.
  • Integration: Implement required methods and integrate with run.py.
  • Testing: Validate the new model with existing task sets.

Interpreting Results

  • Results Directory: Check the results/ directory for output files and logs.
  • Evaluation: Use evaluate.py to assess model performance.
  • Logs: Review logs for insights into model behavior and errors.

Baseline Results

  • Reference Scores: Baseline results are available in results/baseline_results/.
  • Comparison: Use these scores to evaluate new models or configurations.

Additional Resources

  • Scripts: Explore the scripts/ directory for additional utilities.
  • Examples: Check the examples/ directory for example usage and configurations.
  • Utilities: Use utils.py and other scripts in utils/ for common tasks.

Documentation

Using the Benchmark

  • Setup: Ensure all dependencies are installed and API keys are configured in the .env file.
  • Running Tests: Use the benchmark module to run tests on different models. Specify the model and task set.
  • Serial vs Parallel: Use --serial for models with strict rate limits.

Adding New Agents

  • Model Integration: Implement a new model class inheriting from BaseModel.
  • Configuration: Configure API keys and model parameters in the new class.
  • Testing: Add the new model to the benchmark script and test with existing task sets.

Interpreting Results

  • Output Files: Check the results directory for detailed logs and evaluation scores.
  • Error Handling: Review logs for any errors or skipped tasks.
  • Baseline Comparison: Compare results against baseline scores provided in the baseline_results directory.

Baseline Results

  • Baseline results for each model are available for comparison.
  • Use these results to gauge the performance of new models or configurations.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A First-in-Class Benchmark for Evaluating the Ability of Web Agents to Identify and Interact with Web Components

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published