Skip to content

Commit

Permalink
Model adapters
Browse files Browse the repository at this point in the history
  • Loading branch information
dhruvahuja19 committed Dec 17, 2024
1 parent 1a40c5a commit 078ebdf
Show file tree
Hide file tree
Showing 14 changed files with 706 additions and 144 deletions.
13 changes: 13 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# OpenAI API Key for GPT-4 model
OPENAI_API_KEY=your-openai-api-key-here

# Anthropic API Key for Claude model
ANTHROPIC_API_KEY=your-anthropic-api-key-here

# Optional: Model configurations
GPT4_MODEL=gpt-4-turbo-preview # or gpt-4
CLAUDE_MODEL=claude-3-opus-20240229

# Optional: Execution settings
MAX_WORKERS=4
TIMEOUT_SECONDS=30
222 changes: 86 additions & 136 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,175 +10,125 @@ DOM and DOMer-2 focuses on testing a model's ability to interact with web elemen
2. Real websites with diverse DOM structures
3. Ground truth screenshots for validation
4. GPT-4V based evaluation
5. Support for both serial and parallel execution

## Directory Structure

```
DOMe-and-DOMer-2/
├── data/
│ ├── dom_tasks.jsonl # Task definitions
│ └── ground_truth/ # Ground truth screenshots
│ ├── amazon_search_1_gt.png
│ └── ...
├── evaluation/
│ ├── auto_eval.py # GPT-4V evaluation script
│ └── README.md # Evaluation documentation
├── results/ # Results for each run
│ └── run_001/
│ ├── before_*.png # Screenshots before interaction
│ ├── after_*.png # Screenshots after interaction
│ ├── accessibility_*.json # Accessibility trees
│ ├── results.json # Raw results
│ ├── evaluation.json # GPT-4V evaluations
│ └── benchmark.log # Detailed logs
├── prompts.py # LLM system prompts
├── run.py # Main benchmark runner
├── utils.py # Utility functions
└── requirements.txt # Dependencies
## Task Format
Tasks are defined in `data/dom_tasks.jsonl`:
```json
{
"web_name": "Cambridge Dictionary",
"id": "cambridge_lookup_1",
"task": "Click the search box and type 'hello'",
"web": "https://dictionary.cambridge.org/",
"element_type": "input",
"interaction": "type",
"target_element": {
"type": "id",
"value": "searchword"
},
"input_text": "hello",
"target_html": "<input id='searchword' type='text' ...>",
"ground_truth": {
"screenshot": "evaluation/ground_truth/task_1_gt.png",
"description": "The word 'hello' has been entered in the search box"
}
}
```

Key fields:
- `target_element`: Selector information for finding the element
- `target_html`: Expected HTML structure of the element
- `ground_truth`: Reference screenshot and description

## Ground Truth

Ground truth is provided in two forms:
1. **Screenshots**: Visual state after successful interaction
2. **Descriptions**: Text description of expected changes

Located in `data/ground_truth/`, each task has:
- `[task_id]_gt.png`: Screenshot of successful interaction
- Description in task JSON explaining expected changes

## Environment Setup
## Installation

1. Create a virtual environment and install dependencies:
1. Clone the repository:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
git clone https://github.com/yourusername/DOMe-and-DOMer-2.git
cd DOMe-and-DOMer-2
```

2. Set up environment variables in `.env`:
2. Install dependencies using pip:
```bash
OPENAI_API_KEY=your_openai_api_key
pip install -e .
```

## Running the Benchmark

Required dependencies:
- selenium
- webdriver-manager
- Pillow
- numpy
- requests
- beautifulsoup4
- openai
- python-dotenv

3. Set up your OpenAI API key in a `.env` file:
```bash
python run.py --tasks data/dom_tasks.jsonl --output data/results --evaluate
OPENAI_API_KEY=your_api_key_here
```

This will:
1. Execute each task in the tasks file
2. Save screenshots and results
3. Compare actual HTML elements with expected ones
4. Run GPT-4V evaluation on screenshots

## Ground Truth Management

Ground truth images are stored in `evaluation/ground_truth/` with a consistent naming scheme:
```
evaluation/ground_truth/
└── task_1_gt.png
└── task_2_gt.png
...
```

The tasks file references these images using relative paths:
```json
{
"id": 1,
"ground_truth": {
"screenshot": "evaluation/ground_truth/task_1_gt.png"
}
}
```
## Usage

## Testing
The benchmark can be run in either serial or parallel mode:

Run environment tests:
### Parallel Mode (Default)
```bash
python test_env.py
python run.py --tasks data/dom_tasks.jsonl --output results --max-workers 4 --evaluate
```

Run OpenAI API connection test:
### Serial Mode
```bash
python test_openai.py
python run.py --tasks data/dom_tasks.jsonl --output results --mode serial --evaluate
```

## Evaluation Process
### Key Arguments
- `--tasks`: Path to JSONL file containing tasks
- `--output`: Output directory for results
- `--mode`: Run tasks in 'serial' or 'parallel' mode (default: parallel)
- `--max-workers`: Number of parallel workers (default: 4)
- `--evaluate`: Run GPT-4V evaluation after tasks complete
- `--evaluate-mode`: Run evaluations in 'serial' or 'parallel' mode (default: parallel)
- `--save-accessibility-tree`: Save accessibility trees for each task
- `--wait-time`: Wait time between actions in seconds (default: 2.0)

1. **Technical Validation**:
- Element found and interacted with
- No errors during interaction
- Accessibility tree verification
## Directory Structure

2. **Visual Validation**:
- Compare after screenshot with ground truth
- Verify expected visual changes
- Check for unintended side effects
```
DOMe-and-DOMer-2/
├── data/
│ ├── dom_tasks.jsonl # Task definitions
│ └── task_schema.json # JSON schema for tasks
├── evaluation/
│ ├── auto_eval.py # Evaluation orchestrator
│ ├── parallel_eval.py # Parallel evaluation implementation
│ ├── image_match.py # GPT-4V image comparison
│ └── fuzzy_match.py # HTML structure comparison
├── parallel_runner.py # Parallel task execution
├── serial_runner.py # Serial task execution
├── utils.py # Shared utilities
├── run.py # Main entry point
└── pyproject.toml # Project configuration and dependencies
## Output Structure
Results are saved in the specified output directory:
```
output_dir/
├── results.json # Task execution results
├── evaluation.json # GPT-4V evaluation results
├── benchmark.log # Execution logs
├── *_before.png # Screenshots before interaction
├── *_after.png # Screenshots after interaction
└── *_tree.json # Accessibility trees (if enabled)
```
3. **GPT-4V Analysis**:
- Compare before/after/ground-truth screenshots
- Verify interaction success
- Check visual state matches expectations
## Task Format
## Output Format
Tasks are defined in `data/dom_tasks.jsonl`:
```json
{
"total_tasks": 10,
"successful_tasks": 8,
"evaluations": [
{
"task_id": "amazon_search_1",
"success": true,
"evaluation": "Detailed evaluation text...",
"timestamp": 1234567890
}
]
"id": "task_id",
"task": "Click the search box and type 'hello'",
"web": "https://example.com",
"interaction": "type",
"target_element": {
"type": "css",
"value": "#searchbox"
},
"input_text": "hello",
"ground_truth": {
"screenshot": "path/to/ground_truth.png"
}
}
```

## Requirements
## Evaluation

The benchmark uses GPT-4V to evaluate task success by comparing:
1. Before/after screenshots with ground truth
2. DOM structure changes
3. Task completion criteria

- Python 3.8+
- Chrome/Chromium browser
- OpenAI API key (for evaluation)
- Required packages in `requirements.txt`
Evaluation can be run in parallel or serial mode and produces detailed scoring and reasoning for each task.

## Contributing

[Contributing guidelines will be added]
Contributions are welcome! Please feel free to submit a Pull Request.

## License

[License information will be added]
This project is licensed under the MIT License - see the LICENSE file for details.
Binary file added evaluation/ground_truth/task_2_gt.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added evaluation/ground_truth/task_3_gt.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added evaluation/ground_truth/task_4_gt.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added evaluation/ground_truth/task_5_gt.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
52 changes: 52 additions & 0 deletions examples/model_usage.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
"""Example usage of different models in the DOM benchmark."""

import os
from dotenv import load_dotenv
from models import GPT4Model, ClaudeModel
from utils import TaskExecutor

# Load environment variables
load_dotenv()

def run_example_task(model, task):
"""Run a single task with the given model and print results."""
executor = TaskExecutor()
print(f"\nRunning task with {model.__class__.__name__}:")
print(f"Task: {task['task']}")

result = model.run_task(task, executor)

print(f"Success: {result.success}")
if result.error:
print(f"Error: {result.error}")
print(f"Time taken: {result.time_taken:.2f}s")
return result

def main():
# Initialize models
gpt4_model = GPT4Model(api_key=os.getenv("OPENAI_API_KEY"))
claude_model = ClaudeModel(api_key=os.getenv("ANTHROPIC_API_KEY"))

# Example task
task = {
"task": "Click the 'Sign In' button",
"target_element": {
"type": "css",
"value": "#signin-button"
},
"interaction": "click"
}

# Run with both models
gpt4_result = run_example_task(gpt4_model, task)
claude_result = run_example_task(claude_model, task)

# Compare results
print("\nComparison:")
print(f"GPT-4 success: {gpt4_result.success}")
print(f"Claude success: {claude_result.success}")
print(f"GPT-4 time: {gpt4_result.time_taken:.2f}s")
print(f"Claude time: {claude_result.time_taken:.2f}s")

if __name__ == "__main__":
main()
5 changes: 5 additions & 0 deletions models/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from .base import BaseModel, WebInteraction, TaskResult
from .gpt4 import GPT4Model
from .claude import ClaudeModel

__all__ = ['BaseModel', 'WebInteraction', 'TaskResult', 'GPT4Model', 'ClaudeModel']
Loading

0 comments on commit 078ebdf

Please sign in to comment.