Skip to content

Commit

Permalink
just need a small model fix
Browse files Browse the repository at this point in the history
  • Loading branch information
dhruvahuja19 committed Dec 17, 2024
1 parent 4dc5de5 commit 0ce8955
Show file tree
Hide file tree
Showing 14 changed files with 589 additions and 651 deletions.
63 changes: 50 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,25 +69,62 @@ Located in `data/ground_truth/`, each task has:
- `[task_id]_gt.png`: Screenshot of successful interaction
- Description in task JSON explaining expected changes

## Environment Setup

1. Create a virtual environment and install dependencies:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
```

2. Set up environment variables in `.env`:
```bash
OPENAI_API_KEY=your_openai_api_key
```

## Running the Benchmark

1. **Run Tests**:
1. Run tasks:
```bash
python run.py --tasks data/dom_tasks.jsonl --output results --evaluate
```

This will:
- Execute each task in the tasks file
- Save screenshots and results to the output directory
- Run GPT-4V evaluation if --evaluate is specified

## Ground Truth Management

Ground truth images are stored in `evaluation/ground_truth/` with a consistent naming scheme:
```
evaluation/ground_truth/
└── task_1_gt.png
└── task_2_gt.png
...
```

The tasks file references these images using relative paths:
```json
{
"id": 1,
"ground_truth": {
"screenshot": "evaluation/ground_truth/task_1_gt.png"
}
}
```

## Testing

Run environment tests:
```bash
python run.py \
--tasks data/dom_tasks.jsonl \
--output results/run_001 \
--headless \
--save-accessibility-tree
python test_env.py
```

2. **Evaluate Results**:
Run OpenAI API connection test:
```bash
python evaluation/auto_eval.py \
--tasks data/dom_tasks.jsonl \
--results results/run_001 \
--ground-truth data/ground_truth \
--output results/run_001/evaluation.json \
--openai-key YOUR_API_KEY
python test_openai.py
```

## Evaluation Process
Expand Down
81 changes: 1 addition & 80 deletions data/dom_tasks.jsonl

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions data/evaluation_output.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"total_tasks": 80,
"successful_tasks": 0,
"evaluations": []
}
1 change: 1 addition & 0 deletions data/ground_truth.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"task_id": 1, "target_html": "<button class='primary-button'>Click me</button>", "screenshot": "evaluation/ground_truth/task_1_gt.png"}
121 changes: 51 additions & 70 deletions evaluation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,89 +8,70 @@ The evaluation system combines two approaches:
1. Visual Validation (60% of score): Using GPT-4V to analyze screenshots
2. HTML Element Validation (40% of score): Comparing actual HTML elements

## Usage
## Directory Structure

```
evaluation/
├── ground_truth/ # Ground truth screenshots
│ └── task_1_gt.png # Named consistently as task_{id}_gt.png
├── auto_eval.py # Main evaluation script
├── image_match.py # GPT-4V based image comparison
└── fuzzy_match.py # HTML element comparison
```

## Environment Setup

1. Ensure you have the OpenAI API key in your `.env` file:
```bash
OPENAI_API_KEY=your_openai_api_key
```

## Running Evaluation

The evaluation is typically run through the main benchmark script:
```bash
python ../run.py --tasks data/tasks.jsonl --output data/results --evaluate
```

Or can be run separately:
```bash
python auto_eval.py \
--tasks ../data/dom_tasks.jsonl \
--results ../results/run_001 \
--output ../results/run_001/evaluation.json \
--openai-key YOUR_API_KEY
--tasks-file data/tasks.jsonl \
--results-dir data/results.json \
--output-file data/evaluation.json
```

## Evaluation Process

1. **Visual Validation (60%)**
- Compare before/after screenshots
- Verify visual changes match expected interaction
- Check element visibility and state changes
- Uses GPT-4V for intelligent visual comparison
1. **Visual Validation (GPT-4V)**
- Compares before/after screenshots with ground truth
- Considers task-specific requirements
- Returns a score and detailed reasoning

2. **HTML Element Validation (40%)**
- Compare model's selected HTML element with ground truth
- Structure score (40%): Tag hierarchy and relationships
- Attributes score (30%): Element properties and identifiers
- Content score (30%): Inner HTML and text content
2. **HTML Element Validation**
- Compares target HTML with actual interaction
- Uses fuzzy matching for robustness
- Considers element attributes and structure

3. **Success Criteria**
- Visual score ≥ 0.9 for visual validation
- HTML similarity score ≥ 0.9 for element validation
- Combined weighted score ≥ 0.9 for overall success
The final score is a weighted average:
- Visual Score: 60%
- HTML Score: 40%

## Output Format

```json
{
"total_tasks": 10,
"successful_tasks": 8,
"evaluations": [
{
"task_id": "task_001",
"visual_evaluation": {
"score": 0.95,
"details": "Detailed visual evaluation..."
},
"html_evaluation": {
"score": 0.92,
"structure_score": 0.95,
"attributes_score": 0.90,
"content_score": 0.89
},
"final_score": 0.94,
"success": true,
"timestamp": 1234567890
}
]
"total_tasks": 10,
"successful_tasks": 8,
"evaluations": [
{
"task_id": 1,
"success": true,
"visual_score": 0.95,
"html_score": 0.90,
"final_score": 0.93,
"reasoning": "..."
}
]
}
```

## Scoring Details

### Visual Score (60%)
- Element visibility and positioning
- State changes (hover effects, expansions)
- Content updates and transitions
- Overall visual accuracy

### HTML Score (40%)
1. **Structure (40% of HTML score)**
- Correct tag name
- Parent-child relationships
- Sibling context

2. **Attributes (30% of HTML score)**
- ID and class matching
- ARIA attributes
- Event handlers
- Custom data attributes

3. **Content (30% of HTML score)**
- Inner HTML similarity
- Text content matching
- Nested element structure

## Requirements

- OpenAI API key with GPT-4V access
- Python 3.8+
- Required packages in `requirements.txt`
Loading

0 comments on commit 0ce8955

Please sign in to comment.