Skip to content

Commit

Permalink
Fixed json schema to allow for html elements
Browse files Browse the repository at this point in the history
  • Loading branch information
dhruvahuja19 committed Dec 16, 2024
1 parent 4d13582 commit f7b6c6f
Show file tree
Hide file tree
Showing 11 changed files with 707 additions and 206 deletions.
83 changes: 83 additions & 0 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# DOM Task Format

This document describes the format for DOM interaction tasks in our benchmark.

## Schema

Tasks are defined in JSONL format, where each line is a valid JSON object following the schema in `task_schema.json`.

## Example Task

```json
{
"web_name": "Cambridge Dictionary",
"id": "cambridge_lookup_1",
"task": "Click the search box and type 'hello'",
"web": "https://dictionary.cambridge.org/",
"element_type": "input",
"interaction": "type",
"target_element": {
"type": "id",
"value": "searchword"
},
"input_text": "hello",
"target_html": "<input type=\"text\" id=\"searchword\" class=\"search-input\" ...>",
"ground_truth": {
"screenshot": "cambridge_lookup_1_gt.png",
"description": "The word 'hello' has been entered in the search box",
"visual_changes": [
"Text 'hello' appears in search box",
"Text cursor visible at end of input",
"Search suggestions may appear"
],
"success_criteria": [
"Input text matches 'hello' exactly",
"Text is visible in search box",
"Search box maintains focus"
]
}
}
```

## Field Descriptions

### Basic Information
- `web_name`: Name of the website
- `id`: Unique identifier for the task
- `task`: Human-readable task description
- `web`: Website URL

### Element and Interaction
- `element_type`: Type of HTML element (input, button, link, etc.)
- `interaction`: Type of interaction (click, type, hover)
- `target_element`: How to find the element
- `type`: Selector type (id, class, text)
- `value`: Selector value
- `input_text`: Text to type (only for type interactions)

### Validation
- `target_html`: The actual HTML element for structural validation
- `ground_truth`: Validation data
- `screenshot`: Reference screenshot filename
- `description`: What should happen
- `visual_changes`: List of expected visual changes
- `success_criteria`: Specific conditions for success

## Validation Process

Tasks are validated using two methods:
1. **Visual Validation** (60% of score)
- Compares screenshots before/after interaction
- Verifies visual changes match ground truth

2. **HTML Validation** (40% of score)
- Matches the HTML element the model interacted with
- Checks structure, attributes, and content

## Adding New Tasks

1. Follow the schema in `task_schema.json`
2. Ensure unique task IDs
3. Provide clear success criteria
4. Include reference screenshots
5. Fill in the `target_html` field with the actual HTML element
160 changes: 80 additions & 80 deletions data/dom_tasks.jsonl

Large diffs are not rendered by default.

108 changes: 108 additions & 0 deletions data/task_schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "DOM Task Schema",
"description": "Schema for DOM interaction tasks in the benchmark",
"type": "object",
"required": [
"web_name",
"id",
"task",
"web",
"element_type",
"interaction",
"target_element",
"target_html",
"ground_truth"
],
"properties": {
"web_name": {
"type": "string",
"description": "Name of the website"
},
"id": {
"type": "string",
"description": "Unique identifier for the task",
"pattern": "^[a-z0-9_]+$"
},
"task": {
"type": "string",
"description": "Human-readable task description"
},
"web": {
"type": "string",
"description": "Website URL",
"format": "uri"
},
"element_type": {
"type": "string",
"description": "Type of HTML element to interact with",
"enum": ["input", "button", "link", "div", "span"]
},
"interaction": {
"type": "string",
"description": "Type of interaction to perform",
"enum": ["click", "type", "hover"]
},
"target_element": {
"type": "object",
"description": "How to find the element",
"required": ["type", "value"],
"properties": {
"type": {
"type": "string",
"description": "Type of selector to use",
"enum": ["id", "class", "text"]
},
"value": {
"type": "string",
"description": "Value of the selector"
}
}
},
"input_text": {
"type": "string",
"description": "Text to type (only required for type interactions)"
},
"target_html": {
"type": "string",
"description": "The actual HTML element to match against for validation"
},
"ground_truth": {
"type": "object",
"description": "Validation data",
"required": [
"screenshot",
"description",
"visual_changes",
"success_criteria"
],
"properties": {
"screenshot": {
"type": "string",
"description": "Filename of the ground truth screenshot",
"pattern": "^[a-z0-9_]+\\.png$"
},
"description": {
"type": "string",
"description": "Description of the expected outcome"
},
"visual_changes": {
"type": "array",
"description": "List of expected visual changes",
"items": {
"type": "string"
},
"minItems": 1
},
"success_criteria": {
"type": "array",
"description": "List of specific conditions that must be met for success",
"items": {
"type": "string"
},
"minItems": 1
}
}
}
}
}
67 changes: 51 additions & 16 deletions evaluation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,9 @@ This directory contains the evaluation tools for the DOM and DOMer-2 benchmark.

## Overview

The evaluation uses GPT-4V to assess web interactions by analyzing:
1. Before/After screenshots of the webpage
2. Accessibility tree information
3. Task descriptions and expected outcomes
The evaluation system combines two approaches:
1. Visual Validation (60% of score): Using GPT-4V to analyze screenshots
2. HTML Element Validation (40% of score): Comparing actual HTML elements

## Usage

Expand All @@ -21,20 +20,22 @@ python auto_eval.py \

## Evaluation Process

1. **Screenshot Analysis**
- Compare before/after states
1. **Visual Validation (60%)**
- Compare before/after screenshots
- Verify visual changes match expected interaction
- Check element visibility and state changes
- Uses GPT-4V for intelligent visual comparison

2. **Accessibility Tree Verification**
- Validate correct element was targeted
- Check element attributes and relationships
- Verify element state changes
2. **HTML Element Validation (40%)**
- Compare model's selected HTML element with ground truth
- Structure score (40%): Tag hierarchy and relationships
- Attributes score (30%): Element properties and identifiers
- Content score (30%): Inner HTML and text content

3. **Success Criteria**
- Correct element identified and interacted with
- Expected visual changes occurred
- No unintended side effects
- Visual score ≥ 0.9 for visual validation
- HTML similarity score ≥ 0.9 for element validation
- Combined weighted score ≥ 0.9 for overall success

## Output Format

Expand All @@ -45,15 +46,49 @@ python auto_eval.py \
"evaluations": [
{
"task_id": "task_001",
"visual_evaluation": {
"score": 0.95,
"details": "Detailed visual evaluation..."
},
"html_evaluation": {
"score": 0.92,
"structure_score": 0.95,
"attributes_score": 0.90,
"content_score": 0.89
},
"final_score": 0.94,
"success": true,
"evaluation": "Detailed evaluation text...",
"timestamp": 1234567890
},
...
}
]
}
```

## Scoring Details

### Visual Score (60%)
- Element visibility and positioning
- State changes (hover effects, expansions)
- Content updates and transitions
- Overall visual accuracy

### HTML Score (40%)
1. **Structure (40% of HTML score)**
- Correct tag name
- Parent-child relationships
- Sibling context

2. **Attributes (30% of HTML score)**
- ID and class matching
- ARIA attributes
- Event handlers
- Custom data attributes

3. **Content (30% of HTML score)**
- Inner HTML similarity
- Text content matching
- Nested element structure

## Requirements

- OpenAI API key with GPT-4V access
Expand Down
Loading

0 comments on commit f7b6c6f

Please sign in to comment.