Fixed json schema to allow for html elements

dhruvahuja19 · Dec 16, 2024 · f7b6c6f · f7b6c6f
1 parent 4d13582
commit f7b6c6f
Show file tree

Hide file tree

Showing 11 changed files with 707 additions and 206 deletions.
diff --git a/data/README.md b/data/README.md
@@ -0,0 +1,83 @@
+# DOM Task Format
+
+This document describes the format for DOM interaction tasks in our benchmark.
+
+## Schema
+
+Tasks are defined in JSONL format, where each line is a valid JSON object following the schema in `task_schema.json`.
+
+## Example Task
+
+```json
+{
+    "web_name": "Cambridge Dictionary",
+    "id": "cambridge_lookup_1",
+    "task": "Click the search box and type 'hello'",
+    "web": "https://dictionary.cambridge.org/",
+    "element_type": "input",
+    "interaction": "type",
+    "target_element": {
+        "type": "id",
+        "value": "searchword"
+    },
+    "input_text": "hello",
+    "target_html": "<input type=\"text\" id=\"searchword\" class=\"search-input\" ...>",
+    "ground_truth": {
+        "screenshot": "cambridge_lookup_1_gt.png",
+        "description": "The word 'hello' has been entered in the search box",
+        "visual_changes": [
+            "Text 'hello' appears in search box",
+            "Text cursor visible at end of input",
+            "Search suggestions may appear"
+        ],
+        "success_criteria": [
+            "Input text matches 'hello' exactly",
+            "Text is visible in search box",
+            "Search box maintains focus"
+        ]
+    }
+}
+```
+
+## Field Descriptions
+
+### Basic Information
+- `web_name`: Name of the website
+- `id`: Unique identifier for the task
+- `task`: Human-readable task description
+- `web`: Website URL
+
+### Element and Interaction
+- `element_type`: Type of HTML element (input, button, link, etc.)
+- `interaction`: Type of interaction (click, type, hover)
+- `target_element`: How to find the element
+  - `type`: Selector type (id, class, text)
+  - `value`: Selector value
+- `input_text`: Text to type (only for type interactions)
+
+### Validation
+- `target_html`: The actual HTML element for structural validation
+- `ground_truth`: Validation data
+  - `screenshot`: Reference screenshot filename
+  - `description`: What should happen
+  - `visual_changes`: List of expected visual changes
+  - `success_criteria`: Specific conditions for success
+
+## Validation Process
+
+Tasks are validated using two methods:
+1. **Visual Validation** (60% of score)
+   - Compares screenshots before/after interaction
+   - Verifies visual changes match ground truth
+
+2. **HTML Validation** (40% of score)
+   - Matches the HTML element the model interacted with
+   - Checks structure, attributes, and content
+
+## Adding New Tasks
+
+1. Follow the schema in `task_schema.json`
+2. Ensure unique task IDs
+3. Provide clear success criteria
+4. Include reference screenshots
+5. Fill in the `target_html` field with the actual HTML element
diff --git a/data/dom_tasks.jsonl b/data/dom_tasks.jsonl
diff --git a/data/task_schema.json b/data/task_schema.json
@@ -0,0 +1,108 @@
+{
+    "$schema": "http://json-schema.org/draft-07/schema#",
+    "title": "DOM Task Schema",
+    "description": "Schema for DOM interaction tasks in the benchmark",
+    "type": "object",
+    "required": [
+        "web_name",
+        "id",
+        "task",
+        "web",
+        "element_type",
+        "interaction",
+        "target_element",
+        "target_html",
+        "ground_truth"
+    ],
+    "properties": {
+        "web_name": {
+            "type": "string",
+            "description": "Name of the website"
+        },
+        "id": {
+            "type": "string",
+            "description": "Unique identifier for the task",
+            "pattern": "^[a-z0-9_]+$"
+        },
+        "task": {
+            "type": "string",
+            "description": "Human-readable task description"
+        },
+        "web": {
+            "type": "string",
+            "description": "Website URL",
+            "format": "uri"
+        },
+        "element_type": {
+            "type": "string",
+            "description": "Type of HTML element to interact with",
+            "enum": ["input", "button", "link", "div", "span"]
+        },
+        "interaction": {
+            "type": "string",
+            "description": "Type of interaction to perform",
+            "enum": ["click", "type", "hover"]
+        },
+        "target_element": {
+            "type": "object",
+            "description": "How to find the element",
+            "required": ["type", "value"],
+            "properties": {
+                "type": {
+                    "type": "string",
+                    "description": "Type of selector to use",
+                    "enum": ["id", "class", "text"]
+                },
+                "value": {
+                    "type": "string",
+                    "description": "Value of the selector"
+                }
+            }
+        },
+        "input_text": {
+            "type": "string",
+            "description": "Text to type (only required for type interactions)"
+        },
+        "target_html": {
+            "type": "string",
+            "description": "The actual HTML element to match against for validation"
+        },
+        "ground_truth": {
+            "type": "object",
+            "description": "Validation data",
+            "required": [
+                "screenshot",
+                "description",
+                "visual_changes",
+                "success_criteria"
+            ],
+            "properties": {
+                "screenshot": {
+                    "type": "string",
+                    "description": "Filename of the ground truth screenshot",
+                    "pattern": "^[a-z0-9_]+\\.png$"
+                },
+                "description": {
+                    "type": "string",
+                    "description": "Description of the expected outcome"
+                },
+                "visual_changes": {
+                    "type": "array",
+                    "description": "List of expected visual changes",
+                    "items": {
+                        "type": "string"
+                    },
+                    "minItems": 1
+                },
+                "success_criteria": {
+                    "type": "array",
+                    "description": "List of specific conditions that must be met for success",
+                    "items": {
+                        "type": "string"
+                    },
+                    "minItems": 1
+                }
+            }
+        }
+    }
+}
diff --git a/evaluation/README.md b/evaluation/README.md
@@ -4,10 +4,9 @@ This directory contains the evaluation tools for the DOM and DOMer-2 benchmark.
 
 ## Overview
 
-The evaluation uses GPT-4V to assess web interactions by analyzing:
-1. Before/After screenshots of the webpage
-2. Accessibility tree information
-3. Task descriptions and expected outcomes
+The evaluation system combines two approaches:
+1. Visual Validation (60% of score): Using GPT-4V to analyze screenshots
+2. HTML Element Validation (40% of score): Comparing actual HTML elements
 
 ## Usage
 
@@ -21,20 +20,22 @@ python auto_eval.py \
 
 ## Evaluation Process
 
-1. **Screenshot Analysis**
-   - Compare before/after states
+1. **Visual Validation (60%)**
+   - Compare before/after screenshots
    - Verify visual changes match expected interaction
    - Check element visibility and state changes
+   - Uses GPT-4V for intelligent visual comparison
 
-2. **Accessibility Tree Verification**
-   - Validate correct element was targeted
-   - Check element attributes and relationships
-   - Verify element state changes
+2. **HTML Element Validation (40%)**
+   - Compare model's selected HTML element with ground truth
+   - Structure score (40%): Tag hierarchy and relationships
+   - Attributes score (30%): Element properties and identifiers
+   - Content score (30%): Inner HTML and text content
 
 3. **Success Criteria**
-   - Correct element identified and interacted with
-   - Expected visual changes occurred
-   - No unintended side effects
+   - Visual score ≥ 0.9 for visual validation
+   - HTML similarity score ≥ 0.9 for element validation
+   - Combined weighted score ≥ 0.9 for overall success
 
 ## Output Format
 
@@ -45,15 +46,49 @@ python auto_eval.py \
     "evaluations": [
         {
             "task_id": "task_001",
+            "visual_evaluation": {
+                "score": 0.95,
+                "details": "Detailed visual evaluation..."
+            },
+            "html_evaluation": {
+                "score": 0.92,
+                "structure_score": 0.95,
+                "attributes_score": 0.90,
+                "content_score": 0.89
+            },
+            "final_score": 0.94,
             "success": true,
-            "evaluation": "Detailed evaluation text...",
             "timestamp": 1234567890
-        },
-        ...
+        }
     ]
 }
 ```
 
+## Scoring Details
+
+### Visual Score (60%)
+- Element visibility and positioning
+- State changes (hover effects, expansions)
+- Content updates and transitions
+- Overall visual accuracy
+
+### HTML Score (40%)
+1. **Structure (40% of HTML score)**
+   - Correct tag name
+   - Parent-child relationships
+   - Sibling context
+
+2. **Attributes (30% of HTML score)**
+   - ID and class matching
+   - ARIA attributes
+   - Event handlers
+   - Custom data attributes
+
+3. **Content (30% of HTML score)**
+   - Inner HTML similarity
+   - Text content matching
+   - Nested element structure
+
 ## Requirements
 
 - OpenAI API key with GPT-4V access