Initial Setup

dhruvahuja19 · Dec 16, 2024 · 23d4aa5 · 23d4aa5
commit 23d4aa5
Show file tree

Hide file tree

Showing 16 changed files with 1,151 additions and 0 deletions.
diff --git a/.env.template b/.env.template
@@ -0,0 +1,15 @@
+# OpenAI API Key for GPT-4V evaluation
+OPENAI_API_KEY=your_openai_api_key_here
+
+# Chrome WebDriver Settings
+CHROME_BINARY_PATH=/path/to/chrome/binary  # Optional
+CHROME_DRIVER_PATH=/path/to/chromedriver   # Optional
+
+# Benchmark Settings
+HEADLESS=true                   # Run browser in headless mode
+FORCE_DEVICE_SCALE=true        # Force consistent device scaling
+IMAGE_MATCH_THRESHOLD=0.95     # Threshold for image similarity matching
+
+# Output Settings
+SAVE_ACCESSIBILITY_TREE=true   # Save accessibility tree for each task
+LOG_LEVEL=INFO                 # Logging level (DEBUG, INFO, WARNING, ERROR)
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,44 @@
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+
+# Virtual Environment
+venv/
+env/
+ENV/
+
+# IDE
+.idea/
+.vscode/
+*.swp
+*.swo
+
+# Benchmark specific
+results/
+data/ground_truth/*.png
+*.log
+
+# Environment variables
+.env
+
+# OS specific
+.DS_Store
+Thumbs.db
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2023 Dhruv Ahuja
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/Makefile b/Makefile
@@ -0,0 +1,49 @@
+.PHONY: install test lint format clean run evaluate
+
+# Environment setup
+install:
+	pip install -e .
+	pip install -r requirements.txt
+
+# Testing
+test:
+	pytest
+
+# Code quality
+lint:
+	flake8 .
+	mypy .
+	black . --check
+	isort . --check
+
+format:
+	black .
+	isort .
+
+# Cleaning
+clean:
+	rm -rf build/
+	rm -rf dist/
+	rm -rf *.egg-info
+	find . -type d -name __pycache__ -exec rm -r {} +
+	find . -type f -name "*.pyc" -delete
+	find . -type f -name "*.pyo" -delete
+	find . -type f -name "*.pyd" -delete
+	find . -type f -name ".coverage" -delete
+	find . -type d -name "*.egg-info" -exec rm -r {} +
+	find . -type d -name "*.egg" -exec rm -r {} +
+
+# Benchmark commands
+run:
+	python run.py \
+		--tasks data/dom_tasks.jsonl \
+		--output results/run_001 \
+		--headless \
+		--save-accessibility-tree
+
+evaluate:
+	python evaluation/auto_eval.py \
+		--tasks data/dom_tasks.jsonl \
+		--results results/run_001 \
+		--ground-truth data/ground_truth \
+		--output results/run_001/evaluation.json
diff --git a/README.md b/README.md
@@ -0,0 +1,140 @@
+# DOM and DOMer-2
+
+A benchmark for evaluating language models' ability to execute web element interactions.
+
+## Overview
+
+DOM and DOMer-2 focuses on testing a model's ability to interact with web elements (clicking buttons, typing text, etc.) without requiring complex planning or reasoning. The benchmark provides:
+
+1. Simple, single-action tasks
+2. Real websites with diverse DOM structures
+3. Ground truth screenshots for validation
+4. GPT-4V based evaluation
+
+## Directory Structure
+
+```
+DOMe-and-DOMer-2/
+├── data/
+│   ├── dom_tasks.jsonl         # Task definitions
+│   └── ground_truth/          # Ground truth screenshots
+│       ├── amazon_search_1_gt.png
+│       └── ...
+├── evaluation/
+│   ├── auto_eval.py           # GPT-4V evaluation script
+│   └── README.md              # Evaluation documentation
+├── results/                   # Results for each run
+│   └── run_001/
+│       ├── before_*.png       # Screenshots before interaction
+│       ├── after_*.png        # Screenshots after interaction
+│       ├── accessibility_*.json  # Accessibility trees
+│       ├── results.json       # Raw results
+│       ├── evaluation.json    # GPT-4V evaluations
+│       └── benchmark.log      # Detailed logs
+├── prompts.py                # LLM system prompts
+├── run.py                    # Main benchmark runner
+├── utils.py                 # Utility functions
+└── requirements.txt         # Dependencies
+
+## Task Format
+
+Tasks are defined in `data/dom_tasks.jsonl`:
+
+```json
+{
+    "web_name": "Amazon",
+    "id": "amazon_search_1",
+    "task": "Click the search button",
+    "web": "https://www.amazon.com",
+    "element_type": "button",
+    "interaction": "click",
+    "target_element": {
+        "type": "id",
+        "value": "nav-search-submit-button"
+    },
+    "ground_truth": {
+        "screenshot": "amazon_search_1_gt.png",
+        "description": "The search button has been clicked, showing search results"
+    }
+}
+```
+
+## Ground Truth
+
+Ground truth is provided in two forms:
+1. **Screenshots**: Visual state after successful interaction
+2. **Descriptions**: Text description of expected changes
+
+Located in `data/ground_truth/`, each task has:
+- `[task_id]_gt.png`: Screenshot of successful interaction
+- Description in task JSON explaining expected changes
+
+## Running the Benchmark
+
+1. **Run Tests**:
+```bash
+python run.py \
+    --tasks data/dom_tasks.jsonl \
+    --output results/run_001 \
+    --headless \
+    --save-accessibility-tree
+```
+
+2. **Evaluate Results**:
+```bash
+python evaluation/auto_eval.py \
+    --tasks data/dom_tasks.jsonl \
+    --results results/run_001 \
+    --ground-truth data/ground_truth \
+    --output results/run_001/evaluation.json \
+    --openai-key YOUR_API_KEY
+```
+
+## Evaluation Process
+
+1. **Technical Validation**:
+   - Element found and interacted with
+   - No errors during interaction
+   - Accessibility tree verification
+
+2. **Visual Validation**:
+   - Compare after screenshot with ground truth
+   - Verify expected visual changes
+   - Check for unintended side effects
+
+3. **GPT-4V Analysis**:
+   - Compare before/after/ground-truth screenshots
+   - Verify interaction success
+   - Check visual state matches expectations
+
+## Output Format
+
+```json
+{
+    "total_tasks": 10,
+    "successful_tasks": 8,
+    "evaluations": [
+        {
+            "task_id": "amazon_search_1",
+            "success": true,
+            "evaluation": "Detailed evaluation text...",
+            "timestamp": 1234567890
+        }
+    ]
+}
+```
+
+## Requirements
+
+- Python 3.8+
+- Chrome/Chromium browser
+- OpenAI API key (for evaluation)
+- Required packages in `requirements.txt`
+
+## Contributing
+
+[Contributing guidelines will be added]
+
+## License
+
+[License information will be added]
diff --git a/data/dom_tasks.jsonl b/data/dom_tasks.jsonl
@@ -0,0 +1,2 @@
+{"web_name": "Cambridge Dictionary", "id": "cambridge_lookup_1", "task": "Click the search box and type 'hello'", "web": "https://dictionary.cambridge.org/", "element_type": "input", "interaction": "type", "target_element": {"type": "id", "value": "searchword"}, "input_text": "hello", "ground_truth": {"screenshot": "cambridge_lookup_1_gt.png", "description": "The word 'hello' has been entered in the search box", "visual_changes": ["Text 'hello' appears in search box", "Text cursor visible at end of input", "Search suggestions may appear"], "accessibility_changes": ["Search box aria-value updates to 'hello'", "Search suggestions list may become visible"], "success_criteria": ["Input text matches 'hello' exactly", "Text is visible in search box", "Search box maintains focus"]}}
+{"web_name": "Cambridge Dictionary", "id": "cambridge_search_1", "task": "Click the search button", "web": "https://dictionary.cambridge.org/", "element_type": "button", "interaction": "click", "target_element": {"type": "class", "value": "cdo-search-button"}, "ground_truth": {"screenshot": "cambridge_search_1_gt.png", "description": "The search results for 'hello' are displayed", "visual_changes": ["Search button appears pressed", "Page transitions to search results", "Definition of 'hello' is displayed"], "accessibility_changes": ["Search results region becomes visible", "Page title updates to include 'hello'", "Search results are announced to screen readers"], "success_criteria": ["Search button responds to click", "Results page loads with 'hello' definition", "No error messages are displayed"]}}
diff --git a/evaluation/README.md b/evaluation/README.md
@@ -0,0 +1,61 @@
+# DOM and DOMer-2 Evaluation
+
+This directory contains the evaluation tools for the DOM and DOMer-2 benchmark.
+
+## Overview
+
+The evaluation uses GPT-4V to assess web interactions by analyzing:
+1. Before/After screenshots of the webpage
+2. Accessibility tree information
+3. Task descriptions and expected outcomes
+
+## Usage
+
+```bash
+python auto_eval.py \
+    --tasks ../data/dom_tasks.jsonl \
+    --results ../results/run_001 \
+    --output ../results/run_001/evaluation.json \
+    --openai-key YOUR_API_KEY
+```
+
+## Evaluation Process
+
+1. **Screenshot Analysis**
+   - Compare before/after states
+   - Verify visual changes match expected interaction
+   - Check element visibility and state changes
+
+2. **Accessibility Tree Verification**
+   - Validate correct element was targeted
+   - Check element attributes and relationships
+   - Verify element state changes
+
+3. **Success Criteria**
+   - Correct element identified and interacted with
+   - Expected visual changes occurred
+   - No unintended side effects
+
+## Output Format
+
+```json
+{
+    "total_tasks": 10,
+    "successful_tasks": 8,
+    "evaluations": [
+        {
+            "task_id": "task_001",
+            "success": true,
+            "evaluation": "Detailed evaluation text...",
+            "timestamp": 1234567890
+        },
+        ...
+    ]
+}
+```
+
+## Requirements
+
+- OpenAI API key with GPT-4V access
+- Python 3.8+
+- Required packages in `requirements.txt`
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		{"web_name": "Cambridge Dictionary", "id": "cambridge_lookup_1", "task": "Click the search box and type 'hello'", "web": "https://dictionary.cambridge.org/", "element_type": "input", "interaction": "type", "target_element": {"type": "id", "value": "searchword"}, "input_text": "hello", "ground_truth": {"screenshot": "cambridge_lookup_1_gt.png", "description": "The word 'hello' has been entered in the search box", "visual_changes": ["Text 'hello' appears in search box", "Text cursor visible at end of input", "Search suggestions may appear"], "accessibility_changes": ["Search box aria-value updates to 'hello'", "Search suggestions list may become visible"], "success_criteria": ["Input text matches 'hello' exactly", "Text is visible in search box", "Search box maintains focus"]}}
		{"web_name": "Cambridge Dictionary", "id": "cambridge_search_1", "task": "Click the search button", "web": "https://dictionary.cambridge.org/", "element_type": "button", "interaction": "click", "target_element": {"type": "class", "value": "cdo-search-button"}, "ground_truth": {"screenshot": "cambridge_search_1_gt.png", "description": "The search results for 'hello' are displayed", "visual_changes": ["Search button appears pressed", "Page transitions to search results", "Definition of 'hello' is displayed"], "accessibility_changes": ["Search results region becomes visible", "Page title updates to include 'hello'", "Search results are announced to screen readers"], "success_criteria": ["Search button responds to click", "Results page loads with 'hello' definition", "No error messages are displayed"]}}