just need a small model fix

dhruvahuja19 · Dec 17, 2024 · 0ce8955 · 0ce8955
1 parent 4dc5de5
commit 0ce8955
Show file tree

Hide file tree

Showing 14 changed files with 589 additions and 651 deletions.
diff --git a/README.md b/README.md
@@ -69,25 +69,62 @@ Located in `data/ground_truth/`, each task has:
 - `[task_id]_gt.png`: Screenshot of successful interaction
 - Description in task JSON explaining expected changes
 
+## Environment Setup
+
+1. Create a virtual environment and install dependencies:
+```bash
+python -m venv venv
+source venv/bin/activate  # On Windows: venv\Scripts\activate
+pip install -r requirements.txt
+```
+
+2. Set up environment variables in `.env`:
+```bash
+OPENAI_API_KEY=your_openai_api_key
+```
+
 ## Running the Benchmark
 
-1. **Run Tests**:
+1. Run tasks:
+```bash
+python run.py --tasks data/dom_tasks.jsonl --output results --evaluate
+```
+
+This will:
+- Execute each task in the tasks file
+- Save screenshots and results to the output directory
+- Run GPT-4V evaluation if --evaluate is specified
+
+## Ground Truth Management
+
+Ground truth images are stored in `evaluation/ground_truth/` with a consistent naming scheme:
+```
+evaluation/ground_truth/
+└── task_1_gt.png
+└── task_2_gt.png
+...
+```
+
+The tasks file references these images using relative paths:
+```json
+{
+  "id": 1,
+  "ground_truth": {
+    "screenshot": "evaluation/ground_truth/task_1_gt.png"
+  }
+}
+```
+
+## Testing
+
+Run environment tests:
 ```bash
-python run.py \
-    --tasks data/dom_tasks.jsonl \
-    --output results/run_001 \
-    --headless \
-    --save-accessibility-tree
+python test_env.py
 ```
 
-2. **Evaluate Results**:
+Run OpenAI API connection test:
 ```bash
-python evaluation/auto_eval.py \
-    --tasks data/dom_tasks.jsonl \
-    --results results/run_001 \
-    --ground-truth data/ground_truth \
-    --output results/run_001/evaluation.json \
-    --openai-key YOUR_API_KEY
+python test_openai.py
 ```
 
 ## Evaluation Process

diff --git a/data/dom_tasks.jsonl b/data/dom_tasks.jsonl
diff --git a/data/evaluation_output.jsonl b/data/evaluation_output.jsonl
@@ -0,0 +1,5 @@
+{
+  "total_tasks": 80,
+  "successful_tasks": 0,
+  "evaluations": []
+}
diff --git a/data/ground_truth.jsonl b/data/ground_truth.jsonl
@@ -0,0 +1 @@
+{"task_id": 1, "target_html": "<button class='primary-button'>Click me</button>", "screenshot": "evaluation/ground_truth/task_1_gt.png"}
diff --git a/evaluation/README.md b/evaluation/README.md
@@ -8,89 +8,70 @@ The evaluation system combines two approaches:
 1. Visual Validation (60% of score): Using GPT-4V to analyze screenshots
 2. HTML Element Validation (40% of score): Comparing actual HTML elements
 
-## Usage
+## Directory Structure
 
+```
+evaluation/
+├── ground_truth/        # Ground truth screenshots
+│   └── task_1_gt.png   # Named consistently as task_{id}_gt.png
+├── auto_eval.py        # Main evaluation script
+├── image_match.py      # GPT-4V based image comparison
+└── fuzzy_match.py      # HTML element comparison
+```
+
+## Environment Setup
+
+1. Ensure you have the OpenAI API key in your `.env` file:
+```bash
+OPENAI_API_KEY=your_openai_api_key
+```
+
+## Running Evaluation
+
+The evaluation is typically run through the main benchmark script:
+```bash
+python ../run.py --tasks data/tasks.jsonl --output data/results --evaluate
+```
+
+Or can be run separately:
 ```bash
 python auto_eval.py \
-    --tasks ../data/dom_tasks.jsonl \
-    --results ../results/run_001 \
-    --output ../results/run_001/evaluation.json \
-    --openai-key YOUR_API_KEY
+    --tasks-file data/tasks.jsonl \
+    --results-dir data/results.json \
+    --output-file data/evaluation.json
 ```
 
 ## Evaluation Process
 
-1. **Visual Validation (60%)**
-   - Compare before/after screenshots
-   - Verify visual changes match expected interaction
-   - Check element visibility and state changes
-   - Uses GPT-4V for intelligent visual comparison
+1. **Visual Validation (GPT-4V)**
+   - Compares before/after screenshots with ground truth
+   - Considers task-specific requirements
+   - Returns a score and detailed reasoning
 
-2. **HTML Element Validation (40%)**
-   - Compare model's selected HTML element with ground truth
-   - Structure score (40%): Tag hierarchy and relationships
-   - Attributes score (30%): Element properties and identifiers
-   - Content score (30%): Inner HTML and text content
+2. **HTML Element Validation**
+   - Compares target HTML with actual interaction
+   - Uses fuzzy matching for robustness
+   - Considers element attributes and structure
 
-3. **Success Criteria**
-   - Visual score ≥ 0.9 for visual validation
-   - HTML similarity score ≥ 0.9 for element validation
-   - Combined weighted score ≥ 0.9 for overall success
+The final score is a weighted average:
+- Visual Score: 60%
+- HTML Score: 40%
 
 ## Output Format
 
 ```json
 {
-    "total_tasks": 10,
-    "successful_tasks": 8,
-    "evaluations": [
-        {
-            "task_id": "task_001",
-            "visual_evaluation": {
-                "score": 0.95,
-                "details": "Detailed visual evaluation..."
-            },
-            "html_evaluation": {
-                "score": 0.92,
-                "structure_score": 0.95,
-                "attributes_score": 0.90,
-                "content_score": 0.89
-            },
-            "final_score": 0.94,
-            "success": true,
-            "timestamp": 1234567890
-        }
-    ]
+  "total_tasks": 10,
+  "successful_tasks": 8,
+  "evaluations": [
+    {
+      "task_id": 1,
+      "success": true,
+      "visual_score": 0.95,
+      "html_score": 0.90,
+      "final_score": 0.93,
+      "reasoning": "..."
+    }
+  ]
 }
 ```
-
-## Scoring Details
-
-### Visual Score (60%)
-- Element visibility and positioning
-- State changes (hover effects, expansions)
-- Content updates and transitions
-- Overall visual accuracy
-
-### HTML Score (40%)
-1. **Structure (40% of HTML score)**
-   - Correct tag name
-   - Parent-child relationships
-   - Sibling context
-
-2. **Attributes (30% of HTML score)**
-   - ID and class matching
-   - ARIA attributes
-   - Event handlers
-   - Custom data attributes
-
-3. **Content (30% of HTML score)**
-   - Inner HTML similarity
-   - Text content matching
-   - Nested element structure
-
-## Requirements
-
-- OpenAI API key with GPT-4V access
-- Python 3.8+
-- Required packages in `requirements.txt`
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"task_id": 1, "target_html": "<button class='primary-button'>Click me</button>", "screenshot": "evaluation/ground_truth/task_1_gt.png"}