Model adapters

dhruvahuja19 · Dec 17, 2024 · 078ebdf · 078ebdf
1 parent 1a40c5a
commit 078ebdf
Show file tree

Hide file tree

Showing 14 changed files with 706 additions and 144 deletions.
diff --git a/.env.example b/.env.example
@@ -0,0 +1,13 @@
+# OpenAI API Key for GPT-4 model
+OPENAI_API_KEY=your-openai-api-key-here
+
+# Anthropic API Key for Claude model
+ANTHROPIC_API_KEY=your-anthropic-api-key-here
+
+# Optional: Model configurations
+GPT4_MODEL=gpt-4-turbo-preview  # or gpt-4
+CLAUDE_MODEL=claude-3-opus-20240229
+
+# Optional: Execution settings
+MAX_WORKERS=4
+TIMEOUT_SECONDS=30
diff --git a/README.md b/README.md
@@ -10,175 +10,125 @@ DOM and DOMer-2 focuses on testing a model's ability to interact with web elemen
 2. Real websites with diverse DOM structures
 3. Ground truth screenshots for validation
 4. GPT-4V based evaluation
+5. Support for both serial and parallel execution
 
-## Directory Structure
-
-```
-DOMe-and-DOMer-2/
-├── data/
-│   ├── dom_tasks.jsonl         # Task definitions
-│   └── ground_truth/          # Ground truth screenshots
-│       ├── amazon_search_1_gt.png
-│       └── ...
-├── evaluation/
-│   ├── auto_eval.py           # GPT-4V evaluation script
-│   └── README.md              # Evaluation documentation
-├── results/                   # Results for each run
-│   └── run_001/
-│       ├── before_*.png       # Screenshots before interaction
-│       ├── after_*.png        # Screenshots after interaction
-│       ├── accessibility_*.json  # Accessibility trees
-│       ├── results.json       # Raw results
-│       ├── evaluation.json    # GPT-4V evaluations
-│       └── benchmark.log      # Detailed logs
-├── prompts.py                # LLM system prompts
-├── run.py                    # Main benchmark runner
-├── utils.py                 # Utility functions
-└── requirements.txt         # Dependencies
-
-## Task Format
-
-Tasks are defined in `data/dom_tasks.jsonl`:
-
-```json
-{
-    "web_name": "Cambridge Dictionary",
-    "id": "cambridge_lookup_1",
-    "task": "Click the search box and type 'hello'",
-    "web": "https://dictionary.cambridge.org/",
-    "element_type": "input",
-    "interaction": "type",
-    "target_element": {
-        "type": "id",
-        "value": "searchword"
-    },
-    "input_text": "hello",
-    "target_html": "<input id='searchword' type='text' ...>",
-    "ground_truth": {
-        "screenshot": "evaluation/ground_truth/task_1_gt.png",
-        "description": "The word 'hello' has been entered in the search box"
-    }
-}
-```
-
-Key fields:
-- `target_element`: Selector information for finding the element
-- `target_html`: Expected HTML structure of the element
-- `ground_truth`: Reference screenshot and description
-
-## Ground Truth
-
-Ground truth is provided in two forms:
-1. **Screenshots**: Visual state after successful interaction
-2. **Descriptions**: Text description of expected changes
-
-Located in `data/ground_truth/`, each task has:
-- `[task_id]_gt.png`: Screenshot of successful interaction
-- Description in task JSON explaining expected changes
-
-## Environment Setup
+## Installation
 
-1. Create a virtual environment and install dependencies:
+1. Clone the repository:
 ```bash
-python -m venv venv
-source venv/bin/activate  # On Windows: venv\Scripts\activate
-pip install -r requirements.txt
+git clone https://github.com/yourusername/DOMe-and-DOMer-2.git
+cd DOMe-and-DOMer-2
 ```
 
-2. Set up environment variables in `.env`:
+2. Install dependencies using pip:
 ```bash
-OPENAI_API_KEY=your_openai_api_key
+pip install -e .
 ```
 
-## Running the Benchmark
-
+Required dependencies:
+- selenium
+- webdriver-manager
+- Pillow
+- numpy
+- requests
+- beautifulsoup4
+- openai
+- python-dotenv
+
+3. Set up your OpenAI API key in a `.env` file:
 ```bash
-python run.py --tasks data/dom_tasks.jsonl --output data/results --evaluate
+OPENAI_API_KEY=your_api_key_here
 ```
 
-This will:
-1. Execute each task in the tasks file
-2. Save screenshots and results
-3. Compare actual HTML elements with expected ones
-4. Run GPT-4V evaluation on screenshots
-
-## Ground Truth Management
-
-Ground truth images are stored in `evaluation/ground_truth/` with a consistent naming scheme:
-```
-evaluation/ground_truth/
-└── task_1_gt.png
-└── task_2_gt.png
-...
-```
-
-The tasks file references these images using relative paths:
-```json
-{
-  "id": 1,
-  "ground_truth": {
-    "screenshot": "evaluation/ground_truth/task_1_gt.png"
-  }
-}
-```
+## Usage
 
-## Testing
+The benchmark can be run in either serial or parallel mode:
 
-Run environment tests:
+### Parallel Mode (Default)
 ```bash
-python test_env.py
+python run.py --tasks data/dom_tasks.jsonl --output results --max-workers 4 --evaluate
 ```
 
-Run OpenAI API connection test:
+### Serial Mode
 ```bash
-python test_openai.py
+python run.py --tasks data/dom_tasks.jsonl --output results --mode serial --evaluate
 ```
 
-## Evaluation Process
+### Key Arguments
+- `--tasks`: Path to JSONL file containing tasks
+- `--output`: Output directory for results
+- `--mode`: Run tasks in 'serial' or 'parallel' mode (default: parallel)
+- `--max-workers`: Number of parallel workers (default: 4)
+- `--evaluate`: Run GPT-4V evaluation after tasks complete
+- `--evaluate-mode`: Run evaluations in 'serial' or 'parallel' mode (default: parallel)
+- `--save-accessibility-tree`: Save accessibility trees for each task
+- `--wait-time`: Wait time between actions in seconds (default: 2.0)
 
-1. **Technical Validation**:
-   - Element found and interacted with
-   - No errors during interaction
-   - Accessibility tree verification
+## Directory Structure
 
-2. **Visual Validation**:
-   - Compare after screenshot with ground truth
-   - Verify expected visual changes
-   - Check for unintended side effects
+```
+DOMe-and-DOMer-2/
+├── data/
+│   ├── dom_tasks.jsonl         # Task definitions
+│   └── task_schema.json        # JSON schema for tasks
+├── evaluation/
+│   ├── auto_eval.py           # Evaluation orchestrator
+│   ├── parallel_eval.py       # Parallel evaluation implementation
+│   ├── image_match.py         # GPT-4V image comparison
+│   └── fuzzy_match.py         # HTML structure comparison
+├── parallel_runner.py         # Parallel task execution
+├── serial_runner.py          # Serial task execution
+├── utils.py                  # Shared utilities
+├── run.py                    # Main entry point
+└── pyproject.toml           # Project configuration and dependencies
+
+## Output Structure
+
+Results are saved in the specified output directory:
+```
+output_dir/
+├── results.json              # Task execution results
+├── evaluation.json           # GPT-4V evaluation results
+├── benchmark.log            # Execution logs
+├── *_before.png            # Screenshots before interaction
+├── *_after.png             # Screenshots after interaction
+└── *_tree.json            # Accessibility trees (if enabled)
+```
 
-3. **GPT-4V Analysis**:
-   - Compare before/after/ground-truth screenshots
-   - Verify interaction success
-   - Check visual state matches expectations
+## Task Format
 
-## Output Format
+Tasks are defined in `data/dom_tasks.jsonl`:
 
 ```json
 {
-    "total_tasks": 10,
-    "successful_tasks": 8,
-    "evaluations": [
-        {
-            "task_id": "amazon_search_1",
-            "success": true,
-            "evaluation": "Detailed evaluation text...",
-            "timestamp": 1234567890
-        }
-    ]
+    "id": "task_id",
+    "task": "Click the search box and type 'hello'",
+    "web": "https://example.com",
+    "interaction": "type",
+    "target_element": {
+        "type": "css",
+        "value": "#searchbox"
+    },
+    "input_text": "hello",
+    "ground_truth": {
+        "screenshot": "path/to/ground_truth.png"
+    }
 }
 ```
 
-## Requirements
+## Evaluation
+
+The benchmark uses GPT-4V to evaluate task success by comparing:
+1. Before/after screenshots with ground truth
+2. DOM structure changes
+3. Task completion criteria
 
-- Python 3.8+
-- Chrome/Chromium browser
-- OpenAI API key (for evaluation)
-- Required packages in `requirements.txt`
+Evaluation can be run in parallel or serial mode and produces detailed scoring and reasoning for each task.
 
 ## Contributing
 
-[Contributing guidelines will be added]
+Contributions are welcome! Please feel free to submit a Pull Request.
 
 ## License
 
-[License information will be added]
+This project is licensed under the MIT License - see the LICENSE file for details.
diff --git a/evaluation/ground_truth/task_2_gt.png b/evaluation/ground_truth/task_2_gt.png
diff --git a/evaluation/ground_truth/task_3_gt.png b/evaluation/ground_truth/task_3_gt.png
diff --git a/evaluation/ground_truth/task_4_gt.png b/evaluation/ground_truth/task_4_gt.png
diff --git a/evaluation/ground_truth/task_5_gt.png b/evaluation/ground_truth/task_5_gt.png
diff --git a/examples/model_usage.py b/examples/model_usage.py
@@ -0,0 +1,52 @@
+"""Example usage of different models in the DOM benchmark."""
+
+import os
+from dotenv import load_dotenv
+from models import GPT4Model, ClaudeModel
+from utils import TaskExecutor
+
+# Load environment variables
+load_dotenv()
+
+def run_example_task(model, task):
+    """Run a single task with the given model and print results."""
+    executor = TaskExecutor()
+    print(f"\nRunning task with {model.__class__.__name__}:")
+    print(f"Task: {task['task']}")
+
+    result = model.run_task(task, executor)
+
+    print(f"Success: {result.success}")
+    if result.error:
+        print(f"Error: {result.error}")
+    print(f"Time taken: {result.time_taken:.2f}s")
+    return result
+
+def main():
+    # Initialize models
+    gpt4_model = GPT4Model(api_key=os.getenv("OPENAI_API_KEY"))
+    claude_model = ClaudeModel(api_key=os.getenv("ANTHROPIC_API_KEY"))
+
+    # Example task
+    task = {
+        "task": "Click the 'Sign In' button",
+        "target_element": {
+            "type": "css",
+            "value": "#signin-button"
+        },
+        "interaction": "click"
+    }
+
+    # Run with both models
+    gpt4_result = run_example_task(gpt4_model, task)
+    claude_result = run_example_task(claude_model, task)
+
+    # Compare results
+    print("\nComparison:")
+    print(f"GPT-4 success: {gpt4_result.success}")
+    print(f"Claude success: {claude_result.success}")
+    print(f"GPT-4 time: {gpt4_result.time_taken:.2f}s")
+    print(f"Claude time: {claude_result.time_taken:.2f}s")
+
+if __name__ == "__main__":
+    main()
diff --git a/models/__init__.py b/models/__init__.py
@@ -0,0 +1,5 @@
+from .base import BaseModel, WebInteraction, TaskResult
+from .gpt4 import GPT4Model
+from .claude import ClaudeModel
+
+__all__ = ['BaseModel', 'WebInteraction', 'TaskResult', 'GPT4Model', 'ClaudeModel']