Final Commit

dhruvahuja19 · Dec 20, 2024 · 117a792 · 117a792
1 parent 8fbea63
commit 117a792
Show file tree

Hide file tree

Showing 11 changed files with 705 additions and 445 deletions.
diff --git a/README.md b/README.md
@@ -33,97 +33,100 @@ Required dependencies:
 - requests
 - beautifulsoup4
 - openai
+- anthropic
+- google-generativeai
 - python-dotenv
 
-3. Set up your OpenAI API key in a `.env` file:
+3. Set up your API keys in a `.env` file:
 ```bash
-OPENAI_API_KEY=your_api_key_here
+OPENAI_API_KEY=your_openai_key_here
+ANTHROPIC_API_KEY=your_anthropic_key_here
+GOOGLE_API_KEY=your_google_key_here
 ```
 
+## Supported Models
+
+The benchmark currently supports the following models:
+
+1. **GPT-4 Turbo** (OpenAI)
+   - Default model for both task execution and evaluation
+   - High accuracy but subject to rate limits (3500 RPM)
+
+2. **Claude 3 Haiku** (Anthropic)
+   - Fast and efficient for task execution
+   - Subject to stricter rate limits (5 RPM)
+   - Use `--serial` flag for best results
+
+3. **Gemini 1.5 Pro** (Google)
+   - Latest version of Google's Gemini model
+   - Good balance of speed and accuracy
+
 ## Usage
 
 The benchmark can be run in either serial or parallel mode:
 
 ### Parallel Mode (Default)
 ```bash
-python run.py --tasks data/dom_tasks.jsonl --output results --max-workers 4 --evaluate
+# Run with GPT-4
+python -m benchmark --model gpt4 --tasks data/test_tasks.jsonl --output-dir results
+
+# Run with Claude
+python -m benchmark --model claude --tasks data/test_tasks.jsonl --output-dir results --serial
+
+# Run with Gemini
+python -m benchmark --model gemini --tasks data/test_tasks.jsonl --output-dir results
 ```
 
 ### Serial Mode
 ```bash
-python run.py --tasks data/dom_tasks.jsonl --output results --mode serial --evaluate
+python -m benchmark --model [gpt4|claude|gemini] --tasks data/test_tasks.jsonl --output-dir results --serial
 ```
 
-### Key Arguments
-- `--tasks`: Path to JSONL file containing tasks
-- `--output`: Output directory for results
-- `--mode`: Run tasks in 'serial' or 'parallel' mode (default: parallel)
-- `--max-workers`: Number of parallel workers (default: 4)
-- `--evaluate`: Run GPT-4V evaluation after tasks complete
-- `--evaluate-mode`: Run evaluations in 'serial' or 'parallel' mode (default: parallel)
-- `--save-accessibility-tree`: Save accessibility trees for each task
-- `--wait-time`: Wait time between actions in seconds (default: 2.0)
+### Evaluation
+Results are automatically evaluated using GPT-4V for visual comparison and GPT-4 for HTML structure matching:
 
-## Directory Structure
-
-```
-DOMe-and-DOMer-2/
-├── data/
-│   ├── dom_tasks.jsonl         # Task definitions
-│   └── task_schema.json        # JSON schema for tasks
-├── evaluation/
-│   ├── auto_eval.py           # Evaluation orchestrator
-│   ├── parallel_eval.py       # Parallel evaluation implementation
-│   ├── image_match.py         # GPT-4V image comparison
-│   └── fuzzy_match.py         # HTML structure comparison
-├── parallel_runner.py         # Parallel task execution
-├── serial_runner.py          # Serial task execution
-├── utils.py                  # Shared utilities
-├── run.py                    # Main entry point
-└── pyproject.toml           # Project configuration and dependencies
-
-## Output Structure
-
-Results are saved in the specified output directory:
-```
-output_dir/
-├── results.json              # Task execution results
-├── evaluation.json           # GPT-4V evaluation results
-├── benchmark.log            # Execution logs
-├── *_before.png            # Screenshots before interaction
-├── *_after.png             # Screenshots after interaction
-└── *_tree.json            # Accessibility trees (if enabled)
+```bash
+python -m evaluate --tasks data/test_tasks.jsonl --results-dir results --output results/evaluation.json
 ```
 
 ## Task Format
 
-Tasks are defined in `data/dom_tasks.jsonl`:
-
+Tasks are defined in JSONL format with the following structure:
 ```json
 {
-    "id": "task_id",
-    "task": "Click the search box and type 'hello'",
-    "web": "https://example.com",
-    "interaction": "type",
+    "web_name": "Website Name",
+    "id": "unique_task_id",
+    "task": "Description of the interaction task",
+    "web": "https://website.url",
+    "element_type": "button|input|link",
+    "interaction": "click|type|hover",
     "target_element": {
-        "type": "css",
-        "value": "#searchbox"
+        "type": "id|class|xpath",
+        "value": "selector_value"
     },
-    "input_text": "hello",
+    "input_text": "Text to type (for type interactions)",
+    "target_html": "HTML of target element",
     "ground_truth": {
-        "screenshot": "path/to/ground_truth.png"
+        "screenshot": "path/to/screenshot.png",
+        "description": "Expected result description"
     }
 }
 ```
 
-## Evaluation
+## Rate Limits
+
+Different models have different rate limits:
+- GPT-4: 3500 requests per minute
+- Claude: 5 requests per minute
+- Gemini: 60 requests per minute
+
+Use the `--serial` flag for models with strict rate limits (e.g., Claude) to avoid hitting limits.
 
-The benchmark uses GPT-4V to evaluate task success by comparing:
-1. Before/after screenshots with ground truth
-2. DOM structure changes
-3. Task completion criteria
+## Test Tasks
 
-Evaluation can be run in parallel or serial mode and produces detailed scoring and reasoning for each task.
+The repository includes two task sets:
+- `data/test_tasks.jsonl`: Full test set with 100+ tasks
+- `data/test_tasks_10.jsonl`: Smaller set of 10 tasks for quick testing
 
 ## Contributing
 

diff --git a/analyze_insights.py b/analyze_insights.py
@@ -0,0 +1,109 @@
+import json
+from collections import defaultdict
+from typing import Dict, List, Any
+
+def load_results() -> List[Dict[str, Any]]:
+    with open('results/results.json') as f:
+        return json.load(f)
+
+def analyze_results(results: List[Dict[str, Any]]) -> None:
+    total_tasks = len(results)
+    successes = [r for r in results if r.get('success', False)]
+    failures = [r for r in results if not r.get('success', False)]
+
+    print("\n=== Overall Statistics ===")
+    print(f"Total Tasks: {total_tasks}")
+    print(f"Success Rate: {len(successes)/total_tasks*100:.2f}% ({len(successes)} successes, {len(failures)} failures)")
+
+    # Error Analysis
+    error_types = defaultdict(int)
+    for task in failures:
+        error = task.get('error', 'Unknown error')
+        if isinstance(error, str):
+            # Simplify error messages to group similar errors
+            if 'has no attribute' in error:
+                error = "Missing attribute error"
+            elif 'timeout' in error.lower():
+                error = "Timeout error"
+            elif 'not found' in error.lower():
+                error = "Element not found"
+            elif 'failed evaluation' in error.lower():
+                error = "Failed evaluation checks"
+        error_types[error] += 1
+
+    print("\n=== Error Analysis ===")
+    print("Common failure reasons:")
+    for error, count in sorted(error_types.items(), key=lambda x: x[1], reverse=True):
+        percentage = (count / len(failures)) * 100
+        print(f"{error}: {percentage:.1f}% ({count} tasks)")
+
+    # Task Type Analysis
+    def categorize_task(task_desc: str) -> str:
+        desc = task_desc.lower()
+        if 'click' in desc:
+            return 'Click'
+        elif 'type' in desc or 'enter' in desc:
+            return 'Type/Input'
+        elif 'search' in desc:
+            return 'Search'
+        elif 'hover' in desc:
+            return 'Hover'
+        return 'Other'
+
+    task_types = defaultdict(lambda: {'success': 0, 'fail': 0})
+    for task in results:
+        task_type = categorize_task(task.get('task_description', ''))
+        if task.get('success', False):
+            task_types[task_type]['success'] += 1
+        else:
+            task_types[task_type]['fail'] += 1
+
+    print("\n=== Task Type Analysis ===")
+    for task_type, stats in task_types.items():
+        total = stats['success'] + stats['fail']
+        success_rate = (stats['success']/total*100) if total > 0 else 0
+        print(f"{task_type}: {success_rate:.1f}% success rate ({stats['success']}/{total} tasks)")
+
+    # Website Analysis
+    def extract_website(task_id: str) -> str:
+        return task_id.split('_')[0] if '_' in task_id else 'unknown'
+
+    website_stats = defaultdict(lambda: {'success': 0, 'fail': 0})
+    for task in results:
+        website = extract_website(task.get('task_id', 'unknown'))
+        if task.get('success', False):
+            website_stats[website]['success'] += 1
+        else:
+            website_stats[website]['fail'] += 1
+
+    print("\n=== Website Performance ===")
+    for website, stats in sorted(website_stats.items(), 
+                               key=lambda x: (x[1]['success'] + x[1]['fail']), 
+                               reverse=True):
+        total = stats['success'] + stats['fail']
+        if total < 2:  # Skip websites with very few tasks
+            continue
+        success_rate = (stats['success']/total*100)
+        print(f"{website}: {success_rate:.1f}% success rate ({stats['success']}/{total} tasks)")
+
+    # Example Analysis
+    print("\n=== Example Cases ===")
+    print("\nSuccessful Tasks:")
+    for task in successes[:3]:
+        print(f"✓ {task.get('task_description', '')}")
+        print(f"  ID: {task.get('task_id', '')}")
+        if task.get('error'):
+            print(f"  Note: {task['error']}")
+        print()
+
+    print("\nFailed Tasks:")
+    for task in failures[:3]:
+        print(f"✗ {task.get('task_description', '')}")
+        print(f"  ID: {task.get('task_id', '')}")
+        if task.get('error'):
+            print(f"  Error: {task['error']}")
+        print()
+
+if __name__ == "__main__":
+    results = load_results()
+    analyze_results(results)
diff --git a/analyze_results.py b/analyze_results.py
@@ -8,7 +8,7 @@
 
 # Calculate success percentage
 total_tasks = len(results)
-successful_tasks = [result for result in results if result.get('final_score', 0) == 1]
+successful_tasks = [result for result in results if result.get('final_score', 0) >= .8]
 success_percentage = (len(successful_tasks) / total_tasks) * 100 if total_tasks > 0 else 0
 
 print(f"\nResults Analysis:")

diff --git a/dataset_cleaner.py b/dataset_cleaner.py
@@ -0,0 +1,118 @@
+import json
+import os
+from pathlib import Path
+from typing import Dict, List, Any, Optional
+from openai import OpenAI
+from dotenv import load_dotenv
+
+load_dotenv()
+
+class DatasetCleaner:
+    def __init__(self, results_file: str, api_key: Optional[str] = None):
+        """Initialize the dataset cleaner.
+        
+        Args:
+            results_file: Path to results.json file
+            api_key: OpenAI API key (optional, will use environment variable if not provided)
+        """
+        self.results_file = Path(results_file)
+        self.client = OpenAI(api_key=api_key)
+
+    def analyze_result(self, result: Dict[str, Any]) -> Dict[str, Any]:
+        """Analyze a single result entry to determine if it's valid."""
+        response = self.client.chat.completions.create(
+            model="gpt-4-turbo",
+            messages=[
+                {
+                    "role": "system",
+                    "content": """You are an expert at analyzing web automation test results to determine if a test case is invalid.
+A test case should be considered invalid if it encounters issues that make it unsuitable for benchmarking, such as:
+1. CAPTCHA or verification challenges
+2. Network or connection issues
+3. Page timeouts or loading failures
+4. Security blocks or authentication requirements
+5. Missing or broken page elements
+6. Browser crashes
+7. Rate limiting or API errors
+8. Geolocation restrictions"""
+                },
+                {
+                    "role": "user",
+                    "content": f"""Analyze this test result and determine if it should be excluded from benchmarking:
+
+Task ID: {result['task_id']}
+Success: {result['success']}
+Error: {result.get('error', 'None')}
+Task Description: {result['task_description']}
+HTML Element: {result.get('html_element', 'None')}
+
+Respond with a JSON object containing:
+{{
+    "is_valid": boolean,
+    "reason": string explaining why the test case is invalid (if applicable),
+    "confidence": float between 0 and 1
+}}"""
+                }
+            ],
+            response_format={"type": "json_object"}
+        )
+
+        return json.loads(response.choices[0].message.content)
+
+    def clean_dataset(self, min_confidence: float = 0.8) -> Dict[str, List[str]]:
+        """Clean the dataset by analyzing results.json entries.
+        
+        Args:
+            min_confidence: Minimum confidence threshold for filtering (default: 0.8)
+            
+        Returns:
+            Dictionary containing lists of valid and invalid test cases
+        """
+        results = {
+            "valid": [],
+            "invalid": []
+        }
+
+        # Load and process results.json
+        with open(self.results_file) as f:
+            test_results = json.load(f)
+
+        for result in test_results:
+            analysis = self.analyze_result(result)
+
+            if analysis["is_valid"] or analysis["confidence"] < min_confidence:
+                results["valid"].append(result["task_id"])
+            else:
+                results["invalid"].append({
+                    "task_id": result["task_id"],
+                    "reason": analysis["reason"],
+                    "confidence": analysis["confidence"]
+                })
+
+        # Save results
+        output_path = self.results_file.parent / "dataset_cleaning_results.json"
+        with open(output_path, "w") as f:
+            json.dump(results, f, indent=2)
+
+        print(f"Dataset cleaning results saved to {output_path}")
+        print(f"Valid test cases: {len(results['valid'])}")
+        print(f"Invalid test cases: {len(results['invalid'])}")
+        print("\nInvalid test cases and reasons:")
+        for invalid in results["invalid"]:
+            print(f"- {invalid['task_id']}: {invalid['reason']} (confidence: {invalid['confidence']:.2f})")
+
+        return results
+
+if __name__ == "__main__":
+    import argparse
+
+    parser = argparse.ArgumentParser(description="Clean benchmark dataset by filtering invalid test cases")
+    parser.add_argument("results_file", help="Path to results.json file")
+    parser.add_argument("--min-confidence", type=float, default=0.8,
+                       help="Minimum confidence threshold for filtering (default: 0.8)")
+    parser.add_argument("--api-key", help="OpenAI API key (optional)")
+
+    args = parser.parse_args()
+
+    cleaner = DatasetCleaner(args.results_file, os.getenv("OPENAI_API_KEY"))
+    results = cleaner.clean_dataset(min_confidence=args.min_confidence)