Skip to content

Commit

Permalink
Final Commit
Browse files Browse the repository at this point in the history
  • Loading branch information
dhruvahuja19 committed Dec 20, 2024
1 parent 8fbea63 commit 117a792
Show file tree
Hide file tree
Showing 11 changed files with 705 additions and 445 deletions.
119 changes: 61 additions & 58 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,97 +33,100 @@ Required dependencies:
- requests
- beautifulsoup4
- openai
- anthropic
- google-generativeai
- python-dotenv

3. Set up your OpenAI API key in a `.env` file:
3. Set up your API keys in a `.env` file:
```bash
OPENAI_API_KEY=your_api_key_here
OPENAI_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here
GOOGLE_API_KEY=your_google_key_here
```

## Supported Models

The benchmark currently supports the following models:

1. **GPT-4 Turbo** (OpenAI)
- Default model for both task execution and evaluation
- High accuracy but subject to rate limits (3500 RPM)

2. **Claude 3 Haiku** (Anthropic)
- Fast and efficient for task execution
- Subject to stricter rate limits (5 RPM)
- Use `--serial` flag for best results

3. **Gemini 1.5 Pro** (Google)
- Latest version of Google's Gemini model
- Good balance of speed and accuracy

## Usage

The benchmark can be run in either serial or parallel mode:

### Parallel Mode (Default)
```bash
python run.py --tasks data/dom_tasks.jsonl --output results --max-workers 4 --evaluate
# Run with GPT-4
python -m benchmark --model gpt4 --tasks data/test_tasks.jsonl --output-dir results

# Run with Claude
python -m benchmark --model claude --tasks data/test_tasks.jsonl --output-dir results --serial

# Run with Gemini
python -m benchmark --model gemini --tasks data/test_tasks.jsonl --output-dir results
```

### Serial Mode
```bash
python run.py --tasks data/dom_tasks.jsonl --output results --mode serial --evaluate
python -m benchmark --model [gpt4|claude|gemini] --tasks data/test_tasks.jsonl --output-dir results --serial
```

### Key Arguments
- `--tasks`: Path to JSONL file containing tasks
- `--output`: Output directory for results
- `--mode`: Run tasks in 'serial' or 'parallel' mode (default: parallel)
- `--max-workers`: Number of parallel workers (default: 4)
- `--evaluate`: Run GPT-4V evaluation after tasks complete
- `--evaluate-mode`: Run evaluations in 'serial' or 'parallel' mode (default: parallel)
- `--save-accessibility-tree`: Save accessibility trees for each task
- `--wait-time`: Wait time between actions in seconds (default: 2.0)
### Evaluation
Results are automatically evaluated using GPT-4V for visual comparison and GPT-4 for HTML structure matching:

## Directory Structure

```
DOMe-and-DOMer-2/
├── data/
│ ├── dom_tasks.jsonl # Task definitions
│ └── task_schema.json # JSON schema for tasks
├── evaluation/
│ ├── auto_eval.py # Evaluation orchestrator
│ ├── parallel_eval.py # Parallel evaluation implementation
│ ├── image_match.py # GPT-4V image comparison
│ └── fuzzy_match.py # HTML structure comparison
├── parallel_runner.py # Parallel task execution
├── serial_runner.py # Serial task execution
├── utils.py # Shared utilities
├── run.py # Main entry point
└── pyproject.toml # Project configuration and dependencies
## Output Structure
Results are saved in the specified output directory:
```
output_dir/
├── results.json # Task execution results
├── evaluation.json # GPT-4V evaluation results
├── benchmark.log # Execution logs
├── *_before.png # Screenshots before interaction
├── *_after.png # Screenshots after interaction
└── *_tree.json # Accessibility trees (if enabled)
```bash
python -m evaluate --tasks data/test_tasks.jsonl --results-dir results --output results/evaluation.json
```

## Task Format

Tasks are defined in `data/dom_tasks.jsonl`:
Tasks are defined in JSONL format with the following structure:
```json
{
"id": "task_id",
"task": "Click the search box and type 'hello'",
"web": "https://example.com",
"interaction": "type",
"web_name": "Website Name",
"id": "unique_task_id",
"task": "Description of the interaction task",
"web": "https://website.url",
"element_type": "button|input|link",
"interaction": "click|type|hover",
"target_element": {
"type": "css",
"value": "#searchbox"
"type": "id|class|xpath",
"value": "selector_value"
},
"input_text": "hello",
"input_text": "Text to type (for type interactions)",
"target_html": "HTML of target element",
"ground_truth": {
"screenshot": "path/to/ground_truth.png"
"screenshot": "path/to/screenshot.png",
"description": "Expected result description"
}
}
```

## Evaluation
## Rate Limits

Different models have different rate limits:
- GPT-4: 3500 requests per minute
- Claude: 5 requests per minute
- Gemini: 60 requests per minute

Use the `--serial` flag for models with strict rate limits (e.g., Claude) to avoid hitting limits.

The benchmark uses GPT-4V to evaluate task success by comparing:
1. Before/after screenshots with ground truth
2. DOM structure changes
3. Task completion criteria
## Test Tasks

Evaluation can be run in parallel or serial mode and produces detailed scoring and reasoning for each task.
The repository includes two task sets:
- `data/test_tasks.jsonl`: Full test set with 100+ tasks
- `data/test_tasks_10.jsonl`: Smaller set of 10 tasks for quick testing

## Contributing

Expand Down
109 changes: 109 additions & 0 deletions analyze_insights.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
import json
from collections import defaultdict
from typing import Dict, List, Any

def load_results() -> List[Dict[str, Any]]:
with open('results/results.json') as f:
return json.load(f)

def analyze_results(results: List[Dict[str, Any]]) -> None:
total_tasks = len(results)
successes = [r for r in results if r.get('success', False)]
failures = [r for r in results if not r.get('success', False)]

print("\n=== Overall Statistics ===")
print(f"Total Tasks: {total_tasks}")
print(f"Success Rate: {len(successes)/total_tasks*100:.2f}% ({len(successes)} successes, {len(failures)} failures)")

# Error Analysis
error_types = defaultdict(int)
for task in failures:
error = task.get('error', 'Unknown error')
if isinstance(error, str):
# Simplify error messages to group similar errors
if 'has no attribute' in error:
error = "Missing attribute error"
elif 'timeout' in error.lower():
error = "Timeout error"
elif 'not found' in error.lower():
error = "Element not found"
elif 'failed evaluation' in error.lower():
error = "Failed evaluation checks"
error_types[error] += 1

print("\n=== Error Analysis ===")
print("Common failure reasons:")
for error, count in sorted(error_types.items(), key=lambda x: x[1], reverse=True):
percentage = (count / len(failures)) * 100
print(f"{error}: {percentage:.1f}% ({count} tasks)")

# Task Type Analysis
def categorize_task(task_desc: str) -> str:
desc = task_desc.lower()
if 'click' in desc:
return 'Click'
elif 'type' in desc or 'enter' in desc:
return 'Type/Input'
elif 'search' in desc:
return 'Search'
elif 'hover' in desc:
return 'Hover'
return 'Other'

task_types = defaultdict(lambda: {'success': 0, 'fail': 0})
for task in results:
task_type = categorize_task(task.get('task_description', ''))
if task.get('success', False):
task_types[task_type]['success'] += 1
else:
task_types[task_type]['fail'] += 1

print("\n=== Task Type Analysis ===")
for task_type, stats in task_types.items():
total = stats['success'] + stats['fail']
success_rate = (stats['success']/total*100) if total > 0 else 0
print(f"{task_type}: {success_rate:.1f}% success rate ({stats['success']}/{total} tasks)")

# Website Analysis
def extract_website(task_id: str) -> str:
return task_id.split('_')[0] if '_' in task_id else 'unknown'

website_stats = defaultdict(lambda: {'success': 0, 'fail': 0})
for task in results:
website = extract_website(task.get('task_id', 'unknown'))
if task.get('success', False):
website_stats[website]['success'] += 1
else:
website_stats[website]['fail'] += 1

print("\n=== Website Performance ===")
for website, stats in sorted(website_stats.items(),
key=lambda x: (x[1]['success'] + x[1]['fail']),
reverse=True):
total = stats['success'] + stats['fail']
if total < 2: # Skip websites with very few tasks
continue
success_rate = (stats['success']/total*100)
print(f"{website}: {success_rate:.1f}% success rate ({stats['success']}/{total} tasks)")

# Example Analysis
print("\n=== Example Cases ===")
print("\nSuccessful Tasks:")
for task in successes[:3]:
print(f"✓ {task.get('task_description', '')}")
print(f" ID: {task.get('task_id', '')}")
if task.get('error'):
print(f" Note: {task['error']}")
print()

print("\nFailed Tasks:")
for task in failures[:3]:
print(f"✗ {task.get('task_description', '')}")
print(f" ID: {task.get('task_id', '')}")
if task.get('error'):
print(f" Error: {task['error']}")
print()

if __name__ == "__main__":
results = load_results()
analyze_results(results)
2 changes: 1 addition & 1 deletion analyze_results.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

# Calculate success percentage
total_tasks = len(results)
successful_tasks = [result for result in results if result.get('final_score', 0) == 1]
successful_tasks = [result for result in results if result.get('final_score', 0) >= .8]
success_percentage = (len(successful_tasks) / total_tasks) * 100 if total_tasks > 0 else 0

print(f"\nResults Analysis:")
Expand Down
118 changes: 118 additions & 0 deletions dataset_cleaner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
import json
import os
from pathlib import Path
from typing import Dict, List, Any, Optional
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

class DatasetCleaner:
def __init__(self, results_file: str, api_key: Optional[str] = None):
"""Initialize the dataset cleaner.
Args:
results_file: Path to results.json file
api_key: OpenAI API key (optional, will use environment variable if not provided)
"""
self.results_file = Path(results_file)
self.client = OpenAI(api_key=api_key)

def analyze_result(self, result: Dict[str, Any]) -> Dict[str, Any]:
"""Analyze a single result entry to determine if it's valid."""
response = self.client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{
"role": "system",
"content": """You are an expert at analyzing web automation test results to determine if a test case is invalid.
A test case should be considered invalid if it encounters issues that make it unsuitable for benchmarking, such as:
1. CAPTCHA or verification challenges
2. Network or connection issues
3. Page timeouts or loading failures
4. Security blocks or authentication requirements
5. Missing or broken page elements
6. Browser crashes
7. Rate limiting or API errors
8. Geolocation restrictions"""
},
{
"role": "user",
"content": f"""Analyze this test result and determine if it should be excluded from benchmarking:
Task ID: {result['task_id']}
Success: {result['success']}
Error: {result.get('error', 'None')}
Task Description: {result['task_description']}
HTML Element: {result.get('html_element', 'None')}
Respond with a JSON object containing:
{{
"is_valid": boolean,
"reason": string explaining why the test case is invalid (if applicable),
"confidence": float between 0 and 1
}}"""
}
],
response_format={"type": "json_object"}
)

return json.loads(response.choices[0].message.content)

def clean_dataset(self, min_confidence: float = 0.8) -> Dict[str, List[str]]:
"""Clean the dataset by analyzing results.json entries.
Args:
min_confidence: Minimum confidence threshold for filtering (default: 0.8)
Returns:
Dictionary containing lists of valid and invalid test cases
"""
results = {
"valid": [],
"invalid": []
}

# Load and process results.json
with open(self.results_file) as f:
test_results = json.load(f)

for result in test_results:
analysis = self.analyze_result(result)

if analysis["is_valid"] or analysis["confidence"] < min_confidence:
results["valid"].append(result["task_id"])
else:
results["invalid"].append({
"task_id": result["task_id"],
"reason": analysis["reason"],
"confidence": analysis["confidence"]
})

# Save results
output_path = self.results_file.parent / "dataset_cleaning_results.json"
with open(output_path, "w") as f:
json.dump(results, f, indent=2)

print(f"Dataset cleaning results saved to {output_path}")
print(f"Valid test cases: {len(results['valid'])}")
print(f"Invalid test cases: {len(results['invalid'])}")
print("\nInvalid test cases and reasons:")
for invalid in results["invalid"]:
print(f"- {invalid['task_id']}: {invalid['reason']} (confidence: {invalid['confidence']:.2f})")

return results

if __name__ == "__main__":
import argparse

parser = argparse.ArgumentParser(description="Clean benchmark dataset by filtering invalid test cases")
parser.add_argument("results_file", help="Path to results.json file")
parser.add_argument("--min-confidence", type=float, default=0.8,
help="Minimum confidence threshold for filtering (default: 0.8)")
parser.add_argument("--api-key", help="OpenAI API key (optional)")

args = parser.parse_args()

cleaner = DatasetCleaner(args.results_file, os.getenv("OPENAI_API_KEY"))
results = cleaner.clean_dataset(min_confidence=args.min_confidence)
Loading

0 comments on commit 117a792

Please sign in to comment.