A tool for analyzing GitHub issues to identify and categorize API contract violations using state-of-the-art language models.
- Advanced Contract Analysis: Leverages LLMs to analyze GitHub issues for potential API contract violations
- Multiple Storage Options:
- JSON storage for detailed analysis results
- CSV export functionality for data analysis
- MongoDB integration for scalable data storage
- Robust Data Processing:
- Support for both direct GitHub API fetching and CSV file input
- Automatic checkpointing for long-running analyses
- Intermediate results saving
- Graceful shutdown handling
- Modular Architecture:
- Pluggable storage backends
- Extensible analyzer framework
- Configurable LLM clients
- Progress Tracking: Real-time progress monitoring with customizable trackers
- Clone the repository:
git clone https://github.com/thromel/llm-contracts-research.git
cd llm-contracts-research
- Install dependencies:
pip install -r requirements.txt
- Create a
.env
file with your configuration:
# API Keys
GITHUB_TOKEN=your_github_token
OPENAI_API_KEY=your_openai_key
# OpenAI Settings
OPENAI_MODEL=your_model_name
OPENAI_BASE_URL=your_api_base_url
OPENAI_TEMPERATURE=0.7
OPENAI_MAX_TOKENS=2000
OPENAI_TOP_P=1.0
OPENAI_FREQUENCY_PENALTY=0.0
OPENAI_PRESENCE_PENALTY=0.0
# MongoDB Settings (Optional)
MONGODB_URI=your_mongodb_uri
MONGODB_DB=your_database_name
MONGODB_ENABLED=true
# Analysis Settings
BATCH_SIZE=50
MAX_COMMENTS_PER_ISSUE=10
DEFAULT_LOOKBACK_DAYS=1000
SAVE_INTERMEDIATE=true
JSON_EXPORT=true
CSV_EXPORT=true
src/
├── analysis/
│ ├── core/
│ │ ├── analyzers/
│ │ │ ├── contract_analyzer.py # Core contract analysis logic
│ │ │ ├── github.py # GitHub-specific analysis
│ │ │ └── orchestrator.py # Analysis orchestration
│ │ ├── clients/
│ │ │ ├── github.py # GitHub API client
│ │ │ └── openai.py # OpenAI API client
│ │ ├── processors/
│ │ │ ├── cleaner.py # Response cleaning
│ │ │ ├── validator.py # Analysis validation
│ │ │ └── checkpoint.py # Checkpoint management
│ │ ├── storage/
│ │ │ ├── json_storage.py # JSON storage implementation
│ │ │ ├── csv_storage.py # CSV storage implementation
│ │ │ └── mongodb/ # MongoDB integration
│ │ └── dto/ # Data transfer objects
│ └── main.py # Main entry point
├── config/
│ └── settings.py # Configuration settings
└── utils/
└── logger.py # Logging utilities
- Analyzing issues from a GitHub repository:
python -m src.analysis.main --repo owner/repo --issues 100
- Analyzing issues from a CSV file:
python -m src.analysis.main --input-csv path/to/issues.csv
--resume
: Resume from the last checkpoint if available--checkpoint-interval N
: Create checkpoints every N issues (default: 5)
The analyzer supports multiple storage backends that can be configured in your .env
file:
- JSON Storage: Enable with
JSON_EXPORT=true
- CSV Storage: Enable with
CSV_EXPORT=true
- MongoDB Storage: Enable with
MONGODB_ENABLED=true
and configure connection settings
- Analyze 50 issues with custom checkpoint interval:
python -m src.analysis.main --repo openai/openai-python --issues 50 --checkpoint-interval 10
- Resume a previously interrupted analysis:
python -m src.analysis.main --repo openai/openai-python --issues 50 --resume
- Analyze issues from a CSV file:
python -m src.analysis.main --input-csv data/raw/github_issues.csv
The analyzer generates several output files in the data/analyzed
directory:
-
JSON Output:
github_issues_analysis_TIMESTAMP_raw.json
: Raw analysis datagithub_issues_analysis_TIMESTAMP_final.json
: Final analysis results
-
CSV Output:
github_issues_analysis_TIMESTAMP_final.csv
: Tabular format of analysis results
-
Checkpoints:
analysis_checkpoint.json
: Temporary checkpoint fileintermediate/
: Directory containing intermediate analysis results
-
Analyzers:
ContractAnalyzer
: Core analysis logic for contract violationsGitHubIssuesAnalyzer
: GitHub-specific implementationAnalysisOrchestrator
: Coordinates the analysis process
-
Storage:
- Modular storage system with support for multiple backends
- Factory pattern for storage creation
- Adapter pattern for consistent interface
-
Processors:
- Response cleaning and validation
- Checkpoint management
- Progress tracking
- Factory Pattern: Used for storage backend creation
- Strategy Pattern: Used for different analysis strategies
- Adapter Pattern: Used for storage implementations
- Observer Pattern: Used for progress tracking
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests
- Submit a pull request
- Follow PEP 8 style guide
- Add type hints to all functions
- Write unit tests for new features
- Update documentation for significant changes
This project is licensed under the MIT License - see the LICENSE file for details.
We welcome contributions from the community! If you'd like to contribute improvements, fixes, or new features, please follow these guidelines:
- Fork the repository and clone your fork.
- Create a new branch for your changes (e.g., feature/your-feature or fix/issue-number).
- Make your changes with clear, concise commit messages.
- Ensure that your code adheres to the project's coding style (PEP 8).
- Write tests for your changes where applicable.
- Push your branch and open a pull request describing your changes.
- Consult the issue tracker before making major changes to avoid duplicated efforts.
Thank you for your interest in contributing to GitHub Issues Contract Violation Analyzer!