How our dataset was curated? How to create the benchmark instances?
To obtain testable real-world repositories from GitHub, we propose a fully automated curation pipeline that utilizes GitHub Actions CI and LLM assistance, eliminating the need for human involvement in benchmark construction.
python -m dibench.curate.crawling --help
-
Searches GitHub for repositories in
star_range
forlanguage
(10-star batches). -
Check each repo for workflows, if found, dump repo instance into JSONL.
python -m dibench.curate.curate --help
- Locate the test CI file
- Locate the test job in the CI file
- Get the ACT command
- Sanitize & Mask
- Get the gold patch
python -m dibench.curate.verify --help
Expected:
- Tests Pass when dependencies unmasked
- Tests Fail when dependencies masked