This is an automated benchmark for solving hard backend systems problems.
Each task has the agent implement an API specification for a specific backend. Currently, we support Convex and Python/FastAPI/Postgres on Modal.
Then, the implementation is graded on a few categories:
- Unit tests: Each task specifies some direct correctness tests.
- Model testing: We use Elle (Jepsen's model checker) to test that the implementation behaves correctly under high concurrency.
- Performance testing (TODO): Maximum throughput before congestion collapse.
pdm install
Also, put OPENAI_API_KEY
, BRAINTRUST_API_KEY
, and ANTHROPIC_API_KEY
in .env
.
We depend on elle-cli
(TODO: Find a better way to use this dependency).
brew install leiningen
cd ../
git clone https://github.com/ligurio/elle-cli
cd elle-cli
lein deps
lein uberjar