api-bench (WIP)

This is an automated benchmark for solving hard backend systems problems.

Each task has the agent implement an API specification for a specific backend. Currently, we support Convex and Python/FastAPI/Postgres on Modal.

Then, the implementation is graded on a few categories:

Unit tests: Each task specifies some direct correctness tests.
Model testing: We use Elle (Jepsen's model checker) to test that the implementation behaves correctly under high concurrency.
Performance testing (TODO): Maximum throughput before congestion collapse.

Installation

pdm install

Also, put OPENAI_API_KEY, BRAINTRUST_API_KEY, and ANTHROPIC_API_KEY in .env.

We depend on elle-cli (TODO: Find a better way to use this dependency).

brew install leiningen
cd ../
git clone https://github.com/ligurio/elle-cli
cd elle-cli
lein deps
lein uberjar

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.vscode		.vscode
backends		backends
evaluation		evaluation
graders		graders
models		models
.flake8		.flake8
.gitignore		.gitignore
Justfile		Justfile
README.md		README.md
main.py		main.py
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml