Experiments with Llemma and hypertree proof search (HTPS) for Lean4.
Relies on
Run bash scripts/eval_llemma7b.sh
to run Llemma 7b (with awq quantization)
on the minif2f validation data.
Compute the accuracy with python compute_metrics.py
.
This should get around 26% accuracy.