The neural theorem proving toolkit transforms Lean repositories into datasets for training and evaluating machine learning models.
The toolkit is originally a fork of Kim Morrison's lean-training-data and developed in miniCTX.
To run the full pipeline on all repositories in a config file in the configs
directory:
python scripts/extract_repos.py --cwd {filepath_of_this_repo} --config {filepath_of_config_file} [flags to indicate which processes to run]
The flags that can be set to indicate which processes to run are:
--training_data
: This outputs to theTacticPrediction
directory--full_proof_training_data
: This outputs to theFullProof
directory--premises
: This outputs to thePremises
directory--state_comments
: This outputs to theStateComments
directory--full_proof_training_data_states
: This outputs to theFullProofWithStates
directory--training_data_with_premises
: This outputs to theTrainingDataWithPremises
directory--declarations
: This outputs to theDeclarations
directory--imports
: This outputs to theImports
directory
At least one of the above flags must be set in order for the script to run (but there should be no issue with setting multiple or even all of the above flags)
On a Macbook Pro (M3 Max, 14 CPU) it takes around 2 hours to run the extractions on mathlib.
To run a tool individually, use lake exe <tool>
.
The run_pipeline.py
script uses Python to call tools in this way and organize the resulting files.
This produces a .jsonl
file where each line is an example of the following form:
{
"state": "{tactic state}",
"nextTactic" : "{pretty-printed next tactic}",
"srcUpToTactic" : "{source code in the file up to the tactic invocation}",
"declUpToTactic" : "{source code in the declaration up to the tactic invocation}",
"decl": "{declaration without proof (e.g., statement of a theorem)}",
"declId": "{unique identifier of the declaration}"
}
This produces a .jsonl
file where each line is an example of the following form:
{
"srcUpToDecl":"{source code in the file up to the declaration}",
"decl": "{declaration without proof (e.g., statement of a theorem)}",
"declId": "{unique identifier of the declaration}",
"proof":"{proof}"
}
This produces Lean source files with proof states interleaved as comments after each tactic.
This produces premises used by each constant in a module.
This produces information that pretty-prints each declaration in a module. The resulting format is
{
"name": "pow_two",
"kind": "theorem",
"args": ["{M : Type u_2}", "[Monoid M]", "(a : M)"],
"type": "a ^ 2 = a * a",
"doc": "Note that most of the lemmas about powers of two refer to it as `sq`.",
"decl": "/-- Note that most of the lemmas about powers of two refer to it as `sq`. -/\ntheorem pow_two {M : Type u_2} [Monoid M] (a : M) : a ^ 2 = a * a",
"line": 810,
"column": 0
}
This outputs the imports of each module (both transitively imported modules and directly imported modules). The resulting format is
{
"name": "{imported module name}",
"isDirect": "{whether it is explicitly imported by the module (otherwise it is transitively imported)}"
}
This produces a .jsonl
file where each line contains all of the information in training_data
plus the following fields:
{
...
"nextTacticIsSimpOrRwVariant": "{A boolean indicating whether the next tactic is one of several `simp` or `rw` variants}",
"nextTacticTermPremises": "{Prop-typed constants that appear in the proof term generated by the tactic}",
"nextTacticExplicitConstants": "{Constants that appear in tactic syntax}",
"nextTacticExplicitPremises": "{Prop-typed constants that appear in tactic syntax}",
"nextTacticExplicitLocalHypotheses": "{Local context Prop-typed hypotheses that appear in tactic syntax}",
"nextTacticAllPremises": "{The union of nextTacticTermPremises and nextTacticExplicitPremises}",
"nextTacticHammerRecommendation": "{Best guess of which constants would help a hammer based on the constants that appear in the next tactic}",
"declTermPremises": "{Prop-typed constants that appear in the overall declaration's proof term}",
"declExplicitConstants": "{Constants that appear in the tactic syntax for any tactic used to create the overall declaration's proof}",
"declExplicitPremises": "{Prop-typed constants that appear in the tactic syntax for any tactic used to create the overall declaration's proof}",
"declAllPremises": "{The union of declTermPremises and declExplicitPremises}",
"declHammerRecommendation": "{The union of each nextTacticHammerRecommendation for every tactic used to create the overall declaration's proof}",
"termPremisesToEndOfGoal": "{Prop-typed constants that appear in the sum of all proof terms used to close the current goal}",
"explicitConstantsToEndOfGoal": "{Constants that appear in the tactic syntax for any tactic used to close the current goal}",
"explicitPremisesToEndOfGoal": "{Prop-typed constants that appear in the tactic syntax for any tactic used to close the current goal}",
"allPremisesToEndOfGoal": "{The union of termPremisesToEndOfGoal and explicitPremisesToEndOfGoal}",
"hammerRecommendationToEndOfGoal": "{The union of each nextTacticHammerRecommendation for every tactic used to close the current goal}"
}
Note: Of the above additional fields, all ...ToEndOfGoal
fields are potentially approximations. They currently appear to be accurate for the patterns I have tested, but it's hard to check whether there may be edge cases for which they are inaccurate.
You can add a new extraction task, by:
- Add a relevant flag to
scripts/extract_repos.py
. Also change the_lakefile
method accordingly, and add the flag toflags
. - Then add an entry in
TASKS
toscripts/run_pipeline.py
After extraction, you can generate various forms of (prompt, completion) examples for fine-tuning language models.
To do so, run:
python scripts/instruction_tuning.py --prompt context_state_tactic
See python scripts/instruction_tuning.py -h
for other options for --prompt
or other settings.
The prompt includes a natural language description of the task, commonly referred to as an "instruction" (hence the name instruction tuning data).
You may find these useful during setup.
- Install
elan
by running
curl https://raw.githubusercontent.com/leanprover/elan/master/elan-init.sh -sSf | sh
- Run
lake exe cache get
(this downloads precompiled binaries forMathlib
). - Run
lake build
- Run
lake exe <tool>
, where<tool>
is one of the programs documented below.
Projects that use or build upon ntp-toolkit
:
- miniCTX: Neural Theorem Proving with (Long-)Contexts
- ImProver: Agent-Based Automated Proof Optimization
Submit a PR to add your project or paper here!
The toolkit is originally a fork of Kim Morrison's lean-training-data:
@misc{lean-training-data,
author = {Kim Morrison},
title = {lean-training-data},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/semorrison/lean-training-data}},
}
The ntp-toolkit
was initially developed in miniCTX:
@misc{hu2024minictxneuraltheoremproving,
title={miniCTX: Neural Theorem Proving with (Long-)Contexts},
author={Jiewen Hu and Thomas Zhu and Sean Welleck},
year={2024},
eprint={2408.03350},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2408.03350},
}