ntp-toolkit

The neural theorem proving toolkit transforms Lean repositories into datasets for training and evaluating machine learning models.

The toolkit is originally a fork of Kim Morrison's lean-training-data and developed in miniCTX.

Running extraction

To run the full pipeline on all repositories in a config file in the configs directory:

python scripts/extract_repos.py --cwd {filepath_of_this_repo} --config {filepath_of_config_file} [flags to indicate which processes to run]

The flags that can be set to indicate which processes to run are:

--training_data: This outputs to the TacticPrediction directory
--full_proof_training_data: This outputs to the FullProof directory
--premises: This outputs to the Premises directory
--state_comments: This outputs to the StateComments directory
--full_proof_training_data_states: This outputs to the FullProofWithStates directory
--training_data_with_premises: This outputs to the TrainingDataWithPremises directory
--declarations: This outputs to the Declarations directory
--imports: This outputs to the Imports directory

At least one of the above flags must be set in order for the script to run (but there should be no issue with setting multiple or even all of the above flags)

On a Macbook Pro (M3 Max, 14 CPU) it takes around 2 hours to run the extractions on mathlib.

To run a tool individually, use lake exe <tool>.
The run_pipeline.py script uses Python to call tools in this way and organize the resulting files.

Extraction tools:

`training_data`

This produces a .jsonl file where each line is an example of the following form:

{
   "state": "{tactic state}",
   "nextTactic" : "{pretty-printed next tactic}",
   "srcUpToTactic" : "{source code in the file up to the tactic invocation}",
   "declUpToTactic" : "{source code in the declaration up to the tactic invocation}",
   "decl": "{declaration without proof (e.g., statement of a theorem)}",
   "declId": "{unique identifier of the declaration}"
}

`full_proof_training_data`

This produces a .jsonl file where each line is an example of the following form:

{
   "srcUpToDecl":"{source code in the file up to the declaration}",
   "decl": "{declaration without proof (e.g., statement of a theorem)}",
   "declId": "{unique identifier of the declaration}",
   "proof":"{proof}"
}

`state_comments`

This produces Lean source files with proof states interleaved as comments after each tactic.

`premises`

This produces premises used by each constant in a module.

`declarations`

This produces information that pretty-prints each declaration in a module. The resulting format is

{
   "name": "pow_two",
   "kind": "theorem",
   "args": ["{M : Type u_2}", "[Monoid M]", "(a : M)"],
   "type": "a ^ 2 = a * a",
   "doc": "Note that most of the lemmas about powers of two refer to it as `sq`.",
   "decl": "/-- Note that most of the lemmas about powers of two refer to it as `sq`. -/\ntheorem pow_two {M : Type u_2} [Monoid M] (a : M) : a ^ 2 = a * a",
   "line": 810,
   "column": 0
}

`imports`

This outputs the imports of each module (both transitively imported modules and directly imported modules). The resulting format is

{
   "name": "{imported module name}",
   "isDirect": "{whether it is explicitly imported by the module (otherwise it is transitively imported)}"
}

`training_data_with_premises`

This produces a .jsonl file where each line contains all of the information in training_data plus the following fields:

{
   ...
   "nextTacticIsSimpOrRwVariant": "{A boolean indicating whether the next tactic is one of several `simp` or `rw` variants}",
   "nextTacticTermPremises": "{Prop-typed constants that appear in the proof term generated by the tactic}",
   "nextTacticExplicitConstants": "{Constants that appear in tactic syntax}",
   "nextTacticExplicitPremises": "{Prop-typed constants that appear in tactic syntax}",
   "nextTacticExplicitLocalHypotheses": "{Local context Prop-typed hypotheses that appear in tactic syntax}",
   "nextTacticAllPremises": "{The union of nextTacticTermPremises and nextTacticExplicitPremises}",
   "nextTacticHammerRecommendation": "{Best guess of which constants would help a hammer based on the constants that appear in the next tactic}",
   "declTermPremises": "{Prop-typed constants that appear in the overall declaration's proof term}",
   "declExplicitConstants": "{Constants that appear in the tactic syntax for any tactic used to create the overall declaration's proof}",
   "declExplicitPremises": "{Prop-typed constants that appear in the tactic syntax for any tactic used to create the overall declaration's proof}",
   "declAllPremises": "{The union of declTermPremises and declExplicitPremises}",
   "declHammerRecommendation": "{The union of each nextTacticHammerRecommendation for every tactic used to create the overall declaration's proof}",
   "termPremisesToEndOfGoal": "{Prop-typed constants that appear in the sum of all proof terms used to close the current goal}",
   "explicitConstantsToEndOfGoal": "{Constants that appear in the tactic syntax for any tactic used to close the current goal}",
   "explicitPremisesToEndOfGoal": "{Prop-typed constants that appear in the tactic syntax for any tactic used to close the current goal}",
   "allPremisesToEndOfGoal": "{The union of termPremisesToEndOfGoal and explicitPremisesToEndOfGoal}",
   "hammerRecommendationToEndOfGoal": "{The union of each nextTacticHammerRecommendation for every tactic used to close the current goal}"
}

Note: Of the above additional fields, all ...ToEndOfGoal fields are potentially approximations. They currently appear to be accurate for the patterns I have tested, but it's hard to check whether there may be edge cases for which they are inaccurate.

Adding new extraction tasks

You can add a new extraction task, by:

Add a relevant flag to scripts/extract_repos.py. Also change the _lakefile method accordingly, and add the flag to flags.
Then add an entry in TASKS to scripts/run_pipeline.py

Running instruction tuning data generation

After extraction, you can generate various forms of (prompt, completion) examples for fine-tuning language models.

To do so, run:

python scripts/instruction_tuning.py --prompt context_state_tactic

See python scripts/instruction_tuning.py -h for other options for --prompt or other settings.

The prompt includes a natural language description of the task, commonly referred to as an "instruction" (hence the name instruction tuning data).

Other setup docs from `lean-training-data`

You may find these useful during setup.

Install elan by running

curl https://raw.githubusercontent.com/leanprover/elan/master/elan-init.sh -sSf | sh

Run lake exe cache get (this downloads precompiled binaries for Mathlib).
Run lake build
Run lake exe <tool>, where <tool> is one of the programs documented below.

Projects

Projects that use or build upon ntp-toolkit:

Submit a PR to add your project or paper here!

Citation

The toolkit is originally a fork of Kim Morrison's lean-training-data:

@misc{lean-training-data,
  author = {Kim Morrison},
  title = {lean-training-data},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/semorrison/lean-training-data}},
}

The ntp-toolkit was initially developed in miniCTX:

@misc{hu2024minictxneuraltheoremproving,
      title={miniCTX: Neural Theorem Proving with (Long-)Contexts},
      author={Jiewen Hu and Thomas Zhu and Sean Welleck},
      year={2024},
      eprint={2408.03350},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2408.03350},
}

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
TrainingData		TrainingData
configs		configs
scripts		scripts
.gitignore		.gitignore
Examples.lean		Examples.lean
README.md		README.md
TrainingData.lean		TrainingData.lean
lake-manifest.json		lake-manifest.json
lakefile.lean		lakefile.lean
lean-toolchain		lean-toolchain
temp.lean		temp.lean

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ntp-toolkit

Running extraction

Extraction tools:

`training_data`

`full_proof_training_data`

`state_comments`

`premises`

`declarations`

`imports`

`training_data_with_premises`

Adding new extraction tasks

Running instruction tuning data generation

Other setup docs from `lean-training-data`

Projects

Citation

About

Releases

Packages

Contributors 4

Languages

cmu-l3/ntp-toolkit

Folders and files

Latest commit

History

Repository files navigation

ntp-toolkit

Running extraction

Extraction tools:

training_data

full_proof_training_data

state_comments

premises

declarations

imports

training_data_with_premises

Adding new extraction tasks

Running instruction tuning data generation

Other setup docs from lean-training-data

Projects

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

`training_data`

`full_proof_training_data`

`state_comments`

`premises`

`declarations`

`imports`

`training_data_with_premises`

Other setup docs from `lean-training-data`

Packages