Evaluating Large Language Models on Formal Mathematics

First create a conda environment using environment.yml

conda create --name environment_name --file environment.yml

Building Updated IsarStep

cd build_isarstep

Replicating Results

Follow the instructions in the README of Build_Train_IsarStep to build the training set. Follow the instructions in the README of Build_Validation_IsarStep to build the validation set.

The above folders are pulled from here. These scripts were written by Wenda Li and modified by Jeffrey Ma to produce the desired splits.

Alternative Splits

To make an alternative split of the current AFP, just alter train_names.txt. All the names that appear in train_names.txt will be processed and built into the training dataset and all excluded names will be built into the validation set. There are helper functions in ../scripts/data/ to collect theory names and count lines of code.

Fine-tuning Models

cd scripts

Create a new config yml in configs/ based on config.yml. Make sure finetune.py uses the config file you want.

python finetune.py

Evaluating Models

Make sure gen.py and eval.py use the correct config files. Note that n specifies how many answers to produce for each question. If n > 1, only top-n accuracy will be evaluated. If n = 1, top-1 accuracy and BLEU score will be evaluated.

To generate model output run: python gen.py

To evaluate the output run: python eval.py

To visualize attention weights, use viz.ipynb and plugin your model.

Results

The 2019 IsarStep, 2022 IsarStep, and small IsarStep are available in datasets/ The trained models are available under models/ The outputs of the models are available under outputs/

Data Analysis

Scripts used for processing and analyzing the dataset are under scripts/data/ Some of the scripts were run on Yale's HPC Grace cluster due to the size of the datasets. The scripts to run these slurm jobs on the cluster are found under grace/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating Large Language Models on Formal Mathematics

Building Updated IsarStep

Replicating Results

Alternative Splits

Fine-tuning Models

Evaluating Models

Results

Data Analysis

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
build_isarstep		build_isarstep
grace		grace
scripts		scripts
README.md		README.md
environment.yml		environment.yml

jeffzma2000/Evaluating-LMs-on-Formal-Mathematics

Folders and files

Latest commit

History

Repository files navigation

Evaluating Large Language Models on Formal Mathematics

Building Updated IsarStep

Replicating Results

Alternative Splits

Fine-tuning Models

Evaluating Models

Results

Data Analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages