For an up to date codebase, issues, and pull requests, please continue to the new repository. This repository will not be maintained any further, and issues and pull requests may be ignored.
This repository contains the code accompanying the paper "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation". We also recommend to read our blog post "How EQT Motherbrain uses LLMs to map companies to industry sectors".
After cloning this repository, the necessary packages can be installed with:
pip install -r requirements.txt
pip install -e .
# if using a vertex ai notebook with CUDA
pip3 install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117 --no-cache-dir
All experiments, including hyperparameter search, can be reproduced by running the following batch files:
bash preprocessing/preprocessing.sh
bash sectors/experiments/run_experiments_gpu.sh
bash sectors/experiments/run_experiments_cpu.sh
The scripts can also be run individually:
The preprocessed data for the hatespeech dataset is already contained in this repository. However, it can be rerun with
python preprocessing/get_dataset.py
python preprocessing/preprocess_data.py # this line will take ~10 min as it summarizes long descriptions and keyword lists
The preprocessed dataset can be augmented by applying paraphrasing with vicuna:
python preprocessing/paraphrase_augmentation.py
This will create a new dataset data/[DATASET]/train_augmented.json
.
For test runs, all the following commands include the --model_name=bigscience/bloom-560m
flag, as this can easily be run on a cpu. However, it can also be replaced with other huggingface hosted LLaMa or Bloom models. By default it uses huggyllama/llama-7b
. All experimental results will be saved as json
files in the results/[DATASET]/
directory.
python sectors/experiments/nshot/nshot.py --model_name bigscience/bloom-560m
In order to use gpt-3.5-turbo
as a model for n-shot prompting, a .env
file with the OpenAI API credentials needs to be added to the root directory of this repository:
OPENAI_SECRET_KEY = "secret key"
OPENAI_ORGANIZATION_ID = "org id"
For these experiments, the embeddings still have to be generated by running the following code
python embedding_proximity/generate_embeddings.py --model_name bigscience/bloom-560m
# for augmented data
python embedding_proximity/generate_embeddings.py --model_name bigscience/bloom-560m --augmented augmented
Then, the following code runs all embedding proximity experiments:
python embedding_proximity/vector_similarity.py --model_name bigscience/bloom-560m
python embedding_proximity/vector_similarity.py --model_name bigscience/bloom-560m --augmented augmented
python embedding_proximity/vector_similarity.py --type RadiusNN --model_name bigscience/bloom-560m
python embedding_proximity/vector_similarity.py --type RadiusNN --model_name bigscience/bloom-560m --augmented augmented
python embedding_proximity/classification_head/classification_head.py --model_name bigscience/bloom-560m
python embedding_proximity/classification_head/classification_head.py --model_name bigscience/bloom-560m --augmented augmented
python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --interrupt_threshold 0.01
python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --interrupt_threshold 0.01 --augmented augmented
python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --head ch --scheduler exponential --interrupt_threshold 0.01
python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --head ch --scheduler exponential --interrupt_threshold 0.01 --augmented augmented
For an example of applying Trie Search, see notebooks/constrained_beam_search.ipynb
If you use or refer to this repository in your research, please cite our paper:
@inproceedings{buchner2023prompt,
title={Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation},
author={Buchner, V. L. and Cao, L. and Kalo, J.-C. and von Ehrenheim, V.},
booktitle={to appear In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL)},
year={2024}
}
Buchner, V. L., Cao, L., Kalo, J.-C., & von Ehrenheim, V. (2024). Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation. to appear In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL)
Buchner, Valentin Leonhard, et al. "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation." to appear In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024.