Code release for the paper Inference-Time Language Model Alignment via Integrated Value Guidance.
In this work, we implement chunk-level beam search and emulator fine-tuning by extending the GenerationMixin
class. We provide the code and details for three specific tasks: controlled sentiment generation, summarization, and instruction following.
We provide run.sh
scripts for each task to facilitate for inference. The scripts are located in the scripts
directory.
We utilize the IMDB dataset to achieve controlled sentiment generation. All models could be found in huggingface.co.
The models used in this task are as follows:
- GPT-2:
openai-community/gpt2
,openai-community/gpt2-large
,openai-community/gpt2-xl
- Untuned Model:
lvwerra/gpt2-imdb
- Tuned Model:
chadlzx/gpt2-imdb-dpo
- Token Reward Model:
chadlzx/gpt2-imdb-token-rm
lvwerra/distilbert-imdb
We implement the summarization task using the CarperAI/openai_summarize_comparisons dataset. The models used are as follows:
- GPT-2:
openai-community/gpt2
,openai-community/gpt2-large
,openai-community/gpt2-xl
- Untuned Model:
chadlzx/gpt2-summarize
- Tuned Model:
chadlzx/gpt2-summarize-dpo
- Token Reward Model:
chadlzx/gpt2-summarize-token-rm
chadlzx/golden_rm_summarize
We implement the instruction-following task using the Ultrafeedback dataset. The models used are as follows:
chadlzx/ultrafeedback_with_rewards
- LLaMA-2:
meta-llama/Llama-2-7b-chat-hf
,meta-llama/Llama-2-70b-chat-hf
- Mistral:
mistralai/Mistral-7B-Instruct-v0.2
,mistralai/Mixtral-8x7B-Instruct-v0.1
allenai/tulu-2-7b
allenai/tulu-2-dpo-7b
meta-llama/Llama-2-7b-hf
chadlzx/llama-ultrafeedback-dpo
chadlzx/llama-ultrafeedback-token-rm
- GPT-4