Skip to content
/ ivg Public

Official repository of the paper "Inference-Time Language Model Alignment via Integrated Value Guidance"

Notifications You must be signed in to change notification settings

chadlzx/ivg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Integrated Value Guidance

Code release for the paper Inference-Time Language Model Alignment via Integrated Value Guidance.

In this work, we implement chunk-level beam search and emulator fine-tuning by extending the GenerationMixin class. We provide the code and details for three specific tasks: controlled sentiment generation, summarization, and instruction following.

We provide run.sh scripts for each task to facilitate for inference. The scripts are located in the scripts directory.

Controlled Sentiment Generation

We utilize the IMDB dataset to achieve controlled sentiment generation. All models could be found in huggingface.co.

The models used in this task are as follows:

Base Model:

  • GPT-2: openai-community/gpt2, openai-community/gpt2-large, openai-community/gpt2-xl

Implicit Value Function:

  • Untuned Model: lvwerra/gpt2-imdb
  • Tuned Model: chadlzx/gpt2-imdb-dpo

Explicit Value Function:

  • Token Reward Model: chadlzx/gpt2-imdb-token-rm

Golden Reward Model:

  • lvwerra/distilbert-imdb

Summarization

We implement the summarization task using the CarperAI/openai_summarize_comparisons dataset. The models used are as follows:

Base Model:

  • GPT-2: openai-community/gpt2, openai-community/gpt2-large, openai-community/gpt2-xl

Implicit Value Function:

  • Untuned Model: chadlzx/gpt2-summarize
  • Tuned Model: chadlzx/gpt2-summarize-dpo

Explicit Value Function:

  • Token Reward Model: chadlzx/gpt2-summarize-token-rm

Golden Reward Model:

  • chadlzx/golden_rm_summarize

Instruction Following

We implement the instruction-following task using the Ultrafeedback dataset. The models used are as follows:

Dataset:

  • chadlzx/ultrafeedback_with_rewards

Base Model:

  • LLaMA-2: meta-llama/Llama-2-7b-chat-hf, meta-llama/Llama-2-70b-chat-hf
  • Mistral: mistralai/Mistral-7B-Instruct-v0.2, mistralai/Mixtral-8x7B-Instruct-v0.1

Implicit Value Function:

Tulu Guidance:

  • allenai/tulu-2-7b
  • allenai/tulu-2-dpo-7b

Ultra Guidance:

  • meta-llama/Llama-2-7b-hf
  • chadlzx/llama-ultrafeedback-dpo

Explicit Value Function:

  • chadlzx/llama-ultrafeedback-token-rm

Golden Reward Model:

  • GPT-4

About

Official repository of the paper "Inference-Time Language Model Alignment via Integrated Value Guidance"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published