This repository is a landing page for tools related to prompt steering in language models. Our primary focus is to measure and analyze how controllable models are through prompting.
prompt-steerability
: A benchmark suite for measuring the extent to which models can be prompted to change behavior
To install, set up a virtual environment using Python 3.10, navigate to prompt_steerability/
, and run:
pip install -e .
Model access is enabled via HuggingFace and vLLM. For a simple guide please refer to the quickstart or follow these steps:
-
Request access to models through HuggingFace:
- visit the model page on HuggingFace
- click "Access" button and fill out the form if model is gated
- once approved, get your HuggingFace token from the account settings
- log in using:
huggingface-cli login
-
Install vLLM with GPU support in your virtual environment:
pip install vllm[all]
- Start a vLLM server (note model access uses the OpenAI API format):
python -m vllm.entrypoints.openai.api_server \
--model <your-model-name> \
--host 0.0.0.0 \
--port 8000 \
--device cuda
Ensure that <your-model-name>
exists as one of the model-id
's in model-config
in config.yaml
(and as named on HuggingFace).
For a complete list of parameter options (including options for instantiating multiple GPUs), please see the vLLM documentation.
Configuration details are specified in prompt_steerability/config.yaml
. Key parameters are:
num-benchmark-trials
: the number of (outer) benchmark trialspersona-dimensions
: the persona dimensions you want to benchmark the models overmodels
: the models you wish to benchmark (ensure these are hosted at thebase-url
locations specified inmodel-config
)steering
config:method
: the steering method (principles
)params
:num-steering-trials
: the number of steering trialssteering-budgets
: dictates the number of steering statements to include
profiling
:method
: the profiling method (query-principles
)params
:num-questions
: how many profiling questions to ask per trialrepresentation
: how to represent profiles
inference
: how to parse outputs (log-probs
vsoutput-parsing
)
After hosting the desired models, navigate to persona/benchmark/
and run run_persona_steerability.py
to run the benchmark.
@article{miehling2024evaluating,
title={Evaluating the Prompt Steerability of Large Language Models},
author={Miehling, Erik and Desmond, Michael and Ramamurthy, Karthikeyan Natesan and Daly, Elizabeth M. and Dognin, Pierre and Rios, Jesus and Bouneffouf, Djallel and Liu, Miao},
journal={arXiv preprint arXiv:2411.12405},
year={2024}
}