Code Analysis

model.forward()
- ModelRunner.execute_model()
- ModelRunner.capture_model() <- CUDAGraphRunner.capture()
ModelRunner.execute_model()
ModelRunner.profile_run()
- kv_caches=[(None, None)] * num_layers
- prefix_kv_caches = [(None, None)] * num_layers
ModelRunner.fill_prefix_kv_cache()
- kv_caches = [(None, None)] * num_layers
- prefix_kv_caches are the allocated buffers if enable_relay_attention
CUDAGraphRunner.capture()
- kv_caches are allocated buffers
- prefix_kv_caches are the allocated buffers if enable_relay_attention
Worker.execute_model()
- kv_caches and prefix_kv_caches are both allocated buffers
- prefix_kv_caches are the allocated buffers if enable_relay_attention

TODOs

finish implemenation
test in eager mode
make the implementation work with CUDAGraph
- use a static buffer to track the prefix cache length
- fix a bug to make paged_attention_v2 work with CUDAGraph
optimize the implementation further
- write a relay fusion kernel with triton
- modify the paged attention kernel to return log-softmax-exp
- use native flash attention kernel to support MQA/GQA
benchmark standalone relay attention (teaser)
- script for latency, memory usage, profile; eager & cudagraph mode
- run benchmark & profiling, plot figures
benchmark for non-interactive applications (exp group 1)
- throughput ~~& latency~~ for synthetic workload, plot figures
  - (partially) fixed a bug of vllm for OPT and LLAMA models
- throughput ~~& latency~~ for real workload (ShareGPT dataset), plot figures
benchmark for interactive aplications (exp group 2)
- throughput, latency to first token, latency to susequent tokens w/ ShareGPT dataset
check if we need to change the behavior of tokenizer (e.g. avoid prepending bos token)
- https://huggingface.co/docs/transformers/main_classes/tokenizer
- currently HACKED, see vllm/engine/llm_engine.py, add_request()
adaptations for the cases where window attention is used and sequence length > window size
adaptations to support ALiBi

environment setup
- conda cudatoolkit
quantization
- vllm-AutoAWQ
- Cannot find the config file for awq when load llm with LLaMA-2 + AWQ
model downloading
- hf-mirror
relay attention does not work with CUDAGraph