- model.forward()
- ModelRunner.execute_model()
- ModelRunner.profile_run()
- kv_caches=[(None, None)] * num_layers
- prefix_kv_caches = [(None, None)] * num_layers
- ModelRunner.fill_prefix_kv_cache()
- kv_caches = [(None, None)] * num_layers
- prefix_kv_caches are the allocated buffers if enable_relay_attention
- CUDAGraphRunner.capture()
- kv_caches are allocated buffers
- prefix_kv_caches are the allocated buffers if enable_relay_attention
- Worker.execute_model()
- kv_caches and prefix_kv_caches are both allocated buffers
- prefix_kv_caches are the allocated buffers if enable_relay_attention
- finish implemenation
- test in eager mode
- make the implementation work with CUDAGraph
- use a static buffer to track the prefix cache length
- fix a bug to make paged_attention_v2 work with CUDAGraph
- optimize the implementation further
- write a relay fusion kernel with triton
- modify the paged attention kernel to return log-softmax-exp
- use native flash attention kernel to support MQA/GQA
- benchmark standalone relay attention (teaser)
- script for latency, memory usage, profile; eager & cudagraph mode
- run benchmark & profiling, plot figures
- benchmark for non-interactive applications (exp group 1)
- throughput
& latencyfor synthetic workload, plot figures- (partially) fixed a bug of vllm for OPT and LLAMA models
- throughput
& latencyfor real workload (ShareGPT dataset), plot figures
- throughput
- benchmark for interactive aplications (exp group 2)
- throughput, latency to first token, latency to susequent tokens w/ ShareGPT dataset
- check if we need to change the behavior of tokenizer (e.g. avoid prepending bos token)
- adaptations for the cases where window attention is used and sequence length > window size
- adaptations to support ALiBi
- environment setup
- quantization
- model downloading
- relay attention does not work with CUDAGraph