GQA models have not supported prefix caching #2873

toslunar · 2024-02-14T18:05:33Z

I found a model that uses GQA returns wrong result with prefix_pos. After some investigation, the code to support MQA/GQA

vllm/vllm/model_executor/layers/attention.py

Lines 141 to 155 in 7e45107

    
           if self.num_kv_heads != self.num_heads: 
        
               # As of Nov 2023, xformers only supports MHA. For MQA/GQA, 
        
               # project the key and value tensors to the desired number of 
        
               # heads. 
        
               # TODO(woosuk): Use MQA/GQA kernels for higher performance. 
        
               query = query.view(query.shape[0], self.num_kv_heads, 
        
                                  self.num_queries_per_kv, query.shape[-1]) 
        
               key = key[:, :, 
        
                         None, :].expand(key.shape[0], self.num_kv_heads, 
        
                                         self.num_queries_per_kv, 
        
                                         key.shape[-1]) 
        
               value = value[:, :, None, :].expand(value.shape[0], 
        
                                                   self.num_kv_heads, 
        
                                                   self.num_queries_per_kv, 
        
                                                   value.shape[-1])

, which repeats the inputs, is not compatible with the current implementation of prefix caching (context_attention_fwd).

To support MQA/GQA,

                if self.num_kv_heads != self.num_heads:
                    query = query.view(batch_size * seq_len, self.num_heads, self.head_size)
                    key = key.reshape(batch_size * seq_len, self.num_heads, self.head_size)
                    value = value.reshape(batch_size * seq_len, self.num_heads, self.head_size)

is closer, but KV of prefix should also be expanded (after they are read from key_cache and value_cache).

sighingnow · 2024-02-23T09:45:35Z

The issue was addressed by #3007

GQA models have not supported prefix caching

79fdfb3

WoosukKwon closed this Mar 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GQA models have not supported prefix caching #2873

GQA models have not supported prefix caching #2873

toslunar commented Feb 14, 2024

sighingnow commented Feb 23, 2024

	if self.num_kv_heads != self.num_heads:
	# As of Nov 2023, xformers only supports MHA. For MQA/GQA,
	# project the key and value tensors to the desired number of
	# heads.
	# TODO(woosuk): Use MQA/GQA kernels for higher performance.
	query = query.view(query.shape[0], self.num_kv_heads,
	self.num_queries_per_kv, query.shape[-1])
	key = key[:, :,
	None, :].expand(key.shape[0], self.num_kv_heads,
	self.num_queries_per_kv,
	key.shape[-1])
	value = value[:, :, None, :].expand(value.shape[0],
	self.num_kv_heads,
	self.num_queries_per_kv,
	value.shape[-1])

GQA models have not supported prefix caching #2873

GQA models have not supported prefix caching #2873

Conversation

toslunar commented Feb 14, 2024

sighingnow commented Feb 23, 2024