-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Prefix sharing. #53
Conversation
Signed-off-by: Duyi-Wang <[email protected]>
src/models/kvcache_manager.h
Outdated
@@ -24,27 +24,41 @@ class KVCacheManager { | |||
this->layers = layers; | |||
this->cachedKeys = new KVCacheTensor<KVCacheT>[layers]; | |||
this->cachedValues = new KVCacheTensor<KVCacheT>[layers]; | |||
this->cachedPrefixKeys = new KVCacheTensor<KVCacheT>[layers]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If prefix_sharing=false, do not need to allocate it (although small memory).
Suggest allocating it when really needed.
this->getPositionIds(prefixIDs, batchSize, pastSeqLen, 0); | ||
|
||
free(prefixIDs); | ||
ids = newIDs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any chance to free the ID in future since it is dynamically allocated?
|
||
this->prepareAttnMask(prefixIDs, 0); | ||
|
||
this->getPositionIds(prefixIDs, batchSize, pastSeqLen, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to call getPositionIds?
p[keyLen - 1] * ctx->attFactor); | ||
p[2] * ctx->attFactor, p[strideC - 3] * ctx->attFactor, p[strideC - 2] * ctx->attFactor, | ||
p[strideC - 1] * ctx->attFactor); | ||
// for (int qki = 0; qki < queryLen; qki++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not need, pls remove such commented code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to print the whole QK score and attention mask matrix.
src/models/common_decoder.h
Outdated
memcpy(newIDs + inputSeqLen * bs, ids + seqLen * bs + pastSeqLen, inputSeqLen * sizeof(int)); | ||
} | ||
|
||
this->prepareAttnMask(prefixIDs, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this step is?
Support Llama, chatGLM2, Baichuan, and Opt. Not support chatGLM 1 model.