-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API] Export MLP LLaMA api. #64
[API] Export MLP LLaMA api. #64
Conversation
changqi1
commented
Nov 20, 2023
•
edited
Loading
edited
6bb26c1
to
8ce973b
Compare
void invokeMLPLLaMA(DataType dt, int numTokens, int hiddenSize, int intermediateSize, void *output, int outputStride, | ||
const void *input, int inputStride, const void *gateWeight, const void *upWeight, const void *downWeight) { | ||
static std::mutex mutex; | ||
std::lock_guard<std::mutex> lock(mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need a lock here? to protect the unordered_map access?
If so, suggest narrowing down the scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
narrowing down in forward.
src/layers/mlp_llama.cpp
Outdated
// create hash key | ||
std::stringstream weights_addr; | ||
weights_addr << gateWeight << "_" << upWeight << "_" << downWeight; | ||
std::string llama_mlp_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the address is enough, do we really need to add the size info in the key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/layers/mlp_llama.cpp
Outdated
auto it_created = llama_mlp_hub.find(llama_mlp_key); | ||
if (it_created == llama_mlp_hub.end()) { | ||
// LlamaMLP<bfloat16_t> &llama_mlp = LlamaMLP<bfloat16_t>::getInstance(); | ||
ctx = new DecoderContext(1, hiddenSize, 1, 1, intermediateSize, "silu", 1e-6, 0, 0, 0, 0, 0, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we reuse the context? to make sure better performance, we need to reuse the context for every layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reused.
src/layers/mlp_llama.cpp
Outdated
} | ||
|
||
// Unsupport different type model serving simultaneously because of same DecoderContext | ||
std::lock_guard<std::mutex> lock(mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since now we are using a static variable: "static DecoderContext *ctx;"
If multi-threads calls into invokeMLPLLaMA, then potentially they together modify the value of 'ctx', or use the same intermediate buffer at the same time, thus make problem.
So, this time, I think we need to expand the lock scope, :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, expanded the lock scope.