-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added multiple model prompt formats to llama_cpp chat streaming. #3588
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
case "chatML": | ||
messageText = `<|im_start|>user\n${message.content}<|im_end|>`; | ||
break; | ||
case "falcon": | ||
messageText = `User: ${message.content}`; | ||
break; | ||
case "general": | ||
messageText = `### Human\n${message.content}`; | ||
break; | ||
default: | ||
messageText = `[INST] ${message.content} [/INST]`; | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nigel-daniels Why aren't you using a LlamaChatSession
with the relevant ChatPromptWrapper
for it?
For example, you have a ChatPromptWrapper
for ChatML under ChatMLChatPromptWrapper
It's better to only have one centralized place for the handling of chat formats, especially as more will get supported in node-llama-cpp
over time.
Furthermore, in the next major version of node-llama-cpp
it'll figure out the right chat wrapper by default, so it won't require a manual config most of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the call for streaming uses the context.evaluate()
method:
const stream = await this.caller.call(async () =>
this._context.evaluate(this._context.encode(prompt), promptOptions)
);
This returned a generator function where I can use yield
to pass back the tokens as they arrive.
default: | ||
messageText = `[INST] ${message.content} [/INST]`; | ||
break; | ||
} | ||
} else if (message._getType() === "ai") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next version will support the generation of a response based on chat histories that can have "system", "user" and "model" roles, so I think most of this code could be replaced by the node-llama-cpp
functionality for this.
I predict that the next beta that includes this functionality will be ready in about ~2 weeks if you're willing to wait for it, but in any case, I think this better be handled by node-llama-cpp
as it its responsibility to handle the interface against a model.
Thanks for the review @giladgd - @nigel-daniels should we wait a bit then? |
@jacoblee93 as I understood it this would only work if I was using the |
@nigel-daniels If I understood the intention of A good solution would be to start generating a response using a This is a general idea of how it can be implemented: async function *_streamResponseChunks(
input: BaseMessage[],
_options: this["ParsedCallOptions"],
runManager?: CallbackManagerForLLMRun
): AsyncGenerator<ChatGenerationChunk> {
// `this._parseInput` will convert a `BaseMessage[]` to the `conversationHistory` format that `LlamaChatSession` expects,
// return the last human prompt text as `lastHumPrompt`,
// and the system message as `sysMessage` or `undefined` if there is none, so the default system message of `node-llama-cpp` will be used.
const {sysMessage, conversationHistory, lastHumanPrompt} = this._parseInput(input);
const chatSession = new LlamaChatSession({
context: this._context,
conversationHistory,
systemPrompt: sysMessage,
});
let pendingTokens: Token[] = [];
let generationCompleted = false;
let onNewUpdate: (() => void) | undefined;
const responsePromise = chatSession.prompt(lastHumanPrompt, {
temperature: this?.temperature,
topK: this?.topK,
topP: this?.topP,
onToken(tokens: Token[]) {
pendingTokens = pendingTokens.concat(tokens);
onNewUpdate?.();
}
});
responsePromise
.finally(() => {
generationCompleted = true;
onNewUpdate?.();
});
while (!generationCompleted || pendingTokens.length > 0) {
if (pendingTokens.length === 0) {
await new Promise((accept) => {
onNewUpdate = accept;
});
continue;
}
const chunkText = this._context.decode(pendingTokens);
pendingTokens = [];
yield new ChatGenerationChunk({
text: chunkText,
message: new AIMessageChunk({
content: chunkText
}),
generationInfo: {},
});
await runManager?.handleLLMNewToken(chunkText);
}
await responsePromise; // in case of an error, this will throw
} Note that the next major version of |
@giladgd looks like a good workaround, it will also ensure the Wrappers get used so this module can remain better aligned with |
Given the previous comments about the upcoming update in a couple of weeks and the restructure happening in the |
This adds multiple prompt formats to the
llama_cpp
chat model to support models other than Llama2 running locally.