Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added multiple model prompt formats to llama_cpp chat streaming. #3588

Closed
wants to merge 2 commits into from

Conversation

nigel-daniels
Copy link
Contributor

This adds multiple prompt formats to the llama_cpp chat model to support models other than Llama2 running locally.

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Dec 8, 2023
Copy link

vercel bot commented Dec 8, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchainjs-api-refs ❌ Failed (Inspect) Dec 14, 2023 6:26pm
langchainjs-docs ❌ Failed (Inspect) Dec 14, 2023 6:26pm

Comment on lines +321 to +332
case "chatML":
messageText = `<|im_start|>user\n${message.content}<|im_end|>`;
break;
case "falcon":
messageText = `User: ${message.content}`;
break;
case "general":
messageText = `### Human\n${message.content}`;
break;
default:
messageText = `[INST] ${message.content} [/INST]`;
break;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nigel-daniels Why aren't you using a LlamaChatSession with the relevant ChatPromptWrapper for it?
For example, you have a ChatPromptWrapper for ChatML under ChatMLChatPromptWrapper
It's better to only have one centralized place for the handling of chat formats, especially as more will get supported in node-llama-cpp over time.

Furthermore, in the next major version of node-llama-cpp it'll figure out the right chat wrapper by default, so it won't require a manual config most of the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the call for streaming uses the context.evaluate() method:

const stream = await this.caller.call(async () =>
      this._context.evaluate(this._context.encode(prompt), promptOptions)
    );

This returned a generator function where I can use yield to pass back the tokens as they arrive.

default:
messageText = `[INST] ${message.content} [/INST]`;
break;
}
} else if (message._getType() === "ai") {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next version will support the generation of a response based on chat histories that can have "system", "user" and "model" roles, so I think most of this code could be replaced by the node-llama-cpp functionality for this.

I predict that the next beta that includes this functionality will be ready in about ~2 weeks if you're willing to wait for it, but in any case, I think this better be handled by node-llama-cpp as it its responsibility to handle the interface against a model.

@jacoblee93
Copy link
Collaborator

Thanks for the review @giladgd - @nigel-daniels should we wait a bit then?

@nigel-daniels
Copy link
Contributor Author

@jacoblee93 as I understood it this would only work if I was using the session.prompt() method but in order to get hold of a generator function I was calling the context.evaluate() and bypass the session. It's the session which applies the different wrapper types.
@giladgd do you have any thoughts on a better way to do this using the session?
Afaik If I use the context directly then there is no point in waiting.

@giladgd
Copy link

giladgd commented Dec 13, 2023

@nigel-daniels If I understood the intention of _streamResponseChunks correctly, it means streaming the model's response but not necessarily controlling the generation of the response itself.

A good solution would be to start generating a response using a LlamaChatSession and stream it without controlling the generation process.
This solution will also improve the generation performance.

This is a general idea of how it can be implemented:

async function *_streamResponseChunks(
    input: BaseMessage[],
    _options: this["ParsedCallOptions"],
    runManager?: CallbackManagerForLLMRun
): AsyncGenerator<ChatGenerationChunk> {
    // `this._parseInput` will convert a `BaseMessage[]` to the `conversationHistory` format that `LlamaChatSession` expects,
    // return the last human prompt text as `lastHumPrompt`,
    // and the system message as `sysMessage` or `undefined` if there is none, so the default system message of `node-llama-cpp` will be used.
    const {sysMessage, conversationHistory, lastHumanPrompt} = this._parseInput(input);

    const chatSession = new LlamaChatSession({
        context: this._context,
        conversationHistory,
        systemPrompt: sysMessage,
    });

    let pendingTokens: Token[] = [];
    let generationCompleted = false;
    let onNewUpdate: (() => void) | undefined;

    const responsePromise = chatSession.prompt(lastHumanPrompt, {
        temperature: this?.temperature,
        topK: this?.topK,
        topP: this?.topP,
        onToken(tokens: Token[]) {
            pendingTokens = pendingTokens.concat(tokens);
            onNewUpdate?.();
        }
    });
    responsePromise
        .finally(() => {
            generationCompleted = true;
            onNewUpdate?.();
        });

    while (!generationCompleted || pendingTokens.length > 0) {
        if (pendingTokens.length === 0) {
            await new Promise((accept) => {
                onNewUpdate = accept;
            });
            continue;
        }

        const chunkText = this._context.decode(pendingTokens);
        pendingTokens = [];

        yield new ChatGenerationChunk({
            text: chunkText,
            message: new AIMessageChunk({
                content: chunkText
            }),
            generationInfo: {},
        });
        await runManager?.handleLLMNewToken(chunkText);
    }
    
    await responsePromise; // in case of an error, this will throw
}

Note that the next major version of node-llama-cpp will provide a much more convenient interface that will make most of the this._parseInput function redundant, but as it'll take some time until it's ready, I suggest proceeding with this workaround for now.

@nigel-daniels
Copy link
Contributor Author

@giladgd looks like a good workaround, it will also ensure the Wrappers get used so this module can remain better aligned with node-llama-cpp. Thanks!

@nigel-daniels
Copy link
Contributor Author

Given the previous comments about the upcoming update in a couple of weeks and the restructure happening in the Langchain.JS main branch I'll close this PR and rebase on main when the 'node-llama-cpp' updates are out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:improvement Medium size change to existing code to handle new use-cases size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants