-
-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support file based prompt caching #180
Comments
I think this function in llama.cpp might be the right one to call to try to implement this. But I've never done any kind of C++ to Nodejs bindings before, so I'm doing my best to try and work through how that works and how to implement this here just by inferring from addon.cpp. |
I really like the idea :) I've experimented with
If you like you can try to add the ability to save and load only the evaluation cache of a context sequence to |
I've used the oobabooga API for some batch tasks and it is noticeably fast for sequential large prompts if just the start of the text is the same but the ending is different. It seems to be a feature of llama-cpp-python? Is that a different implementation of the prefix caching? I was hoping to benefit from this feature too, forgot that llama.cpp and the python version are two different things |
@Madd0g The way it works is that it reuses the existing context state for the new evaluation, and since the start of the current context state is the same, it allows it to start the evaluation of the new prompt at the first different token in the existing context state. This feature already exists in import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";
const __dirname = path.dirname(fileURLToPath(import.meta.url));
const llama = await getLlama();
const model = await llama.loadModel({
modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext({
contextSize: Math.min(4096, model.trainContextSize)
});
const contextSequence = context.getSequence();
const session = new LlamaChatSession({
contextSequence,
autoDisposeSequence: false
});
const q1 = "Hi there, how are you?";
console.log("User: " + q1);
const a1 = await session.prompt(q1);
console.log("AI: " + a1);
session.dispose();
const session2 = new LlamaChatSession({
contextSequence
});
const q1a = "Hi there";
console.log("User: " + q1a);
const a1a = await session2.prompt(q1a);
console.log("AI: " + a1a); |
@giladgd - thanks I played around today with the beta. I tried running on CPU, I'm looping over an array of strings, for me the evaluation takes longer only if I dispose and fully recreate the session.
I'm resetting the history in the loop to only keep the system message: const context = await model.createContext({
contextSize: Math.min(4096, model.trainContextSize),
});
const contextSequence = context.getSequence();
const session = new LlamaChatSession({
systemPrompt,
contextSequence,
autoDisposeSequence: false,
});
// for of ....
// and in the loop to keep the system message:
session.setChatHistory(session.getChatHistory().slice(0, 1)); am I doing something wrong? |
I'm a little fuzzy on what you mean between the "entire context state" vs "evaluation cache" because I don't have a super solid conceptual idea of how things work under the hood on LlamaCPP for batching. It sounds to me like the existing prompt based caching would only really be useful for single user setups, and for short term caching. Is there a way to cache a context to disk on the Node side with the V3 beta? I'm assuming a naive attempt to do something like this won't actually work: const context = await model.createContext({
contextSize: 2048,
})
fs.writeFileSync("context.bin", context) |
@Madd0g It's not a good idea to manually truncate chat history like that just in order to reset it; you better create a new The next beta version should be released next week and include a |
@StrangeBytesDev A context can have multiple sequences, and each sequence has its own state and history. Since every sequence is supposed to be independent and have its own state, there shouldn't be any functions that have side effects that can affect other sequences when you only intend to affect a specific sequence. The problem with |
Thanks, I initially couldn't get it to work without it retaining history, I was doing something wrong. Today I did manage to do it correctly, with something like this in a loop: if (session) {
session.dispose();
}
session = new LlamaChatSession({ contextSequence, systemPrompt, chatWrapper: "auto" })
console.log(session.getChatHistory()); I tried to get chatHistory out of the session and I correctly see only one the system message in there. |
Hey @giladgd, thanks for all of your work in the library. I have a couple of questions (some of them related to this) and I didn’t know a better way to get in touch than to comment here. My questions:
Please let me know if there is a Discord or better way of getting in touch. You can reach me at [email protected]. Excited to chat and potentially collaborate on this issue! |
@dabs9 the file-based caching will be released as a non-breaking feature after the version 3 stable release. Contributions are welcome, but the specific feature of file-based caching will have to wait a bit for the feature of using the GPU of other machines first, so it can be implemented in a stable manner and avoid breaking changes. I prefer to use GitHub Discussions for communications since it makes it easier for people new to this library to search for information in existing discussions, and relevant information shows up on Google, which is helpful when looking for stuff. |
Feature Description
LlamaCPP is able to cache prompts to a specific file via the "--prompt-cache" flag. I think that exposing this through node-llama-cpp would provide for some techniques for substantial performance improvements which are otherwise impossible.
For example, you could create specific cache files for individual conversations, and when you switch from one conversation to another, you're able to load the existing cache file, and not have to re-process the conversation history.
You'd also be able to keep the cache available indefinitely, which is currently not possible via other caching mechanisms.
The Solution
Implement a config option to specify a prompt cache file.
Considered Alternatives
LlamaCPP server implements something similar with slots. With each request, you're able to specify a slot ID, and it will then utilize that existing prompt cache for the request. This works pretty well, but as each slot is kept in memory, limits the amount of slots that you can utilize at once, and doesn't preserve the cache between server restarts.
Additional Context
I'm able to work on this feature with a little guidance.
Related Features to This Feature Request
Are you willing to resolve this issue by submitting a Pull Request?
Yes, I have the time, but I don't know how to start. I would need guidance.
The text was updated successfully, but these errors were encountered: