Beta version feedback (`3.0.0-beta`) #109

giladgd · 2023-12-06T11:52:43Z

giladgd
Dec 6, 2023
Maintainer

Please share here any feedback you have for the beta of version 3.0

danielmeloalencar · 2023-12-06T13:50:44Z

danielmeloalencar
Dec 6, 2023

In version 2.8.1, the response is in 8 seconds, in the beta version it took more than 20 minutes (I got tired of waiting).

The prompt was: Hi there, how are you?
The model was: openchat_3.5.Q4_K_S.gguf

The example used was the simple chat example from each version.

In the beta version, there is also an error: Cannot read properties of undefined (reading ‘disposed’)

if I do not provide a contextSequence.

2 replies

giladgd Dec 6, 2023
Maintainer Author

The TS types require you to pass contextSequence, there's no point in using a LlamaChatSession without it.
Have you used the example code?

danielmeloalencar Dec 6, 2023

I tested again with another model: zephyr-cucumber-instruct.gguf.q4_k_m.bin

This time it worked with a response time of 23 seconds.

danielmeloalencar · 2023-12-06T15:27:32Z

danielmeloalencar
Dec 6, 2023

But what about the slowness for such simple prompts with the beta version? I’ve been waiting for over an hour to run another test.

O código do teste é

import {fileURLToPath} from "url";
import path from "path";
import {LlamaModel, LlamaContext, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const model = new LlamaModel({
    modelPath: path.join(__dirname, "", "openchat_3.5.Q4_K_S.gguf")
});
const context = new LlamaContext({
    model,
    contextSize: Math.min(4096, model.trainContextSize)
});
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});


const q1 = "Hi there, how are you?";
console.log("User: " + q1);
console.time("teste")
const a1 = await session.prompt(q1);
console.log("AI: " + a1);
console.timeEnd("teste")

0 replies

carlosgalveias · 2023-12-06T20:17:46Z

carlosgalveias
Dec 6, 2023

On 2.8 If I was sending large pieces of text or talked for too long it would break because it was not having more kv slots.
On version 3, I needed to specify the context size as it was inheriting the context size of the model (?).
But after that it seems to work well as I've sent a lot larger pieces of text and it scales well. I only say the error when it actually reached physical memory limit.

Did not notice increase time of inference.

PS: Using CPU only

0 replies

stewartoallen · 2024-01-21T01:18:19Z

stewartoallen
Jan 21, 2024

prior to today, beta3 was able to load mixtral files. as of the latest update, it errors out with

error loading model: create_tensor: tensor 'blk.0.ffn_gate.weight' not found

I was mostly using: mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf

Can submit a bug report with sample code, but it happens simply during LlamaModel() instantiation

6 replies

stewartoallen Jan 21, 2024

I've tested llama.cpp directly with the models I'm using (including mixtral), and it's working (as of their latest code 5 minutes ago). with and without gpu.

giladgd Jan 21, 2024
Maintainer Author

Can you please attach a link to the specific model file you’ve used?
I’ll investigate it

stewartoallen Jan 21, 2024

mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf

any of those should elicit the behavior. thanks.

stewartoallen Jan 21, 2024

@giladgd ugh. I nuked, re-cloned, and rebuilt my local repo and the "tensor" error has disappeared 😕 sorry. no idea. however, gpuLayers is no longer enlisting the gpu. even though adding that option to LlamaModel() does reflect in the console output:

llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU

(setting gpuLayers to 0 results in "0 repeating layers to GPU")

when running llama.cpp from the cli with and without gpu shows a marked difference in the cpu/gpu load monitors and shortened run time. with node-llama-cpp (beta 4 now), the gpu is idle regardless of the settings.

stewartoallen Jan 21, 2024

when I download and build node-llama-cpp from src (have been using npm package) and use the node ./dist/cli/cli.js chat command, it's using the gpu properly. but running the cli.js packaged with the npm package does not (even when adding the --gl parameter).

looks like a bug in the published package. switching to using a local build of node-llama-cpp using npm link.

stewartoallen · 2024-01-22T01:06:24Z

stewartoallen
Jan 22, 2024

I'm trying to understand the difference in the handling of the batch size parameter between llama.cpp main and when node-llama-cpp is using llama.cpp as a library (possibly related to this issue which I thought the beta fixed). For this test, I'm providing an input (about 1.5k) as the initial prompt. To make it easier to debug fully within node-llama-cpp, I've submitted a PR that adds --batchSize, -b and --promptFile, -f cli parameters.

Hardware: Apple M2 Ultra
Node: v20.11.0
RAM: 192GB
Source: llama.cpp and node-llama-cpp are at latest/HEAD
to ensure node-llama-cpp is using the same version of llama.cpp, I've sym-linked it to llama/llama.cpp

tl;dr -- mixtral models appear to limit batch sizes to 512 unless you disable gpu layers. seems to be a llama.cpp bug. I can't work around this in node-llama-cpp by setting the batch size equal to the context size as a result. llama.cpp's main seems to be fine with batch < context, though.

testing with llama.cpp's main:

./main -f /tmp/prompt1k -m models/llama-2-13b-chat.Q6_K.gguf -c 4096 -b 512 << OK

./main -f /tmp/prompt1k -m models/llama-2-13b-chat.Q6_K.gguf -c 4096 -b 4096 << OK

./main -f /tmp/prompt1k -m models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf -c 4096 -b 512 << OK

./main -f /tmp/prompt1k -m models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf -c 4096 -b 4096 << FAIL

### Assistant:GGML_ASSERT: ggml-metal.m:1511: ne11 <= 512

It looks like mixtral won't work with larger batch sizes and gpu. disabling gpu works, but is incredibly slow.

testing with node-llama-cpp's cli:

node dist/cli/cli.js chat -f /tmp/prompt1k -m models/llama-2-13b-chat.Q6_K.gguf -w llamaChat -c 4096 -b 512 << FAIL

GGML_ASSERT: /Users/stewart/Code/node-llama-cpp-soa/llama/llama.cpp/llama.cpp:6186: n_tokens <= n_batch

node dist/cli/cli.js chat -f /tmp/prompt1k -m models/llama-2-13b-chat.Q6_K.gguf -w llamaChat -c 4096 -b 4096 << OK

node dist/cli/cli.js chat -f /tmp/prompt1k -m models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf -w llamaChat -c 4096 -b 512 << FAIL

GGML_ASSERT: /Users/stewart/Code/node-llama-cpp-soa/llama/llama.cpp/llama.cpp:6186: n_tokens <= n_batch

node dist/cli/cli.js chat -f /tmp/prompt1k -m models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf -w llamaChat -c 4096 -b 4096 << FAIL

GGML_ASSERT: /Users/stewart/Code/node-llama-cpp-soa/llama/llama.cpp/ggml-metal.m:1515: ne11 <= 512

I would expect the two cases where context=4096, batch=512 to work same as llama.cpp

2 replies

giladgd Jan 22, 2024
Maintainer Author

@stewartoallen The PR you've submitted is for the master branch, which corresponds to the stable version (2.8.5 at the time of writing this).
I've opened a PR with similar changes for the beta branch so you can use it for your tests against the beta version.

The n_tokens <= n_batch is a known issue in version 2 and was fixed in the version 3 beta.

I suspect you've encountered these issues because you've been using version 2 and not version 3 beta.

stewartoallen Jan 22, 2024

thanks for your patience as I stumble through this. user error, indeed.

bracesproul · 2024-01-23T17:32:56Z

bracesproul
Jan 23, 2024

Hey @stewartoallen, I'm one of the maintainers at LangChain for the JS repo and I noticed v3.0.0-beta.3 contains text embeddings. First off, thank you for getting this in, very exciting! We've been looking to add this to the library, but would ideally wait until a stable 3.0. Is there a timeline for when you expect 3.0 to exit beta?

Thanks again for adding this, and the rest of your work on this library!

3 replies

giladgd Jan 24, 2024
Maintainer Author

@bracesproul I'm the developer behind this library.
The stable version 3 will be released in ~1-2 months as there are a few features I plan to implement that will break the current API, so I'd want to finish them all before a stable release.

bracesproul Jan 24, 2024

@bracesproul I'm the developer behind this library.

The stable version 3 will be released in ~1-2 months as there are a few features I plan to implement that will break the current API, so I'd want to finish them all before a stable release.

Thanks for the estimate! When you say breaking changes, do you mean to the previous (2.0) version, or to the text embeddings API, or both? Asking because if you were confident the beta release for text embeddings contains an API which won't likely change, we could think about adding it now.

giladgd Jan 24, 2024
Maintainer Author

I mean both; the current beta 3 version is already a breaking change compared to version 2, and it'll likely break again very soon.
It wouldn't be hard to adapt to the planned breaking changes, but I don't think it's a good idea to release support for the beta version in a stable version of LangChain, at least not yet.

DennisKo · 2024-02-02T22:16:42Z

DennisKo
Feb 2, 2024

I am on a Apple M1 Max 32GB. 3.0.0-beta.5

With mistral-7b-instruct-v0.2.Q4_K_M.gguf, it crashes when I don't explicitly use contextSize: 2048. Similar to what stewartoallen mentioned. Anything bigger is crashing it as well. I do not think it's my machine reaching some limits - Ive ran bigger models/contexts via Ollama.

Seems like its because its setting batchSize = contextSize by default, which I think my system cant handle. Setting an explicit smaller batchSize also fixes it.

Works

const context = new LlamaContext({
    model,
    contextSize: Math.min(4096, model.trainContextSize),
    batchSize: 512
  })

Crashes

const context = new LlamaContext({
    model,
    contextSize: Math.min(4096, model.trainContextSize)
  })

2 replies

stewartoallen Feb 3, 2024

I don't think this is a system limitation. I tried it on a maxed out Studio w/ 192GB RAM. No luck. Batches over 512 w/ Mistral/Mixtral derivatives crashes. Also instantiating EmbeddingContext on those same models and their derivatives will SIGSEGV

giladgd Feb 4, 2024
Maintainer Author

@DennisKo Can you please share a link to the specific model file you used?
I tried running a mistral model I have with batch size of 4096 and it worked for me.
I also have an Apple M1 Max 32GB.

devoidfury · 2024-02-06T22:54:15Z

devoidfury
Feb 6, 2024

Heya, looking great so far! One little request I have: when using "auto" for the chatWrapper type, it'd be great to have a public API to see which wrapper was chosen. I'm currently using session._chat.chatWrapper?.wrapperName however it's clear that _chat is an internal property and subject to break.

1 reply

giladgd Feb 9, 2024
Maintainer Author

@devoidfury I'll add this in the next beta.
Thanks for the feedback!

mrddter · 2024-02-09T15:59:48Z

mrddter
Feb 9, 2024

hi @giladgd have you already planned when will be possible to use Grammar and Functions together?

4 replies

giladgd Feb 11, 2024
Maintainer Author

@mrddter Using both of them produces unstable results in the models I've tested, as instead of letting the model generate text between function calls, it forces it to generate a final "summary" message according to the grammar but without any relevant data from function calls.
The best way to use function calls and generate a response according to a JSON schema or any other grammar is to prompt the model again, asking it to repeat its previous answer while using a grammar.

I'm still experimenting with various ways to enable using grammar together with function calling, but I'll release it only when it works correctly.
I aim to make this library as easy to use as possible, and right now, enabling the ability to use grammar together with function calling will make it harder to use due to instability, as mentioned above.

mrddter Feb 12, 2024

Do you have an example? Especially, how can I ask to repeat its previous answer? Thanks for your time.

giladgd Feb 13, 2024
Maintainer Author

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaModel, LlamaContext, LlamaChatSession, LlamaJsonSchemaGrammar} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = new LlamaModel({
    llama
    modelPath: path.join(__dirname, "models", "functionary-small-v2.2.q4_0.gguf")
});
const context = new LlamaContext({
    model,
    contextSize: Math.min(4096, model.trainContextSize)
});
const functions = {
    getDate: defineChatSessionFunction({
        description: "Retrieve the current date",
        handler() {
            return new Date().toLocaleDateString();
        }
    }),
    getNthWord: defineChatSessionFunction({
        description: "Get an n-th word",
        params: {
            type: "object",
            properties: {
                n: {
                    enum: [1, 2, 3, 4]
                }
            }
        },
        handler(params) {
            return ["very", "secret", "this", "hello"][params.n - 1];
        }
    })
};
const responseSchema = new LlamaJsonSchemaGrammar(llama, {
    type: "object",
    properties: {
        word: {
            type: "string"
        }
    }
});
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});


const q1 = "What is the second word?";
console.log("User: " + q1);

const a1 = await session.prompt(q1, {functions});
console.log("AI: " + a1);

const formattedResponse = await chatSession.prompt("Repeat your response", {
    grammar: responseSchema
});
const parsedResponse = res2SchemaGrammar.parse(formattedResponse);

console.log("Word: " + parsedResponse.word);

mrddter Feb 14, 2024

Thanks a lot, giving it a try today 🚀

mikjee · 2024-02-12T07:19:14Z

mikjee
Feb 12, 2024

Is the loading of model synchronous? So far it seems the model gets loaded when a context instance is initiated, and it happens synchronously.

Would it be possible to do this asynchronously?

6 replies

mikjee Feb 12, 2024

Not really, no. If you mean that the model does not load until you have sent your first prompt, then thats not the case with me. For me the model loads when I declare the context. Dunno if I am missing something in the APIs.

const embeddingContext = new LlamaEmbeddingContext({
	model: embeddingModelBeta,
	contextSize: 8192,
	batchSize: 512, 
	threads: 1,
});

I was wondering if there is a way to do this async, something like -

await embeddingContext.initModel((progress) => progressCallback(progress));

Because the context declaration blocks the main thread in Nodejs. I havent tried using worker threads, maybe that is a solution to this for now, but I would really like to be able to load the model explicitely.

Also, if there was a way to programmatically get the progress of the model being loaded, it would be useful for the purpose of a more responsive UI in my app. I want to be able to load the model and show a progress of the loading process, all without blocking the responsiveness of the UI. The console shows the progress of the loading process but I guess theres currently no way for the library to report that progress back to my code.

mrddter Feb 13, 2024

I got it now. I don't know a way to do it, sorry; we have to wait for Gilad.

giladgd Feb 13, 2024
Maintainer Author

I'm experimenting with ways to load a model asynchronously with llama.cpp's constraints, but since llama.cpp doesn't support multithreading properly it makes it more tricky to do it in a stable manner.
When I officially add it to node-llama-cpp I'll make sure to provide an option to pass a callback to get loading progress events.

mikjee Feb 16, 2024

Thank you, that'd be awesome 👍
I am using worker threads for this and it works fine for now.

giladgd Mar 16, 2024
Maintainer Author

@mikjee I've just released a new beta version with support for asynchronous model and context loading

stewartoallen · 2024-02-19T19:11:11Z

stewartoallen
Feb 19, 2024

the sample readme code no longer runs with the latest beta. looks like LlamaModel() now requires a { llama } parameter which is acquired from getLlama().

dev % node readme-test.mjs
file:///.../dev/build/node-llama-cpp/dist/evaluator/LlamaModel.js:28
        this._model = new this._llama._bindings.AddonModel(path.resolve(process.cwd(), modelPath), removeNullFields({
                                      ^

TypeError: Cannot read properties of undefined (reading '_bindings')
    at new LlamaModel (file:///.../dev/build/node-llama-cpp/dist/evaluator/LlamaModel.js:28:39)
    at file:///.../dev/test.mjs:7:15

1 reply

giladgd Feb 20, 2024
Maintainer Author

I haven't updated the README.md file or the documentation for version 3 beta yet since the API still changes frequently in the beta.
The README.md file and documentation will be updated with the stable release of version 3.
Until then, you can find up-to-date usage examples in the version 3 PR page.

DennisKo · 2024-02-28T08:06:55Z

DennisKo
Feb 28, 2024

initialising a model swallows errors

This code fails for me (its still something about the batchSize: 4096 but can be ignored for now). It does not reach the catch part though and just silently "does not work" - nothing happens.

const initModel = async () => {
  try {
    const { getLlama, LlamaModel, LlamaContext, LlamaChatSession } = await import('node-llama-cpp')
    const mPath = path.resolve(resourcesPath, 'models', 'mistral-7b-instruct-v0.2.Q4_K_M.gguf')
    const llama = await getLlama()
    model = new LlamaModel({
      modelPath: mPath,
      llama
    })
    context = new LlamaContext({
      model,
      contextSize: Math.min(4096, model.trainContextSize),
      batchSize: 4096
    })
    session = new LlamaChatSession({
      contextSequence: context.getSequence(),
      autoDisposeSequence: true
    })
  } catch (error) {
    console.log('error', error)
    throw new Error(`${error}`)
  }
}

"node-llama-cpp": "3.0.0-beta.12"

4 replies

giladgd Mar 2, 2024
Maintainer Author

You don't perform any evaluation against the model in this code, so it makes sense it won't do anything.
Try adding this:

const res = await session.prompt("How much is 6+6?");
console.log(res);

DennisKo Mar 3, 2024

ah sorry thats just a bad example. I am doing

await session.prompt(req.body.prompt, {
        onToken(chunk) {
          handleChunk(chunk) // basically res.write()
        }
      })

giladgd Mar 16, 2024
Maintainer Author

I tried running the code you attached here and couldn't reproduce the issue you described.
If the issue persists, please open an issue with full code and instructions to reproduce it, including a link to the model you used so I can investigate it.

juned-adenwalla Apr 12, 2024

Here's my complete code I have the same issue

import { fileURLToPath } from "url";
import path from "path";
import { getLlama, LlamaChatSession } from "node-llama-cpp";

try {
    const __dirname = path.dirname(fileURLToPath(import.meta.url));

    const llama = await getLlama({
        gpu: false
    });

    const model = await llama.loadModel({
        llama,
        modelPath: path.join(__dirname, "models", "phi-2-orange.Q2_K.gguf")
    });

    const context = await model.createContext({
        model,
        contextSize: Math.min(4096, model.trainContextSize),
        batchSize: 1
    });

    const session = new LlamaChatSession({
        contextSequence: context.getSequence()
    });

    const q1 = "Hi there, how are you?";
    console.log("User: " + q1);

    const a1 = await session.prompt(q1);
    console.log("AI: " + a1);

    const q2 = "Summarize what you said";
    console.log("User: " + q2);

    const a2 = await session.prompt(q2);
    console.log("AI: " + a2);

} catch (error) {
    console.error("Error occurred:", error);
}

StrangeBytesDev · 2024-03-16T19:33:54Z

StrangeBytesDev
Mar 16, 2024

How is the chat formatting chosen for a given model when using LlamaChat? From my limited poking around the code, and some experiments with some models, it looks like its looking at the model name to estimate which chat syntax to use. When I run openhermes-2.5-mistral-7b for example, it seems to be using the syntax of Mistral Instruct, even though this model uses ChatML.

More specifically, would it be possible to utilize tokenizer.chat_template from metadata to more accurately determine the formatting?

6 replies

giladgd Mar 16, 2024
Maintainer Author

You can manually pass a specific chat wrapper to a LlamaChatSession or LlamaChat to force it to use the correct format:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession, ChatMLChatWrapper} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "openhermes-2.5-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext({
    contextSize: Math.min(4096, model.trainContextSize)
});
const session = new LlamaChatSession({
    contextSequence: context.getSequence(),
    chatWrapper: new ChatMLChatWrapper()
});


const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);


const q2 = "Summerize what you said";
console.log("User: " + q2);

const a2 = await session.prompt(q2);
console.log("AI: " + a2);

StrangeBytesDev Mar 16, 2024

Makes plenty of sense. Seems like that gives the user all the flexibility to implement a chat syntax however they like, including non standard ones, but also is a pretty simple way to get good defaults.
Have you considered using something like Huggingface's JS Jinja library for implementing Chat templating, so that you could directly use the template string provided by a model, rather than having to implement that logic yourself?

giladgd Mar 20, 2024
Maintainer Author

The Jinja library doesn't provide all of the functionality that a ChatWrapper can provide, so I don't plan to replace ChatWrapper just yet, but I plan to make it use a JinjaTemplateChatWrapper with the tokenizer.chat_template template string when there's no better chat wrapper available.
I'll also change the detection algorithm to try to find a better chat wrapper based on the tokenizer.chat_template template string if it exists, so it would increase the stability of the chat format handling significantly.

giladgd Apr 5, 2024
Maintainer Author

@StrangeBytesDev The latest beta version now includes support for JinjaTemplateChatWrapper and makes use of the tokenizer.chat_template metadata header

StrangeBytesDev Apr 6, 2024

This is awesome. I tested it out on a handful of my favorite models and it seems to be working great. Jinja seems to be becoming a pretty well standardized format with Huggingface promoting it's use, so being able to use it here is super useful.
Thank you Giladgd.

jrobinson01 · 2024-04-03T12:06:45Z

jrobinson01
Apr 3, 2024

Trying out the beta and running into an issue with promptWithMeta(prompt, {onToken: (tokens) => console.log(context.decode(tokens)}) where context.decode appears to no longer exist. Is there a replacement?

1 reply

brandon-e2e Apr 3, 2024

You should call detokenize which comes from the model, not context.

linonetwo · 2024-04-13T16:11:35Z

linonetwo
Apr 13, 2024

@giladgd Hi, may I ask what this mean?

{
  "type": "specialTokensText",
  "value": "system: ",
}

I'm reading code of ChatWrapper, and this is in the test of it.

And seems this.systemRoleName and this.userRoleName is not used by ChatWrapper?

I'm currently using a simple prompt template like this (not using ChatWrapper yet, as I'm still not sure how it works):

CONTEXT:
SYSTEM:
You know Tiddlywiki. You should answer the questions in wikitext format

* ! indicates a title
* # indicates a ordered list item
* * for an unordered list item
* The rest is similar to markdown

When using mermaid to draw mind maps, The generated mind map is wrapped in the form of

$$$text/vnd.tiddlywiki.mermaid
graph TD
    A –> B
$$$

You will play the role of a knowledge management expert, generating sensible and professional answers.USER: Hi
ASSISTANT:

And it's genreated result is very bad on qwen1_5-32b-chat-q4_k_m.gguf, I'm not sure what's the problem, maybe llama.cpp requires we using special format like specialTokensText?

Seems special token is auto generated based on model? So we must use ChatWrapper to ensure this. And this.systemRoleName and this.userRoleName will be read by the model later?

It generates some special character between Chinese character, I'm not sure if it is modelInstance.detokenize's bug?

Update:

I'm now using https://github.com/chujiezheng/chat_templates/blob/main/chat_templates/openchat.jinja (which is said to compatable with qwen1.5 https://github.com/chujiezheng/chat_templates/blob/main/generation_configs/qwen2-chat.json ) But still get same messy result.

        const chatWrapper = new JinjaTemplateChatWrapper({
          template: "{{ 'System: ' + systemPrompt if systemPrompt else '' }}{{ 'User: ' + userInput if userInput else '' }}",
          ...templates,
        });
        const session = new LlamaChatSession({
          contextSequence: contextSequenceInstance,
          autoDisposeSequence: false,
          systemPrompt: sessionOptions?.systemPrompt,
          chatWrapper,
        });
        await session.prompt(completionOptions.prompt, {
          ...completionOptions,
          signal: abortController.signal,
          onToken: (tokens) => {
            if (modelInstance === undefined) {
              abortController.abort();
              runnerAbortControllers.delete(conversationID);
              subscriber.next({ type: 'result', token: texts.disposed, id: conversationID });
              subscriber.complete();
              return;
            }
            updateTimeout();
            subscriber.next({ type: 'result', token: modelInstance.detokenize(tokens), id: conversationID });
          },
        });

"node-llama-cpp": "3.0.0-beta.15" win11 gpu prebuild binary.

1 reply

giladgd Apr 19, 2024
Maintainer Author

When you use a chat wrapper, it transforms a chat history into a content (LlamaText) that should be tokenized and loaded into the sequence context state of the model.
This content usually consists of the raw text of system messages, user messages, and model responses with some more text around it that consists of the chat format - this added chat format text can contain special tokens that may have to be tokenized differently from regular text.
For example, tokenizing <s> on a Llama2 model with special tokens enabled would produce a BOS token, while tokenizing it as text will pass the text <s> to the model as-is.

A LlamaText is used to combine different things together (for convenience) before passing them down for tokenization, like holding regular text strings that will be tokenized as regular text without any special tokens, and SpecialTokensTexts that will be tokenized with special tokens enabled.
The {"type": "specialTokensText", "value": "system: " } you see in the tests mean that the text system: will be tokenized with special tokens enabled, so if special tokens will be found in this text, they would be tokenized differently than regular text.

brandon-e2e · 2024-04-24T01:19:19Z

brandon-e2e
Apr 24, 2024

Sorry if this has been asked before but do you have any plans of adding dynamic temperature into the beta? If not, is this something that can be contributed (and if so, do you have any recommendations)?

I've been using it for quite a while now and its impact is noticeable, especially on creativity.

Edit: Added two reference links

Ref: https://github.com/ggerganov/llama.cpp/pull/4972/files
Ref: https://github.com/ggerganov/llama.cpp/blob/4e96a812b3ce7322a29a3008db2ed73d9087b176/common/sampling.cpp#L153

3 replies

giladgd Apr 24, 2024
Maintainer Author

Thanks for the suggestion, I'll look into it.
Do you have a recommendation for a model and a prompt I can use to try it out where using it produces significantly better results than using temperature alone?

giladgd Apr 24, 2024
Maintainer Author

Contributions are always welcome, but I suggest waiting a bit for the stable release of version 3, since contributing will be much easier then due to significantly improved documentation

brandon-e2e Apr 25, 2024

Sure, I'll post an update here this weekend with one of my prompts and a benchmark once I'm back in town. Most of my usage is with Midnight-Miqu-70B-v1.5.

I noticed that EQ-Bench recently added a creativity with judging benchmark. I'm looking to run that using llama.cpp, with and without dynatemp, to get objective metrics for evaluating this sampler. That should at least help with determining if this is confirmation bias I would think. :)

Also, I found a good description of dynamic temperature from the creator for more background on how it works if you were curious.

giladgd · 2024-04-26T00:19:35Z

giladgd
Apr 26, 2024
Maintainer Author

Transferred @nathanlesage's commend from #105 (comment):

I have currently two issues preventing me from updating from beta.13 to beta.17.

Beginning with beta.14, loading models always fails with defaults that have worked in beta.13 with the error message that allegedly the settings require more VRAM than I have (given that the same settings work in beta.13, I believe this to be incorrect). I unfortunately don't know if the bug originates with this library or with llama.cpp

During development, the library itself will load fine, but after packed it will tell me that it requires the (old) ggml-meta.metal library which, as far as I can see, has been replaced with default.metallib – do you have an idea what might cause this? I am a bit confused that the library would complain about a missing library that appears to have been consciously replaced with a different target…?

It would be great if you could give me some pointers so that I can debug it!

(Also: You've mentioned in the changelog to beta.17 that the lib now supports Llama3, but I can confirm that beta.13 works fine with quantized Llama3-models!)

2 replies

giladgd Apr 26, 2024
Maintainer Author

@nathanlesage Can you please share the result of running this command?

npx --yes node-llama-cpp@beta inspect gpu

Also, please provide me with steps to reproduce this issue so I can investigate why it happens, such as the model file URL, parameters you used, etc.

Regarding 2, does this happen using the prebuilt binaries or only after you build llama.cpp from source using npx --no node-llama-cpp download?
It doesn't happen on my Mac machine.

The Llama 3 support was to refine some edge cases to make using Llama 3 more stable, as well as adding it to the suggested models when using the chat command with no model parameter.

giladgd Apr 26, 2024
Maintainer Author

On second thought, regarding 2, is there a chance that default.metallib isn't included on the same folder as llama-addon.node after packing?

You can also try using the LLAMA_METAL_EMBED_LIBRARY cmake option in your code.
Let me know whether it helped you.
To do so, make sure you add it to your getLlama usage on macOS:

const llama = await getLlama({
    cmakeOptions: process.platform === "darwin"
        ? {"LLAMA_METAL_EMBED_LIBRARY": "1"}
        : {}
});

And since I assume you're packing your code for Electron, make sure you build with this parameter beforehand by either running the above code in nodejs in your project directory beforehand or manually running this command:

NODE_LLAMA_CPP_CMAKE_OPTION_LLAMA_METAL_EMBED_LIBRARY=1 npx --no node-llama-cpp download

chadkirby · 2024-05-01T19:21:45Z

chadkirby
May 1, 2024

getGgufMetadataArchitectureData explodes when ggufMetadata.general is undefined. Which prevents the use of a "dummy" .gguf file as suggested here.

2 replies

giladgd May 2, 2024
Maintainer Author

Why would you want to load a model file that doesn't have a general header?
The general.architecture header (for example) is crucial for llama.cpp to function; without it, the file is not usable and cannot be loaded.

chadkirby May 3, 2024

For unit testing. I was hoping that a llama.cpp "dummy" .gguf file would let me test that my implementation can load/dispose models & manage contexts/sessions without having to figure out how to stub out node-llama-cpp.

iimez · 2024-05-06T14:50:03Z

iimez
May 6, 2024

Hello and thanks for your work Gilad!

I've been implementing an OpenAI compatible API on top of the beta during the last weekends. Some feedback/questions I collected on the way

Keeping cancelled completions in context
When aborting a signal passed into LlamaChatSession.promptWithMeta the user message (and partial response of the model) are not left in the context and chat history. That's a good default I think, but I noticed that in practice its sometimes more intuitive to the user that the aborted partial generation is kept in context (some would cancel and then answer "no thats not what i meant, [...]") so I'd like to build that on top. I managed to capture the partial output, catch the AbortError etc, but whats the right way to put the turn back on the context and into chat history?

Output issue on longer context / shift
When testing out long conversations with llama3 8b yesterday I noticed that the model would suddenly start to produce garbage once the context reaches its limit. I'm suspecting this is related to the context shift that's happening but I'm not sure. To reproduce, the test is here, the model is defined here and no context shift strategy is configured here. Should be able to clone and npm run test:llama to reproduce but I'm happy to provide a more minimal reproduction if needed.

Custom stop generation triggers plus chat wrappers
To make OpenAI-like stop work while utilizing node-llama-cpp's automatic chat wrapper resolution I had to resort to what feels like quite the hack. I needed a way to customize the stop trigger per generation (not per chat session) so I wrapped the resolved chat wrapper to inject a property + a setter that allows me to do that. I wonder if there's a better way to do it. See withCustomStopGenerationTrigger in the file linked above.

24 replies

giladgd Jun 30, 2024
Maintainer Author

I have released a new version that includes the fix for the max call stack

iimez Jun 30, 2024

The logspam is not too bad tho, nothing urgent. I just figured I should mention it because having CUDA installed but no device might be uncommon edge case.
On Vulkan issue: I found it curious that it seems to only happen on text completions for me, but not for chat? Thats why I suspected a relation to templates as well. Some form of precision/rounding error? I'll try later what happens if I use the chat template with text completions; and different gpu layer settings.

giladgd Jul 3, 2024
Maintainer Author

I agree that it's peculiar that it only happens with completions and not with a chat session, but this is an issue in the Vulkan code, as the generation output of a model is supposed to be the same across all compute layers (be it Vulkan or CUDA).
I think there might be a miscalculation somewhere in the Vulkan code, so explaining why some input can be completed properly while another can't would be hard and probably extremely technical.

iimez Sep 7, 2024

Small bug in beta.44: The lora property of LlamaModelOptions exists on the LlamaModel.ts version of the type, but not in the d.ts typedef version in LlamaModel.d.ts, latter seems to take precedence in type checking.

giladgd Sep 7, 2024
Maintainer Author

There's no lora option for loading a model anymore, it was moved to the context level, so you can reuse the same model with multiple contexts that each has a different lora configuration.

stewartoallen · 2024-06-20T14:28:04Z

stewartoallen
Jun 20, 2024

what is the proper way to unload/reload sessions/context/context-sequences? trying to do the simple thing of pausing and resuming chat sessions. is it simply using get/set chat history on a chat session where chat sessions are mapped to context sequences? working memory is the primary constraint so I'm also trying to understand lifecycle management and the intent for each of these abstractions.

3 replies

giladgd Jun 21, 2024
Maintainer Author

If you'd like to simply save the current chat session state so you can resume it later, you can get the chat history and set it again later to resume from where you left off:

import {fileURLToPath} from "url";
import path from "path";
import fs from "node:fs/promises";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));
const chatHistoryFilePath = path.join(__dirname, "chatHistory.json");

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});


const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);

const chatHistory = chatSession.getChatHistory();
await fs.writeFile(chatHistoryFilePath, JSON.stringify(chatHistory));

Then, in a new instance of your code:

import {fileURLToPath} from "url";
import path from "path";
import fs from "node:fs/promises";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));
const chatHistoryFilePath = path.join(__dirname, "chatHistory.json");

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});


const chatHistory = JSON.parse(await fs.readFile("chatHistory.json", "utf8"));
chatSession.setChatHistory(chatHistory);

const q2 = "Repeat your answer";
console.log("User: " + q2);

const a2 = await session.prompt(q2);
console.log("AI: " + a2);

Every time you prompt the model using a LlamaChatSession, it "renders" the state that should be loaded into the context sequence and only evaluates the remaining state that has not already been loaded.
There are also more optimizations in place to attempt to utilize the current context sequence state as much as possible to avoid context shifts.

I plan to add support for saving a more comprehensive state to make the saving and loading of state even more efficient and optimized, but the current solution that already exists with LlamaChatSession would suffice until then (the new feature will be introduced as a non-breaking change after the stable version 3 release).

After you're done with a context sequence, instead of disposing it, you can reuse it for a different LlamaChatSession to utilize the optimization that uses the current context sequence state as much as possible, to make the response of the first prompt in the new chat begin sooner.
Note that it's important to make sure you don't use the same context sequence in more than one active chat session on the same time, as doing so would make both chat sessions conflict as they compete for the same resource.

I'm currently working on the documentation for version 3, and it'll have a more comprehensive explanation of how all of this works.

giladgd Jun 21, 2024
Maintainer Author

To unload something from the memory, you can call the .dispose() method on it; doing so will also dispose all dependent objects.
For example, disposing a context will also dispose all of its context sequences, and it'll then dispose all of the chat sessions that use those context sequences, etc.
Disposing a model will dispose all of its contexts, and disposing a Llama will dispose all of the models loaded with it.

stewartoallen Jun 21, 2024

Great. Thanks for the explanation. I look forward to the documentation and more detail on context sequences.

haldunanil · 2024-06-21T20:11:41Z

haldunanil
Jun 21, 2024

Hi! Great library, thanks for all the hard work. We have a few questions regarding the beta:

Can we create a model singleton to avoid loading the model each time?
What is the timeline for file based caching and how complex will it be to implement?
Why does it seem like in some runs a cache persists between them when running on Linux vs. MacOS? Is there some secret sauce to making sure the cache works as intended?

7 replies

giladgd Jun 22, 2024
Maintainer Author

You can create a context that has more sequences like this:

const context = await model.createContext({
    sequences: 2
})

Note that each sequence you specify increases the memory usage of the context, but this is still more memory-efficient and compute-efficient than creating multiple contexts.
I plan to add a new kind of context in the future that handles more memory management tasks automatically, so you won't have to specify how many sequences you want to allocate before you use them, and it'll also have some memory optimizations around sequences.

haldunanil Jun 22, 2024

That's helpful thanks, I tried that with this code (gguf file from here):

// Put together the file path
const modelPath = path.join(dirPath, "llama-2-7b-chat.Q4_K_M.gguf");

const llama = await getLlama();
const model = await llama.loadModel({
  modelPath,
});
const context = await model.createContext({
  contextSize: Math.min(4096, model.trainContextSize),
  sequences: 2,
});

const systemPrompt = `You are a fair judge assistant tasked with providing clear, objective feedback based on a specific criterion, ensuring each assessment reflects the absolute standards set for performance.

You will be given a response to evaluate, a binary criterion to evaluate against, and an optional instruction (might include an Input inside it). You must provide feedback based on the given criterion and the response.

Please follow these guidelines:

1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing feedback, write a determination True or False about whether the response meets the binary criteria specified.
3. Do not generate any other opening, closing, and explanations.
4. Keep your feedback concise and clear, do not repeat yourself and do not exceed 280 characters for the feedback.
5. Only describe the boolean result as True/False, do not use any other words to describe the result.`;

const session = new LlamaChatSession({
  contextSequence: context.getSequence(),
  autoDisposeSequence: false,
  systemPrompt,
});

const q1 = `### Response to evaluate:

I am not a recipe.

### Score Rubrics:

[Is this a recipe?]:

- False: The response being evaluated does not meet the criterion described in the square brackets.
- True: The response being evaluated does meet the criterion described in the square brackets.

### Feedback:`;
// console.log(`User: ${q1}`);

console.time("a1");
const a1 = await session.prompt(q1);
console.timeEnd("a1");
// console.log(`AI: ${a1}`);

// session.dispose();

const session2 = new LlamaChatSession({
  contextSequence: context.getSequence(),
  autoDisposeSequence: false,
  systemPrompt,
});

const q1a = `${q1}\n(Return feedback as JSON)`;
// console.log(`User: ${q1a}`);

console.time("a1a");
const a1a = await session2.prompt(q1a);
console.timeEnd("a1a");

After running it three times, I get these outputs:

Run	a1	a1a
1	3.979s	2.475s
2	3.771s	2.677s
3	4.982s	2.730s

Is that about what I should expect perf-improvement wise?

On another note, I get this message every time I run this:

[node-llama-cpp] Using this model ("~/.../llama-2-7b-chat.Q4_K_M.gguf") to tokenize text with special tokens and then detokenize it resulted in a different text. There might be an issue with the model or the tokenizer implementation. Using this model may not work as intended

All of this is on Metal, btw. M1 Pro w/ 16GB RAM

giladgd Jun 22, 2024
Maintainer Author

In the code you shared, you can reuse the existing context sequence, which will then utilize less memory and generally be faster due to the aforementioned optimizations.
Using multiple sequences is useful when you need to keep an existing state in memory, or evaluate multiple inputs simultaneously. When doing synchronous work, it's usually preferable to only use a single context sequence.

I've revised your code a bit to reuse the same context sequence:

Revised code

// Put together the file path
const modelPath = path.join(dirPath, "llama-2-7b-chat.Q4_K_M.gguf");

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath,
});
const context = await model.createContext();

const systemPrompt = `You are a fair judge assistant tasked with providing clear, objective feedback based on a specific criterion, ensuring each assessment reflects the absolute standards set for performance.

You will be given a response to evaluate, a binary criterion to evaluate against, and an optional instruction (might include an Input inside it). You must provide feedback based on the given criterion and the response.

Please follow these guidelines:

1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing feedback, write a determination True or False about whether the response meets the binary criteria specified.
3. Do not generate any other opening, closing, and explanations.
4. Keep your feedback concise and clear, do not repeat yourself and do not exceed 280 characters for the feedback.
5. Only describe the boolean result as True/False, do not use any other words to describe the result.`;

const contextSequence = context.getSequence();
const session = new LlamaChatSession({
    contextSequence,
    autoDisposeSequence: false,
    systemPrompt,
});

const q1 = `### Response to evaluate:

I am not a recipe.

### Score Rubrics:

[Is this a recipe?]:

- False: The response being evaluated does not meet the criterion described in the square brackets.
- True: The response being evaluated does meet the criterion described in the square brackets.

### Feedback:`;
// console.log(`User: ${q1}`);

console.time("a1");
const a1 = await session.prompt(q1);
console.timeEnd("a1");
// console.log(`AI: ${a1}`);

// session.dispose();

const session2 = new LlamaChatSession({
    contextSequence,
    autoDisposeSequence: false,
    systemPrompt,
});

const q1a = `${q1}\n(Return feedback as JSON)`;
// console.log(`User: ${q1a}`);

console.time("a1a");
const a1a = await session2.prompt(q1a);
console.timeEnd("a1a");

About 2 months ago llama.cpp introduced a few changes in the tokenizer; generally, those changes are good and help to produce better generation results, but they may break some old GGUF files, so I've added a detection for when that may be the case, which prints out this warning message.
Your model may still work just fine, but if you notice low-quality generation, this may be the cause.
The solution is to either get a new GGUF file of the model you use (a newer conversion of the original model will result in a GGUF file that works better with the new llama.cpp implementation), or use GGUF-my-repo to convert the original model to GGUF.
Michael Radermacher converts a lot of models to GGUF, so his page is a good source of GGUF models.
Try to use Llama 3 Instruct instead of Llama 2, as it's significantly smarter and its tokenizer is more efficient, which makes it possible to process longer texts using the same context size.

haldunanil Jun 23, 2024

Excellent, this is a lot faster, thank you! Is there anything I should watch out for when reusing the contextSequence? Previously, I could have checked sequencesLeft on the context, but I'm guessing that doesn't apply here.

giladgd Jun 23, 2024
Maintainer Author

The main thing is to make sure you're not using the same context sequence for more than one evaluation in parallel.
When creating a context, it allocates memory for the number of sequences you specify (1 by default), which is why calling .getSequence() when all the sequences are already in use results in an error.
Calling .getSequence() once should be safe, or calling .dispose() on a context sequence before calling .getSequence() should also be safe.

SimpleVictor · 2024-06-28T16:28:41Z

SimpleVictor
Jun 28, 2024

Sorry in advance if this is not the right place to ask but Is there an active discord channel for this package/beta?

2 replies

giladgd Jun 28, 2024
Maintainer Author

I prefer to use GitHub Discussions for communications since it makes it easier for people new to this library to search for information in existing discussions, and relevant information shows up on Google, which is helpful when looking for stuff.
I contemplated opening a Discord server, but I think GitHub Discussions is good enough for now.

If you have a question, feel free to start a discussion.

debrisapron Jul 10, 2024

Strongly support this decision. Much better discoverability.

chadkirby · 2024-06-28T22:27:26Z

chadkirby
Jun 28, 2024

In 3.0.0-beta.32, we're crashing on windows (not macOS) when we try to dispose of an embedding context after creating a chat context. Here's a minimal repro. This exits cleanly on Mac, but the whole process is killed (Access Violation) in windows.

import path from "node:path";
import os from "node:os";
import { getLlama } from "node-llama-cpp";

const modelDir = ...;

// create a chat context as if we were going to chat with the model
const chatLlama = await getLlama();
const chatModel = await chatLlama.loadModel({
  modelPath: path.join(modelDir, "Phi-3-mini-4k-instruct-q4.gguf")
});
await chatModel.createContext();

// create an embedding context
const embedLlama = await getLlama();
const embedModel = await embedLlama.loadModel({
  modelPath: path.join(modelDir, "all-MiniLM-L6-v2-Q8_0.gguf")
});
const embeddingContext = await embedModel.createEmbeddingContext();

const embedding = await embeddingContext.getEmbeddingFor("Hello");
console.log(embedding.vector.slice(0, 10));

// not a problem in macOS, but in windows, disposing the chatLlama at
// this point will kill the entire process (Access Violation)
// %errorlevel% -1073741819
await embedLlama.dispose();

4 replies

giladgd Jun 29, 2024
Maintainer Author

You're supposed to call await getLlama() only once and use it for all of your needs throughout the process's life; although this might not be what causes the crash, it might help avoid it.

Can you please run this command and attach the output of it from the Windows machine you use?

npx --no node-llama-cpp inspect gpu

chadkirby Jun 29, 2024

Thanks for the information.

Here is the GPU inspection

OS: Windows 10.0.19045 (x64)
Node: 20.15.0 (x64)
node-llama-cpp: 3.0.0-beta.32

Vulkan: available

Vulkan device: Intel(R) UHD Graphics 620
Vulkan used VRAM: 0% (0B/7.92GB)
Vulkan free VRAM: 100% (7.92GB/7.92GB)

CPU model: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz
Used RAM: 32.63% (5.17GB/15.84GB)
Free RAM: 67.36% (10.67GB/15.84GB)

I modified the repro to avoid calling getLlama() multiple times. This is still getting killed (Access Violation) in windows:

import path from "node:path";
import os from "node:os";
import { getLlama } from "node-llama-cpp";

const modelDir = ...;

// create a chat context as if we were going to chat with the model
const llama = await getLlama();
const chatModel = await llama.loadModel({
  modelPath: path.join(modelDir, "Phi-3-mini-4k-instruct-q4.gguf")
});
await chatModel.createContext();

// create an embedding context
const embedModel = await llama.loadModel({
  modelPath: path.join(modelDir, "all-MiniLM-L6-v2-Q8_0.gguf")
});
const embeddingContext = await embedModel.createEmbeddingContext();

const embedding = await embeddingContext.getEmbeddingFor("Hello");
console.log(embedding.vector.slice(0, 10));

// not a problem in macOS, but in windows, disposing the embedModel at
// this point will kill the entire process (Access Violation)
// %errorlevel% -1073741819
await embedModel.dispose();

giladgd Jun 29, 2024
Maintainer Author

There's an issue with the Vulkan backend of llama.cpp where using more than one context at the same time doesn't work as expected.
I've opened an issue for this on llama.cpp.

Until this issue is fixed, to workaround it, you can make sure you don't have more than one context loaded at the same time by disposing the previous context (using await for this is crucial) before creating a new one.

chadkirby Jul 1, 2024

many thanks

bqhuyy · 2024-07-01T09:53:33Z

bqhuyy
Jul 1, 2024

hi @giladgd , I've met a problem when using node-llama-cpp with cuda. This works normally with cpu, vulkan. The model keeps generating dummy token like this.
version: 3.0.0-beta.36

1 reply

giladgd Jul 3, 2024
Maintainer Author

I replied to you in the issue you opened for this

dabs9 · 2024-07-07T20:37:55Z

dabs9
Jul 7, 2024

@giladgd how would I take advantage of continuous batching in node llama? Is it on by default if I make multiple async calls to .prompt? If, so, do all of these calls need to share the same context in order for continuous batching to occur? If not, is there some configuration that needs to be toggled to enable it?

3 replies

giladgd Jul 7, 2024
Maintainer Author

If you create a context with multiple sequences (via model.createContext({sequences: 2}) for example), parallel evaluations from the sequences of the context will be automatically batched.
For example:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext({
    sequences: 2
});

const sequence1 = context.getSequence();
const sequence2 = context.getSequence();

const session1 = new LlamaChatSession({
    contextSequence: sequence1
});
const session2 = new LlamaChatSession({
    contextSequence: sequence2
});

const q1 = "Hi there, how are you?";
const q2 = "How much is 6+6?";

const [
    a1,
    a2
] = await Promise.all([
    session1.prompt(q1),
    session2.prompt(q2)
]);

console.log("User: " + q1);
console.log("AI: " + a1);

console.log("User: " + q2);
console.log("AI: " + a2);

SimpleVictor Jul 10, 2024

Speaking of sequences. @giladgd, what are some scenarios of when I should use multiple sequence?

In the electron demo app, you have auto generated pre-prompt that populates the input field as you type. Along with that, you can chat with the LLM too on hitting enter. Was this an example usage of using 2 sequences?

giladgd Jul 10, 2024
Maintainer Author

Context sequences are useful for holding multiple independent states at the same time (like independent chats, for example), since continuing evaluation on an existing state is much faster than loading a state from scratch (although I plan to also optimize this use case).
When you use multiple context sequences you also benefit from batching (a performance optimization) when you evaluate inputs on multiple sequences in parallel.

The demo Electron app is the Electron template used to generate a new Electron project when you run this command:

npm create --yes node-llama-cpp@beta

You can inspect the code of the generated project to see how this functionality is implemented, but essentially it uses the .completePrompt(...) method on a chat session to complete the partial prompt the user writes; this is more memory-efficient than creating another context sequence for this since it doesn't require allocating more resources and it uses the existing chat history as context for the completion.

haldunanil · 2024-07-13T18:58:00Z

haldunanil
Jul 13, 2024

I'm having some trouble reconciling llama-server and node-llama-cpp outputs, what are the things I should be looking out for? The outputs of node-llama-cpp are the ones I want and make sense, llama-server on the other hand has two issues, see below.

Using node-llama-cpp

I'm using node-llama-cpp 3.0.0-beta.38.

Input

const llama = await getLlama({
  vramPadding: 0,
  debug: false,
});

const model = await llama.loadModel({ 
  modelPath: <model-path>, 
  gpuLayers: 33
});

const context = await model.createContext({
  contextSize: 512,
  seed: 9,
});

const contextSequence = context.getSequence();

const session = new LlamaChatSession({
  contextSequence,
  systemPrompt,
  autoDisposeSequence: false,
});

const grammar = new LlamaJsonSchemaGrammar(this.llama, grammar);

const answer = await session.prompt(prompt, options);

console.log(answer)

Output

{ "feedback": "The response starts with 'Hello', which meets the criterion of saying hello. The rest of the text is irrelevant to this evaluation. Therefore, [Says hello] is True.","result": true }

This is what I expect, yay!

Using llama-server

The server is started using this command:

llama-server -m model/model.gguf --port 8080 --cont-batching --gpu-layers 33 --ctx-size 16384 --batch-size 16384 --parallel 32 --mlock

Here's the version for llama-server:

version: 3384 (4e24cffd)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0

Input

Here's the payload, via http:localhost:8080/completion:

{
    "prompt": "### Response to evaluate:\n\nHello, world!\n\n### Score Rubrics:\n\n[Says hello]:\n\n- False: The response being evaluated does not meet the criterion described in the square brackets.\n- True: The response being evaluated does meet the criterion described in the square brackets.\n\n### Feedback:",
    "temperature": 0,
    "seed": 9,
    "system_prompt": "You are a fair evaluation assistant tasked with providing clear, objective, self-consistent feedback based on a specific criterion.\n\nYou will be given a response to evaluate, a binary criterion to evaluate against, and (optionally) additional context to consider. You must provide feedback based on the given criterion and the response.\n\nPlease follow these guidelines:\n1. Write a detailed feedback that assess the quality of the response strictly based on the binary criterion.\n2. Your feedback should end by explicitly stating whether the criterion is met, explicitly using the words True or False.\n3. Keep your feedback concise and clear, do not repeat yourself and do not exceed 280 characters for the feedback.",
    "json_schema": {
        "type": "object",
        "properties": {
            "feedback": {
                "type": "string"
            },
            "result": {
                "type": "boolean"
            }
        }
    }
}

### Output

```json
{
    "content": "{\"feedback\": \"The response is a simple 'Hello, world!' message, which does not meet the criterion of saying hello. The response does not explicitly state 'hello' or any variation of it. Therefore, the criterion is not met. False.\"} ",
    "id_slot": 0,
    "stop": true,
    "model": "model/model.gguf",
    "tokens_predicted": 54,
    "tokens_evaluated": 56,
    "generation_settings": {
        "n_ctx": 512,
        "n_predict": -1,
        "model": "model/model.gguf",
        "seed": 9,
        "temperature": 0.0,
        "dynatemp_range": 0.0,
        "dynatemp_exponent": 1.0,
        "top_k": 40,
        "top_p": 0.949999988079071,
        "min_p": 0.05000000074505806,
        "tfs_z": 1.0,
        "typical_p": 1.0,
        "repeat_last_n": 64,
        "repeat_penalty": 1.0,
        "presence_penalty": 0.0,
        "frequency_penalty": 0.0,
        "penalty_prompt_tokens": [],
        "use_penalty_prompt_tokens": false,
        "mirostat": 0,
        "mirostat_tau": 5.0,
        "mirostat_eta": 0.10000000149011612,
        "penalize_nl": false,
        "stop": [],
        "n_keep": 0,
        "n_discard": 0,
        "ignore_eos": false,
        "stream": false,
        "logit_bias": [],
        "n_probs": 0,
        "min_keep": 0,
        "grammar": "boolean ::= (\"true\" | \"false\") space\nchar ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\nfeedback-kv ::= \"\\\"feedback\\\"\" space \":\" space string\nfeedback-rest ::= ( \",\" space result-kv )?\nresult-kv ::= \"\\\"result\\\"\" space \":\" space boolean\nroot ::= \"{\" space  (feedback-kv feedback-rest | result-kv )? \"}\" space\nspace ::= | \" \" | \"\\n\" [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\n",
        "samplers": [
            "top_k",
            "tfs_z",
            "typical_p",
            "top_p",
            "min_p",
            "temperature"
        ]
    },
    "prompt": "### Response to evaluate:\n\nHello, world!\n\n### Score Rubrics:\n\n[Says hello]:\n\n- False: The response being evaluated does not meet the criterion described in the square brackets.\n- True: The response being evaluated does meet the criterion described in the square brackets.\n\n### Feedback:",
    "truncated": false,
    "stopped_eos": true,
    "stopped_word": false,
    "stopped_limit": false,
    "stopping_word": "",
    "tokens_cached": 109,
    "timings": {
        "prompt_n": 56,
        "prompt_ms": 405.85,
        "prompt_per_token_ms": 7.247321428571429,
        "prompt_per_second": 137.98201305901193,
        "predicted_n": 54,
        "predicted_ms": 2668.323,
        "predicted_per_token_ms": 49.41338888888889,
        "predicted_per_second": 20.237430026274932
    }
}

Note that:

Unlike node-llama-cpp, this result outputs that it's false (even though it should be true)
The output result object doesn't even have a "result" boolean

Do you know what might be causing this discrepancy?

14 replies

giladgd Jul 16, 2024
Maintainer Author

Does that make sense to you or is there something we missed?

The setup of each file will load the model again and create a new context, which can take some time.
I've just tested it to make sure and it seems that console logs are hidden when printed in setup files for some reason, but if you write some code that creates a new file with a random name every time the setup file is run, you'll end up with a lot of files.

However, I presume the grammar stored in _grammarText is not the grammar that's actually used by llama.cpp right? I see here there are some parameters that I believe modify the input grammar somehow. It would be nice to be able to get the grammar actually used under the hood in some way

The .grammar on a LlamaGrammar is the entire grammar passed to llama.cpp, the other parameters you see there are for node-llama-cpp for the evaluation logic.

haldunanil Jul 20, 2024

Alright so as mentioned earlier, here are the steps to repro the issue we were having running it on a serverless pod. We've been trying to (unsuccessfully) deploy a serverless pod to Runpod.io that runs node-llama-cpp on the side.

Here's the setup:

a handler.py file that is run when the serverless endpoint is triggered, which does a simple HTTP request to...
an express server running a single endpoint that asks node-llama-cpp for a completion
both these functions are run inside a docker container using supervisor

The main issue seems to be that when enough concurrent requests (which don't have to be that many) are sent to the serverless endpoint, it simply seems to crash/restart the pod.

Admittedly, this is a bit of an esoteric use case, but I guess my main question might be: do you have any experience integrating node-llama-cpp into an Express.js server and have it handle multiple concurrent requests to an endpoint? If so, should the context be initialized on a per endpoint call basis or once at server start? Anything we should be thinking about/watching out for? Thanks!

giladgd Jul 21, 2024
Maintainer Author

Regarding the crashes that you experience, does the pod use Vulkan?
The Vulkan backend doesn't work well with multiple concurrent contexts yet (this is a limitation of llama.cpp).
Ideally, you'd want to have CUDA installed and configured properly so node-llama-cpp can use it, since its performance is significantly better.
You can run the npx --no node-llama-cpp inspect gpu command in the pod to see what compute layers are detected, or access the .gpu parameter on a Llama instance to see what compute layer is it using.

Regarding the Express server, you'd want to create as few contexts as possible (ideally only one) with as many sequences as you can fit in the pod (while having their context size still be at a reasonable size), and have a pool of those contexts and sequences to use for incoming requests.
The creation of a context takes some time, so you'd probably want to have one ready already for incoming requests.

Also, for improved performance, consider using a faster alternative to Express, such as tinyhttp or Hono.

haldunanil Jul 24, 2024

It uses CUDA, so I think we're good on that front. Do you have any suggested code snippets on how to do few contexts with multiple sequences in a way that can run reliably on a web server? Will check out the other suggested frameworks to see if they help!

giladgd Aug 3, 2024
Maintainer Author

Sorry for the late response, here's a basic example of how to reuse a single context with multiple sequences:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession, LlamaContextSequence} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext({
    // set these to whatever you find works best for your use case
    sequences: 12,
    contextSize: {
        // ensure that each sequence will have a context size of at least 1024
        // or this will throw an error if the machine hardware cannot handle this
        min: 1024
    }
});

const queue: (() => void)[] = [];
const pool: LlamaContextSequence[] = [];

async function getSequence() {
    while (true) {
        if (pool.length > 0)
            return pool.shift()!;

        if (context.sequencesLeft > 0)
            return context.getSequence();

        await new Promise<void>(resolve => void queue.push(resolve));
    }

    throw new Error("unreachable");
}

function disposeSequence(sequence: LlamaContextSequence) {
    if (sequence.disposed)
        return;

    queue.shift()?.();
}

// use this in the handler of your http server
async function handleUserInput(input: string) {
    const contextSequence = await getSequence();

    try {
        const chatSession = new LlamaChatSession({
            contextSequence
        });

        return await chatSession.prompt(input);
    } finally {
        disposeSequence(contextSequence);
    }
}

dabs9 · 2024-07-15T01:17:24Z

dabs9
Jul 15, 2024

Hey @giladgd I'd like to create a ChatWrapper for the Phi-3 model. It seems that the implementation has gotten a lot more complex than what is in the docs / main. Do you have some guidance on how I could get started?

The prompt format is fairly simple: <|system|> {system_prompt}<|end|><|user|> {prompt}<|end|><|assistant|>

Also curious if this is the only adaptation necessary to use a model not already supported by this library or if any other work is necessary.

1 reply

giladgd Jul 15, 2024
Maintainer Author

The built-in Jinja chat template provided by the model is used automatically, and from my tests, it seems to work with the model you linked.
If you encountered an issue where there are missing spaces between words, it has to do with a recent change in llama.cpp's tokenizer implementation that I haven't yet added support for; this issue affects only the streaming of responses though, so if you detokenize an entire array of tokens it still works correctly.

I'm currently working on a completely new documentation for version 3 with explanations of how to create your own chat wrapper.
The explanation is longer than I can fit here without the relevant context, so in the meantime, you can use TemplateChatWrapper which is relatively easy to use or JinjaTemplateChatWrapper if you need something more advanced.

MitchellMonaghan · 2024-08-17T16:43:13Z

MitchellMonaghan
Aug 17, 2024

I was trying to use the new vulkan support. Here is the output of my npx --no node-llama-cpp inspect gpu

OS: Windows 10.0.22631 (x64)
Node: 20.11.1 (x64)
TypeScript: 5.5.4
node-llama-cpp: 3.0.0-beta.44

Vulkan: available

Vulkan devices: AMD Radeon RX 6650 XT, Intel(R) UHD Graphics 770
Vulkan used VRAM: 0% (512KB/23.62GB)
Vulkan free VRAM: 99.99% (23.62GB/23.62GB)

CPU model: 12th Gen Intel(R) Core(TM) i5-12600K
Used RAM: 46.31% (14.72GB/31.78GB)
Free RAM: 53.68% (17.06GB/31.78GB)

With a basic example I was getting.

ggml_vulkan: Device memory allocation of size 2141061120 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
[node-llama-cpp] llama_kv_cache_init: failed to allocate buffer for kv cache
[node-llama-cpp] llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache

I tried setting gpuLayers to 0 and 32 and still got the same issue.
The issue was resolved when I explicitly set a contextSize of 2048.

Here is the model I was attempting to use.
https://huggingface.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF

If this is expected or if you need more info let me know. I didn't see anything in the documentation about having to set the contextSize.

1 reply

giladgd Aug 18, 2024
Maintainer Author

Thanks for sharing!
It seems there are some memory constraints when using Vulkan, which affects the maximum possible context size llama.cpp can use.
I'll investigate it further and implement a fix for this so you won't have to specify a context size when using Vulkan.

scenaristeur · 2024-09-19T15:27:01Z

scenaristeur
Sep 19, 2024

with function calling #139
is it possible to call async function ? something like await axios.get(url) and to wait for the result ?

to get the same beahavior as this function call in python autogen https://github.com/scenaristeur/dady/blob/c239bdf9d8334e719730eb5b4f46ea3d844ca62b/llm/basic_functions_with%20results.py#L82

with many / optionals params ? what is the norme to define params ?
is it possible to import node_modules, like axios ?
i though something like

  httpRequest: defineChatSessionFunction({
    description: "perform an http request (GET, POST, PUT, DELETE,...)",
    params: {
      type: "object",
      properties: {
        url: "url",
        method: "text",
        headers: {
          "Content-Type" : "application/json",
          "Accept": ...
        }
      }
    },
    handler(params) {

      try {
        const response = await axios.get(params.url, {
          // params: {
          //   ID: 12345
          // }
        })
        // console.log(response)
        return {"status": "ok cool", response:response}
      } catch (error) {
        // console.error(error)
        return {"status": "ko", error: error}
      }

    }
  }),

1 reply

giladgd Sep 19, 2024
Maintainer Author

You can return a Promise and it'll wait for it to resolve before passing the result to the model.
I'll improve the documentation of defineChatSessionFunction to make it more clear that it's possible.

Please note that it's quite dangerous to let a model call whatever API it wants, so I recommend putting some restrictions on what it can call, like only allowing GET and HEAD requests, for example.
However, here's an example of how you can achieve what you described:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, defineChatSessionFunction, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context = await model.createContext();

const functions = {
    httpRequest: defineChatSessionFunction({
        description: "Perform an http request to a given url",
        params: {
            type: "object",
            properties: {
                method: {
                    enum: ["GET", "HEAD"]
                },
                url: {
                    type: "string"
                },
                headers: {
                    type: "array",
                    items: {
                        type: "object",
                        properties: {
                            key: {
                                type: "string"
                            },
                            value: {
                                type: "string"
                            }
                        }
                    }
                }
            }
        },
        async handler({method, url, headers}) {
            const headersMap = Object.fromEntries(headers.map(({key, value}) => [key, value]));

            try {
                console.log("Performing http request", {method, url, headers: headersMap});
                const res = await fetch(url, {
                    method,
                    headers: headersMap
                });

                const isJson = res.headers.get("content-type")?.includes("application/json");
                const response = isJson
                    ? await res.json()
                    : await res.text();

                return {
                    stauts: res.status,
                    response: response
                };
            } catch (err) {
                return "Error: " + String(err);
            }
        }
    })
};
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});


const q1 = "What's the current USD to EUR exchange rate?";
console.log("User: " + q1);

const a1 = await session.prompt(q1, {functions});
console.log("AI: " + a1);

giladgd · 2024-09-24T18:53:25Z

giladgd
Sep 24, 2024
Maintainer Author

I'm closing this thread as version 3 is now released.
Feel free to open new threads or issues to give your feedback.

0 replies

Beta version feedback (3.0.0-beta) #109

giladgd Dec 6, 2023 Maintainer

Replies: 31 comments · 109 replies

giladgd Dec 6, 2023 Maintainer Author

giladgd Jan 21, 2024 Maintainer Author

giladgd Jan 22, 2024 Maintainer Author

giladgd Jan 24, 2024 Maintainer Author

giladgd Jan 24, 2024 Maintainer Author

giladgd Feb 4, 2024 Maintainer Author

giladgd Feb 9, 2024 Maintainer Author

giladgd Feb 11, 2024 Maintainer Author

giladgd Feb 13, 2024 Maintainer Author

giladgd Feb 13, 2024 Maintainer Author

giladgd Mar 16, 2024 Maintainer Author

giladgd Feb 20, 2024 Maintainer Author

giladgd Mar 2, 2024 Maintainer Author

giladgd Mar 16, 2024 Maintainer Author

giladgd Mar 16, 2024 Maintainer Author

giladgd Mar 20, 2024 Maintainer Author

giladgd Apr 5, 2024 Maintainer Author

giladgd Apr 19, 2024 Maintainer Author

giladgd Apr 24, 2024 Maintainer Author

giladgd Apr 24, 2024 Maintainer Author

giladgd Apr 26, 2024 Maintainer Author

Beta version feedback (`3.0.0-beta`) #109

giladgd
Dec 6, 2023
Maintainer

Replies: 31 comments 109 replies

giladgd Dec 6, 2023
Maintainer Author

giladgd Jan 21, 2024
Maintainer Author

giladgd Jan 22, 2024
Maintainer Author

giladgd Jan 24, 2024
Maintainer Author

giladgd Jan 24, 2024
Maintainer Author

giladgd Feb 4, 2024
Maintainer Author

giladgd Feb 9, 2024
Maintainer Author

giladgd Feb 11, 2024
Maintainer Author

giladgd Feb 13, 2024
Maintainer Author

giladgd Feb 13, 2024
Maintainer Author

giladgd Mar 16, 2024
Maintainer Author

giladgd Feb 20, 2024
Maintainer Author

giladgd Mar 2, 2024
Maintainer Author

giladgd Mar 16, 2024
Maintainer Author

giladgd Mar 16, 2024
Maintainer Author

giladgd Mar 20, 2024
Maintainer Author

giladgd Apr 5, 2024
Maintainer Author

giladgd Apr 19, 2024
Maintainer Author

giladgd Apr 24, 2024
Maintainer Author

giladgd Apr 24, 2024
Maintainer Author

giladgd
Apr 26, 2024
Maintainer Author