Problem when running some models with cuda #261

bqhuyy · 2024-07-02T05:30:37Z

Issue description

Models keep generating dummy result when running with cuda

Expected Behavior

Models stop generating dummy output like running as cpu or vulkan.

Actual Behavior

Models keep generating dummy result.

Steps to reproduce

I use this Qwen2 1.5B model download from here
Running with gpu is auto or cuda

const llama = await getLlama({gpu: 'cuda'})

My Environment

Dependency	Version
Operating System	Windows 10
CPU	AMD Ryzen 7 3700X
GPU	RTX4090, RTX3080
Node.js version	v20.11.1
Typescript version	5.5.2
`node-llama-cpp` version	3.0.0-beta.36

Additional Context

Here is example I running using https://github.com/withcatai/node-llama-cpp/releases/download/v3.0.0-beta.36/node-llama-cpp-electron-example.Windows.3.0.0-beta.36.x64.exe

These models run normally with 'vulkan', 'cpu', 'metal'

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

The text was updated successfully, but these errors were encountered:

giladgd · 2024-07-03T17:24:19Z

There seems to be an issue with Qwen models when using CUDA.
I've seen some suggestions to try other quantizations of a Qwen model since some may still work with CUDA, but I couldn't get any of the quantizations of the model you linked to work with CUDA.
I've done some tests and can confirm this is an issue with llama.cpp and is not something specific to node-llama-cpp.

I'll make it easier to disable the use of CUDA (or any other compute layer you want), so you can force node-llama-cpp to not use it when you know there's an issue with it with a model you want to use.

anunknowperson · 2024-07-04T07:51:48Z

Try to use IQ quants with cuda.

bqhuyy · 2024-07-05T04:21:11Z

@giladgd Thank you for your feedback. I found that there is a suggestion of using FlashAttention to solve this problem. How can I enable it in node-llama-cpp?

giladgd · 2024-07-05T23:29:18Z

@bqhuyy I've released a new beta version that allows you to enable flash attention like this:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "Qwen2-1.5B-Instruct.Q4_K_M.gguf"),
    defaultContextFlashAttention: true // it's best to enable it via this setting
});

// you can also pass {flashAttention: true} here to enable it for only this context
const context = await model.createContext();

Let me know if you were able to use any Qwen models with CUDA without flash attention, because if it's impossible to use Qwen models with CUDA without flash attention, then I'll make flash attention enabled by default for Qwen models when CUDA is used.
I haven't enabled flash attention by default for all models since it's still considered experimental, so it may not work well with all models, by if Qwen is unusable without it then it's better to have it enabled by default in this case.

bqhuyy · 2024-07-11T09:28:27Z

@giladgd hi, Qwen2 (CUDA) works with defaultContextFlashAttention: true

anunknowperson · 2024-07-13T10:01:38Z

@giladgd
You can use Flash attention with Qwen2 as workaround for this bug, but it will work only with Cuda 12. With Cuda 11, FlashAttention will not change anything.

(source here: ggerganov/llama.cpp#8025)

giladgd · 2024-07-30T18:07:40Z

I found some Qwen2 models that worked on CUDA 12 without flash attention, so since enabling flash attention for Qwen2 models is not always necessary, I won't make it the default because it is still considered experimental.

I'm closing this issue for now since defaultContextFlashAttention: true seems to solve this issue, and I'll make sure to mention using flash attention as a fix for this issue in the documentation of version 3.

bqhuyy added bug Something isn't working requires triage Requires triaging labels Jul 2, 2024

giladgd closed this as completed Jul 30, 2024

giladgd removed the requires triage Requires triaging label Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem when running some models with cuda #261

Problem when running some models with cuda #261

bqhuyy commented Jul 2, 2024

giladgd commented Jul 3, 2024

anunknowperson commented Jul 4, 2024

bqhuyy commented Jul 5, 2024

giladgd commented Jul 5, 2024 •

edited

Loading

bqhuyy commented Jul 11, 2024

anunknowperson commented Jul 13, 2024 •

edited

Loading

giladgd commented Jul 30, 2024

Problem when running some models with cuda #261

Problem when running some models with cuda #261

Comments

bqhuyy commented Jul 2, 2024

Issue description

Expected Behavior

Actual Behavior

Steps to reproduce

My Environment

Additional Context

Relevant Features Used

Are you willing to resolve this issue by submitting a Pull Request?

giladgd commented Jul 3, 2024

anunknowperson commented Jul 4, 2024

bqhuyy commented Jul 5, 2024

giladgd commented Jul 5, 2024 • edited Loading

bqhuyy commented Jul 11, 2024

anunknowperson commented Jul 13, 2024 • edited Loading

giladgd commented Jul 30, 2024

giladgd commented Jul 5, 2024 •

edited

Loading

anunknowperson commented Jul 13, 2024 •

edited

Loading