Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem when running some models with cuda #261

Closed
1 of 3 tasks
bqhuyy opened this issue Jul 2, 2024 · 7 comments
Closed
1 of 3 tasks

Problem when running some models with cuda #261

bqhuyy opened this issue Jul 2, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@bqhuyy
Copy link

bqhuyy commented Jul 2, 2024

Issue description

Models keep generating dummy result when running with cuda

Expected Behavior

Models stop generating dummy output like running as cpu or vulkan.

Actual Behavior

Models keep generating dummy result.

Steps to reproduce

I use this Qwen2 1.5B model download from here
Running with gpu is auto or cuda

const llama = await getLlama({gpu: 'cuda'})

My Environment

Dependency Version
Operating System Windows 10
CPU AMD Ryzen 7 3700X
GPU RTX4090, RTX3080
Node.js version v20.11.1
Typescript version 5.5.2
node-llama-cpp version 3.0.0-beta.36

Additional Context

Here is example I running using https://github.com/withcatai/node-llama-cpp/releases/download/v3.0.0-beta.36/node-llama-cpp-electron-example.Windows.3.0.0-beta.36.x64.exe
Screenshot 2024-07-01 165036

These models run normally with 'vulkan', 'cpu', 'metal'

Relevant Features Used

  • Metal support
  • CUDA support
  • Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

@bqhuyy bqhuyy added bug Something isn't working requires triage Requires triaging labels Jul 2, 2024
@giladgd
Copy link
Contributor

giladgd commented Jul 3, 2024

There seems to be an issue with Qwen models when using CUDA.
I've seen some suggestions to try other quantizations of a Qwen model since some may still work with CUDA, but I couldn't get any of the quantizations of the model you linked to work with CUDA.
I've done some tests and can confirm this is an issue with llama.cpp and is not something specific to node-llama-cpp.

I'll make it easier to disable the use of CUDA (or any other compute layer you want), so you can force node-llama-cpp to not use it when you know there's an issue with it with a model you want to use.

@anunknowperson
Copy link

Try to use IQ quants with cuda.

@bqhuyy
Copy link
Author

bqhuyy commented Jul 5, 2024

@giladgd Thank you for your feedback. I found that there is a suggestion of using FlashAttention to solve this problem. How can I enable it in node-llama-cpp?

@giladgd
Copy link
Contributor

giladgd commented Jul 5, 2024

@bqhuyy I've released a new beta version that allows you to enable flash attention like this:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "Qwen2-1.5B-Instruct.Q4_K_M.gguf"),
    defaultContextFlashAttention: true // it's best to enable it via this setting
});

// you can also pass {flashAttention: true} here to enable it for only this context
const context = await model.createContext();

Let me know if you were able to use any Qwen models with CUDA without flash attention, because if it's impossible to use Qwen models with CUDA without flash attention, then I'll make flash attention enabled by default for Qwen models when CUDA is used.
I haven't enabled flash attention by default for all models since it's still considered experimental, so it may not work well with all models, by if Qwen is unusable without it then it's better to have it enabled by default in this case.

@bqhuyy
Copy link
Author

bqhuyy commented Jul 11, 2024

@giladgd hi, Qwen2 (CUDA) works with defaultContextFlashAttention: true

@anunknowperson
Copy link

anunknowperson commented Jul 13, 2024

@giladgd
You can use Flash attention with Qwen2 as workaround for this bug, but it will work only with Cuda 12. With Cuda 11, FlashAttention will not change anything.

(source here: ggerganov/llama.cpp#8025)

@giladgd
Copy link
Contributor

giladgd commented Jul 30, 2024

I found some Qwen2 models that worked on CUDA 12 without flash attention, so since enabling flash attention for Qwen2 models is not always necessary, I won't make it the default because it is still considered experimental.

I'm closing this issue for now since defaultContextFlashAttention: true seems to solve this issue, and I'll make sure to mention using flash attention as a fix for this issue in the documentation of version 3.

@giladgd giladgd closed this as completed Jul 30, 2024
@giladgd giladgd removed the requires triage Requires triaging label Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants