-
-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem when running some models with cuda #261
Comments
There seems to be an issue with Qwen models when using CUDA. I'll make it easier to disable the use of CUDA (or any other compute layer you want), so you can force |
Try to use IQ quants with cuda. |
@giladgd Thank you for your feedback. I found that there is a suggestion of using |
@bqhuyy I've released a new beta version that allows you to enable flash attention like this: import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";
const __dirname = path.dirname(fileURLToPath(import.meta.url));
const llama = await getLlama();
const model = await llama.loadModel({
modelPath: path.join(__dirname, "models", "Qwen2-1.5B-Instruct.Q4_K_M.gguf"),
defaultContextFlashAttention: true // it's best to enable it via this setting
});
// you can also pass {flashAttention: true} here to enable it for only this context
const context = await model.createContext(); Let me know if you were able to use any Qwen models with CUDA without flash attention, because if it's impossible to use Qwen models with CUDA without flash attention, then I'll make flash attention enabled by default for Qwen models when CUDA is used. |
@giladgd hi, Qwen2 (CUDA) works with |
@giladgd (source here: ggerganov/llama.cpp#8025) |
I found some Qwen2 models that worked on CUDA 12 without flash attention, so since enabling flash attention for Qwen2 models is not always necessary, I won't make it the default because it is still considered experimental. I'm closing this issue for now since |
Issue description
Models keep generating dummy result when running with cuda
Expected Behavior
Models stop generating dummy output like running as
cpu
orvulkan
.Actual Behavior
Models keep generating dummy result.
Steps to reproduce
I use this Qwen2 1.5B model download from here
Running with
gpu
isauto
orcuda
My Environment
node-llama-cpp
versionAdditional Context
Here is example I running using https://github.com/withcatai/node-llama-cpp/releases/download/v3.0.0-beta.36/node-llama-cpp-electron-example.Windows.3.0.0-beta.36.x64.exe
These models run normally with 'vulkan', 'cpu', 'metal'
Relevant Features Used
Are you willing to resolve this issue by submitting a Pull Request?
Yes, I have the time, but I don't know how to start. I would need guidance.
The text was updated successfully, but these errors were encountered: