-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ReLU and SQR CUDA ops to fix Persimmon offloading #4041
Add ReLU and SQR CUDA ops to fix Persimmon offloading #4041
Conversation
#define CUDA_RELU_BLOCK_SIZE 256 | ||
#define CUDA_SQR_BLOCK_SIZE 256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure what the optimal block sizes are. I just copied from SILU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Block sizes should not matter that much, a thread on the Nvidia forums (a decade ago) suggests 128-256.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Persimmon graph seems to be doing some overly-complicated stuff. I haven't looked deeply in the logic, but we should simplify it. If necessary, the convert script can output data in a more convenient way if that can help to reduce the number of and types of ops currently used in the attention
Thanks for the response.
Unfortunately I don't really know anything about the model, I just downloaded it based on an issue about the CUDA offloading and was able to find the issue. If you have the time to answer, is there any way we can limit |
No, it's better to let it crash. Otherwise we will forget about this problem and won't fix it. |
I added an
Also checked CLBlast, doesn't seem to work with anything more than the repeating layers. I just get garbage output. I was going to add a message for that as well, but I don't know if it's an error specific to my system or whatever or the underlying cause. |
…erganov#4041) * Add ReLU and SQR CUDA ops to fix Persimmon offloading * Persimmon loader: More helpful error on CUDA/ROCM when offloading too many layers
See #4038 - Persimmon uses ReLU and SQR but those CUDA ops didn't exist. Looks like they are in Metal already just as a note.
This pull adds those ops. This still isn't enough for full offloading. You can offload
n_layers + 1
. So for the 8B model with 36 layers,-ngl 37
works but-ngl 38
does not.edit:
The next op it fails on seems to beActually seems likeCPY
. Is the solution just do add that as well?CPY
already exists so the problem must be something else like maybe it's not combination of tensor types that can be copied.One of the operands is on CPU. I don't know how to fix that though. edit: Well, the next issue after that is
CPY
only supports up to 3 dimensions but those tensors are 4D.