Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda : do not use batched GEMM when tensor cores are not available #3882

Merged
merged 1 commit into from
Nov 2, 2023

Conversation

ggerganov
Copy link
Owner

fix #3869

@askmyteapot
Copy link

Can confirm the fix works for Pascal SM6.1

@ggerganov ggerganov added performance Speed related topics Nvidia GPU Issues specific to Nvidia GPUs labels Nov 1, 2023
@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Nov 2, 2023

I can confirm that this brings pp512 on my Tesla P40 back to pre-#3749 speeds.

Now both #3749 and #3776 can be worked around via -DLLAMA_CUDA_FORCE_MMQ=ON on older cards.

@ggerganov ggerganov merged commit 4d719a6 into master Nov 2, 2023
33 checks passed
@ggerganov ggerganov deleted the try-fix-3869 branch November 2, 2023 06:35
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Nvidia GPU Issues specific to Nvidia GPUs performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CTX Processing regression for Pascal - Commit 2b4ea35
4 participants