Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qgemm: optimize avxvnni QGEMM inner kernel for M=1 #22952

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

r-devulap
Copy link

Add specialized path for M=1 case that exploits additional available ymm registers for deeper inner kernel loop unrolling.

Performance impact (measured on 13th Gen Intel(R) Core(TM) i9-13900K):

  • 30% improvement in single threaded QGEMM kernels with M = 1
  • 7% reduction in average inference time on small quantized model where all kernels have M=1
|--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|
| Benchmark                                                          | Time   | CPU     | Time Old | Time New | CPU Old | CPU New |
|--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|
| QGEMM/UnsignedAPackB/M:1/N:512/K:512/Batch:1/Threads:1/real_time   | -0.275 | -0.2756 | 4330     | 3137     | 4330    | 3136    |
| QGEMM/UnsignedAPackB/M:1/N:512/K:1024/Batch:1/Threads:1/real_time  | -0.292 | -0.2927 | 9027     | 6385     | 9027    | 6385    |
| QGEMM/UnsignedAPackB/M:1/N:1024/K:1024/Batch:1/Threads:1/real_time | -0.300 | -0.3005 | 17867    | 12499    | 17866   | 12498   |
| OVERALL_GEOMEAN                                                    | -0.289 | -0.2897 |          |          |         |         |
|--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|

QGEMM Benchmarks when M = 1 on an 13th Gen Intel(R) Core(TM) i9-13900K
shows a 1.4x improvement on a single thread.

|--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|
| Benchmark                                                          | Time   | CPU     | Time Old | Time New | CPU Old | CPU New |
|--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|
| QGEMM/UnsignedAPackB/M:1/N:512/K:512/Batch:1/Threads:1/real_time   | -0.275 | -0.2756 | 4330     | 3137     | 4330    | 3136    |
| QGEMM/UnsignedAPackB/M:1/N:512/K:1024/Batch:1/Threads:1/real_time  | -0.292 | -0.2927 | 9027     | 6385     | 9027    | 6385    |
| QGEMM/UnsignedAPackB/M:1/N:1024/K:1024/Batch:1/Threads:1/real_time | -0.300 | -0.3005 | 17867    | 12499    | 17866   | 12498   |
| OVERALL_GEOMEAN                                                    | -0.289 | -0.2897 |          |          |         |         |
|--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|
@r-devulap r-devulap requested a review from a team as a code owner November 26, 2024 22:00
@r-devulap
Copy link
Author

Posting raw qgemm benchmark numbers for clarity:

Before:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
QGEMM/UnsignedAPackB/M:1/N:512/K:512/Batch:1/Threads:1/real_time         4330 ns         4330 ns       161969
QGEMM/UnsignedAPackB/M:1/N:512/K:1024/Batch:1/Threads:1/real_time        9027 ns         9027 ns        77210
QGEMM/UnsignedAPackB/M:1/N:1024/K:1024/Batch:1/Threads:1/real_time      17867 ns        17866 ns        39329
-------------------------------------------------------------------------------------------------------------

After:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
QGEMM/UnsignedAPackB/M:1/N:512/K:512/Batch:1/Threads:1/real_time         3137 ns         3136 ns       221932
QGEMM/UnsignedAPackB/M:1/N:512/K:1024/Batch:1/Threads:1/real_time        6385 ns         6385 ns       109727
QGEMM/UnsignedAPackB/M:1/N:1024/K:1024/Batch:1/Threads:1/real_time      12499 ns        12498 ns        55934

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant