sync : ggml #2573

ggerganov · 2024-11-19T17:09:56Z

TODO:

fix examples
start using backend registry
update Makefile

* ggml : build backends as libraries --------- Signed-off-by: Xiaodong Ye <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: R0CKSTAR <[email protected]>

…a/9921) * backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels --------- Co-authored-by: Diego Devesa <[email protected]>

* sycl: Use syclcompat::dp4a * Using the syclcompat version allow the compiler to optimize the operation with native function * Update news section * Update CI Windows oneAPI version to 2025.0 * Reword doc * Call syclcompat::dp4a inside dpct::dp4a This reverts commit 90cb61d692d61360b46954a1c7f780bd2e569b73.

* use 128 bit loads (i've tried 256->128 to death and its slower) * double accumulator * avx bf16 vec dot * +3% q4_0 inference * +7% tg +5% pp compared to master * slower f16c version, kep for reference * 256b version, also slow. i tried :) * revert f16 * faster with madd * split to functions * Q8_0 and IQ4_NL, 5-7% faster * fix potential overflow (performance reduced) * 16 bit add for q4_0 only * merge

ggml-ci

* ggml : remove duplicated sources from the last sync ggml-ci * cont : remove FindSIMD.cmake [no ci]

* ggml: new optimization interface remove test2.c, test3.c store adamw params in tensor move grads from tensor to graph * avoid segfault upon API misuse * add ggml-opt.h to public headers * remove dependence of ggml-opt.cpp on ggml-cpu.h

…ags (llama/10314)

Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses the B loads across the rows and also reuses some addressing calculations. This required manually partially unrolling the loop, since the compiler is less willing to unroll outer loops. Add bounds-checking on the last iteration of the loop. I think this was at least partly broken before. Optimize the Q4_K shader to vectorize most loads and reduce the number of bit twiddling instructions.

ggml-ci

ggerganov/llama.cpp#10352

* metal : add kernel arg structs (wip) * metal : fattn args ggml-ci * metal : cont + avoid potential int overflow [no ci] * metal : mul mat struct (wip) * cont : mul mat vec * cont : pass by reference * cont : args is first argument * cont : use char ptr * cont : shmem style * cont : thread counters style * cont : mul mm id ggml-ci * cont : int safety + register optimizations ggml-ci * metal : GGML_OP_CONCAT ggml-ci * metal : GGML_OP_ADD, GGML_OP_SUB, GGML_OP_MUL, GGML_OP_DIV * metal : GGML_OP_REPEAT * metal : GGML_OP_CPY * metal : GGML_OP_RMS_NORM * metal : GGML_OP_NORM * metal : add TODOs for rest of ops * ggml : add ggml-metal-impl.h ggml-ci

* Vulkan: Fix device info output format specifiers * Vulkan: Use zu printf specifier for size_t instead of ld

-- While running StableDiffusion.cpp locally with Metal some offsets overflow and results in incorrect calculations

Seems like this isn't working for vulkan-over-metal when the array is sized by a spec constant. Maybe a spirv-cross limitation?

* vulkan: Optimize soft_max Large soft_max could already saturate memory, but small/medium sizes were pretty slow. The bulk of the gains for them comes from using a smaller workgroup size, and making the workgroup size match the subgroup size also makes the barriers much cheaper. Cache some values in locals to avoid refetching/recomputing. And stamp out a few "template instantiations" so smaller cases will fully unroll. Add a missing early return for OOB rows. This happens when there are more than 512 rows and the dispatch is 512 x H. * vulkan: Further soft_max optimizations Restore the workgroup size of 512 case, use it for >1024. Use unrollable loops for more iteration counts.

…/10266) * Add option to set the SYCL architecture for all targets * Convert GGML_SYCL_HIP_TARGET to the more generic GGML_SYCL_ARCH option * Document that setting GGML_SYCL_ARCH can improve the performance

ggerganov · 2024-11-20T12:13:21Z

@slaren I am getting the following assertion after the sync:

make -j && ./main -m models/ggml-base.bin -f samples/jfk.wav

whisper_backend_init: using BLAS backend
whisper_init_state: kv self size  =    6.29 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
ggml_gallocr_reserve_n: reallocating Metal buffer from size 0.00 MiB to 14.01 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 0.92 MiB
whisper_init_state: compute buffer (conv)   =   17.22 MB
Assertion failed: (src_backend_id != -1), function ggml_backend_sched_split_graph, file ggml-backend.cpp, line 1165.
Process 66825 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = hit program assert
    frame #4: 0x0000000100048950 main`ggml_backend_sched_split_graph(sched=0x000000013890f400, graph=0x0000000148648020) at ggml-backend.cpp:1165:17
   1162	
   1163	               size_t src_id = hash_id(src);
   1164	               const int src_backend_id = sched->hv_tensor_backend_ids[src_id];
-> 1165	               assert(src_backend_id != -1); // all inputs should be assigned by now
   1166	
   1167	               if (src->flags & GGML_TENSOR_FLAG_INPUT && sched->n_copies > 1) {
   1168	                   if (tensor_id_copy(src_id, src_backend_id, 0) == NULL) {

The problem seems to be that the "encode" scheduler does not know about the embd_conv tensor which is the result of the previous "conv" graph. What would be the recommended way to fix this? I think I can copy the data embd_conv data to host memory after the "conv" graph and then copy it back to device memory before calling the "encode" graph. But I wonder if this copy can be avoided.

slaren · 2024-11-20T12:25:47Z

Should be fixed now, sorry about that.

ggerganov · 2024-11-20T12:32:44Z

Nice, thank you!

ggerganov · 2024-11-20T12:41:23Z

@KitaitiMakoto With this PR, the ggml source tree has changed a bit and the Ruby bindings need to be adapted respectively. I'll leave them for now in a broken state, but you can make a PR either to this branch or later to master to resolve the build. Thanks.

KitaitiMakoto · 2024-11-20T13:28:15Z

Okay, I will make a pull request to master after this pull request will be merged. Thank you for mentioning me.

ggerganov · 2024-11-20T13:58:59Z

I can't figure out why the whisper.objc CI is failing to use the correct include path. Locally, it runs successfully.

ggerganov and others added 30 commits November 19, 2024 18:59

scripts : update sync

06c86c0

ggml : build backends as libraries (llama/10256)

ce58be7

* ggml : build backends as libraries --------- Signed-off-by: Xiaodong Ye <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: R0CKSTAR <[email protected]>

backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (llam…

41c9065

…a/9921) * backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels --------- Co-authored-by: Diego Devesa <[email protected]>

cmake : restore CMakeLists.txt (llama/10256)

1d49a2e

ggml-ci

sync : leftovers (ggml/0)

8dffd64

ggml-ci

ggml : fix some build issues

83c7739

ggml : remove duplicated sources from the last sync (ggml/1017)

f33c7ea

* ggml : remove duplicated sources from the last sync ggml-ci * cont : remove FindSIMD.cmake [no ci]

ggml: new optimization interface (ggml/988)

adf81dc

* ggml: new optimization interface remove test2.c, test3.c store adamw params in tensor move grads from tensor to graph * avoid segfault upon API misuse * add ggml-opt.h to public headers * remove dependence of ggml-opt.cpp on ggml-cpu.h

Make updates to fix issues with clang-cl builds while using AVX512 fl…

4b8ddfb

…ags (llama/10314)

ggml : optimize Q4_0 into Q4_0_X_Y repack (llama/10324)

49ca481

llamafile : fix include path (llama/0)

7caa6b2

ggml-ci

ggml : fix compile warnings (llama/0)

e726307

ggml-ci

ggml : adapt AMX to tensor->grad removal (llama/0)

600728e

ggml-ci

ggml : inttypes.h -> cinttypes (llama/0)

3f1a78d

ggml-ci

ggml : fix possible buffer use after free in sched reserve (llama/9930)

c96434f

CMake: default to -arch=native for CUDA build (llama/10320)

77ea626

CUDA: remove DMMV, consolidate F16 mult mat vec (llama/10318)

8bd8688

ggml : fix undefined reference to 'getcpu' (llama/10354)

a901ba0

ggerganov/llama.cpp#10352

llama : only use default buffer types for the KV cache (llama/10358)

6b4de57

CMake: fix typo in comment [no ci] (llama/10360)

fcd8ea6

CUDA: fix MMV kernel being used for FP16 src1 (llama/10357)

58b5fc4

metal : add GGML_UNARY_OP_ELU kernel (ggml/1018)

937684c

Vulkan: Fix device info output format specifiers (llama/10366)

748d633

* Vulkan: Fix device info output format specifiers * Vulkan: Use zu printf specifier for size_t instead of ld

metal : fox offset integer overflows in im2col (ggml/1015)

c157f62

-- While running StableDiffusion.cpp locally with Metal some offsets overflow and results in incorrect calculations

vulkan: remove use of null initializer (llama/10372)

c4f4639

Seems like this isn't working for vulkan-over-metal when the array is sized by a spec constant. Maybe a spirv-cross limitation?

cuda : only use native when supported by cmake (llama/10389)

761d310

Alcpz and others added 9 commits November 19, 2024 19:02

sycl: Revert MUL_MAT_OP support changes (llama/10385)

8d6e30f

sycl : Add option to set the SYCL architecture for all targets (llama…

d2aaf9e

…/10266) * Add option to set the SYCL architecture for all targets * Convert GGML_SYCL_HIP_TARGET to the more generic GGML_SYCL_ARCH option * Document that setting GGML_SYCL_ARCH can improve the performance

cuda : fix CUDA_FLAGS not being applied (llama/10403)

166237d

Add required ggml-base and backend libs to cmake pkg (llama/10407)

bfaf1fc

ggml : sync resolve (skip) (#0)

52799f9

sync : ggml

0eddc9f

talk-llama : sync llama.cpp

4e1f516

whisper : adapt to new ggml (wip)

8c24c64

ggerganov force-pushed the sync branch from 7393ba5 to 8c24c64 Compare November 20, 2024 12:03

ggml/sched : do not skip views in pre-assignments

c800966

ggerganov marked this pull request as ready for review November 20, 2024 13:57

whisper : use backend registry (#0)

e611417

ggerganov force-pushed the sync branch from fbae8dc to e611417 Compare November 20, 2024 18:54

ggerganov merged commit 37c8802 into master Nov 20, 2024
85 of 89 checks passed

ggerganov deleted the sync branch November 20, 2024 19:00

KitaitiMakoto mentioned this pull request Nov 21, 2024

ruby : Follow source tree change #2580

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync : ggml #2573

sync : ggml #2573

ggerganov commented Nov 19, 2024 •

edited

Loading

ggerganov commented Nov 20, 2024

slaren commented Nov 20, 2024

ggerganov commented Nov 20, 2024

ggerganov commented Nov 20, 2024

KitaitiMakoto commented Nov 20, 2024

ggerganov commented Nov 20, 2024

sync : ggml #2573

sync : ggml #2573

Conversation

ggerganov commented Nov 19, 2024 • edited Loading

ggerganov commented Nov 20, 2024

slaren commented Nov 20, 2024

ggerganov commented Nov 20, 2024

ggerganov commented Nov 20, 2024

KitaitiMakoto commented Nov 20, 2024

ggerganov commented Nov 20, 2024

ggerganov commented Nov 19, 2024 •

edited

Loading