Refactor/online repacking #10446

Djip007 · 2024-11-21T20:46:00Z

It is WIP. not for merge (like that)
goal: consolidation of the cpu backend for reintegration of the AMX backend.

1 commit: remove Q4_0_N_M from ggml file tensor type, only the cpu backend have know of this type and do dynamic repacking for it on "ggml_backend_cpu_aarch64_buffer_type" only.
2 commit: "extract" extra_buffer_type part (aarch64/hbm) and move most in there .cpp/.h files (migrate aarch64 to c++
TODO: have more general structure (class) for buffer->extra / extra_buffer
...

I steel have some question so only here for comment / idea.

- remove from "file" tensor type - allow only with dynamic repack

Djip007 · 2024-11-21T20:48:55Z

ggml/src/ggml-cpu/ggml-cpu-hbm.cpp

Not sure it is good, can not test.
And may not work/build on master branch either.

It could probably be removed, the normal CPU buffer type calls ggml_aligned_malloc which already uses HBM. So at the moment this buffer type serves no purpose.

Djip007 · 2024-11-21T20:52:51Z

ggml/src/ggml-cpu/ggml-cpu.c

-    int64_t           const matmul_num_cols = type_traits_cpu[type].ncols;
-    ggml_gemv_t       const gemv            = type_traits_cpu[type].gemv;
+    //int64_t           const matmul_num_cols = type_traits_cpu[type].ncols;
+    //ggml_gemv_t       const gemv            = type_traits_cpu[type].gemv;


Look for me it have not be write for dynamic repack but only with "native" Q4_0_N_M packing.
leave it commented it need some work to be usable on dynamic repacking.

Djip007 · 2024-11-21T20:59:18Z

ggml/src/ggml-cpu/ggml-cpu.c

+// move to ggml-cpu-traits...
+static const struct ggml_cpu_tensor_traits* ggml_cpu_get_tensor_traits(
+        const struct ggml_tensor * src0)
+{


Is it possible to have here a "src0->extra" that is not part of the CPU backend?
ie: can the ggml_compute_zzzz be call with weight that is part of an other backend/device/backend?

I don't see the point. As I already told you, the way this is intended to be handled is with ggml_backend_sched.

To be clear, I am not opposed to making changes to the design if you can come up with a better way to do things, but I am not seeing an argument in favor of this. IMO there are clear advantages to keeping backends independent of each other, and more generally, in reducing coupling between the different components to a minimum.

I think my question was not clear.
In the ggml_compute_forward_mul_mat we test that the buffer is a "aarch64" before use extra.
Can I remove this test and be sure that if extra exist at this point it have be set by the CPU backend?

Is this test there to differentiate between the (future) different cpu extra buffer?

If so if we can have the same struct (or base class...) for all cpu buffer type then we can remove this test.

My question was not to make it possible, but to see if we can simplify this function.

There are some reasons reasons why we should not do this:

The RPC backend does not support backends that use extras. This is not completely unavoidable, but currently this is done because the cost of calling init_tensor on every tensor is too high since it requires a round trip between the server and the client, so the RPC backend skips these calls.

The CPU backend should be able to use buffers of iGPU backends that do not modify the weights to avoid extra copies. For example, the CPU backend can use Metal backend buffers since it uses host buffers. Making it require an extra would break that.

And the same applies the other way around. The BLAS backend can use CPU buffers without copies because it can assume that the tensors stored in the default CPU buffer are in standard ggml layout. If we just use one buffer type and use the extras, this would no longer be possible.

So I think it is better to keep tensor conversions in different buffer types.

So I think it is better to keep tensor conversions in different buffer types.

Yes, I don't what to remove the fact that the tensor is in the "aarch64" backend, I just want to know if I can simplify this function like that:

static const struct ggml_cpu_tensor_traits* ggml_cpu_get_tensor_traits(const struct ggml_tensor * src0) { if (src0->extra != NULL) { return (struct ggml_cpu_tensor_traits*)src0->extra; } return NULL; }

For what I see it is possible here, but I may have missed something.

No, you should check the buffer type before using the extra because other backends that use host memory may want to set an extra. The CPU backend can use all buffer types that return true to is_host, and the only requirements for a buffer to be considered a "host buffer" is that tensor->data points to an address in system memory, and the tensor data is stored in standard ggml layout.

But you can rename the aarch64 buffer type to some generic name like "ggml_cpu_repack_buffer_type" and reuse it for the AMX or other repackings.

The CPU backend should be able to use buffers of iGPU backends that do not modify the weights to avoid extra copies. For example, the CPU backend can use Metal backend buffers since it uses host buffers. Making it require an extra would break that.

Maybe that's what i missed. Is it possible that a weight is initialise by the Metal backend and added an extra for these needs. And the CPU backend receive it.
But if it can be, we will have 2 backends that use the same weight on the same buffer_type that may want to register there own "extra"...

We may have to be more restrict like:

static const struct ggml_cpu_tensor_traits* ggml_cpu_get_tensor_traits(const struct ggml_tensor * src0) { if (src0->buffer && src0->buffer->usage==GGML_BACKEND_BUFFER_USAGE_WEIGHTS && src0->extra != NULL) { return (struct ggml_cpu_tensor_traits*)src0->extra; } return NULL; }

No, you should check the buffer type before using the extra because other backends that use host memory may want to set an extra. The CPU backend can use all buffer types that return true to is_host, and the only requirements for a buffer to be considered a "host buffer" is that tensor->data points to an address in system memory, and the tensor data is stored in standard ggml layout.

But you can rename the aarch64 buffer type to some generic name like "ggml_cpu_repack_buffer_type" and reuse it for the AMX or other repackings.

👍 OK I see. (did not read it before last reply) .

- hbm - "aarch64"

slaren · 2024-11-22T00:27:33Z

Overall looks good. I am not sure about removing support for current Q4_0_x_x models, but I guess if we are going to do it, it is better to do it sooner than later.

Djip007 · 2024-11-22T01:11:27Z

I am not sure about removing support for current Q4_0_x_x models, but I guess if we are going to do it, it is better to do it sooner than later.

yes it will be the main/difficult choice :

Allow weight repacking only at load time, and reduce the interest of mmap...
Allow to add new "bloc" type... and be prepare to have lot of new type (AVX512 will like bloc of 16xN, AVX512BF16 of 2x16xN, AVX2 of 8xN, RDNA3 of 16x16 ...)

Djip007 · 2024-11-24T13:56:07Z

@slaren I still need your expertise so as not to make too many mistakes.

I was looking for where params->wdata was created.

llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c

Line 7497 in 9336db4

char * wdata = params->wdata;

for me look to be in this function:

llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c

Lines 13220 to 13223 in 9336db4

    
           struct ggml_cplan ggml_graph_plan( 
        
                     const struct ggml_cgraph * cgraph, 
        
                                          int   n_threads, 
        
                       struct ggml_threadpool * threadpool) {

Am I right?

If yes, look for me that the size is not calculated correctly for llamafile and Q4_0 repacking:

llamafile: we may compute size for src[1] that may not be used.
Q4_0_M_N: may be compute with wrong 'vec_dot_type'

llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c

Lines 13277 to 13284 in 9336db4

    
           case GGML_OP_MUL_MAT: 
        
               { 
        
                   const enum ggml_type vec_dot_type = type_traits_cpu[node->src[0]->type].vec_dot_type; 
        
                   if (node->src[1]->type != vec_dot_type) { 
        
                       cur = ggml_row_size(vec_dot_type, ggml_nelements(node->src[1])); 
        
                   } 
        
               } break;

Note: I'm trying to make it more generic to make it easier to reintegrate the AMX backend so maybe not useful to fix it for now.

ggerganov · 2024-11-24T15:46:28Z

llamafile: we may compute size for src[1] that may not be used.

It's OK if we over-allocate a bit of memory for wdata even if it ends up not being needed. It would be best to add asserts in the different branches that validate wdata is big enough.

Q4_0_M_N: may be compute with wrong 'vec_dot_type'

Isn't vec_dot_type always GGML_TYPE_Q8_0 for the Q4_0_M_N?

Djip007 · 2024-11-24T18:12:59Z

Q4_0_M_N: may be compute with wrong 'vec_dot_type'

Isn't vec_dot_type always GGML_TYPE_Q8_0 for the Q4_0_M_N?

Yes it is the case for Q4_0_M_N for, so not critical for now. Even if internally it is more a Q8_0_N:

llama.cpp/ggml/src/ggml-cpu/ggml-cpu-aarch64.c

Line 195 in cce5a90

block_q8_0x4 * restrict y = (block_q8_0x4 *) vy;

But may not work with other/future case.

slaren · 2024-11-24T18:17:56Z

If we remove the old API and make the CPU backend accessible only through ggml-backend, then there will be a context that can be used to store the work buffer. Then the work buffer could simply be a std::vector in the context, and each operation that uses it only needs to resize it to the amount of memory it needs. Then we can remove ggml_cplan and related functions. However at this point this would break a lot of code.

Djip007 · 2024-11-24T18:47:29Z

If we remove the old API and make the CPU backend accessible only through ggml-backend, then there will be a context that can be used to store the work buffer. Then the work buffer could simply be a std::vector in the context, and each operation that uses it only needs to resize it to the amount of memory it needs. Then we can remove ggml_cplan and related functions. However at this point this would break a lot of code.

So you confirm that for now this is where the size is calculated.

slaren · 2024-11-24T18:48:51Z

Yes, the size is calculated in the function ggml_graph_plan.

clean Q4_0_N_M

9849c64

- remove from "file" tensor type - allow only with dynamic repack

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 21, 2024

Djip007 commented Nov 21, 2024

View reviewed changes

extract cpu extra bufts and convert to C++

655a3fb

- hbm - "aarch64"

Djip007 force-pushed the refactor/online_repacking branch from 36a0406 to 655a3fb Compare November 21, 2024 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor/online repacking #10446

Refactor/online repacking #10446

Djip007 commented Nov 21, 2024

Djip007 Nov 21, 2024

slaren Nov 22, 2024

Djip007 Nov 21, 2024

Djip007 Nov 21, 2024

slaren Nov 22, 2024 •

edited

Loading

Djip007 Nov 22, 2024 •

edited

Loading

slaren Nov 22, 2024

Djip007 Nov 22, 2024

slaren Nov 22, 2024 •

edited

Loading

Djip007 Nov 22, 2024

Djip007 Nov 22, 2024

slaren commented Nov 22, 2024

Djip007 commented Nov 22, 2024

Djip007 commented Nov 24, 2024

ggerganov commented Nov 24, 2024

Djip007 commented Nov 24, 2024

slaren commented Nov 24, 2024

Djip007 commented Nov 24, 2024

slaren commented Nov 24, 2024

Refactor/online repacking #10446

Are you sure you want to change the base?

Refactor/online repacking #10446

Conversation

Djip007 commented Nov 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slaren Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Djip007 Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slaren Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slaren commented Nov 22, 2024

Djip007 commented Nov 22, 2024

Djip007 commented Nov 24, 2024

ggerganov commented Nov 24, 2024

Djip007 commented Nov 24, 2024

slaren commented Nov 24, 2024

Djip007 commented Nov 24, 2024

slaren commented Nov 24, 2024

slaren Nov 22, 2024 •

edited

Loading

Djip007 Nov 22, 2024 •

edited

Loading

slaren Nov 22, 2024 •

edited

Loading