Working around new int4wo weight packing #1389

Jack-Khuu · 2024-12-07T02:39:45Z

Given the change in output shape/behavior in pytorch/pytorch#139611 + #1278

Question: What is the recommended way of migrating to the new cpu implementation of

_weight_int4pack_mm_for_cpu
_convert_weight_to_int4pack_for_cpu

while maintaining the previous behavior?

Specifically _convert_weight_to_int4pack

        q, s, z = Q4_0.unpack(t)
        scales_and_zeros = pack_scales_and_zeros(s, z)
        q_uint8 = (q[::, ::2] << 4 | q[::, 1::2]).to(torch.uint8)
        weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(
            q_uint8, inner_k_tiles
        )

and _weight_int4pack_mm

        c = torch.ops.aten._weight_int4pack_mm(
            input,
            weight_int4pack,
            groupsize,
            scales_and_zeros,
        )

Tested: With no code changes

The following error is encountered:

Could not run 'aten::_convert_weight_to_int4pack' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend

Tested: Naive (Just add *_for_cpu)

Size mismatch was encountered (expected since signatures are different)

size mismatch for model.layers.0.attention.wq.weight: copying a param with shape torch.Size([2048, 1024]) from checkpoint, the shape in current model is torch.Size([256, 16, 32, 4]).

cc: @yanbing-j @jerryzh168 who worked on the changes

The text was updated successfully, but these errors were encountered:

jerryzh168 · 2024-12-07T02:49:25Z

there is no change of the input shape I believe, so the old code should work after you add _for_cpu

size mismatch for model.layers.0.attention.wq.weight: copying a param with shape torch.Size([2048, 1024]) from checkpoint, the shape in current model is torch.Size([256, 16, 32, 4]).

this seems to be an error of loading a unquantized model state dict into a quantized model?

jerryzh168 · 2024-12-07T02:53:32Z

@yanbing-j can you make corresponding changes in torchchat (https://github.com/pytorch/torchchat/blob/main/torchchat/utils/gguf_loader.py#L609C17-L614C18) as well? also it would be helpful to add some docs for https://github.com/pytorch/pytorch/blob/7939b5f5f9b073984c26adef1446fa250a20bceb/aten/src/ATen/native/LinearAlgebra.cpp#L3457 and friends so it's clear the input and output dimensions

yanbing-j · 2024-12-09T05:49:32Z

@Jack-Khuu @jerryzh168

I follow https://github.com/pytorch/torchchat/blob/main/.github/workflows/pull.yml#L830-L874 to reproduce this issue.
With the PR of pytorch/torchchat#1404, it can run on CPU now. The root cause is weight in WeightOnlyInt4Linear needs to be updated.

Jack-Khuu · 2024-12-09T19:05:59Z

Thanks @yanbing-j, I'll follow up in the other PR

there is no change of the input shape I believe

There is a change input type and output shape i believe?

yanbing-j linked a pull request Dec 9, 2024 that will close this issue

Update int4pack related in torchchat gguf pytorch/torchchat#1404

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working around new int4wo weight packing #1389

Working around new int4wo weight packing #1389

Jack-Khuu commented Dec 7, 2024

jerryzh168 commented Dec 7, 2024

jerryzh168 commented Dec 7, 2024

yanbing-j commented Dec 9, 2024

Jack-Khuu commented Dec 9, 2024

Working around new int4wo weight packing #1389

Working around new int4wo weight packing #1389

Comments

Jack-Khuu commented Dec 7, 2024

Tested: With no code changes

Tested: Naive (Just add *_for_cpu)

jerryzh168 commented Dec 7, 2024

jerryzh168 commented Dec 7, 2024

yanbing-j commented Dec 9, 2024

Jack-Khuu commented Dec 9, 2024