Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Add q3_s and q1_s #5886

Merged
merged 9 commits into from
Mar 11, 2024
Merged

[SYCL] Add q3_s and q1_s #5886

merged 9 commits into from
Mar 11, 2024

Conversation

abhilash1910
Copy link
Collaborator

Support GGML_TYPE_IQ3_S, GGML_TYPE_IQ1_S in mul_mal/dequant.
cc @NeoZhangJianyu @airMeng @ggerganov

@airMeng
Copy link
Collaborator

airMeng commented Mar 5, 2024

have you ran test-backend-ops or verified on model level?

@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented Mar 6, 2024

  1. Please update ggml_backend_sycl_supports_op() to suport the new types.
    And run ci/run.sh to test it.

  2. How do you test with the new types? which model or test case?

ggml-sycl.cpp Outdated Show resolved Hide resolved
@AlexKoff88
Copy link

@abhilash1910, do you have any performance estimates for the supported types in this PR?

@abhilash1910
Copy link
Collaborator Author

  1. Please update ggml_backend_sycl_supports_op() to suport the new types.
    And run ci/run.sh to test it.
  2. How do you test with the new types? which model or test case?

For

  1. handled
  2. I think q3s should support mistral along with llamav1/v2. Same for 1.5bit q1s. test-backend-ops is ok.

@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented Mar 8, 2024

  1. Please update ggml_backend_sycl_supports_op() to suport the new types.
    And run ci/run.sh to test it.
  2. How do you test with the new types? which model or test case?

For

  1. handled
  2. I think q3s should support mistral along with llamav1/v2. Same for 1.5bit q1s. test-backend-ops is ok.

Great!

The UT will cover every OPs.
But in real inference, it's depended on the condition.
Have you tested this new type with Mistral model? and it's passed?

@abhilash1910
Copy link
Collaborator Author

  1. Please update ggml_backend_sycl_supports_op() to suport the new types.
    And run ci/run.sh to test it.
  2. How do you test with the new types? which model or test case?

For

  1. handled
  2. I think q3s should support mistral along with llamav1/v2. Same for 1.5bit q1s. test-backend-ops is ok.

Great!

The UT will cover every OPs. But in real inference, it's depended on the condition. Have you tested this new type with Mistral model? and it's passed?

  • Tested on mistral 7b without errors functionality wise.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't give this a test, but seems OK

Btw, might need some of your help in #5940 to help move the tables with quantum constants to the new ggml-common.h header. Will make the change and ping you to give it a try and confirm that it works

@qnixsynapse
Copy link
Contributor

qnixsynapse commented Mar 9, 2024

@abhilash1910 increasing the grid space seems to fix the regression. However. iQ3_S still throwing a ggml_assert.
ggml-sycl.cpp:14654: false

Probably around here:

  inline void ggml_sycl_op_dequantize_mul_mat_vec(
    const ggml_tensor *src0, const ggml_tensor *src1, ggml_tensor *dst,
    const char *src0_dd_i, const float *src1_ddf_i, const char *src1_ddq_i,
    float *dst_dd_i, const int64_t row_low, const int64_t row_high,
    const int64_t src1_ncols, const int64_t src1_padded_row_size,
    const dpct::queue_ptr &stream) {

    const int64_t ne00 = src0->ne[0];
    const int64_t row_diff = row_high - row_low;

    GGML_ASSERT(src1->type == GGML_TYPE_F32);

    // on some GPUs it is faster to convert src1 to half and to use half precision intrinsics
#ifdef GGML_SYCL_F16
    sycl_pool_alloc<sycl::half> src1_dfloat_a;
    sycl::half *src1_dfloat = nullptr; // dfloat == half

    bool src1_convert_f16 =
        src0->type == GGML_TYPE_Q4_0 || src0->type == GGML_TYPE_Q4_1 ||
        src0->type == GGML_TYPE_Q5_0 || src0->type == GGML_TYPE_Q5_1 ||
        src0->type == GGML_TYPE_Q8_0 || src0->type == GGML_TYPE_F16;

    if (src1_convert_f16) {
        src1_dfloat = src1_dfloat_a.alloc(ne00);
        const to_fp16_sycl_t to_fp16_sycl = ggml_get_to_fp16_sycl(src1->type);
        GGML_ASSERT(to_fp16_sycl != nullptr);
        to_fp16_sycl(src1_ddf_i, src1_dfloat, ne00, stream);
    }
#else
    const dfloat * src1_dfloat = (const dfloat *) src1_ddf_i; // dfloat == float, no conversion
#endif // GGML_SYCL_F16

    switch (src0->type) {
        case GGML_TYPE_Q4_0:
            dequantize_mul_mat_vec_q4_0_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q4_1:
            dequantize_mul_mat_vec_q4_1_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q5_0:
            dequantize_mul_mat_vec_q5_0_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q5_1:
            dequantize_mul_mat_vec_q5_1_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q8_0:
            dequantize_mul_mat_vec_q8_0_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q2_K:
            dequantize_mul_mat_vec_q2_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q3_K:
            dequantize_mul_mat_vec_q3_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q4_K:
            dequantize_mul_mat_vec_q4_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q5_K:
            dequantize_mul_mat_vec_q5_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q6_K:
            dequantize_mul_mat_vec_q6_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_F16:
            convert_mul_mat_vec_f16_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
            break;
        default:
            GGML_ASSERT(false);
            break;
    }

    (void) src1;
    (void) dst;
    (void) src1_ddq_i;
    (void) src1_ncols;
    (void) src1_padded_row_size;
}

@airMeng
Copy link
Collaborator

airMeng commented Mar 11, 2024

@abhilash1910 increasing the grid space seems to fix the regression. However. iQ3_S still throwing a ggml_assert. ggml-sycl.cpp:14654: false

echo, can't work on model level

@abhilash1910 abhilash1910 merged commit ef3ced2 into master Mar 11, 2024
64 checks passed
NeoZhangJianyu pushed a commit to NeoZhangJianyu/llama.cpp that referenced this pull request Mar 12, 2024
* Add q3_s and q1_s

* fix compilation

* fix build

* fix build

* fix build

* enable ops

* rm macro

* increase grid space
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* Add q3_s and q1_s

* fix compilation

* fix build

* fix build

* fix build

* enable ops

* rm macro

* increase grid space
@NeoZhangJianyu
Copy link
Collaborator

@abhilash1910 increasing the grid space seems to fix the regression. However. iQ3_S still throwing a ggml_assert. ggml-sycl.cpp:14654: false

echo, can't work on model level

@abhilash1910 Could you check and fix this issue?

@abhilash1910
Copy link
Collaborator Author

@abhilash1910 increasing the grid space seems to fix the regression. However. iQ3_S still throwing a ggml_assert. ggml-sycl.cpp:14654: false

echo, can't work on model level

@abhilash1910 Could you check and fix this issue?

Fix in progress at #6052 .

Copy link
Contributor

@ikawrakow ikawrakow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this incorrect implementation.

const int ib = tid%8; // 0...7
dst_t * y = yy + i*QK_K + 32*ib + 8*il;
const uint8_t * qs = x[i].qs + 8*ib;
const uint8_t * grid1 = (const uint8_t *)(iq3s_grid + qs[2*il+0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. When you copy-paste my core without attribution, please make sure you are copy-pasting the correct code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are currently reviewing this, and interms of "attribution" I would suggest that we follow cuda code to adapt to our sycl backend and since some parts of the code base is almost similar, I donot find a reason to be defensive about it.
Like I said before, we are working on this because not all cuda code is applicable for us. I hope this makes communication easier .

const block_iq1_s * x = (const block_iq1_s *) vx;

const int tid = item_ct1.get_local_id(2);
#if QK_K == 256
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. Please see PR #6014 for the correct implementation.

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* Add q3_s and q1_s

* fix compilation

* fix build

* fix build

* fix build

* enable ops

* rm macro

* increase grid space
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants