Q2 and Q3 quantization #1004

sw · 2023-04-15T19:23:56Z

This adds support for 2-bit and 3-bit quantization with an FP16 shared scale and 16 quants per block.

I don't consider it ready to merge, as we might come up with a different block format. The struct definitions are not portable: they use a #pragma for one, and are unlikely to work on big-endian systems.

#951 really is a game-changer for 2-bit quantization. I have updated my code to use the Q8 intermediate quantization.

Q2 sample output (we may need to implement a profanity filter, but swearing is appropriate if you have to use PHP):

$ ./main -m models/7B/ggml-model-q2_0.bin -p "Building a website can be done in 10 simple steps:" -s 1681585661
 Building a website can be done in 10 simple steps:
Make sure your copy is up to date
Make sure graphics are updated
Make sure the fonts are refreshed
Make sure the javascript is refreshed
Make sure the PHP is refreshed
P.S.- I'll just tell ya what the fuck went wrong with P.S.- It's not like I'm superman or something...
P.S.- If you don't want to get your eyes glued up by playing Candyland, go read the comments for a minute
T.F.- I'll just tell ya what the fuck went wrong with T.F.- It

Q3 is more sensible, but I haven't played with it a lot.

As mentioned, both new types use FP16 and QK=16. This was easiest to implement for two reasons: I can just use half of a Q8 block, and 16 3-bit values can be mangled in an AVX2 register.

So I've essentially invented my own new formats, but I'm aware there's a 3-bit GPTQ format. I'd love to re-use what's already been done, but haven't been able to find a clear definition (or model files) of that.

Looking forward to anyone finding better SIMD optimizations, especially for Q3, which is a pain in the butt...

Model file sizes:

$ ls -gho models/7B/*q*
-rw-rw-r-- 1 2.4G Apr 15 21:04 models/7B/ggml-model-q2_0.bin
-rw-rw-r-- 1 3.2G Apr 15 19:53 models/7B/ggml-model-q3_0.bin
-rw-rw-r-- 1 4.0G Apr 13 21:51 models/7B/ggml-model-q4_0.bin
-rw-rw-r-- 1 4.8G Apr 12 21:43 models/7B/ggml-model-q4_1.bin

Perplexity for 7B: (not going to let this run for over a day, can someone with a faster machine help out?)

Q2:
78.25 seconds per pass - ETA 14.24 hours
[1]9.5516,[2]10.8049,[3]11.6886,[4]12.9123,[5]12.7524,[6]12.7123,[7]12.9646,[8]13.1274,[9]13.8018,[10]14.1944,[11]14.7764,

Q3:
141.31 seconds per pass - ETA 25.71 hours
[1]4.8166,[2]5.2200,[3]6.1143,

sw · 2023-04-15T19:32:47Z

The build failures on macOS show that I've messed up with using AVX2 in the AVX part, so this probably won't work on an AVX-only machine without modification.

Edit: removed the AVX parts, I tested these with the wrong compiler flags.

ggerganov · 2023-04-15T19:34:56Z

I'm playing with the smallest GPT-2 models and trying to make them work with 4-bit quantization. They keep breaking down completely even after #951. I guess the small number of parameters requires very high precision in the quantization.

However, I just noticed that if I keep just the last tensor in F16, they suddenly become coherent.
So I made this issue: #1003

Maybe it could further improve your Q2 and Q3 if you keep the last tensor in high precision

slaren · 2023-04-15T20:14:10Z

Probably nothing new, but here is a quick test of the perplexity performance with Q2_0:

Model	Time per pass
7B Q4_0	26.27 seconds
7B Q2_0	53.26 seconds
13B Q2_0	100.88 seconds
30B Q2_0	246.48 seconds

slaren · 2023-04-16T00:40:08Z

Here is a AVX2 implementation of ggml_vec_dot_q2_0_q8_0 that operates on two blocks at a time. Doing this the performance is much closer to q4_0.

Model	Time per pass
7B Q4_0	26.27 seconds
7B Q2_0	34.74 seconds
13B Q2_0	66.76 seconds
30B Q2_0	165.91 seconds

static inline __m256i bytesFromi2(uint32_t packed1, uint32_t packed2) {
    __m128i bx1 = _mm_set1_epi32(packed1);
    __m128i bx2 = _mm_set1_epi32(packed2);
    __m256i bx = _mm256_set_m128i(bx1, bx2);

    // shift counts to get all bit pairs in lowest position of each byte
    const __m256i shift256 = _mm256_set_epi32(6, 4, 2, 0,
                                              6, 4, 2, 0);
    bx = _mm256_srlv_epi32(bx, shift256);

    const __m256i shufmask = _mm256_set_epi8(15,11,7,3,
                                             14,10,6,2,
                                             13,9,5,1,
                                             12,8,4,0,
                                             15,11,7,3,
                                             14,10,6,2,
                                             13,9,5,1,
                                             12,8,4,0);

    bx = _mm256_shuffle_epi8(bx, shufmask);

    const __m256i mask = _mm256_set1_epi8(3);
    bx = _mm256_and_si256(mask, bx);

    return bx;
}

static void ggml_vec_dot_q2_0_q8_0(const int n, float * restrict s, const void * restrict vx, const void * restrict vy) {
    const int nb = n / QK2_0;

    assert(n % QK2_0 == 0);
    assert(nb % 2 == 0);

    const block_q2_0 * restrict x = vx;
    const block_q8_0 * restrict y = vy;

    float sumf = 0.0f;

#if defined(__AVX2__)
    // Initialize accumulator with zeros
    __m256 acc = _mm256_setzero_ps();

    for (int i = 0; i < nb; i += 2) {
        __m256i bx = bytesFromi2(x[i+1].qs, x[i].qs);

        // Compute combined scale for the block
        const __m128 scale1 = _mm_set1_ps(GGML_FP16_TO_FP32(x[i].d) * y[i/2].d);
        const __m128 scale2 = _mm_set1_ps(GGML_FP16_TO_FP32(x[i+1].d) * y[i/2].d);
        const __m256 scale = _mm256_set_m128(scale2, scale1);

        const __m256i off = _mm256_set1_epi8(2);
        bx = _mm256_sub_epi8(bx, off);

        // Load y vector
        const __m256i by = _mm256_loadu_si256((const __m256i *)y[i/2].qs);

        // Get absolute values of x vectors
        const __m256i ax = _mm256_sign_epi8(bx, bx);

        // Sign the values of the y vectors
        const __m256i sy = _mm256_sign_epi8(by, bx);

        // Perform multiplication and create 16-bit values
        const __m256i dot = _mm256_maddubs_epi16(ax, sy);

        // Convert int16_t to int32_t by adding pairwise
        const __m256i ones = _mm256_set1_epi16(1);
        __m256i i32 = _mm256_madd_epi16(ones, dot);

        // Convert int32_t to float
        __m256 p = _mm256_cvtepi32_ps( i32 );

        // Apply the scale, and accumulate
        acc = _mm256_fmadd_ps( scale, p, acc );
    }

    // Return horizontal sum of the acc vector
    __m128 res = _mm256_extractf128_ps( acc, 1 );
    res = _mm_add_ps( res, _mm256_castps256_ps128( acc ) );
    res = _mm_add_ps( res, _mm_movehl_ps( res, res ) );
    res = _mm_add_ss( res, _mm_movehdup_ps( res ) );

    sumf = _mm_cvtss_f32( res );
#else
...

sw · 2023-04-16T08:04:20Z

Here is a AVX2 implementation of ggml_vec_dot_q2_0_q8_0 that operates on two blocks at a time

Thanks @slaren, I just added this. Apparently 2 bits are called a crumb, so I went with that.

I originally wrote it for 128 bits because I thought it would be AVX-compatible, but it turns out just avoiding 256 in the intrinsic names is not enough, some of the 128 bit intrinsics were only introduced with AVX2.

sw · 2023-04-16T09:44:48Z

Maybe it could further improve your Q2 and Q3 if you keep the last tensor in high precision

That does help, but with a block size of 6 bytes, Q2 is already 3-bit in a sense, and this again makes the file bigger. Q3 might be changed to use QK=24, but that doesn't really match the Q8 block size.

QK	output	file size	perplexity	final perplexity
16	FP16	2.6G	`[1] 8.4455,[2] 9.7838,[3]10.4037,[4]11.3945,[5]11.0683,[6]11.0007,[7]11.1454,[8]11.3229,[9]11.8927,[10]12.2416,[11]12.6852,`
16	Q2	2.4G	`[1] 9.5516,[2]10.8049,[3]11.6886,[4]12.9123,[5]12.7524,[6]12.7123,[7]12.9646,[8]13.1274,[9]13.8018,[10]14.1944,[11]14.7764,`	12.6438
32	FP16	2.2G	`[1]12.0870,[2]14.3404,[3]15.1381,[4]16.0226,[5]15.6099,[6]15.3150,[7]15.7483,[8]15.8426,[9]16.5506,[10]17.1701,[11]17.9920,`
32	Q2	2.0G	`[1]15.3372,[2]17.8944,[3]18.8442,[4]19.8121,[5]19.6249,[6]19.5118,[7]20.2628,[8]20.5298,[9]21.2859,[10]21.9962,[11]23.2782,`

Don't know who would want to use the smallest file, except for entertainment. Sometimes it actually talks about building websites, but sometimes it goes off the rails:

Building a website can be done in 10 simple steps:
Ten Tips For Writing A Novel In Thirty Days
I have decided to begin my book reviewing of the Fat, Furry Feline by F. J. (Bed, 2008). This book was written and edited by M. (Bed), using a series of gags about cats with large breasts. The entire feline breed is actually in reference to the cat that is a door mat that has an accompanying set of (forgable, un-forglut, and in, 2010).

Changing the sampling parameters (temperature etc.) may help, but I haven't played with those.

slaren · 2023-04-16T13:10:21Z

In case this is useful for future comparisons, currently 7B q2_0 AVX2 has a perplexity of 12.6438.

pubby · 2023-04-16T16:25:18Z

Looking forward to anyone finding better SIMD optimizations, especially for Q3, which is a pain in the butt...

I wrote about this at #456 (comment), but I think there's a pretty obvious win by using a different q3_0 representation. Downside being, it's not the GPTQ format.

Code and pull request at sw#1

sw · 2023-04-17T15:40:16Z

Thanks to @pubby, the Q3 code is now faster on AVX2 and should be more amenable to other SIMD optimizations. You'll have to re-quantize the model, though.

pubby · 2023-04-17T16:14:30Z

Here's an attempt at porting bytesFromCrumbs to other architectures. I don't have these systems so they may be incorrect and/or slow. Improvements are more than welcome.

ARM NEON:

static inline int8x16_t bytesFromCrumbs(uint32_t packed) {
#  if __ARM_64BIT_STATE
    const uint8x16_t temp = vreinterpretq_u8_u32(vmovq_n_u32(packed));

    // Swizzle to put in the proper order
    const uint8x16_t indices = vcombine_u8(vcreate_u8(0x0D0905010C080400), vcreate_u8(0x0F0B07030E0A0602));
    const uint8x16_t swizzled = vqtbl1q_u8(temp, indices);
#  else
    const uint8x8_t temp_lo = vreinterpret_u8_u16(vmov_n_u16(packed));
    const uint8x8_t temp_hi = vreinterpret_u8_u16(vmov_n_u16(packed >> 16));

    // Swizzle to put in the proper order
    const uint8x8_t indices = vcreate_u8(0x0705030106040200);
    const uint8x8_t swizzled_lo = vtbl1_u8(temp_lo, indices);
    const uint8x8_t swizzled_hi = vtbl1_u8(temp_hi, indices);
    const uint8x16_t swizzled = vcombine_u8(swizzled_lo, swizzled_hi);
#  endif

    // Shift counts left to get all bit pairs in highest position of each byte
    const int8x16_t shift = vreinterpretq_s8_u32(vmovq_n_u32(0x00020406));
    const uint8x16_t shifted = vshlq_u8(swizzled, shift);

    // Then shift right to put in lowest position
    return vreinterpretq_s8_u8(vshrq_n_u8(shifted, 6));
}

WASM:

static inline v128_t bytesFromCrumbs(uint32_t packed) {
{
    // Shift without SIMD
    const uint32_t packed128[4] = { packed >> 6, packed >> 4, packed >> 2, packed };
    const v128_t shifted = wasm_v128_load(&packed128);

    // We only care about the lowest bits
    const v128_t mask = wasm_u8x16_const_splat(3);
    const v128_t masked = wasm_v128_and(shifted, mask);

    // Swizzle to put in the proper order
    const uint8_t swizmask[16] = { 15,11, 7, 3,
                                   14,10, 6, 2,
                                   13, 9, 5, 1,
                                   12, 8, 4, 0 };
    return wasm_i8x16_swizzle(masked, wasm_v128_load(&swizmask));
}

(The bytesFromCrumbs function expands packed 2-bit integers into 8-bit vectors. Every 2 bit integer gets its own byte. Also, the order between integers remains the same.)

sw · 2023-04-19T13:44:28Z

At this point I'm wondering if we should target a specific model size. Is there any environment (wasm for example) where the 4Gbyte 7B Q4_0 is too large?

Q2 probably shouldn't be merged as it's not really useable.

ghost · 2023-04-20T11:28:10Z

Final perplexity for LLaMA 30B Q2_0: 6.9507

[1]5.5177,[2]6.2985,[3]6.9708,[4]8.0269,[5]7.9123,[6]7.8642,[7]8.0664,[8]8.1121,[9]8.4769,[10]8.7209,[11]8.9686,[12]9.0272,[13]8.9574,[14]9.0978,[15]9.2723,[16]8.8251,[17]8.6077,[18]8.6261,[19]8.2110,[20]8.1553,[21]8.0250,[22]7.8321,[23]7.7748,[24]7.6499,[25]7.6338,[26]7.4238,[27]7.1778,[28]7.0810,[29]6.9396,[30]6.7609,[31]6.6959,[32]6.7177,[33]6.6486,[34]6.7021,[35]6.7173,[36]6.7635,[37]6.7516,[38]6.7760,[39]6.8057,[40]6.8891,[41]6.9141,[42]6.9465,[43]6.8917,[44]6.9374,[45]6.9347,[46]6.8974,[47]6.9053,[48]6.8708,[49]6.8891,[50]6.8306,[51]6.8271,[52]6.8127,[53]6.8515,[54]6.8228,[55]6.7945,[56]6.8031,[57]6.7943,[58]6.8206,[59]6.8333,[60]6.8809,[61]6.8704,[62]6.9410,[63]6.9596,[64]6.9626,[65]7.0037,[66]6.9978,[67]7.0110,[68]7.0340,[69]7.0772,[70]7.1199,[71]7.1404,[72]7.1686,[73]7.2482,[74]7.2447,[75]7.2483,[76]7.2525,[77]7.2638,[78]7.2433,[79]7.2776,[80]7.2706,[81]7.2926,[82]7.2969,[83]7.2396,[84]7.2230,[85]7.2163,[86]7.1918,[87]7.1416,[88]7.1082,[89]7.0769,[90]7.0589,[91]7.0746,[92]7.0662,[93]7.0564,[94]7.0498,[95]7.0789,[96]7.0689,[97]7.0662,[98]7.0575,[99]7.0397,[100]7.0310,[101]7.0554,[102]7.0468,[103]7.0648,[104]7.0678,[105]7.0741,[106]7.0914,[107]7.0889,[108]7.0890,[109]7.0803,[110]7.0750,[111]7.0927,[112]7.1181,[113]7.1165,[114]7.1075,[115]7.1169,[116]7.1093,[117]7.1158,[118]7.1380,[119]7.1543,[120]7.2015,[121]7.2236,[122]7.2413,[123]7.2815,[124]7.3005,[125]7.2949,[126]7.3331,[127]7.3764,[128]7.4058,[129]7.3858,[130]7.3895,[131]7.3835,[132]7.3807,[133]7.3744,[134]7.3913,[135]7.3902,[136]7.3851,[137]7.3742,[138]7.3653,[139]7.3534,[140]7.3538,[141]7.3388,[142]7.3374,[143]7.3190,[144]7.2977,[145]7.2939,[146]7.2789,[147]7.2920,[148]7.2946,[149]7.2855,[150]7.2868,[151]7.2915,[152]7.2771,[153]7.2630,[154]7.2591,[155]7.2708,[156]7.2708,[157]7.2883,[158]7.2899,[159]7.2936,[160]7.2982,[161]7.3141,[162]7.2791,[163]7.2627,[164]7.2392,[165]7.2048,[166]7.1751,[167]7.1297,[168]7.0954,[169]7.0795,[170]7.0687,[171]7.0406,[172]7.0216,[173]7.0065,[174]6.9751,[175]6.9491,[176]6.9354,[177]6.9131,[178]6.8908,[179]6.8687,[180]6.8645,[181]6.8417,[182]6.8220,[183]6.8039,[184]6.7999,[185]6.7914,[186]6.7957,[187]6.8000,[188]6.7957,[189]6.8165,[190]6.8203,[191]6.8419,[192]6.8683,[193]6.8867,[194]6.9013,[195]6.9261,[196]6.9405,[197]6.9591,[198]6.9788,[199]6.9811,[200]6.9793,[201]6.9677,[202]6.9847,[203]6.9874,[204]6.9928,[205]7.0002,[206]7.0002,[207]6.9940,[208]6.9978,[209]7.0021,[210]7.0087,[211]7.0157,[212]7.0205,[213]7.0291,[214]7.0318,[215]7.0346,[216]7.0481,[217]7.0638,[218]7.0814,[219]7.0805,[220]7.0773,[221]7.0722,[222]7.0710,[223]7.0601,[224]7.0513,[225]7.0480,[226]7.0690,[227]7.0813,[228]7.0898,[229]7.0987,[230]7.0980,[231]7.1166,[232]7.1067,[233]7.0888,[234]7.0753,[235]7.0602,[236]7.0543,[237]7.0463,[238]7.0491,[239]7.0338,[240]7.0196,[241]7.0263,[242]7.0329,[243]7.0311,[244]7.0205,[245]7.0224,[246]7.0115,[247]7.0014,[248]6.9918,[249]6.9878,[250]6.9932,[251]6.9876,[252]6.9835,[253]6.9754,[254]6.9747,[255]6.9643,[256]6.9442,[257]6.9357,[258]6.9280,[259]6.9269,[260]6.9185,[261]6.9136,[262]6.9074,[263]6.9027,[264]6.8897,[265]6.8917,[266]6.8896,[267]6.8843,[268]6.8944,[269]6.8940,[270]6.8943,[271]6.9033,[272]6.9083,[273]6.9085,[274]6.9125,[275]6.9211,[276]6.9263,[277]6.9436,[278]6.9562,[279]6.9676,[280]6.9694,[281]6.9813,[282]6.9884,[283]7.0045,[284]7.0150,[285]7.0261,[286]7.0435,[287]7.0440,[288]7.0512,[289]7.0415,[290]7.0233,[291]7.0066,[292]6.9897,[293]6.9715,[294]6.9707,[295]6.9718,[296]6.9745,[297]6.9715,[298]6.9723,[299]6.9680,[300]6.9559,[301]6.9546,[302]6.9481,[303]6.9385,[304]6.9293,[305]6.9256,[306]6.9127,[307]6.9121,[308]6.9184,[309]6.9028,[310]6.8987,[311]6.8908,[312]6.8928,[313]6.8875,[314]6.8852,[315]6.8667,[316]6.8661,[317]6.8484,[318]6.8283,[319]6.8452,[320]6.8610,[321]6.8688,[322]6.8653,[323]6.8573,[324]6.8564,[325]6.8676,[326]6.8675,[327]6.8694,[328]6.8716,[329]6.8755,[330]6.8782,[331]6.8905,[332]6.8866,[333]6.8947,[334]6.8871,[335]6.8813,[336]6.8822,[337]6.8780,[338]6.8762,[339]6.8714,[340]6.8666,[341]6.8758,[342]6.8781,[343]6.8844,[344]6.8833,[345]6.8834,[346]6.8800,[347]6.8825,[348]6.8869,[349]6.8890,[350]6.8856,[351]6.8856,[352]6.8852,[353]6.8795,[354]6.8817,[355]6.8854,[356]6.8872,[357]6.8832,[358]6.8923,[359]6.8964,[360]6.8920,[361]6.8907,[362]6.8992,[363]6.9100,[364]6.9174,[365]6.9238,[366]6.9237,[367]6.9327,[368]6.9279,[369]6.9285,[370]6.9301,[371]6.9241,[372]6.9296,[373]6.9358,[374]6.9335,[375]6.9318,[376]6.9402,[377]6.9343,[378]6.9368,[379]6.9415,[380]6.9341,[381]6.9282,[382]6.9230,[383]6.9227,[384]6.9220,[385]6.9203,[386]6.9174,[387]6.9168,[388]6.9110,[389]6.9060,[390]6.8980,[391]6.8887,[392]6.8853,[393]6.8845,[394]6.8879,[395]6.8851,[396]6.8767,[397]6.8881,[398]6.8922,[399]6.8997,[400]6.8987,[401]6.9001,[402]6.9035,[403]6.9058,[404]6.9129,[405]6.9019,[406]6.8964,[407]6.8940,[408]6.8944,[409]6.9054,[410]6.9160,[411]6.9287,[412]6.9461,[413]6.9594,[414]6.9672,[415]6.9715,[416]6.9799,[417]6.9927,[418]6.9963,[419]7.0011,[420]7.0101,[421]7.0235,[422]7.0292,[423]7.0363,[424]7.0467,[425]7.0531,[426]7.0602,[427]7.0641,[428]7.0707,[429]7.0761,[430]7.0843,[431]7.1010,[432]7.1057,[433]7.1031,[434]7.0955,[435]7.0960,[436]7.0976,[437]7.1067,[438]7.1152,[439]7.1098,[440]7.1083,[441]7.1027,[442]7.1010,[443]7.1035,[444]7.1059,[445]7.1025,[446]7.1055,[447]7.1079,[448]7.1116,[449]7.1098,[450]7.1103,[451]7.1070,[452]7.0995,[453]7.0911,[454]7.0866,[455]7.0853,[456]7.0886,[457]7.0900,[458]7.0868,[459]7.0859,[460]7.0942,[461]7.0898,[462]7.0858,[463]7.0901,[464]7.0885,[465]7.0859,[466]7.0785,[467]7.0794,[468]7.0801,[469]7.0797,[470]7.0781,[471]7.0733,[472]7.0759,[473]7.0684,[474]7.0698,[475]7.0648,[476]7.0668,[477]7.0588,[478]7.0588,[479]7.0653,[480]7.0713,[481]7.0728,[482]7.0678,[483]7.0614,[484]7.0619,[485]7.0604,[486]7.0527,[487]7.0519,[488]7.0495,[489]7.0424,[490]7.0378,[491]7.0354,[492]7.0276,[493]7.0223,[494]7.0214,[495]7.0189,[496]7.0145,[497]7.0108,[498]7.0093,[499]7.0033,[500]6.9934,[501]6.9843,[502]6.9826,[503]6.9815,[504]6.9718,[505]6.9733,[506]6.9749,[507]6.9745,[508]6.9710,[509]6.9703,[510]6.9747,[511]6.9806,[512]6.9858,[513]6.9878,[514]6.9950,[515]6.9875,[516]6.9849,[517]6.9853,[518]6.9843,[519]6.9866,[520]6.9886,[521]6.9903,[522]6.9920,[523]6.9928,[524]6.9993,[525]7.0033,[526]7.0051,[527]7.0068,[528]7.0012,[529]7.0021,[530]6.9965,[531]6.9969,[532]7.0026,[533]7.0053,[534]7.0032,[535]7.0061,[536]6.9996,[537]6.9975,[538]7.0023,[539]7.0038,[540]7.0088,[541]7.0104,[542]7.0114,[543]7.0149,[544]7.0168,[545]7.0157,[546]7.0160,[547]7.0116,[548]7.0049,[549]7.0047,[550]7.0021,[551]6.9986,[552]6.9972,[553]6.9918,[554]6.9888,[555]6.9866,[556]6.9863,[557]6.9898,[558]6.9866,[559]6.9890,[560]6.9876,[561]6.9878,[562]6.9858,[563]6.9858,[564]6.9923,[565]6.9945,[566]6.9942,[567]6.9911,[568]6.9926,[569]6.9901,[570]6.9946,[571]6.9955,[572]6.9959,[573]6.9958,[574]6.9921,[575]6.9913,[576]6.9905,[577]6.9871,[578]6.9852,[579]6.9848,[580]6.9770,[581]6.9725,[582]6.9735,[583]6.9745,[584]6.9769,[585]6.9698,[586]6.9633,[587]6.9631,[588]6.9690,[589]6.9753,[590]6.9771,[591]6.9791,[592]6.9777,[593]6.9726,[594]6.9734,[595]6.9694,[596]6.9745,[597]6.9716,[598]6.9692,[599]6.9716,[600]6.9706,[601]6.9684,[602]6.9720,[603]6.9754,[604]6.9759,[605]6.9791,[606]6.9804,[607]6.9802,[608]6.9769,[609]6.9778,[610]6.9836,[611]6.9816,[612]6.9843,[613]6.9813,[614]6.9758,[615]6.9667,[616]6.9704,[617]6.9632,[618]6.9579,[619]6.9525,[620]6.9374,[621]6.9296,[622]6.9265,[623]6.9282,[624]6.9283,[625]6.9290,[626]6.9284,[627]6.9326,[628]6.9336,[629]6.9330,[630]6.9370,[631]6.9436,[632]6.9507,[633]6.9497,[634]6.9539,[635]6.9537,[636]6.9510,[637]6.9491,[638]6.9522,[639]6.9472,[640]6.9464,[641]6.9453,[642]6.9507,[643]6.9510,[644]6.9511,[645]6.9489,[646]6.9526,[647]6.9494,[648]6.9488,[649]6.9482,[650]6.9514,[651]6.9559,[652]6.9564,[653]6.9606,[654]6.9533,[655]6.9507,

sw · 2023-04-20T20:38:29Z

Rebased onto master, but I kept the tensor/ftype numbering, because @TheBloke has published Alpaca/LoRA model files for Q2. These should still work now but I haven't tested that. On the other hand, Q4_2 and Q4_3 will not work on this branch. If and when this gets merged, you will have to re-quantize your Q2/Q3 models.

As for perplexity, thanks to everyone providing numbers, my machine is too slow for that...

But it looks like Q2 isn't really worth it, unless you have some extreme file/RAM size restrictions:

xloem · 2023-04-24T11:25:43Z

Regarding wasm, there is indeed a 32-bit memory model for now, so sizeof(size_t) == 4 and large models cannot be allocated.

In practice on my trial wasm platform (ios a-shell where the whole system is wasm), malloc() calls for a little over 1GiB start returning 0 (and mmap is just a wrapper around malloc() and pread() here so doesn’t resolve it). 2bit quantization of llama 7b wouldn’t be sufficient compression for the particular wasm runtime I’ve been trying without some additional structured pruning and/or ahead-of-time model compilation.

But data > 4GB in size can’t be simultaneously referenced in memory (or “mmap”’d from a file) because the pointers are 32 bit for now.

sw · 2023-06-09T18:12:41Z

Obsolete thanks to #1684

ggerganov added the research 🔬 label Apr 16, 2023

ggerganov assigned sw Apr 16, 2023

ggerganov linked an issue Apr 16, 2023 that may be closed by this pull request

2-bit integer quantization #456

Closed

ggerganov mentioned this pull request Apr 17, 2023

Pulling new quantization format Q4_1_O into upstream ggml ggerganov/ggml#89

Closed

Narsil mentioned this pull request Apr 17, 2023

Q4 quantization support huggingface/safetensors#197

Closed

rabidcopy mentioned this pull request Apr 17, 2023

Cannot load 2 bit quantized ggml model on Windows #1018

Closed

sw force-pushed the q2q3 branch from de542b3 to 7dad717 Compare April 17, 2023 16:47

sw and others added 4 commits April 20, 2023 22:23

Q2 and Q3 quantization

6fc51a8

Q2 AVX2: do two blocks at a time, by @slaren

c29ab90

More AVX2 optimizations

8c90a86

Faster q3_0 implementation, using two planes, by @pubby

7aa501c

sw force-pushed the q2q3 branch from 7dad717 to 7aa501c Compare April 20, 2023 20:33

MarcioPais mentioned this pull request Apr 23, 2023

Investigate alternative approach for Q4 quantization #397

Closed

ikawrakow mentioned this pull request Apr 29, 2023

QX_4 quantization #1240

Closed

sw closed this Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q2 and Q3 quantization #1004

Q2 and Q3 quantization #1004

sw commented Apr 15, 2023 •

edited

Loading

sw commented Apr 15, 2023 •

edited

Loading

ggerganov commented Apr 15, 2023

slaren commented Apr 15, 2023 •

edited

Loading

slaren commented Apr 16, 2023

sw commented Apr 16, 2023 •

edited

Loading

sw commented Apr 16, 2023 •

edited

Loading

slaren commented Apr 16, 2023

pubby commented Apr 16, 2023 •

edited

Loading

sw commented Apr 17, 2023

pubby commented Apr 17, 2023 •

edited

Loading

sw commented Apr 19, 2023

ghost commented Apr 20, 2023

sw commented Apr 20, 2023 •

edited

Loading

xloem commented Apr 24, 2023

sw commented Jun 9, 2023

Q2 and Q3 quantization #1004

Q2 and Q3 quantization #1004

Conversation

sw commented Apr 15, 2023 • edited Loading

sw commented Apr 15, 2023 • edited Loading

ggerganov commented Apr 15, 2023

slaren commented Apr 15, 2023 • edited Loading

slaren commented Apr 16, 2023

sw commented Apr 16, 2023 • edited Loading

sw commented Apr 16, 2023 • edited Loading

slaren commented Apr 16, 2023

pubby commented Apr 16, 2023 • edited Loading

sw commented Apr 17, 2023

pubby commented Apr 17, 2023 • edited Loading

sw commented Apr 19, 2023

ghost commented Apr 20, 2023

sw commented Apr 20, 2023 • edited Loading

xloem commented Apr 24, 2023

sw commented Jun 9, 2023

sw commented Apr 15, 2023 •

edited

Loading

sw commented Apr 15, 2023 •

edited

Loading

slaren commented Apr 15, 2023 •

edited

Loading

sw commented Apr 16, 2023 •

edited

Loading

sw commented Apr 16, 2023 •

edited

Loading

pubby commented Apr 16, 2023 •

edited

Loading

pubby commented Apr 17, 2023 •

edited

Loading

sw commented Apr 20, 2023 •

edited

Loading