-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : Load data into int8x16x4_t using vld4q_s8 on arm64 #1738
Conversation
)" This reverts commit 8432d4d.
@lindeer |
commit b617f28 Merge: 73cc5b8 92f44ff Author: Concedo <[email protected]> Date: Fri Jun 9 16:10:35 2023 +0800 Merge branch 'master' into concedo_experimental commit 73cc5b8 Author: Concedo <[email protected]> Date: Fri Jun 9 16:09:23 2023 +0800 added warning message for unsupported K quants commit 92f44ff Author: AT <[email protected]> Date: Fri Jun 9 04:00:51 2023 -0400 metal : add GELU implementation (ggerganov#1770) Co-authored-by: Adam Treat <[email protected]> commit 245fc3c Author: Kawrakow <[email protected]> Date: Fri Jun 9 10:39:59 2023 +0300 metal : faster q4_0 (ggerganov#1775) * metal : 8% faster q4_0 Avoid copying into local uchar4 anf float4. * metal : 17% faster Q4_0 Use 64 threads in a thread group. --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 01dc509 Merge: 0833845 72ff528 Author: Concedo <[email protected]> Date: Fri Jun 9 14:53:35 2023 +0800 Merge branch 'master' into concedo_experimental commit 0833845 Author: Concedo <[email protected]> Date: Fri Jun 9 14:38:31 2023 +0800 merged metal patch directly into the file commit 72ff528 Author: Kawrakow <[email protected]> Date: Thu Jun 8 22:28:21 2023 +0300 metal : add Q2_K implementation (ggerganov#1762) * metal : add Q2_K implementation 27.1 ms / token on M2 Max 30-core GPU, so about the same speed as Q4_0. Memory throughput is ~156 GB/s. The access pattern used in the Q2_K CUDA implementation resulted in significantly lower performance (~31 ms/token). * Fixing merge conflicts --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 0bf7cf1 Author: Georgi Gerganov <[email protected]> Date: Thu Jun 8 20:48:14 2023 +0300 Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (ggerganov#1738)" This reverts commit 8432d4d. commit 8432d4d Author: le.chang <[email protected]> Date: Fri Jun 9 00:47:56 2023 +0800 ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (ggerganov#1738) commit 6fa1613 Author: Hyun-joo KIM <[email protected]> Date: Fri Jun 9 01:47:36 2023 +0900 Metal inference enhancement - put hard-wired relative path of ggml-model.model file using a patch file due to lack of NSBundle environment commit 0f291e1 Author: Kawrakow <[email protected]> Date: Thu Jun 8 19:46:22 2023 +0300 metal : Q6_K implementation (ggerganov#1752) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master * Metal implementation for Q6_K Similar to the CUDA implementation. No idea if this is the optimum for Metal, but the few alternative variants I tried all had a lower performance. We get 36.5 ms / token on M2 Max with 30 GPU cores. This corresponds to ~200 GB/second throughput. * clang-tidy : add config back * Much better Q6_K implementation for metal 28.3 ms / token for 7B. Subtracting ~9 ms that is spent in other compute graph operations, we are left with ~19 ms for the matrix multiplications. The model is ~5.5 GB, so we are getting 1000 / 19 * 5.5 = 290 GB/s! --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 7f18160 Author: Hyun-joo KIM <[email protected]> Date: Fri Jun 9 01:24:22 2023 +0900 Metal inference enhancement - put hard-wired relative path of ggml-model.model file due to lack of NSBundle environment commit 8fc8179 Author: qingfengfenga <[email protected]> Date: Thu Jun 8 15:58:53 2023 +0800 Add llama.cpp docker support for non-latin languages (ggerganov#1673) * Modify Dockerfile default character set to improve compatibility (ggerganov#1673) commit b50b570 Author: Steven Roussey <[email protected]> Date: Thu Jun 8 00:12:28 2023 -0700 ggml : fix fprintf warnings (ggerganov#1720) commit 53aba3f Author: Georgi Gerganov <[email protected]> Date: Thu Jun 8 10:09:08 2023 +0300 clang-tidy : restore dot file from accidental deletion commit 4161bdc Author: Kawrakow <[email protected]> Date: Thu Jun 8 10:08:23 2023 +0300 metal : add Q4_K implementation (ggerganov#1733) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 0035858 Author: johnson442 <[email protected]> Date: Thu Jun 8 08:02:48 2023 +0100 k-quants : add missing compile definition to CMakeLists (ggerganov#1748)
@ggerganov a debian linux with HUAWEI Kyrin 990
|
@lindeer No, You need something like this:
|
thanks @ikawrakow , maybe there is something wrong with my usage. #ifdef __LITTLE_ENDIAN__
#define vld1q_s8_x4(__p0) __extension__ ({ \
int8x16x4_t __ret; \
__builtin_neon_vld1q_x4_v(&__ret, __p0, 32); \
__ret; \
})
#else
#define vld1q_s8_x4(__p0) __extension__ ({ \
int8x16x4_t __ret; \
__builtin_neon_vld1q_x4_v(&__ret, __p0, 32); \
\
__ret.val[0] = ...
__ret; \
})
#endif but in this context error is reported, I could not figure it out:
while #ifdef __LITTLE_ENDIAN__
#define vld1q_u8_x2(__p0) __extension__ ({ \
uint8x16x2_t __ret; \
__builtin_neon_vld1q_x2_v(&__ret, __p0, 48); \
__ret; \
})
#ifdef __LITTLE_ENDIAN__
#define vld4q_s8(__p0) __extension__ ({ \
int8x16x4_t __ret; \
__builtin_neon_vld4q_v(&__ret, __p0, 32); \
__ret; \
}) $ gcc --version
gcc (Uos 8.3.0.3-3+rebuild) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE |
commit f50ef4a1315ff8a30ebcf1bd223eaf8337aa4193 Merge: c1e5c83 8ac05a5 Author: YellowRoseCx <[email protected]> Date: Sun Jun 25 17:05:52 2023 -0500 add yr_kcpp committs to slyecho branch commit 8ac05a53badcd2532a74f4b6c3fa5622faf6f701 Merge: adeb409 0f75ef9 Author: YellowRoseCx <[email protected]> Date: Sun Jun 25 16:03:26 2023 -0500 Merge remote-tracking branch 'origin/upstream/SlyEcho/llama.cpp/hipblas' into dev commit adeb409de263b269337c468499c0471e5637518e Merge: abed427 d2034ce Author: YellowRoseCx <[email protected]> Date: Sun Jun 25 15:57:56 2023 -0500 Merge branch 'LostRuins:concedo' into dev commit 0f75ef9c2c1e64d43d4164f9f75a27b2c61594ba Merge: 88022bb 447ccbe Author: YellowRoseCx <[email protected]> Date: Sun Jun 25 15:55:16 2023 -0500 Merge pull request #23 from ggerganov/master 6/25/23 Sync w/ Llama.cpp commit d2034ced7b177c5bafa736327f270c722845a74a Merge: 8342fe8 66a2555 Author: Concedo <[email protected]> Date: Sun Jun 25 17:01:15 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # README.md # build.zig # flake.nix # tests/test-grad0.c # tests/test-sampling.cpp # tests/test-tokenizer-0.cpp commit abed427b6f370698fe8e8409e7980f238aad03ef Author: YellowRoseCx <[email protected]> Date: Sat Jun 24 19:16:30 2023 -0500 reorganize If statements to include proper headers commit 06c3bf03b92c2e00fc4bcd27f0c34f32c58b19a9 Merge: ea6d320 8342fe8 Author: YellowRoseCx <[email protected]> Date: Sat Jun 24 16:57:20 2023 -0500 Merge branch 'LostRuins:concedo' into main commit 8342fe81b1c2a00aa81d44c9e1ffb7057df3b323 Author: Concedo <[email protected]> Date: Sat Jun 24 12:58:49 2023 +0800 revert the wstring tokenization. coherency was affected commit 6da38b0d40a6476ccdd56e48143b21d4254b5da1 Author: Concedo <[email protected]> Date: Sat Jun 24 12:30:38 2023 +0800 up ver commit 0485fa65a2fc3159ea9fb2ad7661a5837038b31d Author: Concedo <[email protected]> Date: Sat Jun 24 11:43:42 2023 +0800 wstring convert for mpt commit 6d718525c42c9174e7ecf47c50c9fcb5f64c22f9 Merge: 490cf39 f7b0963 Author: Concedo <[email protected]> Date: Fri Jun 23 23:56:31 2023 +0800 Merge branch 'optimize_quants_upstream' into concedo_experimental commit f7b096374dad99164c610196c1926d53d3e87831 Author: Concedo <[email protected]> Date: Fri Jun 23 23:56:22 2023 +0800 fixed string too long CI issue commit 490cf395f82d7d0582016a51054457e2d6f89769 Author: Concedo <[email protected]> Date: Fri Jun 23 22:51:51 2023 +0800 better alloc error commit ece453ed0984541fa1686e12e275864a36087f05 Merge: f39a746 d7b7484 Author: Concedo <[email protected]> Date: Fri Jun 23 22:46:54 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # README.md commit f39a7460890de883b0d68d45d75d1780984ca76e Author: Concedo <[email protected]> Date: Fri Jun 23 22:45:22 2023 +0800 bug fixes for openblas commit 43c2891afabea24b9a8c2de845d12463f844b949 Author: Concedo <[email protected]> Date: Fri Jun 23 19:01:36 2023 +0800 option to not use scratch commit d5e4cf7ffea99e66d2cf6c38826c2fdbc1d68c8a Author: Concedo <[email protected]> Date: Fri Jun 23 19:01:15 2023 +0800 handle ctx manip commit df9135e3a9a6708bb62e6484d239e2b4ea212ed7 Author: Concedo <[email protected]> Date: Fri Jun 23 18:41:23 2023 +0800 fixing memory bugs commit ea6d3208dcdc0b05e2c164dde8ee0bfc6a02ad09 Author: YellowRoseCx <[email protected]> Date: Fri Jun 23 01:53:28 2023 -0500 Update README.md commit 4d56ad8158595d1e835cb379939dc5526deb39e2 Author: YellowRoseCx <[email protected]> Date: Thu Jun 22 16:19:43 2023 -0500 Update README.md commit 21f930872b6e232679fe02eac9e429367365c6af Author: YellowRoseCx <[email protected]> Date: Thu Jun 22 15:42:05 2023 -0500 kquants_iter for hipblas and add gfx803 commit b6ff89066bbf2de23dab90bc8bbf9f63d8d1e070 Merge: eb094f0 e6ddb15 Author: YellowRoseCx <[email protected]> Date: Thu Jun 22 12:42:09 2023 -0500 Merge branch 'LostRuins:concedo' into main commit 0eedccaf06aaccd25fa6d4545c3b2223eae7aa16 Merge: da668e6 bbca06e Author: Concedo <[email protected]> Date: Thu Jun 22 17:59:58 2023 +0800 Merge branch 'master' into optimize_quants_upstream commit eb094f043f9b0b94e7db028ca36e96ce479b0369 Author: YellowRoseCx <[email protected]> Date: Wed Jun 21 23:59:18 2023 -0500 lowvram parameter description commit e6ddb15c3a838044f18636fabc4a6db16e217256 Author: Concedo <[email protected]> Date: Thu Jun 22 10:38:27 2023 +0800 cleanup commit 88022bbc60f5b6e6cce39ea35e4a88589d798d8e Author: YellowRoseCx <[email protected]> Date: Wed Jun 21 17:26:54 2023 -0500 Create CMakeLists.txt commit 09481c5eb8ba5376f6f7d0b5db3b2cb96a5515e8 Merge: bbca06e 7a00c95 Author: YellowRoseCx <[email protected]> Date: Wed Jun 21 17:24:02 2023 -0500 Merge branch 'ggerganov' of https://github.com/YellowRoseCx/koboldcpp-rocm into ggerganov commit 3a5dfeb568d543376910180caa9a99b081fef9d4 Merge: 665cc11 b1f00fa Author: YellowRoseCx <[email protected]> Date: Wed Jun 21 16:53:03 2023 -0500 Merge branch 'LostRuins:concedo' into koboldcpp-rocm commit 1b71752a9fe07f36c3fb8222c1e27052f170ff54 Author: Concedo <[email protected]> Date: Thu Jun 22 00:43:25 2023 +0800 Implemented basic GPU offloading for MPT, GPT-2, GPT-J and GPT-NeoX commit b1f00fa9ccdaec045636318fa5548547b07c248c Author: Ycros <[email protected]> Date: Thu Jun 22 01:01:46 2023 +1000 Fix hordeconfig max context setting, and add Makefile flags for cuda F16/KQuants per iter. (#252) * Fix hordeconfig maxcontext setting. * cuda: Bring DMMV_F16 and KQUANTS_ITER Makefile flags over from llama. commit dfdd20240c036741c7a0f2d57a5533bb8f81b794 Author: Concedo <[email protected]> Date: Wed Jun 21 16:10:31 2023 +0800 gpt j use scratch buffers commit 665cc1136b188e7ff5c1aa1359118c999ff6d162 Author: YellowRoseCx <[email protected]> Date: Wed Jun 21 01:13:19 2023 -0500 add lowvram parameter commit 222cbbb141f7ce79884cafb6bcebd860ae27cc04 Author: YellowRoseCx <[email protected]> Date: Tue Jun 20 19:03:28 2023 -0500 add additional hipblas conditions for cublas commit e1f958124ec99525cb58d8c534f9d1789377544e Author: YellowRoseCx <[email protected]> Date: Tue Jun 20 16:51:59 2023 -0500 Add hip def for cuda v2 commit 3bff5c0f0defd9d49b770c5ce107c71e5cba8003 Merge: a7e74b3 266d47a Author: YellowRoseCx <[email protected]> Date: Tue Jun 20 13:38:06 2023 -0500 Merge branch 'LostRuins:concedo' into koboldcpp-rocm commit 266d47a4b9d08d9a97edef048a5c8fb0c2331405 Merge: cce6e67 da668e6 Author: Concedo <[email protected]> Date: Tue Jun 20 22:46:35 2023 +0800 Merge branch 'optimize_quants_upstream' into concedo_experimental commit da668e685f6f2782b9a2a23280ec1727f7dfbd62 Author: Concedo <[email protected]> Date: Tue Jun 20 22:45:16 2023 +0800 fixing address spaces commit cce6e67f44b946b355ac9c4dc0c4762d491ccdb5 Author: Concedo <[email protected]> Date: Tue Jun 20 22:45:16 2023 +0800 fixing address spaces commit 1f1735f5adcb8ad4c8ab9886119cc0c5dca165ff Merge: 537ff22 6b75fc4 Author: Concedo <[email protected]> Date: Tue Jun 20 21:39:35 2023 +0800 Merge branch 'optimize_quants_upstream' into concedo_experimental commit 6b75fc48b942b48906199a990fac237a3f1d467f Author: Concedo <[email protected]> Date: Tue Jun 20 21:38:48 2023 +0800 fixed global const struct types commit 537ff22ec93753550b8b7b3f771a48e58b5a61e1 Author: Concedo <[email protected]> Date: Tue Jun 20 20:41:42 2023 +0800 fixed a bug with token timings, updated lite commit c5ae3f50a73f8f1f5f1b48ed5e9fe77a1cc4baa3 Merge: d754915 a6e8b02 Author: Concedo <[email protected]> Date: Tue Jun 20 18:41:13 2023 +0800 Merge branch 'optimize_quants_upstream' into concedo_experimental commit a6e8b0216d950ca558202f08a97a1be978eb9d0a Author: Concedo <[email protected]> Date: Tue Jun 20 18:34:46 2023 +0800 remove old dot kernels and template commit 93247a11cd4d1e664a85a1bde21fe0d6000bad6f Author: Concedo <[email protected]> Date: Tue Jun 20 18:30:30 2023 +0800 ported q2k and q5k speedups commit 029bed64469c3636beac8d04123637ecc82eb4bd Author: Concedo <[email protected]> Date: Tue Jun 20 17:57:44 2023 +0800 ported q3k speedup successfully commit d7549152699fe615c7b8f5dd01c7dd4ffc2464ae Merge: b4c532e 8d816d1 Author: Concedo <[email protected]> Date: Tue Jun 20 17:26:39 2023 +0800 Merge branch 'optimize_quants_upstream' into concedo_experimental commit b4c532e8626a714a2d3aff8b019847e76267b6ff Merge: 5e8e99f aacdbd4 Author: Concedo <[email protected]> Date: Tue Jun 20 17:26:27 2023 +0800 Merge branch 'master' into concedo_experimental commit 8d816d19d1f131393339ebc8ef30ea39c712cd1c Author: 0cc4m <[email protected]> Date: Tue Jun 20 08:41:35 2023 +0200 Add q6_k fast matmul kernel commit 34a4917984afe0fc3cc8dddb58d5f1782118d80b Author: 0cc4m <[email protected]> Date: Tue Jun 20 08:04:16 2023 +0200 Use preprocessor for QK_K commit 069cbe530d826b1b19559d1ded5032202032c287 Author: 0cc4m <[email protected]> Date: Tue Jun 20 08:01:40 2023 +0200 Fix q2_k fast kernel commit a7e74b39fe5eedf85d955fe5ea5f4c546322a9b0 Author: YellowRoseCx <[email protected]> Date: Mon Jun 19 22:04:18 2023 -0500 Update README.md commit 5e99b3cb72d83f45b3f7904ffb8f242e743a142c Author: YellowRoseCx <[email protected]> Date: Mon Jun 19 22:03:42 2023 -0500 Update Makefile commit 9190b17432ebdc489ab05b71df6c3b8d5e7f5895 Author: YellowRoseCx <[email protected]> Date: Mon Jun 19 21:47:10 2023 -0500 Update README.md commit 69fd31d18c56e0bd3ac5c6d72754134062389f80 Merge: c94a438 ba4e85a Author: Concedo <[email protected]> Date: Mon Jun 19 23:38:59 2023 +0800 Merge branch 'master' into optimize_quants_upstream commit 5e8e99f206b173071cb405749b8736bca4956421 Merge: 51e834c ba4e85a Author: Concedo <[email protected]> Date: Mon Jun 19 23:37:53 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt commit c94a438328e21c103fec9e2a5ddedd4a6c02d9e0 Author: Concedo <[email protected]> Date: Mon Jun 19 23:01:49 2023 +0800 xx + ib0 commit 266d436746f3249222029fb5b93bd7e607429b8a Author: Concedo <[email protected]> Date: Mon Jun 19 22:20:19 2023 +0800 Added broken new q4k quant commit 51e834c27bc555441d29b1ce0dd95ff24a0fec4d Author: Concedo <[email protected]> Date: Mon Jun 19 22:38:23 2023 +0800 keep duplicate targets for now commit cf94340dfcf8697fbc739e1d1952344b98390d17 Merge: 8e2dc19 16b9cd1 Author: Concedo <[email protected]> Date: Mon Jun 19 22:28:38 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # README.md commit 8e2dc19dc6ad31e740861ceee5c84ffd4ebb447e Author: Concedo <[email protected]> Date: Mon Jun 19 21:29:06 2023 +0800 updated tokenizer, added support for scratch buffers for neox and gpt2 commit cb6daa31719e3b354e7df31b66bfabf9f40081c9 Author: Concedo <[email protected]> Date: Mon Jun 19 11:51:23 2023 +0800 updated lite commit 2780ea292b1e9c6ead274de3afb34337716be08f Author: YellowRoseCx <[email protected]> Date: Sun Jun 18 15:48:00 2023 -0500 Update Makefile commit 04a3e64807a92c2e105af92f16dd6db2ea024d39 Author: YellowRoseCx <[email protected]> Date: Sun Jun 18 14:33:39 2023 -0500 remove extra line commit cccbca9dea3780e797a3b4972ba211e0c762fdc1 Author: YellowRoseCx <[email protected]> Date: Sun Jun 18 14:31:17 2023 -0500 attempt adding ROCM hipblas commit a44a1d4b90ed11d83d622eb976a945ff26a8974e Author: YellowRoseCx <[email protected]> Date: Sun Jun 18 14:31:01 2023 -0500 attempt adding ROCM hipblas commit b08818416972f83349bc4d6479bccc55ee31436d Author: YellowRoseCx <[email protected]> Date: Sun Jun 18 14:30:54 2023 -0500 attempt adding ROCM hipblas commit d0d3c4f32b4a1e6485437809d1a6694c2ba8f0d2 Merge: b08b371 e1886cf Author: Concedo <[email protected]> Date: Sun Jun 18 22:53:10 2023 +0800 Merge remote-tracking branch 'origin/master' into concedo_experimental # Conflicts: # README.md commit b08b371983932e1f528547b25469a2324d81c835 Author: Concedo <[email protected]> Date: Sun Jun 18 16:42:32 2023 +0800 allow hordeconfig to set a max ctx length too. commit 278427d9a4ef6883d425f2d931dedd9bb059c2c3 Merge: 8775dd9 ce2c7d7 Author: Concedo <[email protected]> Date: Sun Jun 18 15:29:44 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # README.md commit 8775dd99f49d7551e58d15a030d3f89d91741670 Author: Concedo <[email protected]> Date: Sun Jun 18 15:24:58 2023 +0800 various debug logging improvements commit dc3472eb588724f7714ccb4106b4ba1c11ca5b01 Merge: dbd11dd 0711a5f Author: Concedo <[email protected]> Date: Sat Jun 17 23:10:05 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # flake.nix commit dbd11ddd60b7e97146061bba04a198f229b6e770 Author: Concedo <[email protected]> Date: Sat Jun 17 23:08:14 2023 +0800 up ver commit 8bc4143e149b80733927e3222895c7642d68abc4 Merge: 9f8e2f8 971fe9f Author: Concedo <[email protected]> Date: Sat Jun 17 22:29:38 2023 +0800 Merge branch 'concedo' into concedo_experimental commit 9f8e2f8a1804f16fb2724eb786343725645566cb Merge: 795b355 794db3e Author: Concedo <[email protected]> Date: Sat Jun 17 20:02:32 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # README.md # pocs/vdot/vdot.cpp # scripts/verify-checksum-models.py # tests/test-quantize-fns.cpp # tests/test-quantize-perf.cpp # tests/test-sampling.cpp # tests/test-tokenizer-0.cpp commit 795b35546b1d026727f58b6b7934ee5d5f5138d3 Author: Concedo <[email protected]> Date: Sat Jun 17 19:57:09 2023 +0800 updated lite commit 971fe9f007aab94ac385373d011ef21f114243c2 Author: YellowRoseCx <[email protected]> Date: Sat Jun 17 06:54:29 2023 -0500 add tokens per second output (#246) * add tokens per second output * Update gpttype_adapter.cpp simplify --------- Co-authored-by: LostRuins <[email protected]> commit 7ef8d740b9a0e92aaacaf32627c78151050586d7 Merge: ae88eec a09f919 Author: Concedo <[email protected]> Date: Fri Jun 16 16:37:14 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile commit ae88eec40b436f95ec708bf7b731b64b6b8d1ebd Author: Concedo <[email protected]> Date: Fri Jun 16 16:27:23 2023 +0800 updated lite commit 0971f83bca2266fba477932c2285c8c8600b5bfb Author: Concedo <[email protected]> Date: Thu Jun 15 22:57:14 2023 +0800 added eos token id handling for starcoder models, as they use a different EOS ID commit 3649d35cca47ffae1e6955107e567379d0c7363c Merge: 6a113ee 254a7a7 Author: Concedo <[email protected]> Date: Thu Jun 15 18:24:31 2023 +0800 Merge branch 'master' into concedo_experimental commit 6a113eeec88cb2bcc10ba4e2ba454da7bf1c2eab Merge: 8ff35ef b1b8dc3 Author: Concedo <[email protected]> Date: Thu Jun 15 14:47:32 2023 +0800 Merge branch 'concedo' into concedo_experimental commit b1b8dc32c9bf6f99a1b547e5e57a9a7886b9c358 Author: Ycros <[email protected]> Date: Thu Jun 15 16:46:47 2023 +1000 Fix Makefile for CUBLAS. (#241) commit 8ff35ef944d508fc1f14b4933ddea1fa8fa3d1f5 Author: Concedo <[email protected]> Date: Thu Jun 15 12:13:55 2023 +0800 updated lite commit 3ed3e7b7e2b98cbc867bf42a4599ecf11a03422a Author: Concedo <[email protected]> Date: Wed Jun 14 20:03:14 2023 +0800 reverted sequence mode for rwkv due to multiple issues with speed loss with bigger quantized models commit f83b66606b924bac9eb409a62ba1108101ef0def Merge: 443903f ce36167 Author: Concedo <[email protected]> Date: Wed Jun 14 11:50:24 2023 +0800 Merge branch 'concedo' into concedo_experimental commit 443903fa0fd28172749e274637dbfc82b4352be8 Author: Concedo <[email protected]> Date: Wed Jun 14 11:50:13 2023 +0800 up ver with these minor improvements commit ce36167976230e04478afb1e132ca16d440317d8 Author: tqcq <[email protected]> Date: Wed Jun 14 11:41:39 2023 +0800 fix Fix the link on the Mac platform OpenCL method (#227) merging this, please let me know if anything breaks. commit f5247be0d7be1b337eb1f5902e65ceb9c6720831 Merge: 2b4a286 9254920 Author: Concedo <[email protected]> Date: Wed Jun 14 11:35:43 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # tests/test-grad0.c commit 2b4a286e5682a40e9a3f0907cd94e0180fa9b1cc Merge: e426519 0e3cc8e Author: Concedo <[email protected]> Date: Wed Jun 14 11:34:53 2023 +0800 Merge remote-tracking branch 'occam/kquant-opencl' into concedo_experimental commit e4265198edc7e7e4890730654dbeba18702b4a52 Author: Concedo <[email protected]> Date: Wed Jun 14 11:34:40 2023 +0800 added cublas back into the makefile as some people requested commit 15de626b3a81c0c91a3cb644e88b0695cf45ef53 Author: Concedo <[email protected]> Date: Tue Jun 13 23:51:10 2023 +0800 double max nodes again commit 82cf97ce92d49d0e05dd7134c57e9eee99abd2a5 Author: Concedo <[email protected]> Date: Tue Jun 13 23:38:41 2023 +0800 hotfix for rwkv commit 9db2ec068f78db474443e7b12777c6204fffbd86 Author: Concedo <[email protected]> Date: Tue Jun 13 22:29:38 2023 +0800 cuda build file commit 6119b8a3d0af0fe54e9fc3059df4f3e98fe34162 Author: Concedo <[email protected]> Date: Tue Jun 13 22:22:04 2023 +0800 new vocab files commit 0e3cc8e6f7b705df3db3929783504e266bb72e00 Author: 0cc4m <[email protected]> Date: Tue Jun 13 16:10:25 2023 +0200 Improve code formatting commit f1ac03ed37d9b57f7f3ae334504feceee951cc80 Author: 0cc4m <[email protected]> Date: Tue Jun 13 15:21:44 2023 +0200 Shorten switch statements commit f345347e5c0b27d7e7d1e152c5bca0d1fb78503a Author: Concedo <[email protected]> Date: Tue Jun 13 20:44:22 2023 +0800 updated lite commit 561ce6a1531f34cc3787baa96f1f465c06b0cba9 Merge: 67559a1 2a972f3 Author: Concedo <[email protected]> Date: Tue Jun 13 20:27:11 2023 +0800 Merge remote-tracking branch 'occam/kquant-opencl' into concedo_experimental commit 67559a15f33d4b4791039afef6ed4c3c884f11c1 Merge: 871009d 74d4cfa Author: Concedo <[email protected]> Date: Tue Jun 13 20:26:51 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/build.yml # Makefile commit 871009dfab054072f7120e0e265f747803056343 Author: Concedo <[email protected]> Date: Tue Jun 13 20:06:19 2023 +0800 integrated world tokenizer for RWKV commit 2a972f36499d27a6d040e8c61c008cdbca3ed764 Author: 0cc4m <[email protected]> Date: Tue Jun 13 08:25:32 2023 +0200 Fix q3_k commit fc8c823f34a3fb304d614c9064378f8521392c22 Author: 0cc4m <[email protected]> Date: Mon Jun 12 20:02:56 2023 +0200 Fix q2_k, improve code commit 6e20827f933657fedeb63ad99ff069ce4ee814b8 Author: Concedo <[email protected]> Date: Mon Jun 12 19:31:09 2023 +0800 Added OpenCL DMMV kernels commit f558e4c2978fb555d2adac77837d886337db2e36 Author: Concedo <[email protected]> Date: Mon Jun 12 14:55:21 2023 +0800 Finish dequant kernels commit 56151bb875c225679a345f8109277f26e672db84 Author: Concedo <[email protected]> Date: Mon Jun 12 14:20:44 2023 +0800 Replace uchar with uint8_t commit a4ee2b89d206cca5bb70f4e1be38e5ada5d617fe Author: 0cc4m <[email protected]> Date: Mon Jun 12 08:13:05 2023 +0200 Fix q4_k opencl struct order commit 1506affd0a96dca301b72230df11fdec6f2ea7e8 Author: Concedo <[email protected]> Date: Sun Jun 11 22:29:43 2023 +0800 Added q6_k kernel commit 44422fd56773a2f66c24e5a8d726dfe4109f5adc Author: 0cc4m <[email protected]> Date: Sun Jun 11 12:47:21 2023 +0200 Set global and local sizes for kernel calls for dequantizing k-quants commit 9b4186531246cf8716882755b5fcbad3ec635df3 Author: Concedo <[email protected]> Date: Sat Jun 10 21:52:32 2023 +0800 Porting q2_k kernel to OpenCL commit 9830871d0f5c13516c5d7bcfb6ea9ec1a34652c7 Author: Concedo <[email protected]> Date: Tue Jun 13 16:15:13 2023 +0800 pulled all Occam's fixes and the kquants are all working now commit 9b6c35b6518b30e32f9ab6e3a5de33abd7b918cf Author: Concedo <[email protected]> Date: Tue Jun 13 16:02:12 2023 +0800 rwkv speed enhancements (batch processing), fixed a rwkv token processing bug commit 860fb026df7f565c888b50e8fa757ddeae826a48 Author: Concedo <[email protected]> Date: Mon Jun 12 22:40:45 2023 +0800 rwkv compile fix (+1 squashed commits) Squashed commits: [8b0ebb1] upgraded rwkv + added memory overheads + added state_out bufs commit 120851df53e8a6c35a8c57704df4e1e4298f5245 Author: Concedo <[email protected]> Date: Mon Jun 12 21:57:31 2023 +0800 prevent gpu offload if kquant is selected with clblast for now commit 215edf420b19ec9a4f087b374676f7c53e65df5d Merge: 9c08017 58970a4 Author: Concedo <[email protected]> Date: Mon Jun 12 21:53:13 2023 +0800 Merge branch 'master' into concedo_experimental commit 9c08017051da8d0d55900f29cdc890bbf39cc4d3 Author: Concedo <[email protected]> Date: Mon Jun 12 21:47:57 2023 +0800 this patch is a work in progress implementation for the k-quants. the dequant kernels are working, but the DMMV ones are not. commit b9a4da3c6f53473bf0a8477aedaa1aee99eb6c2e Merge: c44b9c3 fa84c4b Author: Concedo <[email protected]> Date: Sun Jun 11 23:27:28 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # SHA256SUMS commit c44b9c3ecf133810984d0a63980f85b1b378e864 Author: Concedo <[email protected]> Date: Sun Jun 11 23:18:03 2023 +0800 added the llama_v2 cuda back (+2 squashed commit) Squashed commit: [1c97fd4] Revert "fix for cublas" This reverts commit 994be9a4db03e61b3e2d594b9d181589e1d13bb9. [fce03c3] Revert "fix for cublas" This reverts commit 33528f5b1d6513feb9a36423b7e7499f3d393f44. commit fb67506c1b49a0fdc902454b8b10c8a0c31add26 Merge: 0c9cd39 303f580 Author: Concedo <[email protected]> Date: Sat Jun 10 23:04:48 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # README.md # flake.nix # ggml-metal.m commit 0c9cd3925905b948f21dbcbf66f3036e57740f7d Author: Concedo <[email protected]> Date: Sat Jun 10 22:12:01 2023 +0800 lowered streaming tickrate for greater efficiency commit b9f74db89e1417be171363244aaa6848706266c7 Merge: fa64971 17c10ac Author: Concedo <[email protected]> Date: Sat Jun 10 21:07:20 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # Makefile commit fa649718811740a9cfbbe6d075a1fcec01d63bbb Author: Concedo <[email protected]> Date: Sat Jun 10 21:05:35 2023 +0800 encoding commit 66a3f4e4219c55fc2049f96ddccfc4c71a978288 Author: Concedo <[email protected]> Date: Sat Jun 10 19:29:45 2023 +0800 added support for lora base commit 375540837e0ca1933c32a57f754fad1b0b11d7ca Author: Concedo <[email protected]> Date: Sat Jun 10 19:16:29 2023 +0800 updated lite commit a68fcfe738dc884906db000aaf3fdd25c0f79591 Author: Concedo <[email protected]> Date: Sat Jun 10 19:03:41 2023 +0800 only start a new thread when using sse commit 43f7e40470607220468c05fea4d4dc31d7b6ffd2 Author: Concedo <[email protected]> Date: Sat Jun 10 18:13:26 2023 +0800 added extra endpoints for abort gen and polled streaming commit 5bd9cef9fac21630dc0ee56467a18804b90b5e7e Author: Concedo <[email protected]> Date: Fri Jun 9 23:22:16 2023 +0800 merging Proper SSE Token Streaming #220 with end connection fix test commit b92f9fe3a29e67e5eee86fe6033fb7e34db23ae8 Merge: 507939c 57b0b53 Author: Concedo <[email protected]> Date: Fri Jun 9 20:41:02 2023 +0800 Merge remote-tracking branch 'sammcheese/sammcheese/tokenstreaming' into concedo_experimental commit 507939c135328e15ccb1ec7e1259408197b899bf Merge: 7887841 ae9663f Author: Concedo <[email protected]> Date: Fri Jun 9 20:20:04 2023 +0800 Merge branch 'master' into concedo_experimental commit 788784179a9fffadf4f7674af3f124abb7c29529 Merge: d28ed99 e1ab14c Author: Concedo <[email protected]> Date: Fri Jun 9 20:19:56 2023 +0800 Merge branch 'concedo' into concedo_experimental commit e1ab14c4ab7a1ee30b896c848863b08f006e7680 Author: 12Boti <[email protected]> Date: Fri Jun 9 14:16:03 2023 +0200 fix format string vulnerability (#223) commit 57b0b53b5457a96afb0c7596d859e54d166cd42f Author: SammCheese <[email protected]> Date: Fri Jun 9 12:39:35 2023 +0200 fix kobold lite generation commit c99ab9df33f21234473f5f7653130a5424de36c7 Author: SammCheese <[email protected]> Date: Fri Jun 9 12:19:08 2023 +0200 Revert "Squashed commit of the following:" This reverts commit 4f665cd63dfd5046cf792d8d220dc8431c1ac650. commit e6231c30553b0720ffdda04106625e3a56b32ae5 Author: SammCheese <[email protected]> Date: Fri Jun 9 12:17:55 2023 +0200 back to http.server, improved implementation commit d28ed99e5916fb9755edf53f78beca3f02aa0050 Author: Concedo <[email protected]> Date: Fri Jun 9 18:01:55 2023 +0800 remove unused declarations commit 4f665cd63dfd5046cf792d8d220dc8431c1ac650 Author: SammCheese <[email protected]> Date: Fri Jun 9 10:55:07 2023 +0200 Squashed commit of the following: commit b617f2847b5914736ccf65bec22caaf49b39c0a8 Merge: 73cc5b8 92f44ff Author: Concedo <[email protected]> Date: Fri Jun 9 16:10:35 2023 +0800 Merge branch 'master' into concedo_experimental commit 73cc5b88fbed75d540346bfad11cc5c1e0678705 Author: Concedo <[email protected]> Date: Fri Jun 9 16:09:23 2023 +0800 added warning message for unsupported K quants commit 92f44ff7f778ef1b94028b2ba6d39943b5ca0ada Author: AT <[email protected]> Date: Fri Jun 9 04:00:51 2023 -0400 metal : add GELU implementation (#1770) Co-authored-by: Adam Treat <[email protected]> commit 245fc3c37da5ac5963f9f11a9f4f2ac08d96afc6 Author: Kawrakow <[email protected]> Date: Fri Jun 9 10:39:59 2023 +0300 metal : faster q4_0 (#1775) * metal : 8% faster q4_0 Avoid copying into local uchar4 anf float4. * metal : 17% faster Q4_0 Use 64 threads in a thread group. --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 01dc509038d5288c9139c60005aba63c0565b379 Merge: 0833845 72ff528 Author: Concedo <[email protected]> Date: Fri Jun 9 14:53:35 2023 +0800 Merge branch 'master' into concedo_experimental commit 0833845268339719a490269faefe66ac1d2d1dd5 Author: Concedo <[email protected]> Date: Fri Jun 9 14:38:31 2023 +0800 merged metal patch directly into the file commit 72ff5282bf0388c60821f504c4c8cc2b1f491aa6 Author: Kawrakow <[email protected]> Date: Thu Jun 8 22:28:21 2023 +0300 metal : add Q2_K implementation (#1762) * metal : add Q2_K implementation 27.1 ms / token on M2 Max 30-core GPU, so about the same speed as Q4_0. Memory throughput is ~156 GB/s. The access pattern used in the Q2_K CUDA implementation resulted in significantly lower performance (~31 ms/token). * Fixing merge conflicts --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 0bf7cf1b296fc9fca05411b37afdf08a531487d2 Author: Georgi Gerganov <[email protected]> Date: Thu Jun 8 20:48:14 2023 +0300 Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)" This reverts commit 8432d4d9f716b25133e3ed671d91e21f6f3be867. commit 8432d4d9f716b25133e3ed671d91e21f6f3be867 Author: le.chang <[email protected]> Date: Fri Jun 9 00:47:56 2023 +0800 ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738) commit 6fa1613f15c7b92fa1279426dc15eae541d0e7be Author: Hyun-joo KIM <[email protected]> Date: Fri Jun 9 01:47:36 2023 +0900 Metal inference enhancement - put hard-wired relative path of ggml-model.model file using a patch file due to lack of NSBundle environment commit 0f291e1f65c1d68201e71ce99c89562a36686b6d Author: Kawrakow <[email protected]> Date: Thu Jun 8 19:46:22 2023 +0300 metal : Q6_K implementation (#1752) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master * Metal implementation for Q6_K Similar to the CUDA implementation. No idea if this is the optimum for Metal, but the few alternative variants I tried all had a lower performance. We get 36.5 ms / token on M2 Max with 30 GPU cores. This corresponds to ~200 GB/second throughput. * clang-tidy : add config back * Much better Q6_K implementation for metal 28.3 ms / token for 7B. Subtracting ~9 ms that is spent in other compute graph operations, we are left with ~19 ms for the matrix multiplications. The model is ~5.5 GB, so we are getting 1000 / 19 * 5.5 = 290 GB/s! --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 7f181600c77efb48a1b2a2e30ff0cd50c294ebea Author: Hyun-joo KIM <[email protected]> Date: Fri Jun 9 01:24:22 2023 +0900 Metal inference enhancement - put hard-wired relative path of ggml-model.model file due to lack of NSBundle environment commit 8fc8179919a11738910db07a800f2b176f8adf09 Author: qingfengfenga <[email protected]> Date: Thu Jun 8 15:58:53 2023 +0800 Add llama.cpp docker support for non-latin languages (#1673) * Modify Dockerfile default character set to improve compatibility (#1673) commit b50b570ed9d699d3d126d72fc02de92926bcd937 Author: Steven Roussey <[email protected]> Date: Thu Jun 8 00:12:28 2023 -0700 ggml : fix fprintf warnings (#1720) commit 53aba3f393f2e02a78ddaba2e934893a8bbf3246 Author: Georgi Gerganov <[email protected]> Date: Thu Jun 8 10:09:08 2023 +0300 clang-tidy : restore dot file from accidental deletion commit 4161bdc04debb70bf5f275492b4d89fd9330087c Author: Kawrakow <[email protected]> Date: Thu Jun 8 10:08:23 2023 +0300 metal : add Q4_K implementation (#1733) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 0035858273ebe0694926bf4414d279f3e1cd109d Author: johnson442 <[email protected]> Date: Thu Jun 8 08:02:48 2023 +0100 k-quants : add missing compile definition to CMakeLists (#1748) commit b617f2847b5914736ccf65bec22caaf49b39c0a8 Merge: 73cc5b8 92f44ff Author: Concedo <[email protected]> Date: Fri Jun 9 16:10:35 2023 +0800 Merge branch 'master' into concedo_experimental commit 73cc5b88fbed75d540346bfad11cc5c1e0678705 Author: Concedo <[email protected]> Date: Fri Jun 9 16:09:23 2023 +0800 added warning message for unsupported K quants commit 01dc509038d5288c9139c60005aba63c0565b379 Merge: 0833845 72ff528 Author: Concedo <[email protected]> Date: Fri Jun 9 14:53:35 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # .devops/full.Dockerfile # .devops/main.Dockerfile # CMakeLists.txt commit 0833845268339719a490269faefe66ac1d2d1dd5 Author: Concedo <[email protected]> Date: Fri Jun 9 14:38:31 2023 +0800 merged metal patch directly into the file commit 6fa1613f15c7b92fa1279426dc15eae541d0e7be Author: Hyun-joo KIM <[email protected]> Date: Fri Jun 9 01:47:36 2023 +0900 Metal inference enhancement - put hard-wired relative path of ggml-model.model file using a patch file due to lack of NSBundle environment commit dee692a63e0801c24f371f49bda83d4f0c1e95a1 Author: SammCheese <[email protected]> Date: Thu Jun 8 15:56:25 2023 +0200 compability with basic_api, change api path to /extra commit b4e9e185d34d476153de8d6389fc65dfffb51fc9 Author: SammCheese <[email protected]> Date: Thu Jun 8 15:21:00 2023 +0200 fix legacy streaming commit 9a8da35ec4a3d37f532a199c3244c0314ea28a61 Author: SammCheese <[email protected]> Date: Thu Jun 8 06:18:23 2023 +0200 working streaming. TODO: fix lite commit 97971291e9e5d05c428d4c0f0cb8f956e36a63c5 Author: SammCheese <[email protected]> Date: Wed Jun 7 00:48:00 2023 +0200 draft: token streaming commit 7f181600c77efb48a1b2a2e30ff0cd50c294ebea Author: Hyun-joo KIM <[email protected]> Date: Fri Jun 9 01:24:22 2023 +0900 Metal inference enhancement - put hard-wired relative path of ggml-model.model file due to lack of NSBundle environment commit a6a0fa338a8fb390c47ca85e11ce54672ceed38b Author: Concedo <[email protected]> Date: Thu Jun 8 22:40:53 2023 +0800 cleanup indentation, fixing cublas build commit a979e71ddc6712e57736578e6218abacf431995f Author: Concedo <[email protected]> Date: Thu Jun 8 16:28:26 2023 +0800 add obj flags to all output make targets commit 6635f7efce3389a0b15d3a01cdc85c4e65c8bccc Author: Concedo <[email protected]> Date: Thu Jun 8 00:20:32 2023 +0800 updated lite commit 49a6be3d872b6f798e9bb6a469602aa65b0cdf0a Author: Concedo <[email protected]> Date: Wed Jun 7 22:29:38 2023 +0800 add llama metal compile flags as an option commit 7b0707ff264f1f1e983b972842cb9ddba68e1503 Merge: e78c675 5c64a09 Author: Concedo <[email protected]> Date: Wed Jun 7 17:06:56 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile commit e78c675a6eae84bfcd4b44f5a9918989ce836976 Merge: ed603dc 5b57a5b Author: Concedo <[email protected]> Date: Wed Jun 7 15:23:29 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # README.md # flake.lock # flake.nix # ggml-opencl.cpp commit ed603dcafc5224de5199f94b01c976c7d07fc87e Merge: c046db5 2d43387 Author: Concedo <[email protected]> Date: Tue Jun 6 23:12:01 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # README.md # docs/BLIS.md # llama.cpp # tests/test-quantize-fns.cpp commit c046db51973bc5cc76e3146d30c3ca73340e1bd0 Author: Concedo <[email protected]> Date: Tue Jun 6 22:38:25 2023 +0800 lite bugfixes, buffer size changes, fixed a topk bug. commit 2e5edc80e003985411fac566a7938564b9424dd0 Author: Concedo <[email protected]> Date: Mon Jun 5 23:56:24 2023 +0800 updated lite commit 79df932d0a798d46b64c5dffc57ec89053522dc3 Author: Concedo <[email protected]> Date: Mon Jun 5 22:50:21 2023 +0800 added dropdown for blasbatch. added capability to build avx clblast but not in default build for now commit 54dc75ce73913293044ed49914458ed529eee554 Merge: c27f250 f6431de Author: Concedo <[email protected]> Date: Mon Jun 5 13:31:53 2023 +0800 Merge branch 'concedo-opencl-dev' into concedo_experimental commit f6431ded5d28433e158b1621f667871b084e413f Author: Concedo <[email protected]> Date: Mon Jun 5 13:31:37 2023 +0800 removed flags from the CL pool malloc, apply code tidying suggestions. commit c27f250b6f22f2f681120db6da50dd4b95b6539d Author: Concedo <[email protected]> Date: Mon Jun 5 13:24:53 2023 +0800 bigger scratch buffer for 3B llama commit 927005626907ce3531be885855212238123c3636 Author: Concedo <[email protected]> Date: Mon Jun 5 11:48:04 2023 +0800 fixed compile error in cmake VS commit b7fb1aa233e9feb7f211eb1ed33fe421b7efc57d Author: Concedo <[email protected]> Date: Sun Jun 4 22:34:27 2023 +0800 removed build info in cmake commit 6f66e4c4a5dd1307451a4d1dd4d66438eba1456f Author: Concedo <[email protected]> Date: Sun Jun 4 22:27:15 2023 +0800 updated lite commit 9aa2d8535b7e8a27e5a017769eefcd5b5c7505e6 Author: Concedo <[email protected]> Date: Sun Jun 4 21:47:17 2023 +0800 hide gpu input box when dropdown not selected, minor memory fix for neox and gptj commit 1ddbb9acd97954036044a005b79250a5d2d8b3c3 Merge: dd4b5c6 64e3e74 Author: Concedo <[email protected]> Date: Sun Jun 4 18:07:27 2023 +0800 Merge branch 'concedo-opencl-dev' into concedo_experimental # Conflicts: # ggml-opencl.cpp commit 64e3e74556f27247d444bcde4bc5873b27d75aae Author: Concedo <[email protected]> Date: Sun Jun 4 18:04:52 2023 +0800 change max value size_t to use limits commit 2b700749e5eb0d0e5eab43eb65ddece06bafada0 Merge: 59fe168 dcb2ed4 Author: LostRuins <[email protected]> Date: Sun Jun 4 18:00:06 2023 +0800 Merge branch 'master' into concedo-opencl-dev commit dd4b5c64b839484f64d127317d287731f46e08ac Merge: 8891909 dcb2ed4 Author: Concedo <[email protected]> Date: Sun Jun 4 17:38:22 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # ggml-opencl.cpp commit 88919095b50fe1286c1d3fa88c2d61f09382e06f Author: Concedo <[email protected]> Date: Sun Jun 4 12:09:49 2023 +0800 edit readme commit c3c05fc33b56d507642a5531471758ff189475e3 Author: Concedo <[email protected]> Date: Sun Jun 4 11:57:46 2023 +0800 further cleanup, refactor renamemode to hordeconfig commit 2868fac676c5f266528895c0806659627c8dc39e Merge: 20803c2 d8bd001 Author: Concedo <[email protected]> Date: Sun Jun 4 11:07:07 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # .devops/tools.sh # README.md commit 20803c221ecf408e78dd434be818d636f1e17bab Author: Concedo <[email protected]> Date: Sun Jun 4 11:05:46 2023 +0800 cleaning up some old junk commit b62279cb39d1595f21ed9ebcb5c2bc07d2d3c5f3 Author: Concedo <[email protected]> Date: Sun Jun 4 00:41:08 2023 +0800 buf size for starcoder still not good commit c1b293d31ae0a2526539cce945c89843d7879baf Author: Concedo <[email protected]> Date: Sat Jun 3 18:37:13 2023 +0800 fixed MPT ooms commit 8bd9a3a48b0acb494acffdc81555853d720cc0b0 Author: Concedo <[email protected]> Date: Sat Jun 3 17:17:15 2023 +0800 updated readme, improved simple launcher commit 6f82e17b7ab02ee555607fa4c366c320c666d4c3 Author: Concedo <[email protected]> Date: Sat Jun 3 16:14:08 2023 +0800 added MPT support commit 9839259b63a2b0f1490ade1c701d7d412998f814 Author: Concedo <[email protected]> Date: Sat Jun 3 00:55:44 2023 +0800 allow specifying the horde limit as well commit 96b0e536b7035a96ed13c6b78e03a9c28d11ad14 Merge: 8d0c81e 59fe168 Author: Concedo <[email protected]> Date: Fri Jun 2 22:12:14 2023 +0800 Merge branch 'opencl-dev-concedo' into concedo_experimental commit 59fe16877d0f40a88834f5f173f8b01fd7d99e4a Author: Concedo <[email protected]> Date: Fri Jun 2 22:10:49 2023 +0800 Clblast fixes + enhancements to save VRAM: 1. Change all Clblast buffers to CL_MEM_READ_WRITE, as the pool malloc currently doesn't properly handle them. 2. When recycling buffers in pool malloc, always assign the SMALLEST available buffer that fits, instead of the FIRST available buffer 3. When failing to recycle a buffer in pool malloc (all too small), instead recycle the largest available free buffer by resizing it. commit 8d0c81e7ccb93f092f0fd7e7dfc21d83fddae404 Merge: 144d8a8 24239f0 Author: Concedo <[email protected]> Date: Fri Jun 2 12:19:59 2023 +0800 Merge remote-tracking branch 'occam/opencl-dev' into concedo_experimental commit 144d8a831280bbf6edc9e66cbe0eacc101c6ba29 Author: Concedo <[email protected]> Date: Fri Jun 2 12:19:51 2023 +0800 updated lite commit 24239f0df7e9f29cddeffd42b9b606dd89ebb819 Author: 0cc4m <[email protected]> Date: Thu Jun 1 18:57:08 2023 +0200 Improve implementation commit 37659d2c4e38adebf3811d0997c8a79e9719edff Author: Concedo <[email protected]> Date: Thu Jun 1 22:33:50 2023 +0800 allow blasbatchsize -1 which disables blas, but keeps benefits like gpu offloads. commit 49272e3c53455a55c0a96dcaebf08199d17861c6 Author: Concedo <[email protected]> Date: Thu Jun 1 20:03:44 2023 +0800 adjusted defaults commit 457aaf5badfe1914f02f8024246d7c5e27aa0ade Author: 0cc4m <[email protected]> Date: Thu Jun 1 07:33:32 2023 +0200 Reduce code duplication between cuda and opencl branches commit 234270bd83e4c07b55202b3be5d4bb94e81c6e7e Author: Concedo <[email protected]> Date: Thu Jun 1 00:14:22 2023 +0800 back to 32 block size, not better commit 446e42a8c6ac8085359cb3985df23a26b02e027f Author: Concedo <[email protected]> Date: Wed May 31 21:40:12 2023 +0800 change dmmv block size commit 077ee4e989a37bc3ef5a337351bfe206f8f834da Author: Concedo <[email protected]> Date: Wed May 31 18:00:52 2023 +0800 Revert "Revert "opencl : no need to allocate cl_mem on heap (#1612)"" This reverts commit 4afa38e7446f997d7034e239e718be26e524638f. commit 50c85bea4cb353b25287494a9ea46d82097a94c2 Merge: 32dada5 5e1eecf Author: Concedo <[email protected]> Date: Wed May 31 17:53:14 2023 +0800 Merge remote-tracking branch 'occam/opencl-dev' into concedo_experimental commit 32dada5e5f5ef96f34bec89c03b38a8bf8403e5c Author: Concedo <[email protected]> Date: Wed May 31 17:52:09 2023 +0800 updated lite commit 5e1eecfe12097c98d56f79236a7b1277be09eeaa Author: 0cc4m <[email protected]> Date: Wed May 31 07:07:47 2023 +0200 Adapt to #1612 cl_mem malloc changes commit 49aaf08387254f64f58660772944e69ceae5bbac Merge: ac6b49e ffb06a3 Author: 0cc4m <[email protected]> Date: Wed May 31 06:58:51 2023 +0200 Merge remote-tracking branch 'origin/master' into opencl-dev commit a5a85d68c654b873cb1c93f407ff758a5d7875d4 Merge: 85c9f7d ffb06a3 Author: Concedo <[email protected]> Date: Wed May 31 10:51:54 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # llama.cpp commit 85c9f7df4135f02f086fe73ea184813e43bf861f Merge: 4afa38e ac6b49e Author: Concedo <[email protected]> Date: Wed May 31 10:20:32 2023 +0800 Merge remote-tracking branch 'occam/opencl-dev' into concedo_experimental commit 4afa38e7446f997d7034e239e718be26e524638f Author: Concedo <[email protected]> Date: Wed May 31 10:20:23 2023 +0800 Revert "opencl : no need to allocate cl_mem on heap (#1612)" This reverts commit bb051d9723d628414b9e929e5264e23262a2f1b2. commit ac6b49ed45b959e75f6ec7432fb6a5a2dc88cc4e Author: 0cc4m <[email protected]> Date: Tue May 30 18:49:53 2023 +0200 Reduce queueing overhead for contiguous tensors by using single mul kernel call commit 56456797f447e6fc32fe3d450c2ce1554c99cce2 Merge: ea336bf 7552ac5 Author: Concedo <[email protected]> Date: Tue May 30 22:15:58 2023 +0800 Merge branch 'master' into concedo_experimental commit ea336bfa332301c9f520e08fa7a538cc60dc0c96 Author: Concedo <[email protected]> Date: Mon May 29 22:40:27 2023 +0800 rwkv eos commit 6b3373cb811435a678493467f81f2ec8e89f7a32 Author: Concedo <[email protected]> Date: Mon May 29 22:06:12 2023 +0800 revert bad fix commit ef16d09a51db8d4f1d298edecc11775768e19576 Author: Concedo <[email protected]> Date: Mon May 29 18:54:15 2023 +0800 fix for older gcc, updated lite commit 3a73ebe8d2172ba8a8234d7ecb34397deffabe55 Merge: 254a9ff 0e730dd Author: Concedo <[email protected]> Date: Mon May 29 16:47:32 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # .devops/full.Dockerfile # .devops/main.Dockerfile # Makefile commit 254a9ff12c6d140e225a900f02050f18bed4e87b Merge: 30ff113 ebc5d06 Author: Concedo <[email protected]> Date: Mon May 29 16:26:24 2023 +0800 Merge commit 'ebc5d0651a1af44a2aecf503c1ceecede1ef99c4' into concedo_experimental # Conflicts: # ggml-opencl.cpp commit 30ff1133f510370229b79e79668a55c5e30c1ff8 Author: Concedo <[email protected]> Date: Mon May 29 16:01:05 2023 +0800 allow users to rename models for use in horde commit 97b39f875c3593a8fe16400d259c0436edcd6737 Author: Concedo <[email protected]> Date: Mon May 29 15:50:07 2023 +0800 fixed fstat64 build error on mac commit 28f1196f65eb27a8f231fb83b38c91fb42a1a11a Author: Concedo <[email protected]> Date: Sun May 28 19:36:21 2023 +0800 adjust default rep pen range commit 7d159bacd7a6ac53c95b431f1d7bd5c4c4774a1d Author: Concedo <[email protected]> Date: Sun May 28 11:23:20 2023 +0800 updated kobold lite commit dcc426e2de9d40909b71f59bd6942ef68db6553d Merge: 5d9f5b2 0df7d63 Author: Concedo <[email protected]> Date: Sun May 28 01:08:39 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/build.yml # CMakeLists.txt # Makefile # README.md commit 5d9f5b28a6ec9c0e2c5271550ef999dcfb15b209 Author: Concedo <[email protected]> Date: Sun May 28 00:48:56 2023 +0800 rwkv integration completed commit 55e0fbf0247c285f494dbee75a713650fd186d71 Author: Concedo <[email protected]> Date: Sat May 27 22:45:28 2023 +0800 wip integrating new rwkv commit fe63bfdb0f5e7e14aeeb324d7849e4c529874c50 Author: Concedo <[email protected]> Date: Sat May 27 18:13:27 2023 +0800 Revert "allow 2048 blasbatchsize" This reverts commit 94dc5c2324100d13ce3ce0587f146f91cba8241e. commit 97c5cca4e5f2b24b631e8c08e8e872fb0864fd3e Author: 0cc4m <[email protected]> Date: Sat May 27 12:00:56 2023 +0200 OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel commit 94dc5c2324100d13ce3ce0587f146f91cba8241e Author: Concedo <[email protected]> Date: Sat May 27 17:47:18 2023 +0800 allow 2048 blasbatchsize commit 92a0d77712bd5ce9399b702fd18c9939dec78893 Merge: abfdfb7 bdbda1b Author: Concedo <[email protected]> Date: Sat May 27 17:44:14 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile commit abfdfb702e483c98d7644813c61ff77ee38b49c6 Author: Concedo <[email protected]> Date: Sat May 27 17:32:37 2023 +0800 added top_a sampler commit ebc5d0651a1af44a2aecf503c1ceecede1ef99c4 Author: 0cc4m <[email protected]> Date: Sat May 27 10:03:35 2023 +0200 Use events instead of clFinish, where possible commit 01a0f206dfaf47f22f40b3a231563b0677686e1f Author: Concedo <[email protected]> Date: Sat May 27 13:35:40 2023 +0800 added support for starcoder, which is basically gpt2 commit 6d7749c98f9ac23c4ffd42f7421b5ddbc916c2f2 Author: Concedo <[email protected]> Date: Sat May 27 12:42:19 2023 +0800 no difference commit bd4fe936f53ef39cdcc64413ea8772b89c8442f9 Author: Concedo <[email protected]> Date: Sat May 27 11:58:39 2023 +0800 cleanup sampling code commit 3c8f4042438b4b1c90b82775d8b0525019a5d90d Author: Concedo <[email protected]> Date: Fri May 26 16:40:26 2023 +0800 integrated token probability viewer in debugmode commit 8b8f2f4cf50416bb546a0946daca5f424b056a03 Author: Concedo <[email protected]> Date: Thu May 25 14:49:30 2023 +0800 up ver to 1.25.1 commit e6eeb234f1d48044bbb75915724772f65d98dcb7 Merge: d2da155 ac7876a Author: Concedo <[email protected]> Date: Thu May 25 10:34:43 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/build.yml # README.md commit d2da155661d62154b98d8bd017729b8d3c353e9b Author: Concedo <[email protected]> Date: Thu May 25 10:18:12 2023 +0800 upgraded clblast commit 37a34deaa099c0517bb9e6caeacd17ed8f419c1b Author: Concedo <[email protected]> Date: Wed May 24 23:34:11 2023 +0800 added a second pyinstaller for my own use that uses a different python version. don't use this. commit bf482d1786aff3312f8bfaea89da457defd10828 Author: Concedo <[email protected]> Date: Wed May 24 22:21:01 2023 +0800 revert klite newline bug, trying to add win7 support commit 844f92688a1c84fed4dc7751749b5023e0045988 Author: Concedo <[email protected]> Date: Wed May 24 16:48:39 2023 +0800 subpattern fix commit d04b3bbe5e663afc51683f4f4e0b01f81cb861a7 Author: Concedo <[email protected]> Date: Wed May 24 15:04:17 2023 +0800 disable mmap when failsafe mode selected from GUI commit b314cbfb6045a1790891770e6526fddd63eb85a4 Author: Concedo <[email protected]> Date: Wed May 24 11:28:35 2023 +0800 updated lite to support variable streaming lengths commit c97e10c50c287f55dd801042f94637321207c569 Merge: abb9ad7 7d87381 Author: Concedo <[email protected]> Date: Wed May 24 00:36:30 2023 +0800 Merge branch 'master' into concedo_experimental commit abb9ad789c06b1440dd4888aac63c862b4b3f674 Author: Concedo <[email protected]> Date: Wed May 24 00:20:43 2023 +0800 fixed other arch commit 0c0009e4b405177d854bc9abd3dca2976549b9cb Author: Concedo <[email protected]> Date: Tue May 23 23:18:52 2023 +0800 updated lite commit 355007b0194e19ab0aa85b3157fe824d1d2dee7b Author: Concedo <[email protected]> Date: Tue May 23 21:52:26 2023 +0800 added sampler seed commit cd4012c3ed7f9020cf4bae70a7d85f20334aeb33 Author: Concedo <[email protected]> Date: Tue May 23 21:31:42 2023 +0800 minor fixes to debug logging, fixed a typo, added a new failsafe mode commit 5bf9784381ebff31c97655cfe0cc4d5c9e47b803 Merge: 7894e85 2e6cd4b Author: Concedo <[email protected]> Date: Tue May 23 18:19:16 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # ggml-opencl.cpp # llama.cpp commit 7894e85788d43ed23fb0ee897c0aead89819b200 Author: Concedo <[email protected]> Date: Mon May 22 21:54:24 2023 +0800 fixed a bug in previous klite commit a05da31fe7f4615d7f251792e866c21ffea23f92 Author: Concedo <[email protected]> Date: Mon May 22 20:58:54 2023 +0800 updated embedded lite commit e20e302e87a76cd884cc9f971ecfc13123206cb3 Merge: b9f06a7 7e4ea5b Author: Concedo <[email protected]> Date: Mon May 22 17:05:34 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile commit b9f06a7670a995f9fea98dd79cec290f45ac5469 Author: Concedo <[email protected]> Date: Mon May 22 16:48:55 2023 +0800 mavx only for windows by default, let them eat march native. commit 981d5ba866a7250ab9dfd16ebb4a1bf9724004a6 Merge: 169a26d 18e9dd8 Author: Concedo <[email protected]> Date: Mon May 22 16:16:48 2023 +0800 Merge remote-tracking branch 'occam/opencl-dev' into concedo_experimental # Conflicts: # .github/workflows/build.yml # CMakeLists.txt # Makefile # README.md # ggml-opencl.cpp # llama.cpp # otherarch/ggml_v2-opencl-legacy.c commit 169a26d15fe7d56bdcddfa36f29cdd616ea35284 Author: Concedo <[email protected]> Date: Mon May 22 13:53:10 2023 +0800 removed unused build targets commit 587308a202024c81171e34e69c1936d490c36e3e Author: Concedo <[email protected]> Date: Mon May 22 12:18:42 2023 +0800 fixed some build errors on linux, changed icon resolution, added more error printing commit fea84c3cf5698ee6d65c9160b214bec494249c60 Author: Concedo <[email protected]> Date: Sun May 21 22:41:33 2023 +0800 fix for stupid msvc compiler commit 60e0c678746c44fb94da5924bac240926d6d0f6e Author: Concedo <[email protected]> Date: Sun May 21 21:13:17 2023 +0800 fix compile errors on cuda commit 33528f5b1d6513feb9a36423b7e7499f3d393f44 Author: Concedo <[email protected]> Date: Sun May 21 21:03:36 2023 +0800 fix for cublas commit 994be9a4db03e61b3e2d594b9d181589e1d13bb9 Author: Concedo <[email protected]> Date: Sun May 21 21:02:21 2023 +0800 fix for cublas commit 24127ebf987dff67d7e1604bf768096f0364c30f Author: Concedo <[email protected]> Date: Sun May 21 17:29:00 2023 +0800 updated lite, fixed some encoding issues commit 18e9dd87da905b8dcb722d9b190998aebdfd8847 Author: 0cc4m <[email protected]> Date: Sun May 21 08:34:17 2023 +0200 Explicitely set GEMM type commit b6b39960c0ddb8d5289defff82a25dd78603f851 Author: 0cc4m <[email protected]> Date: Sun May 21 08:17:17 2023 +0200 Use compile args for preprocessing constants commit a1657d02330f37ba26c849e1d55ec559dd08f88f Author: 0cc4m <[email protected]> Date: Fri May 19 21:18:57 2023 +0200 Add OpenCL compile options commit e41a7ae40c338419177d2fc165cb2e96545111c8 Author: 0cc4m <[email protected]> Date: Thu May 18 08:05:19 2023 +0200 Fix convert_row_f16 kernel issue commit 457eff920e5a687883fb87e9a2a7dc8b29268895 Author: 0cc4m <[email protected]> Date: Thu May 18 07:35:40 2023 +0200 Deduplicate dequant kernels commit 42e1a2ba3de3c30f38bb8f72f73ce0f07b2d675a Author: 0cc4m <[email protected]> Date: Tue May 16 18:49:49 2023 +0200 Fix tensor load to device Co-authored-by: Johannes Gäßler <[email protected]> commit cda2d488f994da00bd4b069b897a8f62f82a5637 Author: 0cc4m <[email protected]> Date: Tue May 16 13:05:33 2023 +0200 Fix error in convert f16 to f32 kernel call commit 915d0d11689db2d23bc9dc41ccf94fc9f6f4c70a Author: 0cc4m <[email protected]> Date: Tue May 16 07:42:01 2023 +0200 Generate dequant_mul_mat kernels from simple templates commit 1968380373b6398ce76b6669b21ad5971361c33c Author: 0cc4m <[email protected]> Date: Mon May 15 19:51:23 2023 +0200 Fix CMakeLists.txt commit cb588e2aa46d3f17a1cfcd11b28c6cef2a9c1a81 Author: 0cc4m <[email protected]> Date: Sun May 14 22:19:54 2023 +0200 Add remaining dequant_mul_mat functions commit 8c7a7cea2eac4852e3d3fe0deeeafb688c013feb Author: 0cc4m <[email protected]> Date: Sun May 14 21:26:07 2023 +0200 Fix dequant_mul_mat kernel commit 5f610c90bfc12ea66c068608689203f759a932de Author: 0cc4m <[email protected]> Date: Sun May 14 21:14:05 2023 +0200 Fix bugs in dequant_mul_mat code commit 17e53dbb7ec0badae9b32244041f08f4060ff8d1 Author: 0cc4m <[email protected]> Date: Sun May 14 17:01:46 2023 +0200 Refactor OpenCL code to work more like the CUDA code, add missing functions commit a7e3bee4cc5e101721f1aaf160590acd8af9f2c6 Author: 0cc4m <[email protected]> Date: Sun May 14 17:00:37 2023 +0200 Move back to C++ for OpenCL commit 75e4548821e9f0f0423626c590672326361f8452 Author: Concedo <[email protected]> Date: Sun May 21 01:44:47 2023 +0800 missed out gpt2 commit 2ead735f0871c18f1ca2414728d22fbd2f8b5a6c Author: Concedo <[email protected]> Date: Sun May 21 01:29:20 2023 +0800 initial integration completed commit d6123f738a14c274514631a74310dd5eab881e4a Merge: d418146 ea60007 Author: Concedo <[email protected]> Date: Sun May 21 01:27:27 2023 +0800 Merge commit 'ea600071cb005267e9e8f2629c1e406dd5fde083' into concedo_experimental # Conflicts: # examples/quantize/quantize.cpp commit d4181465358093d3db5d8cb8f7325ad750efc9ae Author: Concedo <[email protected]> Date: Sun May 21 00:53:20 2023 +0800 fixed a token decoding bug commit d1824f1e88bff79b54940d12a259fe2516b66e6b Merge: 5032e0f d2c59b8 Author: Concedo <[email protected]> Date: Sun May 21 00:30:06 2023 +0800 Merge branch 'master' into concedo_experimental commit 5032e0fd6440bf4cf029db14eaf1d10d3d6f42a4 Author: Concedo <[email protected]> Date: Sun May 21 00:29:50 2023 +0800 trying to fix ggjt v3 commit c048bcfec4f2f44305a9447c01561de333b37703 Author: Concedo <[email protected]> Date: Sat May 20 16:47:44 2023 +0800 remove old filever checks (+7 squashed commit) Squashed commit: [b72627a] new format not working [e568870] old ver works [7053b77] compile errors fixed, fixing linkers [4ae8889] add new ver [ff82dfd] file format checks [25b8aa8] refactoring type names [931063b] still merging commit 417302b226d57cc1dd506bdab6ff06072a54b270 Merge: bd1aa72 fb638fa Author: Concedo <[email protected]> Date: Sat May 20 16:16:48 2023 +0800 Merge remote-tracking branch 'occam/opencl-dev' into concedo_experimental # Conflicts: # ggml-opencl.cpp commit bd1aa7212c58aa4e639e9b7cc8d22e9b794efb62 Author: Concedo <[email protected]> Date: Sat May 20 16:15:06 2023 +0800 wip2 commit d6f6b71478f559f837311aca5b1baa46335c5fc7 Author: Concedo <[email protected]> Date: Sat May 20 16:08:54 2023 +0800 wip commit a0cfed1e3052d610998fd9cab6ad5ec859de04e7 Author: Concedo <[email protected]> Date: Sat May 20 15:58:33 2023 +0800 still merging in process commit a8958f6b7612172972b5031f2c81e42df728f2ad Merge: 4e86a07 2d5db48 Author: Concedo <[email protected]> Date: Sat May 20 15:12:31 2023 +0800 merging, do not use commit fb638fa817c0ffef1e4f01b53525d9fa98bfa949 Merge: 0291469 2d5db48 Author: 0cc4m <[email protected]> Date: Sat May 20 07:55:02 2023 +0200 Merge remote-tracking branch 'origin/master' into opencl-dev commit 02914698f0c7083ecc69f344a356bc54ec405e61 Author: 0cc4m <[email protected]> Date: Sat May 20 07:45:56 2023 +0200 Update Q4_0, Q4_1 and Q8_0 to use half instead of float commit 285f8f990b412435405abc5e03140866eb00f658 Author: 0cc4m <[email protected]> Date: Sat May 20 07:26:38 2023 +0200 Explicitely set CLBlast GEMM type commit 4e86a07e57b6f61b983c1b751875c1263fa8dcc6 Author: Concedo <[email protected]> Date: Sat May 20 12:48:28 2023 +0800 wip cleanup before big merge commit 010b2753d909de4a97005737f76cf6bc6fca31bc Merge: 1225fab 6986c78 Author: Concedo <[email protected]> Date: Sat May 20 11:30:51 2023 +0800 Merge commit '6986c7835adc13ba3f9d933b95671bb1f3984dc6' into concedo_experimental # Conflicts: # README.md commit 1225fab2ecb342c264f39c03142beb9c4b3e06c0 Author: Concedo <39025047+LostRuins@user…
commit b8c8dda75fdf5fdea49c80af36818e7c30fe0ddf Author: Howard Su <[email protected]> Date: Thu Jun 29 21:15:15 2023 +0800 Use unsigned for random seed (#2006) * Use unsigned for random seed. Keep -1 as the value to use a time based seed. Co-authored-by: Georgi Gerganov <[email protected]> commit 96a712ca1b7f427e3bd7ffc0c70b2105cfc7fbf1 Author: LostRuins <[email protected]> Date: Thu Jun 29 11:56:43 2023 +0800 Porting the improved K-Quant CUDA kernels to OpenCL (#1966) * Added broken new q4k quant * xx + ib0 * Fix q2_k fast kernel * Use preprocessor for QK_K * Add q6_k fast matmul kernel * ported q3k speedup successfully * ported q2k and q5k speedups * remove old dot kernels and template * fixed global const struct types * fixing address spaces * fixed string too long CI issue --------- Co-authored-by: 0cc4m <[email protected]> commit d3494bb86bf7ad5b0b60aae0220ea576f273b5c0 Author: m3ndax <[email protected]> Date: Wed Jun 28 20:39:08 2023 +0200 llama : replacing auto &kv with const auto &kv (#2041) * Replacing auto &kv with const auto &kv * Create codacy.yml * Delete codacy.yml commit 5b351e94d041742cd50ffcf2d44718d63bab398a Author: Salvador E. Tropea <[email protected]> Date: Wed Jun 28 14:27:31 2023 -0300 cuda : remove nchannels_x argument from mul_mat_vec_nc_f16_f32 (#2028) - Not used commit 6432aabb6dc887436e4d57414b63116189c3b13b Author: Salvador E. Tropea <[email protected]> Date: Wed Jun 28 14:26:26 2023 -0300 cuda : fix missing const qualifier in casts (#2027) commit b922bc351b69770cec2d35d2aa50fa052b95ca93 Author: Howard Su <[email protected]> Date: Wed Jun 28 10:13:02 2023 -0700 llama : remove shards weight file support (#2000) * Remove multiple shards * Remove multiple file loaders * Remove llama_load_tensor_shard class * Simplify load logic * Remove dead code guess_n_parts function * Remove vocab_only from constructor of llama_model_loader * Remove alignment_prevents_mmap which is not more needed. * Remove useless check commit 7f9753fa1263c4eded9a3de19778562f0e1093d7 Author: Johannes Gäßler <[email protected]> Date: Wed Jun 28 18:35:54 2023 +0200 CUDA GPU acceleration for LoRAs + f16 models (#1970) commit cfa0750bc9dbc2d957a91b8ed09ab0035d8f3d4e Author: ningshanwutuobang <[email protected]> Date: Wed Jun 28 23:53:37 2023 +0800 llama : support input embeddings directly (#1910) * add interface for float input * fixed inpL shape and type * add examples of input floats * add test example for embd input * fixed sampling * add free for context * fixed add end condition for generating * add examples for llava.py * add READMD for llava.py * add READMD for llava.py * add example of PandaGPT * refactor the interface and fixed the styles * add cmake build for embd-input * add cmake build for embd-input * Add MiniGPT-4 example * change the order of the args of llama_eval_internal * fix ci error commit 9d23589d638dc74577d5ff880e6d4248b795f12e Author: Erik Scholz <[email protected]> Date: Tue Jun 27 19:06:33 2023 +0200 fix pthreads setaffinity usage on android (#2020) commit 0be54f75a6c3e9a09ea71bdfcdabf9a996a0549b Author: Howard Su <[email protected]> Date: Tue Jun 27 13:07:13 2023 +0800 baby-llama : fix build after ggml_rope change (#2016) commit 181e8d975528a4e27eabb8ae6e9865f9ceae4b37 Author: Georgi Gerganov <[email protected]> Date: Tue Jun 27 00:37:13 2023 +0300 llama : fix rope usage after ChatGLM change commit d9779021bd59ed96daae75e820a5ac5da47ca8ff Author: Georgi Gerganov <[email protected]> Date: Tue Jun 27 00:06:51 2023 +0300 ggml : add support for ChatGLM RoPE commit d38e45157862b58a1824387e64860d68ca3533a7 Author: Roman Parykin <[email protected]> Date: Mon Jun 26 22:47:59 2023 +0300 readme : add Scala 3 bindings repo (#2010) commit eaa6ca5a61b8c9501df9ebe3d264f45b75a5f8aa Author: David Yang <[email protected]> Date: Tue Jun 27 03:45:32 2023 +0800 ggml : increase max tensor name + clean up compiler warnings in train-text (#1988) * Clean up compiler warnings in train-text Some brackets to disambiguate order of operations * Increase GGML_MAX_NAME Avoiding strncpy danger in train-text-from-scratch and reducing potential future name length issues commit aa777abbb73655c4e1e9237b7c0ad66745e8e48c Author: Gustavo Rocha Dias <[email protected]> Date: Mon Jun 26 16:34:45 2023 -0300 readme : LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux (#2007) * docs - Alternative way to build at Android, with CLBlast. * doc - LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux. * doc- fix typo commit c824d2e368d193d9f564ff29880a51cda9f90527 Author: Georgi Gerganov <[email protected]> Date: Mon Jun 26 21:03:59 2023 +0300 ggml : avoid conv 2d kernel round up commit b853d456018b10820686362af41b2f2f75f1eec6 Author: zrm <[email protected]> Date: Mon Jun 26 13:57:59 2023 -0400 ggml : add NUMA support (#1556) * detect NUMA systems and pin work threads to nodes (linux) * disable mmap prefetch/readahead for NUMA systems * avoid sending finalize op to thread pool if it does nothing * silence robot * fix args * make --numa a param * recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement * lower synchronization overhead * statically allocate * move numa state to g_state * add description for --numa * ggml : minor style changes * ggml : minor style + try fix sanitizer build * llama : allow to initialize backend with NUMA support * llama : avoid ggml include in llama-util.h * ggml : style / formatting * ggml : fix handling of ops with n_threads > n_tasks > 1 * server : utilize numa parameter --------- Co-authored-by: Georgi Gerganov <[email protected]> commit 9225baef71407d799a6f7f563b77fd7f82791416 Author: Georgi Gerganov <[email protected]> Date: Mon Jun 26 20:10:52 2023 +0300 k-quants : fix indentation commit a84ab1da8dc6a59a5b67420ae1322f09503ffc72 Author: katsu560 <[email protected]> Date: Tue Jun 27 01:47:02 2023 +0900 tests : fix quantize perf (#1990) * fix test quantize perf * avoid the global state commit 5743ca80928d8410754ec64a5673d5c2dd6cfbb7 Author: katsu560 <[email protected]> Date: Tue Jun 27 01:46:07 2023 +0900 k-quants : add AVX support to dot functions (#1916) * k_quants : add AVX support * k_quants : apply review comments commit 412c60e4739367144e51e59add5dc7749d084115 Author: Georgi Gerganov <[email protected]> Date: Mon Jun 26 19:45:09 2023 +0300 readme : add link to new k-quants for visibility commit 6769e944c727c63612dcafbef52009d21ae00fff Author: Kawrakow <[email protected]> Date: Mon Jun 26 19:43:07 2023 +0300 k-quants : support for super-block size of 64 (#2001) * k_quants: WIP super-blocks with 64 weights * k_quants: WIP super-blocks with 64 weights Q6_K scalar and AVX2 works * k_quants: WIP super-blocks with 64 weights Q4_K scalar and AVX2 works * k_quants: WIP super-blocks with 64 weights Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower than the scalar implementation) * k_quants: WIP super-blocks with 64 weights Q3_K scalar and AVX2 works. * k_quants: WIP super-blocks with 64 weights Q5_K scalar and AVX2 works, and with that all k_quants are done on AVX2 and scalar * k_quants: WIP super-blocks with 64 weights Q6_K working on CUDA. Cannot make it run quite as gast as with super-blocks with 256 weigths: 8% slower on 4080, 20% slower on the 1660 (but there we fit 1 less layer on the GPU because pf the larger model size), so some fraction of these 20% is due to that, * k_quants: WIP super-blocks with 64 weights Q4_K working on CUDA. ~10% slower on GTX-1660, 16% slower on 4080. * k_quants: WIP super-blocks with 64 weights Q2_K working on CUDA. ~3% slower on GTX-1660, 10% slower on 4080. * k_quants: WIP super-blocks with 64 weights Q3_K working on CUDA. * k_quants: WIP super-blocks with 64 weights Q5_K working on CUDA, and with this CUDA is done. * k_quants: WIP super-blocks with 64 weights Q6_K working on ARM_NEON * k_quants: WIP super-blocks with 64 weights Q4_K working on ARM_NEON, but quite a bit slower than 256 weights * k_quants: WIP super-blocks with 64 weights Q2_K working on ARM_NEON, but quite a bit slower than 256 weights * k_quants: WIP super-blocks with 64 weights Q3_K working on ARM_NEON, but quite a bit slower than 256 weights. * k_quants: WIP super-blocks with 64 weights Q5_K working on ARM_NEON, but quite a bit slower than 256 weights. With that, we have full support for ARM_NEON, although performance is not quite there. * k_quants: WIP super-blocks with 64 weights Slightly more efficient Q3_K and Q5_K * k_quants: WIP super-blocks with 64 weights Another small improvement for Q3_K and Q5_K on ARM_NEON * k_quants: WIP super-blocks with 64 weights Yet another speedup for Q5_K on ARM_NEON. We are now within 10% of the QK_K = 256 version. * k_quants: WIP super-blocks with 64 weights * We are able to pass preprocessor macros to the Metal compiler * Q6_K works and is actually slightly more efficient than the QK_K = 256 version (25.2 ms vs 25.8 ms) * k_quants: WIP super-blocks with 64 weights Q4_K works on Metal and is actually slightly faster than QK_K = 256 (21.95 ms vs 24.0 ms). * k_quants: WIP super-blocks with 64 weights Q2_K works on Metal and is very slightly faster than QK_K = 256 (23.8 ms vs 24.2 ms). * k_quants: WIP super-blocks with 64 weights Q3_K works on Metal and is slightly faster than QK_K = 256 (26.6 ms vs 28.3 ms). * k_quants: WIP super-blocks with 64 weights Q5_K works on Metal and is slightly faster than QK_K = 256 (23.7 ms vs 26.3 ms). * k_quants: call them _K, not _k, also on Metal * k_quants: correctly define QK_K in llama.cpp * Fixed bug in q4_K quantization added with the 64-block addition * Simplify via lambda * k_quants: swicth Q3_K to 4-bit scales when QK_K = 64 Otherwise there isn't much benefit from this quantization type. There is some very slight loss in accuracy, but we reduce size by ~7%. E.g., for OpenLLaMA-3B, Q3_K_S perplexity is 8.6131 with 8-bit scales and 8.6352 with 4-bit, while file size decreases from 1.53G to 1.44G. * k_quants: switch Q4_K to 4-bit scales when QK_K = 64 Here the loss in accuracy is greater than for Q3_K, but the Q4_K points still move further to the left on the perplexity vs size curve. * k_quants: forgot to add the Metal changes in last commit * k_quants: change Q5_K to be type 0 when QK_K = 64 Still needs AVX2 implementation * k_quants: AVX2 implementation for new 64-weight Q5_K * k_quants: 10% faster ARM_NEON Q5_K dot product * k_quants: fixed issue caused by merging with master --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit cbebf61ca7584e9709265395f0127ae7fc0f1882 Author: Howard Su <[email protected]> Date: Mon Jun 26 23:15:47 2023 +0800 Fix assert when free invalid cuda pointer (#2005) Fix assert via initializing extra structure always. CUDA error 1 at C:\GPT\llama.cpp\ggml-cuda.cu:2536: invalid argument commit 447ccbe8c39332fcdd0d98a041b6e2ff6f06219d Author: Georgi Gerganov <[email protected]> Date: Sun Jun 25 16:08:12 2023 +0300 readme : add new roadmap + manifesto commit bd34cdde38f8fd661890ddd5f57ca30bf279877b Author: Georgi Gerganov <[email protected]> Date: Sun Jun 25 14:25:08 2023 +0300 ggml : sync latest ggml (custom operators) commit c2a08f87b8d180115d04b8688f383d1b2761b16d Author: anon998 <[email protected]> Date: Sun Jun 25 08:48:36 2023 +0000 fix server sampling: top k sampler first (#1977) Co-authored-by: anon <[email protected]> commit 66a2555ba6cab954c56d653b29c27bfbbacfbfb1 Author: Georgi Gerganov <[email protected]> Date: Sun Jun 25 09:07:03 2023 +0300 readme : add Azure CI discussion link commit e65ca7e14ac76c4046091da39d41a9017abaa9b3 Author: sjinzh <[email protected]> Date: Sun Jun 25 13:45:44 2023 +0800 zig : upgrade build system support (#1981) * upgrade zig build system support * zig : add new line at the end of the file --------- Co-authored-by: Georgi Gerganov <[email protected]> commit 5ec8dd5a3c6a9a109351d2257bb9d53869bd0a94 Author: Robyn <[email protected]> Date: Sun Jun 25 04:10:29 2023 +1000 #1869 Fix null reference errors when training from scratch with CUDA (#1907) * #1869 Fix null reference errors when training from scratch with CUDA build Calling ggml_compute_forward when node->src0 was null was causing train-text-from-scratch.exe to terminate unexpectedly. * ggml : do not dereference src0 if NULL --------- Co-authored-by: Georgi Gerganov <[email protected]> commit 65bdd52a867539691007f85c5508146d507f72c1 Author: Georgi Gerganov <[email protected]> Date: Sat Jun 24 19:40:18 2023 +0300 tests : sync test-grad0 from ggml commit fdd18609113862dc6eb34dfc44a093d54c59ff1f Author: Rowan Hart <[email protected]> Date: Sat Jun 24 04:07:08 2023 -0700 flake : fix ggml-metal.metal path and run nixfmt (#1974) commit c943d823c14cef33092205ca3944de6fdf7abf99 Author: AN Long <[email protected]> Date: Sat Jun 24 19:02:06 2023 +0800 convert : fix invalid params in write_vocab_only (#1975) commit f2c754e1c38936fdde74e4848ac468a696eb73c6 Author: slaren <[email protected]> Date: Sat Jun 24 12:57:18 2023 +0200 ggml : improve ggml_graph_dump_dot, add ggml_format_name (#1978) * Improve ggml_graph_dump_dot, add ggml_format_name * add more automatic names to view ops * fix name of copies commit 11da1a85cd69af84b5861134738c7e9e20907470 Author: Georgi Gerganov <[email protected]> Date: Sat Jun 24 13:38:18 2023 +0300 readme : fix whitespaces commit 235b610d650cbfed6dbd5d671f750d35fc18cd7d Author: Alberto <[email protected]> Date: Sat Jun 24 12:32:13 2023 +0200 readme : fixed termux instructions (#1973) commit b061ba9e2a7a2c335a200df8c11aed5e31e4ccbb Author: Alex Renda <[email protected]> Date: Sat Jun 24 03:15:01 2023 -0700 llama : fix top-p sampling to match the canonical definition (#1953) * Fix top-p sampling to match the standard definition (smallest set that has probability mass at least p, not largest set with probability mass less than p) * top-p: correct gt to gte * add test for correct top-p behavior commit 527b6fba1d237befb324fd846bda7418c0fa394d Author: Didzis Gosko <[email protected]> Date: Sat Jun 24 11:47:58 2023 +0300 llama : make model stateless and context stateful (llama_state) (#1797) * llama : make model stateless and context stateful * llama : minor cleanup * llama : update internal API declaration * Apply suggestions from code review fix style Co-authored-by: Georgi Gerganov <[email protected]> * Missing model memory release * Fix style * Add deprecated warning for public API function llama_init_from_file * Update public API use cases: move away from deprecated llama_init_from_file * Deprecate public API function llama_apply_lora_from_file --------- Co-authored-by: Georgi Gerganov <[email protected]> commit d7b7484f74d486f77feb4c0b7af7e1718ed91651 Author: eiery <[email protected]> Date: Fri Jun 23 04:38:01 2023 -0400 Add OpenLLaMA instructions to the README (#1954) * add openllama to readme commit 7487137227eb32ed9b12156338b865cb29b2dfd1 Author: Erik Scholz <[email protected]> Date: Thu Jun 22 14:20:47 2023 +0200 rework convert.py to read hyper-parameters from config.json (#1958) * Read hyper-parameters from HuggingFace-transformer config.json, if they exist, and fall back to guessing, like before otherwise. This allows converting open_llama 3B and other non-standard model designs. commit bbca06e26949686d61a5126332680ba3cccf235c Author: Johannes Gäßler <[email protected]> Date: Wed Jun 21 23:49:25 2023 +0200 cmake: revert CUDA arch default to 52, 61 if f16 (#1959) commit fb98254f99d769fcbbf20966ef386abdb48ef601 Author: Rahul Vivek Nair <[email protected]> Date: Thu Jun 22 03:18:43 2023 +0530 Fix typo in README.md (#1961) commit 049aa16b8c5c6d086246e4e6b9feb18de4fbd663 Author: Georgi Gerganov <[email protected]> Date: Tue Jun 20 19:05:54 2023 +0300 readme : add link to p1 commit 2322ec223a21625dfe9bd73ee677444a98a24ac9 Author: Xiake Sun <[email protected]> Date: Tue Jun 20 05:42:40 2023 -0700 Fix typo (#1949) commit aacdbd40562684665b6f7b8ba6695b7a2088bbb0 Author: Ettore Di Giacinto <[email protected]> Date: Tue Jun 20 03:24:39 2023 +0200 llama : fix params struct slignment (#1936) * Workaround struct misalignment during value-copy Signed-off-by: mudler <[email protected]> * Move booleans at the bottom of the structure Signed-off-by: mudler <[email protected]> * Add comment Signed-off-by: mudler <[email protected]> --------- Signed-off-by: mudler <[email protected]> commit 20568fe60f00155fa25e92eb3a7f6b911d557967 Author: Henri Vasserman <[email protected]> Date: Tue Jun 20 01:12:39 2023 +0300 [Fix] Reenable server embedding endpoint (#1937) * Add back embedding feature * Update README commit 18b35625c3c19c64b7818a12460ba5ddb006dfdc Author: Georgi Gerganov <[email protected]> Date: Mon Jun 19 20:43:30 2023 +0300 ggml : fix bug in LBFGS optimizer (found by ggml tests) commit ba4e85a8339b9dd7cdffad31838235f2fe45a8ea Author: l3utterfly <[email protected]> Date: Mon Jun 19 23:20:06 2023 +0800 llama : use aligned memory during ggml_init call from loading saved sessions (#1934) * fixed issue: memory is not guaranteed to be aligned properly during ggml_init call from loading saved sessions * - removed commented out old code from fix - updated another instance of same issue below original commit 23fc5c219a9aebd57c8af3fac454062cc4622980 Author: Georgi Gerganov <[email protected]> Date: Mon Jun 19 18:18:34 2023 +0300 cmake : fix trailing whitespaces commit cb40dfca694b5cb849837548fd69932117c78362 Author: Kawrakow <[email protected]> Date: Mon Jun 19 18:17:03 2023 +0300 llama : only use Q6_K for output weights if tensor size is multiple of 256 (#1932) * Only use Q6_K for output weights if tensor size is multiple of 256 * Fixed copy/paste mistake --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit ca7c3f4da5d144d4cd1dd44903552e6ba49b8ec8 Author: Kawrakow <[email protected]> Date: Mon Jun 19 18:14:09 2023 +0300 cuda : faster k-quants on older GPUs (#1930) * k_quants: hopefully much faster Q4_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 65.5 ms/tok to 41.5 ms/tok! * k_quants: hopefully much faster Q3_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 60.3 ms/tok to 41.0 ms/tok! * k_quants: faster Q2_K on older GPUs It looks like I didn't need to change anything compared to what we already had, so this is just adding clarifying comments. But I now measure 36.3 ms/tok on the GTX-1660, instead fo the 47.2 ms/tok that I have written in the faster k-quants PR. * k_quants: faster Q5_K on older GPUs 68.5 ms/tok -> 62.0 ms/tok on GTX-1660. For some reason the same access pattern that leads to such resounding success for Q2_K to Q4_K did not work at all for Q5_K. It is also more difficult to measure because for Q5_K_S we only have 32 layers on the GTX-1660, so output, tok embeddings and kv cache are done on the CPU. --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit b97ca431db35ec96a339a721acb1219c1dd78bed Author: Georgi Gerganov <[email protected]> Date: Mon Jun 19 18:12:33 2023 +0300 ggml : sync latest ggml repo (#1924) * ggml : sync latest ggml repo * ggml : remove unused comments * ggml : asserts commit 1e3abfcef073e73c2b31e8570cb06c5cb2fd1f55 Author: Howard Su <[email protected]> Date: Mon Jun 19 23:10:37 2023 +0800 cmake : fix build shared ggml when CUDA is enabled (#1929) Co-authored-by: Georgi Gerganov <[email protected]> commit 16b9cd193965769089881bb8ec012fccca7b37b6 Author: Johannes Gäßler <[email protected]> Date: Mon Jun 19 10:23:56 2023 +0200 Convert vector to f16 for dequantize mul mat vec (#1913) * Convert vector to f16 for dmmv * compile option * Added compilation option description to README * Changed cmake CUDA_ARCHITECTURES from "OFF" to "native" commit b24c3049d96557c24782e4d32feaae65f47277af Author: Johannes Gäßler <[email protected]> Date: Sun Jun 18 17:41:26 2023 +0200 Added tokens per second to info prints (#1928) commit 0ede372a51fd8160688e01b587582666c14e94e5 Author: Johannes Gäßler <[email protected]> Date: Sun Jun 18 16:07:09 2023 +0200 Fixed incorrectly applying RMS norm twice (#1925) commit 8596af427722775f0df4a7c90b9af067ba90d4ef Author: l3utterfly <[email protected]> Date: Sun Jun 18 19:19:16 2023 +0800 ggml : fix bug in ggml_compute_forward_add_q_f32 (#1918) commit e1886cf4fe0d0f31661dda52a4a9f34bd9b9009a Author: Mike <[email protected]> Date: Sun Jun 18 16:28:26 2023 +0800 readme : update Android build instructions (#1922) Add steps for using termux on android devices to prevent common errors. commit 8ab8ba62eb27cc340be2edf3418e051b1d967416 Author: Kawrakow <[email protected]> Date: Sun Jun 18 11:13:43 2023 +0300 llama : prevent usage of k-quants when tensor size is not a multiple of 256 (#1921) * Fix examples/metal * k-quants: prevent usage when tensor size is not divisible by 256 --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 90cc59d6ab1363a5c69c60c4b94db647d3a54a18 Author: Kawrakow <[email protected]> Date: Sun Jun 18 10:52:10 2023 +0300 examples : fix examples/metal (#1920) Co-authored-by: Iwan Kawrakow <[email protected]> commit ce2c7d72e2d06988b5ddec6811ab923254542077 Author: Georgi Gerganov <[email protected]> Date: Sun Jun 18 09:09:47 2023 +0300 metal : handle buffers larger than device's maxBufferLength (#1826) * metal : handle buffers larger than device's maxBufferLength * metal : print more verbose device info + handle errors * metal : fix prints for overlapping views * metal : minimize view overlap to try to utilize device memory better commit 57cd69460f736031a3fc54af1e97c03f80128478 Author: Howard Su <[email protected]> Date: Sun Jun 18 12:29:47 2023 +0800 cmake : add CUDA_ARCHITECTURES to new target ggml_static (#1917) commit b2416493ab3ab21686d47c96669da6d6c6af08a4 Author: Georgi Gerganov <[email protected]> Date: Sat Jun 17 20:55:03 2023 +0300 make : do not print help for simple example commit 4f9c43e3bd488b7561119785485e1155dba338d7 Author: Georgi Gerganov <[email protected]> Date: Sat Jun 17 20:24:11 2023 +0300 minor : warning fixes commit 2c9380dd2f77e41149340f3ecb09764d793b16db Author: Johannes Gäßler <[email protected]> Date: Sat Jun 17 19:15:02 2023 +0200 Only one CUDA stream per device for async compute (#1898) commit 051e1b0e6a6e3aee7d989b47760980e6fda5861c Author: Georgi Gerganov <[email protected]> Date: Sat Jun 17 19:30:22 2023 +0300 llama : fix kv_cache `n` init (close #1903) commit 86c7571864ff331f8cdb9e092f3abeb123729a56 Author: DaniAndTheWeb <[email protected]> Date: Sat Jun 17 18:17:22 2023 +0200 make : update for latest Arch (#1701) With the upcoming change to the openblas package in arch the Makefile workaround is no longer needed. commit 3d59ec5935ea1d33e9d51060a8dd737169b9b89b Author: Howard Su <[email protected]> Date: Sat Jun 17 23:46:15 2023 +0800 ggml : fix warnings under MSVC (#1908) commit 0711a5f6dce7f04c2a791b14bc47f7d4cb545408 Author: Aaron Miller <[email protected]> Date: Sat Jun 17 07:37:49 2023 -0700 metal : add norm, cpy f16->f16, alibi kernels (#1823) commit fc45a81bc642b9ef33d9004f2b363d558438a6c9 Author: Faez Shakil <[email protected]> Date: Sat Jun 17 17:13:05 2023 +0500 exposed modules so that they can be invoked by nix run github:ggerganov/llama.cpp#server etc (#1863) commit 794db3e7b982fee37e3995db9c3a216a57ff65e3 Author: Randall Fitzgerald <[email protected]> Date: Sat Jun 17 07:53:04 2023 -0400 Server Example Refactor and Improvements (#1570) A major rewrite for the server example. Note that if you have built something on the previous server API, it will probably be incompatible. Check out the examples for how a typical chat app could work. This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing. Summary of the changes: - adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos - applies missing top k sampler - removes interactive mode/terminal-like behavior, removes exclude parameter - moves threads and batch size to server command-line parameters - adds LoRA loading and matches command line parameters with main example - fixes stopping on EOS token and with the specified token amount with n_predict - adds server timeouts, host, and port settings - adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text - sets defaults for unspecified parameters between requests - removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming - adds CORS headers to responses - adds request logging, exception printing and optional verbose logging - adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string - adds printing an error when it can't bind to the host/port specified - fixes multi-byte character handling and replaces invalid UTF-8 characters on responses - prints timing and build info on startup - adds logit bias to request parameters - removes embedding mode - updates documentation; adds streaming Node.js and Bash examples - fixes code formatting - sets server threads to 1 since the current global state doesn't work well with simultaneous requests - adds truncation of the input prompt and better context reset - removes token limit from the input prompt - significantly simplified the logic and removed a lot of variables --------- Co-authored-by: anon998 <[email protected]> Co-authored-by: Henri Vasserman <[email protected]> Co-authored-by: Felix Hellmann <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]> Co-authored-by: Lesaun Harvey <[email protected]> commit 5ddf7ea1fb42bac21026de2f77e0f9c069b92234 Author: Jiří Podivín <[email protected]> Date: Sat Jun 17 12:32:48 2023 +0200 hooks : setting up flake8 and pre-commit hooks (#1681) Small, non-functional changes were made to non-compliant files. These include breaking up long lines, whitespace sanitation and unused import removal. Maximum line length in python files was set to a generous 125 chars, in order to minimize number of changes needed in scripts and general annoyance. The "txt" prompts directory is excluded from the checks as it may contain oddly formatted files and strings for a good reason. Signed-off-by: Jiri Podivin <[email protected]> commit bac19927c302737465a1deb14ac0943a221863e8 Author: Gustavo Rocha Dias <[email protected]> Date: Sat Jun 17 06:01:06 2023 -0300 readme : alternative way to build for Android with CLBlast. (#1828) commit b4c6f46f17b6e02f1cd55a81339e7e64f3aaa688 Author: Kerfuffle <[email protected]> Date: Sat Jun 17 01:49:42 2023 -0600 Allow cmake to build ggml as a library (#1896) * Allow cmake to build ggml as a library * A ggml_static library will be created * When BUILD_SHARED_LIBS is enabled, ggml_shared will also be built commit 92f20d9942c86daeb78637bdad7296a572f4da28 Author: David Yang <[email protected]> Date: Sat Jun 17 14:51:54 2023 +0800 train : get raw text instead of page with html (#1905) We probably want to train using just the text of Shakespeare instead of the html of the page displaying his work. commit d411968e990c37f51328849c96a743dd78f3c3dd Author: 0cc4m <[email protected]> Date: Fri Jun 16 20:59:49 2023 +0200 opencl : support k-quants (#1836) * Porting q2_k kernel to OpenCL * Set global and local sizes for kernel calls for dequantizing k-quants * Added q6_k kernel * Fix q4_k opencl struct order * Replace uchar with uint8_t * Finish dequant kernels * Added OpenCL DMMV kernels * Fix q2_k, improve code * Fix q3_k * Shorten switch statements * Improve code formatting --------- Co-authored-by: Concedo <[email protected]> commit b41b4cad6f956b5f501db0711dd7007c32b5eee5 Author: SuperUserNameMan <[email protected]> Date: Fri Jun 16 20:58:09 2023 +0200 examples : add "simple" (#1840) * Create `simple.cpp` * minimalist example `CMakeLists.txt` * Update Makefile for minimalist example * remove 273: Trailing whitespace * removed trailing white spaces simple.cpp * typo and comments simple.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]> commit 13fe9d2d84f30cab613c960bf66ac83916006694 Author: Zenix <[email protected]> Date: Sat Jun 17 03:53:04 2023 +0900 cmake : add auto detection of BLAS_INCLUDE_DIRS (#1886) commit ac3b8869538c7fbdb48ff141d78c4dea091789f0 Author: Johannes Gäßler <[email protected]> Date: Fri Jun 16 20:25:51 2023 +0200 llama : fix embd when offloading non-repeating layers (#1891) commit 5b9ccaf104cc1054d4f8f17bc8a4b8dc949e5527 Author: FrankHB <[email protected]> Date: Sat Jun 17 02:25:01 2023 +0800 Fixed possible macro redefinition (#1892) MinGW libstdc++ may define `NOMINMAX` unconditionally. This fixes the case when it is already defined. commit 9cbf50c041a525d781c7764f493a5443924e4e38 Author: Borislav Stanimirov <[email protected]> Date: Fri Jun 16 21:23:53 2023 +0300 build : fix and ignore MSVC warnings (#1889) commit 3d0112261042b356621e93db3fa4c6798a5d098f Author: Kawrakow <[email protected]> Date: Fri Jun 16 20:08:44 2023 +0300 CUDA : faster k-quant dot kernels (#1862) * cuda : faster k-quant dot kernels * Imrove Q2_K dot kernel on older GPUs We now have a K_QUANTS_PER_ITERATION macro, which should be set to 1 on older and to 2 on newer GPUs. With this, we preserve the performance of the original PR on RTX-4080, and are faster compared to master on GTX-1660. * Imrove Q6_K dot kernel on older GPUs Using the same K_QUANTS_PER_ITERATION macro as last commit, we preserve performance on RTX-4080 and speed up Q6_K on a GTX-1660. * Add LLAMA_CUDA_KQUANTS_ITER to CMakeLists.txt and Makefile Allowed values are 1 or 2. 2 gives the best performance on modern GPUs and is set as default. On older GPUs 1 may work better. * PR comments --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 602c748863e15270d80d74aa2c3bf86ab8139e07 Author: Borislav Stanimirov <[email protected]> Date: Fri Jun 16 09:58:11 2023 +0300 gitignore : add several entries specific to Visual Studio (#1888) commit a09f9195be39afb4b023b646c0a6ec8a86915174 Author: Johannes Gäßler <[email protected]> Date: Thu Jun 15 21:49:08 2023 +0200 Fixed CUDA runtime version check (#1879) commit bed92756172d4514b23aaf9744cf8e2dc892fc7b Author: Georgi Gerganov <[email protected]> Date: Thu Jun 15 21:56:50 2023 +0300 cmake : remove whitespaces commit c36e81da62ebfe09a768201cc44fa8d712dd00ed Author: yangli2 <[email protected]> Date: Thu Jun 15 11:05:53 2023 -0700 examples : add chat-vicuna.sh (#1854) Co-authored-by: Yang Li <[email protected]> commit 3559433fecedf365e7aba2fe3d5f89d9abb817c1 Author: Igor Okulist <[email protected]> Date: Thu Jun 15 12:51:26 2023 -0500 cmake : set include path for OpenBlas (#1830) commit 69b34a0e80300bfb3e996983ac3ea075f5526675 Author: Frederik Vogel <[email protected]> Date: Fri Jun 16 02:47:04 2023 +0900 swift : Package compile breaks due to ggml-metal.metal (#1831) * Ignore metal file in spm * Add ggml.h to spm public Headers --------- Co-authored-by: Vogel Frederik <[email protected]> commit cf267d1c71a781700698f8518e903239c3bcc929 Author: daboe01 <[email protected]> Date: Thu Jun 15 19:42:48 2023 +0200 make : add train-text-from-scratch (#1850) * make finetuning example accessible * fixed: targed was in wrong line * fixed: name of executable was wrong * fixed: naming of binary * fixed: model path was wrong * fixed clean target * Update examples/train-text-from-scratch/README.md --------- Co-authored-by: Georgi Gerganov <[email protected]> commit 9dda13e5e1f70bdfc25fbc0f0378f27c8b67e983 Author: Srinivas Billa <[email protected]> Date: Thu Jun 15 18:36:38 2023 +0100 readme : server compile flag (#1874) Explicitly include the server make instructions for C++ noobsl like me ;) commit 37e257c48e350cf03c353c10d31e777f8d00123d Author: sandyiscool <[email protected]> Date: Thu Jun 15 23:06:06 2023 +0530 make : clean *.so files (#1857) commit 64cc19b4fe3df03bc20e520aa111c30cff3a655e Author: Howard Su <[email protected]> Date: Fri Jun 16 01:29:59 2023 +0800 Fix the validation of main device (#1872) commit 4bfcc855abdb2c9fcc3c5a84747974521909fa41 Author: Georgi Gerganov <[email protected]> Date: Thu Jun 15 20:29:48 2023 +0300 metal : parallel command buffer encoding (#1860) * metal : parallel command buffer encoding * metal : determine number of command buffers based on gf->n_threads commit 6b8312e7979b852f6b6ac9d29cd51fda16c17948 Author: Johannes Gäßler <[email protected]> Date: Thu Jun 15 19:06:46 2023 +0200 Better error when using both LoRA + GPU layers (#1861) commit 254a7a7a5ff4c874ff8488f1f5cbdd7e9c89d682 Author: Johannes Gäßler <[email protected]> Date: Wed Jun 14 19:47:19 2023 +0200 CUDA full GPU acceleration, KV cache in VRAM (#1827) * Fixed CUDA RoPE * ggml_cuda_mul_mat_vec_p021 * ggml_cuda_scale * ggml_cuda_diag_mask_inf * ggml_is_permuted * ggml_cuda_cpy * flatten rows for ggml_cuda_op * Added a --low-vram option * Fixed Windows performance * Fixed LLAMA_CUDA_DMMV_Y > 1 for WizardLM commit 92549202659fc23ba9fec5e688227d0da9b06b40 Author: 0xspringtime <[email protected]> Date: Tue Jun 13 15:37:54 2023 -0400 baby-llama : fix operator!= (#1821) * Update baby-llama.cpp Seems to be an error in the implementation of the operator!= function. It attempts to compare the this pointer (a llama_hparams_lora object) with the other pointer (a llama_hparams object) using memcmp. This can lead to incorrect results because the sizes of the objects being compared (sizeof(llama_hparams) and sizeof(llama_hparams_lora)) are different, should now be able to compare two llama_hparams_lora objects for inequality. * Update baby-llama.cpp * Update baby-llama.cpp commit e32089b2c20b1b87b22912f4a8b93fe01647d5b9 Author: xaedes <[email protected]> Date: Tue Jun 13 21:04:40 2023 +0200 train : improved training-from-scratch example (#1652) * add python wrapper https://gist.github.com/abetlen/2b90e5f153f6efd00931d098de5c73ce * fix decoding error. adds errors=ignore parameter * add python bindings for functions to get and set the whole llama state (rng, logits, embedding and kv_cache) * update python bindings * add text generating baby-llama from scratch example * fix race condition bug in ggml_compute_forward_diag_mask_f32 * implement ggml_soft_max_back for more performant backward pass of soft_max avoids creating big intermediate matrices of size n_embd x n_embd for llama layers and n_vocab x n_vocab for cross entropy loss * improve softmax backward pass go from quadratic runtime to linear runtime by simplifying the formulas * fix race condition bug in non-inplace ggml_compute_forward_diag_mask_f32 memcpy needs to be synchronized across threads to avoid race conditions. => do it in INIT phase * fix bug in ggml_compute_forward_soft_max_back_f32 on DEBUG build * improve performance of mul_mat backward pass avoid transpose by using mul_mat with swapped arguments * avoid printing too much newlines in baby-llama-text * activate threading in baby-llama-text * add ggml_out_prod and use it for mul_mat backward pass for improved performance performance stats report improvement from 37 seconds to 16 seconds runtime during my training tests * better weight initialization improves training convergence at start * better weight initialization improves training convergence at start * improve ggml_out_prod performance - change iteration order (>15s -> 10s runtime) - parallelize over one more dimension: over dst matrix rows (10s -> <5s runtime) * add llama sampler, shuffle samples and constrain sampling to tokens occurring in train data * fix get_samples call, add model tensor names, increase model size, start training samples after newline * save train trained model to checkpoint and load model to be trained from checkpoint * use inplace functions where possible * initialize rng with srand * use different arguments for input and output checkpoint * ggml fixes to support backward pass on inplace operations * remove duplicate include * fix cross entropy loss - add target probabilities for each sample which is then used in cross entropy loss * print used memory before and after optimization * sample with non-greedy sampling parameters at the end of training * add cmake target for baby-llama-text * add ggml_add1_inplace to header * enable gradient propagation for inplace add1 and scale operations those functions backward passes don't need the original src0, so they also work when forward is inplace * implement AdamW in ggml_opt_adam by adding weight decay parameter (default 0.001f) also add a schedule parameter (default 1.0f) that can be used to scale alpha and decay according to learning schedule. setting the decay parameter to zero disables AdamW resulting in normal Adam optimizer. since the difference between Adam and AdamW is minimal it is not implemented as another optimizer, but integrated into the existing Adam optimizer. * use inplace operations in cross_entropy_loss * fix random weight initialization scale * add missing default parameters for adam optimizer * add ggml_opt_context, so that we can properly resume training otherwise the optimizer states, tracking statistics about the error function and its derivates, will reset to zero each time ggml_opt is called, hindering convergence on resumed training. now the optimizer context and all its memory is stored in a separate struct. * fix bug in llama_sample_token_mirostat_v2 when all candidates are filtered out through mu threshold, the following soft_max operation will fail. so keep at least one. * add forward function without using cache, for more performant training during training on whole samples no cache is required. removing the cache and simplifying the remaining code results in performance and memory usage improvement. * print suppressed newline tokens as string "\n" printing too much actual newlines is suppressed to avoid flooding the console. * store optimizer state in training checkpoint and add learning schedule persistent optimizer state allows to resume training without resetting the optimizer learning schedule consists of linear warmup ramp followed by cosine decay with restarts * remove unused functions * fix bug in get_samples which corrupted training targets * save checkpoint only when it was trained * simplify code * remove trailing whitespace * simplify backward pass for SQRT * replace inefficient repeat backward pass with dedicated repeat_back operation * add ggml_cross_entropy_loss with backward pass for faster training cross entropy loss can also be implemented using softmax and log, but as dedicated operation it is faster and especially avoids unnecessary memory overhead. * add tests for cross_entropy_loss backward pass finite differences regularly results in estimated gradient of zero, despite the backward pass giving non zero gradient. _probably_ the finite differences fails due to numerical issues * use ggml_cross_entropy_loss in text training example * remove trailing whitespace * slightly improve how cross entropy loss is compute btw: directly implemented cross entropy loss seems to have way lower magnitudes than when implemented with softmax and log. probably the input to log gets closer to zero due to float numerics. maybe the multiplication by (1.0-eps)/sum is more accurate.. * add llama_get_vocab to get the vocabulary as output parameters * set default model.type for unknown models with few layers * add export of training checkpoint to llama compatible model file * get vocabulary for exporting training checkpoint to llama compatible model file * implement backward pass of flash attention * bugfixes for backward pass of flash attention * test flash attention backward pass need to set loose error bounds to pass. the finitie differences are close to numeric limits and often return quite different values than the backward pass. reducing eps further lets the gradients vanish completely. likewise setting eps to big results in wronger values. the softmax in the middle of the function is probably the most responsible for the numeric issues using finite differences. * add option to train with flash attention and move options to the top of the main function training from scratch also works with flash attention training convergence and generation results after fix number of iterations are worse than when not using flash attention. maybe there still lingers a bug in the flash attention backward pass? but training works, just with slower convergence. flash attention is still worth to use, because it requires way less memory and is faster with high n_ctx * add train_params and command line option parser * remove unnecessary comments * add train params to specify memory size * remove python bindings * rename baby-llama-text to train-text-from-scratch * replace auto parameters in lambda function * add #include <climits> * add explicit cast to fix compile error "error: non-constant-expression cannot be narrowed from type 'int64_t' (aka 'long long') to 'uint32_t' (aka 'unsigned int') in initializer list [-Wc++11-narrowing]" * remove trailing whitespace * add ggml_opt_resume_g which accepts forward and backward cgraphs * fix formulas in comments * bug fix for ggml_compute_forward_get_rows_back_f32 the result should be set to zero, not to whatever data is in opt0 * improve training memory usage with scratch buffers instead of relying on the automatic backward pass, we manually create the graph for the backward pass. it turns out that all backward pass operations need only temporary memory which can be reused after each layer. will compute backward pass for ALL model parameters * add option to use scratch buffers in training or not make it configurable because currently training with scratch buffers implies flash attention and optimization over all parameters. * ci : disable temporary * store view offset and permute axes in opt[0] instead of storing it in padding use memcpy to store offset, because offset is of type size_t. when storing it as int32_t offset would have to be smaller than 2^31 which is not necessarily true. * minor : fix compile warnings + minor style changes * fix bug in threaded indices calculation of ggml_compute_forward_flash_attn_back_f32 * store view offset like in master branch * bug fix in forward_batch_wo_cache_flash_attn_train * scratch buffer bug fixes in forward_batch_wo_cache_flash_attn_train data of permute and reshape is the same as their input. if we want to preserve the output of permute/reshape, we also need to preserve their inputs. replace reshape(src0, src1) with reshape_nd calls so that we don't need src1. replace (temporary) t03 with ggml_repeat(ctx0, layer.attention_norm, t02). in the future we could also use the new broadcasting ggml_mul to avoid these repeat calls. for this we need backward pass of broadcasting ggml_mul. * remove unnecessary scratch buffer 0 buf 0 is persistent memory, so we can just disable scratch for this by using buf -1 * avoid creating unnecessary grad tensors previously we need to create grads for model parameters, so that expand(..) correctly populates cgraph->leafs & cgraph->grads this wasted memory, because unnecessary grad for each op were automatically created: the automatically generated grad was unnecessary because we later manually set the grad (e.g. t35->grad = expand(gb, ...) ). this discarded the automatically generated grad resulting in wasted memory. improved this by changing expand(..) to not use ggml_build_forward_expand. expand set cgraph->nodes but not the leafs. cgraph->leafs & cgraph->grads are set in another pass after the last expand call. * print used training seed * zero initialize gfbuf and gbbuf * ci : re-enable workflows + add README for training --------- Co-authored-by: Georgi Gerganov <[email protected]> commit 2347e45e7bdb09c9a7d74b2c0bc86c2b65f0c343 Author: Georgi Gerganov <[email protected]> Date: Tue Jun 13 20:20:07 2023 +0300 llama : do a warm-up eval at start for better timings (#1824) commit 74d4cfa3438cb58bd177eed30014e6588694aaa8 Author: Kerfuffle <[email protected]> Date: Tue Jun 13 04:23:23 2023 -0600 Allow "quantizing" to f16 and f32 (#1787) * Allow "quantizing" to f16 and f32 Fix an issue where quantizing didn't respect LLAMA_NO_K_QUANTS Add brief help to the list of quantization types in the quantize tool Ignore case for quantization type arguments in the quantize tool commit 74a6d922f12ccfe16b0c265f43be8978c6f25e98 Author: Kawrakow <[email protected]> Date: Mon Jun 12 22:39:21 2023 +0300 Metal implementation for all k_quants (#1807) * metal : improve q4_K 28.3 -> 26.0 ms/token by avoiding a branch in the calculation of the scales. * metal : small improvement for Q4_K * metal : still optimizing Q4_K This commit pushes it down to 25.3 ms / token. The crazy idea of using 6 bits for the scales is really costly on Metal: if I remove the bit fiddling necessary to make the block scales, time goes almost to the Q4_0 23 ms/token. Before pushing the k-quants upstream I had a Q4_K variant that had used 8-bit scales. It wasn't more accurate, used 0.125 bits more per weight, was running slightly slower on the CPU (due to the larger model size and being memory bound there), and the difference was entirely negligible under CUDA. So, I decided to publish the version with 6-bit scales. Perhaps I should re-consider and change to 8-bit scales? * metal : some more optimizations Q2_K: 25.4 ms/token Q6_K: 27.3 ms/token Q4_0: 22.8 ms/token Q4_1: 23.1 ms/token * metal : Q3_K support Something is not quite right yet. * metal : Q5_K support Initial version achieves 31.2 ms/token, 210 GB/s * metal : still not able to figure out why q3_K does not work * Minor * metal : yet another failed attempt to make q3_K work * metal : optimize Q5_K 31.2 ms -> 27.8 ms. 250 GB/s. * metal : q3_K still not working Adding a heavily commented q3_K metal kernel to explain my obviously faulty logic. Perhaps someone could spot the issue? * metal : q3_K finally working Not optimized at all. What was the issue? The scales are not 4-bytes aligned, and I was accessing them with a uint32_t pointer. When I tried that on CUDA, I got an error (illegal memory access) and added a memcpy to a local array of 3 uint32_t's. But on Metal it told me there is no memcpy, so I tried accessing directly. There is no error, just garbage results. At some point I did try accessing the scales with an uint16_t pointer (the scales are for sure 2-byte aligned), but was still getting garbage. I guess, there must have been another bug. No access to scales is via a uint16_t pointer and, after starting from scratch from the C dequantize function, it finally works. * metal : Q3_K 1st optimization pass * metal : Q3_K second optimization pass - 29.6 ms/token * metal : Q3_K cleanup * metal : fixed accidentally broken Q2_K --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit e4caa8da59c1c97dc23fa336f4d726984a20560f Author: slaren <[email protected]> Date: Mon Jun 12 19:12:47 2023 +0200 ci : run when changing only the CUDA sources (#1800) commit 58970a4c39124a647ac2a640d9e178ea6c961e65 Author: Howard Su <[email protected]> Date: Mon Jun 12 20:44:16 2023 +0800 Leverage mmap for offloading tensors to GPU (#1597) * Rebase to latest * Show progress * Add assert to make sure we only allocate temp buffer for non-CPU backend tensor Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]> commit 8c0a10e64dbf60fd9946c0cd5e6f59690800b123 Author: Kawrakow <[email protected]> Date: Mon Jun 12 14:31:36 2023 +0300 metal : fix failure to load model (#1817) The number of buffers in the ggml context was left unitialized. This leads to sporadic failures to load the model on startup. It is actually strange that the failure occurred so infrequantly. Co-authored-by: Iwan Kawrakow <[email protected]> commit fa84c4b3e80199a5683438f062009c031a06c4fa Author: Kerfuffle <[email protected]> Date: Sun Jun 11 08:19:17 2023 -0600 Fix issue where interactive mode crashes when input exceeds ctx size (#1789) * Fix issue where interactive mode in the main example crashes when input exceeds ctx size * Ensure the context size is at least 8 tokens in the main example. Closes #1768 commit 12b063f0ecf280e98028e444fc492ee6222cdcdc Author: Kyle Liang <[email protected]> Date: Sun Jun 11 21:20:52 2023 +0800 Fixed WSL cuda's OOM error (#1594) * In the function , add the cuda error bypass. * remove excessive codes and prints --------- Co-authored-by: liang <[email protected]> commit 31d2b5f4a4bae081e59b36ab37c6ff6f5b5940ad Author: Ryan Landay <[email protected]> Date: Sun Jun 11 17:38:53 2023 +0800 Update SHA256SUMS with current hashes for models quantized using q4_0 (#1798) commit 4de0334f5cabf4696eced2e5d6e279fdfaa6c0f2 Author: Georgi Gerganov <[email protected]> Date: Sat Jun 10 22:56:53 2023 +0300 cmake : fix Metal build (close #1791) commit 3f1223155a462477ac933474ebc4eab0ce3ca264 Author: Artyom Lebedev <[email protected]> Date: Sat Jun 10 22:51:36 2023 +0300 k-quants : GCC12 compilation fix (#1792) commit 303f5809f1b4ec49823dbe70cacd2124ec1d0df0 Author: Andrei <[email protected]> Date: Sat Jun 10 10:47:34 2023 -0400 metal : fix issue with ggml-metal.metal path. Closes #1769 (#1782) * Fix issue with ggml-metal.metal path * Add ggml-metal.metal as a resource for llama target * Update flake.nix metal kernel substitution commit 059e99066d95d73d1ca26c3375d47c0e35596229 Author: Aisuko <[email protected]> Date: Sun Jun 11 00:08:11 2023 +1000 doc : fix wrong address of BLIS.md (#1772) Signed-off-by: Aisuko <[email protected]> commit 17c10acfb44ecb7af25e37fb67b9501cbc0034d2 Author: Georgi Gerganov <[email protected]> Date: Sat Jun 10 12:06:45 2023 +0300 ggml : force no_alloc == false when creating opt tensors (close #1699) This is needed to make operators like ggml_view() be able to store their parameters in the ggml context's memory and not get discarded when no_alloc is true commit e9b66ee9829039d4ab54550d6222e42a0b31e52a Author: Kawrakow <[email protected]> Date: Sat Jun 10 11:28:11 2023 +0300 metal : add Q4_1 implementation (#1785) 23.3 ms / token, so just ~1% slower than q4_0. Achieves 290 GB/s memory throughput. Co-authored-by: Iwan Kawrakow <[email protected]> commit 4f0154b0bad775ac4651bf73b5c216eb43c45cdc Author: Kerfuffle <[email protected]> Date: Sat Jun 10 01:59:17 2023 -0600 llama : support requantizing models instead of only allowing quantization from 16/32bit (#1691) * Add support for quantizing already quantized models * Threaded dequantizing and f16 to f32 conversion * Clean up thread blocks with spares calculation a bit * Use std::runtime_error exceptions. commit ef3171d16241c18581d4d08374f0b9e396ade6b7 Author: Xingchen Song(宋星辰) <[email protected]> Date: Sat Jun 10 15:49:40 2023 +0800 ggml : workaround for missing _mm256_setr_m128i in GCC < 8 (#1638) commit 555275a693843273759230547001f9ae07fb537e Author: rankaiyx <[email protected]> Date: Sat Jun 10 14:41:59 2023 +0800 make : add SSSE3 compilation use case (#1659) commit 98ed16557432d7a5179c57eddcc3a08a7ae6d54d Author: Robert Sung-wook Shin <[email protected]> Date: Sat Jun 10 01:24:40 2023 +0900 OpenCL: Add release memory (#1741) * Add opencl release memory * Rename function name commit ae9663f1887513e152839e91f61c513075a19422 Author: Johannes Gäßler <[email protected]> Date: Fri Jun 9 13:58:15 2023 +0200 Windows nvcc workaround (#1753) Fix gibberish output on Windows when using CUDA commit b33dee282f5d8032b5f780152732dc45cbf2d349 Author: Georgi Gerganov <[email protected]> Date: Fri Jun 9 11:11:04 2023 +0300 metal : fix build "tanhf" -> "tanh" commit 92f44ff7f778ef1b94028b2ba6d39943b5ca0ada Author: AT <[email protected]> Date: Fri Jun 9 04:00:51 2023 -0400 metal : add GELU implementation (#1770) Co-authored-by: Adam Treat <[email protected]> commit 245fc3c37da5ac5963f9f11a9f4f2ac08d96afc6 Author: Kawrakow <[email protected]> Date: Fri Jun 9 10:39:59 2023 +0300 metal : faster q4_0 (#1775) * metal : 8% faster q4_0 Avoid copying into local uchar4 anf float4. * metal : 17% faster Q4_0 Use 64 threads in a thread group. --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 72ff5282bf0388c60821f504c4c8cc2b1f491aa6 Author: Kawrakow <[email protected]> Date: Thu Jun 8 22:28:21 2023 +0300 metal : add Q2_K implementation (#1762) * metal : add Q2_K implementation 27.1 ms / token on M2 Max 30-core GPU, so about the same speed as Q4_0. Memory throughput is ~156 GB/s. The access pattern used in the Q2_K CUDA implementation resulted in significantly lower performance (~31 ms/token). * Fixing merge conflicts --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 0bf7cf1b296fc9fca05411b37afdf08a531487d2 Author: Georgi Gerganov <[email protected]> Date: Thu Jun 8 20:48:14 2023 +0300 Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)" This reverts commit 8432d4d9f716b25133e3ed671d91e21f6f3be867. commit 8432d4d9f716b25133e3ed671d91e21f6f3be867 Author: le.chang <[email protected]> Date: Fri Jun 9 00:47:56 2023 +0800 ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738) commit 0f291e1f65c1d68201e71ce99c89562a36686b6d Author: Kawrakow <[email protected]> Date: Thu Jun 8 19:46:22 2023 +0300 metal : Q6_K implementation (#1752) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master * Metal implementation for Q6_K Similar to the CUDA implementation. No idea if this is the optimum for Metal, but the few alternative variants I tried all had a lower performance. We get 36.5 ms / token on M2 Max with 30 GPU cores. This corresponds to ~200 GB/second throughput. * clang-tidy : add config back * Much better Q6_K implementation for metal 28.3 ms / token for 7B. Subtracting ~9 ms that is spent in other compute graph operations, we are left with ~19 ms for the matrix multiplications. The model is ~5.5 GB, so we are getting 1000 / 19 * 5.5 = 290 GB/s! --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 8fc8179919a11738910db07a800f2b176f8adf09 Author: qingfengfenga <[email protected]> Date: Thu Jun 8 15:58:53 2023 +0800 Add llama.cpp docker support for non-latin languages (#1673) * Modify Dockerfile default character set to improve compatibility (#1673) commit b50b570ed9d699d3d126d72fc02de92926bcd937 Author: Steven Roussey <[email protected]> Date: Thu Jun 8 00:12:28 2023 -0700 ggml : fix fprintf warnings (#1720) commit 53aba3f393f2e02a78ddaba2e934893a8bbf3246 Author: Georgi Gerganov <[email protected]> Date: Thu Jun 8 10:09:08 2023 +0300 clang-tidy : restore dot file from accidental deletion commit 4161bdc04debb70bf5f275492b4d89fd9330087c Author: Kawrakow <[email protected]> Date: Thu Jun 8 10:08:23 2023 +0300 metal : add Q4_K implementation (#1733) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 0035858273ebe0694926bf4414d279f3e1cd109d Author: johnson442 <[email protected]> Date: Thu Jun 8 08:02:48 2023 +0100 k-quants : add missing compile definition to CMakeLists (#1748) commit 5c64a0952ee58b2d742ee84e8e3d43cce5d366db Author: Georgi Gerganov <[email protected]> Date: Wed Jun 7 10:59:52 2023 +0300 k-quants : allow to optionally disable at compile time (#1734) * k-quants : put behind optional compile flag LLAMA_K_QUANTS * build : enable k-quants by default commit 5b57a5b72676540b6a45a3f527126299969ad241 Author: jacobi petrucciani <[email protected]> Date: Wed Jun 7 00:15:31 2023 -0400 flake : update to support metal on m1/m2 (#1724) commit 4dc62c545df0af60635d579e9e4dd91bc5afff51 Author: Georgi Gerganov <[email protected]> Date: Wed Jun 7 07:15:08 2023 +0300 readme : add June roadmap commit 35a84916fb029905c44746127026079268216e7a Author: Willy Tarreau <[email protected]> Date: Wed Jun 7 04:10:17 2023 +0200 main: add the possibility to open the prompt cache read-only (#1640) The prompt cache constitutes a nice speed up when using the same prompt prefix across multiple evaluations, but when using it, it will also be updated, which is not always desirable. One use case is to have a large prompt containing some context and usage rules, and a second part containing variable data of the problem being studied. In this case it's desirable to be able to save the first part once, and to always reuse it as-is without updating it with the second part. The new argument --prompt-cache-ro enables this read-only mode on the prompt cache. The prompt's contents that match the cache are loaded from the cache but the rest is not modified. This allowed to reduce a total analysis time from 112s to 49.7s here, without having to backup and restore a copy of the prompt, which takes significant time at 500 MB. Signed-off-by: Willy Tarreau <[email protected]> commit 2d7bf110edd8c49209401a16132052cba706ffd0 Author: Georgi Gerganov <[email protected]> Date: Tue Jun 6 22:54:39 2023 +0300 llama : fix vram_scratch var commit 2a4e41a086ce80da68c402457c75c77e52dcc698 Author: Georgi Gerganov <[email protected]> Date: Tue Jun 6 22:41:53 2023 +0300 llama : fix compile warnings commit 17366df842e358768c0df7024484fffecfc7865b Author: Johannes Gäßler <[email protected]> Date: Tue Jun 6 21:33:23 2023 +0200 Multi GPU support, CUDA refactor, CUDA scratch buffer (#1703) * CUDA multi GPU + scratch ggml_cuda_compute_forward Tensor parallelism ggml_cuda_add ggml_cuda_rms_norm ggml_cuda_silu CUDA scratch buffer --main-gpu CLI option commit 44f906e8537fcec965e312d621c80556d6aa9bec Author: Georgi Gerganov <[email protected]> Date: Tue Jun 6 20:16:57 2023 +0300 metal : add f16 support commit d5b111f53d14972669eb52055f9df2567663ad8b Author: LostRuins <[email protected]> Date: Wed Jun 7 01:00:01 2023 +0800 Clblast fixes + enhancements to save VRAM and offload more layers (#1675) * Use events instead of clFinish, where possible * OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel * Reduce queueing overhead for contiguous tensors by using single mul kernel call * Adapt to #1612 cl_mem malloc changes * Reduce code duplication between cuda and opencl branches * Improve implementation * Clblast fixes + enhancements to save VRAM: 1. Change all Clblast buffers to CL_MEM_READ_WRITE, as the pool malloc currently doesn't properly handle them. 2. When recycling buffers in pool malloc, always assign the SMALLEST available buffer that fits, instead of the FIRST available buffer 3. When failing to recycle a buffer in pool malloc (all too small), instead recycle the largest available free buffer by resizing it. * change max value size_t to use limits * removed flags from the CL pool malloc, apply code tidying suggestions. commit 2d43387dafe9c60f15f57aa23ee0b37864b98b32 Author: Georgi Gerganov <ggerga…
commit b8c8dda75fdf5fdea49c80af36818e7c30fe0ddf Author: Howard Su <[email protected]> Date: Thu Jun 29 21:15:15 2023 +0800 Use unsigned for random seed (#2006) * Use unsigned for random seed. Keep -1 as the value to use a time based seed. Co-authored-by: Georgi Gerganov <[email protected]> commit 96a712ca1b7f427e3bd7ffc0c70b2105cfc7fbf1 Author: LostRuins <[email protected]> Date: Thu Jun 29 11:56:43 2023 +0800 Porting the improved K-Quant CUDA kernels to OpenCL (#1966) * Added broken new q4k quant * xx + ib0 * Fix q2_k fast kernel * Use preprocessor for QK_K * Add q6_k fast matmul kernel * ported q3k speedup successfully * ported q2k and q5k speedups * remove old dot kernels and template * fixed global const struct types * fixing address spaces * fixed string too long CI issue --------- Co-authored-by: 0cc4m <[email protected]> commit d3494bb86bf7ad5b0b60aae0220ea576f273b5c0 Author: m3ndax <[email protected]> Date: Wed Jun 28 20:39:08 2023 +0200 llama : replacing auto &kv with const auto &kv (#2041) * Replacing auto &kv with const auto &kv * Create codacy.yml * Delete codacy.yml commit 5b351e94d041742cd50ffcf2d44718d63bab398a Author: Salvador E. Tropea <[email protected]> Date: Wed Jun 28 14:27:31 2023 -0300 cuda : remove nchannels_x argument from mul_mat_vec_nc_f16_f32 (#2028) - Not used commit 6432aabb6dc887436e4d57414b63116189c3b13b Author: Salvador E. Tropea <[email protected]> Date: Wed Jun 28 14:26:26 2023 -0300 cuda : fix missing const qualifier in casts (#2027) commit b922bc351b69770cec2d35d2aa50fa052b95ca93 Author: Howard Su <[email protected]> Date: Wed Jun 28 10:13:02 2023 -0700 llama : remove shards weight file support (#2000) * Remove multiple shards * Remove multiple file loaders * Remove llama_load_tensor_shard class * Simplify load logic * Remove dead code guess_n_parts function * Remove vocab_only from constructor of llama_model_loader * Remove alignment_prevents_mmap which is not more needed. * Remove useless check commit 7f9753fa1263c4eded9a3de19778562f0e1093d7 Author: Johannes Gäßler <[email protected]> Date: Wed Jun 28 18:35:54 2023 +0200 CUDA GPU acceleration for LoRAs + f16 models (#1970) commit cfa0750bc9dbc2d957a91b8ed09ab0035d8f3d4e Author: ningshanwutuobang <[email protected]> Date: Wed Jun 28 23:53:37 2023 +0800 llama : support input embeddings directly (#1910) * add interface for float input * fixed inpL shape and type * add examples of input floats * add test example for embd input * fixed sampling * add free for context * fixed add end condition for generating * add examples for llava.py * add READMD for llava.py * add READMD for llava.py * add example of PandaGPT * refactor the interface and fixed the styles * add cmake build for embd-input * add cmake build for embd-input * Add MiniGPT-4 example * change the order of the args of llama_eval_internal * fix ci error commit 9d23589d638dc74577d5ff880e6d4248b795f12e Author: Erik Scholz <[email protected]> Date: Tue Jun 27 19:06:33 2023 +0200 fix pthreads setaffinity usage on android (#2020) commit 0be54f75a6c3e9a09ea71bdfcdabf9a996a0549b Author: Howard Su <[email protected]> Date: Tue Jun 27 13:07:13 2023 +0800 baby-llama : fix build after ggml_rope change (#2016) commit 181e8d975528a4e27eabb8ae6e9865f9ceae4b37 Author: Georgi Gerganov <[email protected]> Date: Tue Jun 27 00:37:13 2023 +0300 llama : fix rope usage after ChatGLM change commit d9779021bd59ed96daae75e820a5ac5da47ca8ff Author: Georgi Gerganov <[email protected]> Date: Tue Jun 27 00:06:51 2023 +0300 ggml : add support for ChatGLM RoPE commit d38e45157862b58a1824387e64860d68ca3533a7 Author: Roman Parykin <[email protected]> Date: Mon Jun 26 22:47:59 2023 +0300 readme : add Scala 3 bindings repo (#2010) commit eaa6ca5a61b8c9501df9ebe3d264f45b75a5f8aa Author: David Yang <[email protected]> Date: Tue Jun 27 03:45:32 2023 +0800 ggml : increase max tensor name + clean up compiler warnings in train-text (#1988) * Clean up compiler warnings in train-text Some brackets to disambiguate order of operations * Increase GGML_MAX_NAME Avoiding strncpy danger in train-text-from-scratch and reducing potential future name length issues commit aa777abbb73655c4e1e9237b7c0ad66745e8e48c Author: Gustavo Rocha Dias <[email protected]> Date: Mon Jun 26 16:34:45 2023 -0300 readme : LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux (#2007) * docs - Alternative way to build at Android, with CLBlast. * doc - LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux. * doc- fix typo commit c824d2e368d193d9f564ff29880a51cda9f90527 Author: Georgi Gerganov <[email protected]> Date: Mon Jun 26 21:03:59 2023 +0300 ggml : avoid conv 2d kernel round up commit b853d456018b10820686362af41b2f2f75f1eec6 Author: zrm <[email protected]> Date: Mon Jun 26 13:57:59 2023 -0400 ggml : add NUMA support (#1556) * detect NUMA systems and pin work threads to nodes (linux) * disable mmap prefetch/readahead for NUMA systems * avoid sending finalize op to thread pool if it does nothing * silence robot * fix args * make --numa a param * recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement * lower synchronization overhead * statically allocate * move numa state to g_state * add description for --numa * ggml : minor style changes * ggml : minor style + try fix sanitizer build * llama : allow to initialize backend with NUMA support * llama : avoid ggml include in llama-util.h * ggml : style / formatting * ggml : fix handling of ops with n_threads > n_tasks > 1 * server : utilize numa parameter --------- Co-authored-by: Georgi Gerganov <[email protected]> commit 9225baef71407d799a6f7f563b77fd7f82791416 Author: Georgi Gerganov <[email protected]> Date: Mon Jun 26 20:10:52 2023 +0300 k-quants : fix indentation commit a84ab1da8dc6a59a5b67420ae1322f09503ffc72 Author: katsu560 <[email protected]> Date: Tue Jun 27 01:47:02 2023 +0900 tests : fix quantize perf (#1990) * fix test quantize perf * avoid the global state commit 5743ca80928d8410754ec64a5673d5c2dd6cfbb7 Author: katsu560 <[email protected]> Date: Tue Jun 27 01:46:07 2023 +0900 k-quants : add AVX support to dot functions (#1916) * k_quants : add AVX support * k_quants : apply review comments commit 412c60e4739367144e51e59add5dc7749d084115 Author: Georgi Gerganov <[email protected]> Date: Mon Jun 26 19:45:09 2023 +0300 readme : add link to new k-quants for visibility commit 6769e944c727c63612dcafbef52009d21ae00fff Author: Kawrakow <[email protected]> Date: Mon Jun 26 19:43:07 2023 +0300 k-quants : support for super-block size of 64 (#2001) * k_quants: WIP super-blocks with 64 weights * k_quants: WIP super-blocks with 64 weights Q6_K scalar and AVX2 works * k_quants: WIP super-blocks with 64 weights Q4_K scalar and AVX2 works * k_quants: WIP super-blocks with 64 weights Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower than the scalar implementation) * k_quants: WIP super-blocks with 64 weights Q3_K scalar and AVX2 works. * k_quants: WIP super-blocks with 64 weights Q5_K scalar and AVX2 works, and with that all k_quants are done on AVX2 and scalar * k_quants: WIP super-blocks with 64 weights Q6_K working on CUDA. Cannot make it run quite as gast as with super-blocks with 256 weigths: 8% slower on 4080, 20% slower on the 1660 (but there we fit 1 less layer on the GPU because pf the larger model size), so some fraction of these 20% is due to that, * k_quants: WIP super-blocks with 64 weights Q4_K working on CUDA. ~10% slower on GTX-1660, 16% slower on 4080. * k_quants: WIP super-blocks with 64 weights Q2_K working on CUDA. ~3% slower on GTX-1660, 10% slower on 4080. * k_quants: WIP super-blocks with 64 weights Q3_K working on CUDA. * k_quants: WIP super-blocks with 64 weights Q5_K working on CUDA, and with this CUDA is done. * k_quants: WIP super-blocks with 64 weights Q6_K working on ARM_NEON * k_quants: WIP super-blocks with 64 weights Q4_K working on ARM_NEON, but quite a bit slower than 256 weights * k_quants: WIP super-blocks with 64 weights Q2_K working on ARM_NEON, but quite a bit slower than 256 weights * k_quants: WIP super-blocks with 64 weights Q3_K working on ARM_NEON, but quite a bit slower than 256 weights. * k_quants: WIP super-blocks with 64 weights Q5_K working on ARM_NEON, but quite a bit slower than 256 weights. With that, we have full support for ARM_NEON, although performance is not quite there. * k_quants: WIP super-blocks with 64 weights Slightly more efficient Q3_K and Q5_K * k_quants: WIP super-blocks with 64 weights Another small improvement for Q3_K and Q5_K on ARM_NEON * k_quants: WIP super-blocks with 64 weights Yet another speedup for Q5_K on ARM_NEON. We are now within 10% of the QK_K = 256 version. * k_quants: WIP super-blocks with 64 weights * We are able to pass preprocessor macros to the Metal compiler * Q6_K works and is actually slightly more efficient than the QK_K = 256 version (25.2 ms vs 25.8 ms) * k_quants: WIP super-blocks with 64 weights Q4_K works on Metal and is actually slightly faster than QK_K = 256 (21.95 ms vs 24.0 ms). * k_quants: WIP super-blocks with 64 weights Q2_K works on Metal and is very slightly faster than QK_K = 256 (23.8 ms vs 24.2 ms). * k_quants: WIP super-blocks with 64 weights Q3_K works on Metal and is slightly faster than QK_K = 256 (26.6 ms vs 28.3 ms). * k_quants: WIP super-blocks with 64 weights Q5_K works on Metal and is slightly faster than QK_K = 256 (23.7 ms vs 26.3 ms). * k_quants: call them _K, not _k, also on Metal * k_quants: correctly define QK_K in llama.cpp * Fixed bug in q4_K quantization added with the 64-block addition * Simplify via lambda * k_quants: swicth Q3_K to 4-bit scales when QK_K = 64 Otherwise there isn't much benefit from this quantization type. There is some very slight loss in accuracy, but we reduce size by ~7%. E.g., for OpenLLaMA-3B, Q3_K_S perplexity is 8.6131 with 8-bit scales and 8.6352 with 4-bit, while file size decreases from 1.53G to 1.44G. * k_quants: switch Q4_K to 4-bit scales when QK_K = 64 Here the loss in accuracy is greater than for Q3_K, but the Q4_K points still move further to the left on the perplexity vs size curve. * k_quants: forgot to add the Metal changes in last commit * k_quants: change Q5_K to be type 0 when QK_K = 64 Still needs AVX2 implementation * k_quants: AVX2 implementation for new 64-weight Q5_K * k_quants: 10% faster ARM_NEON Q5_K dot product * k_quants: fixed issue caused by merging with master --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit cbebf61ca7584e9709265395f0127ae7fc0f1882 Author: Howard Su <[email protected]> Date: Mon Jun 26 23:15:47 2023 +0800 Fix assert when free invalid cuda pointer (#2005) Fix assert via initializing extra structure always. CUDA error 1 at C:\GPT\llama.cpp\ggml-cuda.cu:2536: invalid argument commit 447ccbe8c39332fcdd0d98a041b6e2ff6f06219d Author: Georgi Gerganov <[email protected]> Date: Sun Jun 25 16:08:12 2023 +0300 readme : add new roadmap + manifesto commit bd34cdde38f8fd661890ddd5f57ca30bf279877b Author: Georgi Gerganov <[email protected]> Date: Sun Jun 25 14:25:08 2023 +0300 ggml : sync latest ggml (custom operators) commit c2a08f87b8d180115d04b8688f383d1b2761b16d Author: anon998 <[email protected]> Date: Sun Jun 25 08:48:36 2023 +0000 fix server sampling: top k sampler first (#1977) Co-authored-by: anon <[email protected]> commit 66a2555ba6cab954c56d653b29c27bfbbacfbfb1 Author: Georgi Gerganov <[email protected]> Date: Sun Jun 25 09:07:03 2023 +0300 readme : add Azure CI discussion link commit e65ca7e14ac76c4046091da39d41a9017abaa9b3 Author: sjinzh <[email protected]> Date: Sun Jun 25 13:45:44 2023 +0800 zig : upgrade build system support (#1981) * upgrade zig build system support * zig : add new line at the end of the file --------- Co-authored-by: Georgi Gerganov <[email protected]> commit 5ec8dd5a3c6a9a109351d2257bb9d53869bd0a94 Author: Robyn <[email protected]> Date: Sun Jun 25 04:10:29 2023 +1000 #1869 Fix null reference errors when training from scratch with CUDA (#1907) * #1869 Fix null reference errors when training from scratch with CUDA build Calling ggml_compute_forward when node->src0 was null was causing train-text-from-scratch.exe to terminate unexpectedly. * ggml : do not dereference src0 if NULL --------- Co-authored-by: Georgi Gerganov <[email protected]> commit 65bdd52a867539691007f85c5508146d507f72c1 Author: Georgi Gerganov <[email protected]> Date: Sat Jun 24 19:40:18 2023 +0300 tests : sync test-grad0 from ggml commit fdd18609113862dc6eb34dfc44a093d54c59ff1f Author: Rowan Hart <[email protected]> Date: Sat Jun 24 04:07:08 2023 -0700 flake : fix ggml-metal.metal path and run nixfmt (#1974) commit c943d823c14cef33092205ca3944de6fdf7abf99 Author: AN Long <[email protected]> Date: Sat Jun 24 19:02:06 2023 +0800 convert : fix invalid params in write_vocab_only (#1975) commit f2c754e1c38936fdde74e4848ac468a696eb73c6 Author: slaren <[email protected]> Date: Sat Jun 24 12:57:18 2023 +0200 ggml : improve ggml_graph_dump_dot, add ggml_format_name (#1978) * Improve ggml_graph_dump_dot, add ggml_format_name * add more automatic names to view ops * fix name of copies commit 11da1a85cd69af84b5861134738c7e9e20907470 Author: Georgi Gerganov <[email protected]> Date: Sat Jun 24 13:38:18 2023 +0300 readme : fix whitespaces commit 235b610d650cbfed6dbd5d671f750d35fc18cd7d Author: Alberto <[email protected]> Date: Sat Jun 24 12:32:13 2023 +0200 readme : fixed termux instructions (#1973) commit b061ba9e2a7a2c335a200df8c11aed5e31e4ccbb Author: Alex Renda <[email protected]> Date: Sat Jun 24 03:15:01 2023 -0700 llama : fix top-p sampling to match the canonical definition (#1953) * Fix top-p sampling to match the standard definition (smallest set that has probability mass at least p, not largest set with probability mass less than p) * top-p: correct gt to gte * add test for correct top-p behavior commit 527b6fba1d237befb324fd846bda7418c0fa394d Author: Didzis Gosko <[email protected]> Date: Sat Jun 24 11:47:58 2023 +0300 llama : make model stateless and context stateful (llama_state) (#1797) * llama : make model stateless and context stateful * llama : minor cleanup * llama : update internal API declaration * Apply suggestions from code review fix style Co-authored-by: Georgi Gerganov <[email protected]> * Missing model memory release * Fix style * Add deprecated warning for public API function llama_init_from_file * Update public API use cases: move away from deprecated llama_init_from_file * Deprecate public API function llama_apply_lora_from_file --------- Co-authored-by: Georgi Gerganov <[email protected]> commit d7b7484f74d486f77feb4c0b7af7e1718ed91651 Author: eiery <[email protected]> Date: Fri Jun 23 04:38:01 2023 -0400 Add OpenLLaMA instructions to the README (#1954) * add openllama to readme commit 7487137227eb32ed9b12156338b865cb29b2dfd1 Author: Erik Scholz <[email protected]> Date: Thu Jun 22 14:20:47 2023 +0200 rework convert.py to read hyper-parameters from config.json (#1958) * Read hyper-parameters from HuggingFace-transformer config.json, if they exist, and fall back to guessing, like before otherwise. This allows converting open_llama 3B and other non-standard model designs. commit bbca06e26949686d61a5126332680ba3cccf235c Author: Johannes Gäßler <[email protected]> Date: Wed Jun 21 23:49:25 2023 +0200 cmake: revert CUDA arch default to 52, 61 if f16 (#1959) commit fb98254f99d769fcbbf20966ef386abdb48ef601 Author: Rahul Vivek Nair <[email protected]> Date: Thu Jun 22 03:18:43 2023 +0530 Fix typo in README.md (#1961) commit 049aa16b8c5c6d086246e4e6b9feb18de4fbd663 Author: Georgi Gerganov <[email protected]> Date: Tue Jun 20 19:05:54 2023 +0300 readme : add link to p1 commit 2322ec223a21625dfe9bd73ee677444a98a24ac9 Author: Xiake Sun <[email protected]> Date: Tue Jun 20 05:42:40 2023 -0700 Fix typo (#1949) commit aacdbd40562684665b6f7b8ba6695b7a2088bbb0 Author: Ettore Di Giacinto <[email protected]> Date: Tue Jun 20 03:24:39 2023 +0200 llama : fix params struct slignment (#1936) * Workaround struct misalignment during value-copy Signed-off-by: mudler <[email protected]> * Move booleans at the bottom of the structure Signed-off-by: mudler <[email protected]> * Add comment Signed-off-by: mudler <[email protected]> --------- Signed-off-by: mudler <[email protected]> commit 20568fe60f00155fa25e92eb3a7f6b911d557967 Author: Henri Vasserman <[email protected]> Date: Tue Jun 20 01:12:39 2023 +0300 [Fix] Reenable server embedding endpoint (#1937) * Add back embedding feature * Update README commit 18b35625c3c19c64b7818a12460ba5ddb006dfdc Author: Georgi Gerganov <[email protected]> Date: Mon Jun 19 20:43:30 2023 +0300 ggml : fix bug in LBFGS optimizer (found by ggml tests) commit ba4e85a8339b9dd7cdffad31838235f2fe45a8ea Author: l3utterfly <[email protected]> Date: Mon Jun 19 23:20:06 2023 +0800 llama : use aligned memory during ggml_init call from loading saved sessions (#1934) * fixed issue: memory is not guaranteed to be aligned properly during ggml_init call from loading saved sessions * - removed commented out old code from fix - updated another instance of same issue below original commit 23fc5c219a9aebd57c8af3fac454062cc4622980 Author: Georgi Gerganov <[email protected]> Date: Mon Jun 19 18:18:34 2023 +0300 cmake : fix trailing whitespaces commit cb40dfca694b5cb849837548fd69932117c78362 Author: Kawrakow <[email protected]> Date: Mon Jun 19 18:17:03 2023 +0300 llama : only use Q6_K for output weights if tensor size is multiple of 256 (#1932) * Only use Q6_K for output weights if tensor size is multiple of 256 * Fixed copy/paste mistake --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit ca7c3f4da5d144d4cd1dd44903552e6ba49b8ec8 Author: Kawrakow <[email protected]> Date: Mon Jun 19 18:14:09 2023 +0300 cuda : faster k-quants on older GPUs (#1930) * k_quants: hopefully much faster Q4_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 65.5 ms/tok to 41.5 ms/tok! * k_quants: hopefully much faster Q3_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 60.3 ms/tok to 41.0 ms/tok! * k_quants: faster Q2_K on older GPUs It looks like I didn't need to change anything compared to what we already had, so this is just adding clarifying comments. But I now measure 36.3 ms/tok on the GTX-1660, instead fo the 47.2 ms/tok that I have written in the faster k-quants PR. * k_quants: faster Q5_K on older GPUs 68.5 ms/tok -> 62.0 ms/tok on GTX-1660. For some reason the same access pattern that leads to such resounding success for Q2_K to Q4_K did not work at all for Q5_K. It is also more difficult to measure because for Q5_K_S we only have 32 layers on the GTX-1660, so output, tok embeddings and kv cache are done on the CPU. --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit b97ca431db35ec96a339a721acb1219c1dd78bed Author: Georgi Gerganov <[email protected]> Date: Mon Jun 19 18:12:33 2023 +0300 ggml : sync latest ggml repo (#1924) * ggml : sync latest ggml repo * ggml : remove unused comments * ggml : asserts commit 1e3abfcef073e73c2b31e8570cb06c5cb2fd1f55 Author: Howard Su <[email protected]> Date: Mon Jun 19 23:10:37 2023 +0800 cmake : fix build shared ggml when CUDA is enabled (#1929) Co-authored-by: Georgi Gerganov <[email protected]> commit 16b9cd193965769089881bb8ec012fccca7b37b6 Author: Johannes Gäßler <[email protected]> Date: Mon Jun 19 10:23:56 2023 +0200 Convert vector to f16 for dequantize mul mat vec (#1913) * Convert vector to f16 for dmmv * compile option * Added compilation option description to README * Changed cmake CUDA_ARCHITECTURES from "OFF" to "native" commit b24c3049d96557c24782e4d32feaae65f47277af Author: Johannes Gäßler <[email protected]> Date: Sun Jun 18 17:41:26 2023 +0200 Added tokens per second to info prints (#1928) commit 0ede372a51fd8160688e01b587582666c14e94e5 Author: Johannes Gäßler <[email protected]> Date: Sun Jun 18 16:07:09 2023 +0200 Fixed incorrectly applying RMS norm twice (#1925) commit 8596af427722775f0df4a7c90b9af067ba90d4ef Author: l3utterfly <[email protected]> Date: Sun Jun 18 19:19:16 2023 +0800 ggml : fix bug in ggml_compute_forward_add_q_f32 (#1918) commit e1886cf4fe0d0f31661dda52a4a9f34bd9b9009a Author: Mike <[email protected]> Date: Sun Jun 18 16:28:26 2023 +0800 readme : update Android build instructions (#1922) Add steps for using termux on android devices to prevent common errors. commit 8ab8ba62eb27cc340be2edf3418e051b1d967416 Author: Kawrakow <[email protected]> Date: Sun Jun 18 11:13:43 2023 +0300 llama : prevent usage of k-quants when tensor size is not a multiple of 256 (#1921) * Fix examples/metal * k-quants: prevent usage when tensor size is not divisible by 256 --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 90cc59d6ab1363a5c69c60c4b94db647d3a54a18 Author: Kawrakow <[email protected]> Date: Sun Jun 18 10:52:10 2023 +0300 examples : fix examples/metal (#1920) Co-authored-by: Iwan Kawrakow <[email protected]> commit ce2c7d72e2d06988b5ddec6811ab923254542077 Author: Georgi Gerganov <[email protected]> Date: Sun Jun 18 09:09:47 2023 +0300 metal : handle buffers larger than device's maxBufferLength (#1826) * metal : handle buffers larger than device's maxBufferLength * metal : print more verbose device info + handle errors * metal : fix prints for overlapping views * metal : minimize view overlap to try to utilize device memory better commit 57cd69460f736031a3fc54af1e97c03f80128478 Author: Howard Su <[email protected]> Date: Sun Jun 18 12:29:47 2023 +0800 cmake : add CUDA_ARCHITECTURES to new target ggml_static (#1917) commit b2416493ab3ab21686d47c96669da6d6c6af08a4 Author: Georgi Gerganov <[email protected]> Date: Sat Jun 17 20:55:03 2023 +0300 make : do not print help for simple example commit 4f9c43e3bd488b7561119785485e1155dba338d7 Author: Georgi Gerganov <[email protected]> Date: Sat Jun 17 20:24:11 2023 +0300 minor : warning fixes commit 2c9380dd2f77e41149340f3ecb09764d793b16db Author: Johannes Gäßler <[email protected]> Date: Sat Jun 17 19:15:02 2023 +0200 Only one CUDA stream per device for async compute (#1898) commit 051e1b0e6a6e3aee7d989b47760980e6fda5861c Author: Georgi Gerganov <[email protected]> Date: Sat Jun 17 19:30:22 2023 +0300 llama : fix kv_cache `n` init (close #1903) commit 86c7571864ff331f8cdb9e092f3abeb123729a56 Author: DaniAndTheWeb <[email protected]> Date: Sat Jun 17 18:17:22 2023 +0200 make : update for latest Arch (#1701) With the upcoming change to the openblas package in arch the Makefile workaround is no longer needed. commit 3d59ec5935ea1d33e9d51060a8dd737169b9b89b Author: Howard Su <[email protected]> Date: Sat Jun 17 23:46:15 2023 +0800 ggml : fix warnings under MSVC (#1908) commit 0711a5f6dce7f04c2a791b14bc47f7d4cb545408 Author: Aaron Miller <[email protected]> Date: Sat Jun 17 07:37:49 2023 -0700 metal : add norm, cpy f16->f16, alibi kernels (#1823) commit fc45a81bc642b9ef33d9004f2b363d558438a6c9 Author: Faez Shakil <[email protected]> Date: Sat Jun 17 17:13:05 2023 +0500 exposed modules so that they can be invoked by nix run github:ggerganov/llama.cpp#server etc (#1863) commit 794db3e7b982fee37e3995db9c3a216a57ff65e3 Author: Randall Fitzgerald <[email protected]> Date: Sat Jun 17 07:53:04 2023 -0400 Server Example Refactor and Improvements (#1570) A major rewrite for the server example. Note that if you have built something on the previous server API, it will probably be incompatible. Check out the examples for how a typical chat app could work. This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing. Summary of the changes: - adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos - applies missing top k sampler - removes interactive mode/terminal-like behavior, removes exclude parameter - moves threads and batch size to server command-line parameters - adds LoRA loading and matches command line parameters with main example - fixes stopping on EOS token and with the specified token amount with n_predict - adds server timeouts, host, and port settings - adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text - sets defaults for unspecified parameters between requests - removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming - adds CORS headers to responses - adds request logging, exception printing and optional verbose logging - adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string - adds printing an error when it can't bind to the host/port specified - fixes multi-byte character handling and replaces invalid UTF-8 characters on responses - prints timing and build info on startup - adds logit bias to request parameters - removes embedding mode - updates documentation; adds streaming Node.js and Bash examples - fixes code formatting - sets server threads to 1 since the current global state doesn't work well with simultaneous requests - adds truncation of the input prompt and better context reset - removes token limit from the input prompt - significantly simplified the logic and removed a lot of variables --------- Co-authored-by: anon998 <[email protected]> Co-authored-by: Henri Vasserman <[email protected]> Co-authored-by: Felix Hellmann <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]> Co-authored-by: Lesaun Harvey <[email protected]> commit 5ddf7ea1fb42bac21026de2f77e0f9c069b92234 Author: Jiří Podivín <[email protected]> Date: Sat Jun 17 12:32:48 2023 +0200 hooks : setting up flake8 and pre-commit hooks (#1681) Small, non-functional changes were made to non-compliant files. These include breaking up long lines, whitespace sanitation and unused import removal. Maximum line length in python files was set to a generous 125 chars, in order to minimize number of changes needed in scripts and general annoyance. The "txt" prompts directory is excluded from the checks as it may contain oddly formatted files and strings for a good reason. Signed-off-by: Jiri Podivin <[email protected]> commit bac19927c302737465a1deb14ac0943a221863e8 Author: Gustavo Rocha Dias <[email protected]> Date: Sat Jun 17 06:01:06 2023 -0300 readme : alternative way to build for Android with CLBlast. (#1828) commit b4c6f46f17b6e02f1cd55a81339e7e64f3aaa688 Author: Kerfuffle <[email protected]> Date: Sat Jun 17 01:49:42 2023 -0600 Allow cmake to build ggml as a library (#1896) * Allow cmake to build ggml as a library * A ggml_static library will be created * When BUILD_SHARED_LIBS is enabled, ggml_shared will also be built commit 92f20d9942c86daeb78637bdad7296a572f4da28 Author: David Yang <[email protected]> Date: Sat Jun 17 14:51:54 2023 +0800 train : get raw text instead of page with html (#1905) We probably want to train using just the text of Shakespeare instead of the html of the page displaying his work. commit d411968e990c37f51328849c96a743dd78f3c3dd Author: 0cc4m <[email protected]> Date: Fri Jun 16 20:59:49 2023 +0200 opencl : support k-quants (#1836) * Porting q2_k kernel to OpenCL * Set global and local sizes for kernel calls for dequantizing k-quants * Added q6_k kernel * Fix q4_k opencl struct order * Replace uchar with uint8_t * Finish dequant kernels * Added OpenCL DMMV kernels * Fix q2_k, improve code * Fix q3_k * Shorten switch statements * Improve code formatting --------- Co-authored-by: Concedo <[email protected]> commit b41b4cad6f956b5f501db0711dd7007c32b5eee5 Author: SuperUserNameMan <[email protected]> Date: Fri Jun 16 20:58:09 2023 +0200 examples : add "simple" (#1840) * Create `simple.cpp` * minimalist example `CMakeLists.txt` * Update Makefile for minimalist example * remove 273: Trailing whitespace * removed trailing white spaces simple.cpp * typo and comments simple.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]> commit 13fe9d2d84f30cab613c960bf66ac83916006694 Author: Zenix <[email protected]> Date: Sat Jun 17 03:53:04 2023 +0900 cmake : add auto detection of BLAS_INCLUDE_DIRS (#1886) commit ac3b8869538c7fbdb48ff141d78c4dea091789f0 Author: Johannes Gäßler <[email protected]> Date: Fri Jun 16 20:25:51 2023 +0200 llama : fix embd when offloading non-repeating layers (#1891) commit 5b9ccaf104cc1054d4f8f17bc8a4b8dc949e5527 Author: FrankHB <[email protected]> Date: Sat Jun 17 02:25:01 2023 +0800 Fixed possible macro redefinition (#1892) MinGW libstdc++ may define `NOMINMAX` unconditionally. This fixes the case when it is already defined. commit 9cbf50c041a525d781c7764f493a5443924e4e38 Author: Borislav Stanimirov <[email protected]> Date: Fri Jun 16 21:23:53 2023 +0300 build : fix and ignore MSVC warnings (#1889) commit 3d0112261042b356621e93db3fa4c6798a5d098f Author: Kawrakow <[email protected]> Date: Fri Jun 16 20:08:44 2023 +0300 CUDA : faster k-quant dot kernels (#1862) * cuda : faster k-quant dot kernels * Imrove Q2_K dot kernel on older GPUs We now have a K_QUANTS_PER_ITERATION macro, which should be set to 1 on older and to 2 on newer GPUs. With this, we preserve the performance of the original PR on RTX-4080, and are faster compared to master on GTX-1660. * Imrove Q6_K dot kernel on older GPUs Using the same K_QUANTS_PER_ITERATION macro as last commit, we preserve performance on RTX-4080 and speed up Q6_K on a GTX-1660. * Add LLAMA_CUDA_KQUANTS_ITER to CMakeLists.txt and Makefile Allowed values are 1 or 2. 2 gives the best performance on modern GPUs and is set as default. On older GPUs 1 may work better. * PR comments --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 602c748863e15270d80d74aa2c3bf86ab8139e07 Author: Borislav Stanimirov <[email protected]> Date: Fri Jun 16 09:58:11 2023 +0300 gitignore : add several entries specific to Visual Studio (#1888) commit a09f9195be39afb4b023b646c0a6ec8a86915174 Author: Johannes Gäßler <[email protected]> Date: Thu Jun 15 21:49:08 2023 +0200 Fixed CUDA runtime version check (#1879) commit bed92756172d4514b23aaf9744cf8e2dc892fc7b Author: Georgi Gerganov <[email protected]> Date: Thu Jun 15 21:56:50 2023 +0300 cmake : remove whitespaces commit c36e81da62ebfe09a768201cc44fa8d712dd00ed Author: yangli2 <[email protected]> Date: Thu Jun 15 11:05:53 2023 -0700 examples : add chat-vicuna.sh (#1854) Co-authored-by: Yang Li <[email protected]> commit 3559433fecedf365e7aba2fe3d5f89d9abb817c1 Author: Igor Okulist <[email protected]> Date: Thu Jun 15 12:51:26 2023 -0500 cmake : set include path for OpenBlas (#1830) commit 69b34a0e80300bfb3e996983ac3ea075f5526675 Author: Frederik Vogel <[email protected]> Date: Fri Jun 16 02:47:04 2023 +0900 swift : Package compile breaks due to ggml-metal.metal (#1831) * Ignore metal file in spm * Add ggml.h to spm public Headers --------- Co-authored-by: Vogel Frederik <[email protected]> commit cf267d1c71a781700698f8518e903239c3bcc929 Author: daboe01 <[email protected]> Date: Thu Jun 15 19:42:48 2023 +0200 make : add train-text-from-scratch (#1850) * make finetuning example accessible * fixed: targed was in wrong line * fixed: name of executable was wrong * fixed: naming of binary * fixed: model path was wrong * fixed clean target * Update examples/train-text-from-scratch/README.md --------- Co-authored-by: Georgi Gerganov <[email protected]> commit 9dda13e5e1f70bdfc25fbc0f0378f27c8b67e983 Author: Srinivas Billa <[email protected]> Date: Thu Jun 15 18:36:38 2023 +0100 readme : server compile flag (#1874) Explicitly include the server make instructions for C++ noobsl like me ;) commit 37e257c48e350cf03c353c10d31e777f8d00123d Author: sandyiscool <[email protected]> Date: Thu Jun 15 23:06:06 2023 +0530 make : clean *.so files (#1857) commit 64cc19b4fe3df03bc20e520aa111c30cff3a655e Author: Howard Su <[email protected]> Date: Fri Jun 16 01:29:59 2023 +0800 Fix the validation of main device (#1872) commit 4bfcc855abdb2c9fcc3c5a84747974521909fa41 Author: Georgi Gerganov <[email protected]> Date: Thu Jun 15 20:29:48 2023 +0300 metal : parallel command buffer encoding (#1860) * metal : parallel command buffer encoding * metal : determine number of command buffers based on gf->n_threads commit 6b8312e7979b852f6b6ac9d29cd51fda16c17948 Author: Johannes Gäßler <[email protected]> Date: Thu Jun 15 19:06:46 2023 +0200 Better error when using both LoRA + GPU layers (#1861) commit 254a7a7a5ff4c874ff8488f1f5cbdd7e9c89d682 Author: Johannes Gäßler <[email protected]> Date: Wed Jun 14 19:47:19 2023 +0200 CUDA full GPU acceleration, KV cache in VRAM (#1827) * Fixed CUDA RoPE * ggml_cuda_mul_mat_vec_p021 * ggml_cuda_scale * ggml_cuda_diag_mask_inf * ggml_is_permuted * ggml_cuda_cpy * flatten rows for ggml_cuda_op * Added a --low-vram option * Fixed Windows performance * Fixed LLAMA_CUDA_DMMV_Y > 1 for WizardLM commit 92549202659fc23ba9fec5e688227d0da9b06b40 Author: 0xspringtime <[email protected]> Date: Tue Jun 13 15:37:54 2023 -0400 baby-llama : fix operator!= (#1821) * Update baby-llama.cpp Seems to be an error in the implementation of the operator!= function. It attempts to compare the this pointer (a llama_hparams_lora object) with the other pointer (a llama_hparams object) using memcmp. This can lead to incorrect results because the sizes of the objects being compared (sizeof(llama_hparams) and sizeof(llama_hparams_lora)) are different, should now be able to compare two llama_hparams_lora objects for inequality. * Update baby-llama.cpp * Update baby-llama.cpp commit e32089b2c20b1b87b22912f4a8b93fe01647d5b9 Author: xaedes <[email protected]> Date: Tue Jun 13 21:04:40 2023 +0200 train : improved training-from-scratch example (#1652) * add python wrapper https://gist.github.com/abetlen/2b90e5f153f6efd00931d098de5c73ce * fix decoding error. adds errors=ignore parameter * add python bindings for functions to get and set the whole llama state (rng, logits, embedding and kv_cache) * update python bindings * add text generating baby-llama from scratch example * fix race condition bug in ggml_compute_forward_diag_mask_f32 * implement ggml_soft_max_back for more performant backward pass of soft_max avoids creating big intermediate matrices of size n_embd x n_embd for llama layers and n_vocab x n_vocab for cross entropy loss * improve softmax backward pass go from quadratic runtime to linear runtime by simplifying the formulas * fix race condition bug in non-inplace ggml_compute_forward_diag_mask_f32 memcpy needs to be synchronized across threads to avoid race conditions. => do it in INIT phase * fix bug in ggml_compute_forward_soft_max_back_f32 on DEBUG build * improve performance of mul_mat backward pass avoid transpose by using mul_mat with swapped arguments * avoid printing too much newlines in baby-llama-text * activate threading in baby-llama-text * add ggml_out_prod and use it for mul_mat backward pass for improved performance performance stats report improvement from 37 seconds to 16 seconds runtime during my training tests * better weight initialization improves training convergence at start * better weight initialization improves training convergence at start * improve ggml_out_prod performance - change iteration order (>15s -> 10s runtime) - parallelize over one more dimension: over dst matrix rows (10s -> <5s runtime) * add llama sampler, shuffle samples and constrain sampling to tokens occurring in train data * fix get_samples call, add model tensor names, increase model size, start training samples after newline * save train trained model to checkpoint and load model to be trained from checkpoint * use inplace functions where possible * initialize rng with srand * use different arguments for input and output checkpoint * ggml fixes to support backward pass on inplace operations * remove duplicate include * fix cross entropy loss - add target probabilities for each sample which is then used in cross entropy loss * print used memory before and after optimization * sample with non-greedy sampling parameters at the end of training * add cmake target for baby-llama-text * add ggml_add1_inplace to header * enable gradient propagation for inplace add1 and scale operations those functions backward passes don't need the original src0, so they also work when forward is inplace * implement AdamW in ggml_opt_adam by adding weight decay parameter (default 0.001f) also add a schedule parameter (default 1.0f) that can be used to scale alpha and decay according to learning schedule. setting the decay parameter to zero disables AdamW resulting in normal Adam optimizer. since the difference between Adam and AdamW is minimal it is not implemented as another optimizer, but integrated into the existing Adam optimizer. * use inplace operations in cross_entropy_loss * fix random weight initialization scale * add missing default parameters for adam optimizer * add ggml_opt_context, so that we can properly resume training otherwise the optimizer states, tracking statistics about the error function and its derivates, will reset to zero each time ggml_opt is called, hindering convergence on resumed training. now the optimizer context and all its memory is stored in a separate struct. * fix bug in llama_sample_token_mirostat_v2 when all candidates are filtered out through mu threshold, the following soft_max operation will fail. so keep at least one. * add forward function without using cache, for more performant training during training on whole samples no cache is required. removing the cache and simplifying the remaining code results in performance and memory usage improvement. * print suppressed newline tokens as string "\n" printing too much actual newlines is suppressed to avoid flooding the console. * store optimizer state in training checkpoint and add learning schedule persistent optimizer state allows to resume training without resetting the optimizer learning schedule consists of linear warmup ramp followed by cosine decay with restarts * remove unused functions * fix bug in get_samples which corrupted training targets * save checkpoint only when it was trained * simplify code * remove trailing whitespace * simplify backward pass for SQRT * replace inefficient repeat backward pass with dedicated repeat_back operation * add ggml_cross_entropy_loss with backward pass for faster training cross entropy loss can also be implemented using softmax and log, but as dedicated operation it is faster and especially avoids unnecessary memory overhead. * add tests for cross_entropy_loss backward pass finite differences regularly results in estimated gradient of zero, despite the backward pass giving non zero gradient. _probably_ the finite differences fails due to numerical issues * use ggml_cross_entropy_loss in text training example * remove trailing whitespace * slightly improve how cross entropy loss is compute btw: directly implemented cross entropy loss seems to have way lower magnitudes than when implemented with softmax and log. probably the input to log gets closer to zero due to float numerics. maybe the multiplication by (1.0-eps)/sum is more accurate.. * add llama_get_vocab to get the vocabulary as output parameters * set default model.type for unknown models with few layers * add export of training checkpoint to llama compatible model file * get vocabulary for exporting training checkpoint to llama compatible model file * implement backward pass of flash attention * bugfixes for backward pass of flash attention * test flash attention backward pass need to set loose error bounds to pass. the finitie differences are close to numeric limits and often return quite different values than the backward pass. reducing eps further lets the gradients vanish completely. likewise setting eps to big results in wronger values. the softmax in the middle of the function is probably the most responsible for the numeric issues using finite differences. * add option to train with flash attention and move options to the top of the main function training from scratch also works with flash attention training convergence and generation results after fix number of iterations are worse than when not using flash attention. maybe there still lingers a bug in the flash attention backward pass? but training works, just with slower convergence. flash attention is still worth to use, because it requires way less memory and is faster with high n_ctx * add train_params and command line option parser * remove unnecessary comments * add train params to specify memory size * remove python bindings * rename baby-llama-text to train-text-from-scratch * replace auto parameters in lambda function * add #include <climits> * add explicit cast to fix compile error "error: non-constant-expression cannot be narrowed from type 'int64_t' (aka 'long long') to 'uint32_t' (aka 'unsigned int') in initializer list [-Wc++11-narrowing]" * remove trailing whitespace * add ggml_opt_resume_g which accepts forward and backward cgraphs * fix formulas in comments * bug fix for ggml_compute_forward_get_rows_back_f32 the result should be set to zero, not to whatever data is in opt0 * improve training memory usage with scratch buffers instead of relying on the automatic backward pass, we manually create the graph for the backward pass. it turns out that all backward pass operations need only temporary memory which can be reused after each layer. will compute backward pass for ALL model parameters * add option to use scratch buffers in training or not make it configurable because currently training with scratch buffers implies flash attention and optimization over all parameters. * ci : disable temporary * store view offset and permute axes in opt[0] instead of storing it in padding use memcpy to store offset, because offset is of type size_t. when storing it as int32_t offset would have to be smaller than 2^31 which is not necessarily true. * minor : fix compile warnings + minor style changes * fix bug in threaded indices calculation of ggml_compute_forward_flash_attn_back_f32 * store view offset like in master branch * bug fix in forward_batch_wo_cache_flash_attn_train * scratch buffer bug fixes in forward_batch_wo_cache_flash_attn_train data of permute and reshape is the same as their input. if we want to preserve the output of permute/reshape, we also need to preserve their inputs. replace reshape(src0, src1) with reshape_nd calls so that we don't need src1. replace (temporary) t03 with ggml_repeat(ctx0, layer.attention_norm, t02). in the future we could also use the new broadcasting ggml_mul to avoid these repeat calls. for this we need backward pass of broadcasting ggml_mul. * remove unnecessary scratch buffer 0 buf 0 is persistent memory, so we can just disable scratch for this by using buf -1 * avoid creating unnecessary grad tensors previously we need to create grads for model parameters, so that expand(..) correctly populates cgraph->leafs & cgraph->grads this wasted memory, because unnecessary grad for each op were automatically created: the automatically generated grad was unnecessary because we later manually set the grad (e.g. t35->grad = expand(gb, ...) ). this discarded the automatically generated grad resulting in wasted memory. improved this by changing expand(..) to not use ggml_build_forward_expand. expand set cgraph->nodes but not the leafs. cgraph->leafs & cgraph->grads are set in another pass after the last expand call. * print used training seed * zero initialize gfbuf and gbbuf * ci : re-enable workflows + add README for training --------- Co-authored-by: Georgi Gerganov <[email protected]> commit 2347e45e7bdb09c9a7d74b2c0bc86c2b65f0c343 Author: Georgi Gerganov <[email protected]> Date: Tue Jun 13 20:20:07 2023 +0300 llama : do a warm-up eval at start for better timings (#1824) commit 74d4cfa3438cb58bd177eed30014e6588694aaa8 Author: Kerfuffle <[email protected]> Date: Tue Jun 13 04:23:23 2023 -0600 Allow "quantizing" to f16 and f32 (#1787) * Allow "quantizing" to f16 and f32 Fix an issue where quantizing didn't respect LLAMA_NO_K_QUANTS Add brief help to the list of quantization types in the quantize tool Ignore case for quantization type arguments in the quantize tool commit 74a6d922f12ccfe16b0c265f43be8978c6f25e98 Author: Kawrakow <[email protected]> Date: Mon Jun 12 22:39:21 2023 +0300 Metal implementation for all k_quants (#1807) * metal : improve q4_K 28.3 -> 26.0 ms/token by avoiding a branch in the calculation of the scales. * metal : small improvement for Q4_K * metal : still optimizing Q4_K This commit pushes it down to 25.3 ms / token. The crazy idea of using 6 bits for the scales is really costly on Metal: if I remove the bit fiddling necessary to make the block scales, time goes almost to the Q4_0 23 ms/token. Before pushing the k-quants upstream I had a Q4_K variant that had used 8-bit scales. It wasn't more accurate, used 0.125 bits more per weight, was running slightly slower on the CPU (due to the larger model size and being memory bound there), and the difference was entirely negligible under CUDA. So, I decided to publish the version with 6-bit scales. Perhaps I should re-consider and change to 8-bit scales? * metal : some more optimizations Q2_K: 25.4 ms/token Q6_K: 27.3 ms/token Q4_0: 22.8 ms/token Q4_1: 23.1 ms/token * metal : Q3_K support Something is not quite right yet. * metal : Q5_K support Initial version achieves 31.2 ms/token, 210 GB/s * metal : still not able to figure out why q3_K does not work * Minor * metal : yet another failed attempt to make q3_K work * metal : optimize Q5_K 31.2 ms -> 27.8 ms. 250 GB/s. * metal : q3_K still not working Adding a heavily commented q3_K metal kernel to explain my obviously faulty logic. Perhaps someone could spot the issue? * metal : q3_K finally working Not optimized at all. What was the issue? The scales are not 4-bytes aligned, and I was accessing them with a uint32_t pointer. When I tried that on CUDA, I got an error (illegal memory access) and added a memcpy to a local array of 3 uint32_t's. But on Metal it told me there is no memcpy, so I tried accessing directly. There is no error, just garbage results. At some point I did try accessing the scales with an uint16_t pointer (the scales are for sure 2-byte aligned), but was still getting garbage. I guess, there must have been another bug. No access to scales is via a uint16_t pointer and, after starting from scratch from the C dequantize function, it finally works. * metal : Q3_K 1st optimization pass * metal : Q3_K second optimization pass - 29.6 ms/token * metal : Q3_K cleanup * metal : fixed accidentally broken Q2_K --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit e4caa8da59c1c97dc23fa336f4d726984a20560f Author: slaren <[email protected]> Date: Mon Jun 12 19:12:47 2023 +0200 ci : run when changing only the CUDA sources (#1800) commit 58970a4c39124a647ac2a640d9e178ea6c961e65 Author: Howard Su <[email protected]> Date: Mon Jun 12 20:44:16 2023 +0800 Leverage mmap for offloading tensors to GPU (#1597) * Rebase to latest * Show progress * Add assert to make sure we only allocate temp buffer for non-CPU backend tensor Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]> commit 8c0a10e64dbf60fd9946c0cd5e6f59690800b123 Author: Kawrakow <[email protected]> Date: Mon Jun 12 14:31:36 2023 +0300 metal : fix failure to load model (#1817) The number of buffers in the ggml context was left unitialized. This leads to sporadic failures to load the model on startup. It is actually strange that the failure occurred so infrequantly. Co-authored-by: Iwan Kawrakow <[email protected]> commit fa84c4b3e80199a5683438f062009c031a06c4fa Author: Kerfuffle <[email protected]> Date: Sun Jun 11 08:19:17 2023 -0600 Fix issue where interactive mode crashes when input exceeds ctx size (#1789) * Fix issue where interactive mode in the main example crashes when input exceeds ctx size * Ensure the context size is at least 8 tokens in the main example. Closes #1768 commit 12b063f0ecf280e98028e444fc492ee6222cdcdc Author: Kyle Liang <[email protected]> Date: Sun Jun 11 21:20:52 2023 +0800 Fixed WSL cuda's OOM error (#1594) * In the function , add the cuda error bypass. * remove excessive codes and prints --------- Co-authored-by: liang <[email protected]> commit 31d2b5f4a4bae081e59b36ab37c6ff6f5b5940ad Author: Ryan Landay <[email protected]> Date: Sun Jun 11 17:38:53 2023 +0800 Update SHA256SUMS with current hashes for models quantized using q4_0 (#1798) commit 4de0334f5cabf4696eced2e5d6e279fdfaa6c0f2 Author: Georgi Gerganov <[email protected]> Date: Sat Jun 10 22:56:53 2023 +0300 cmake : fix Metal build (close #1791) commit 3f1223155a462477ac933474ebc4eab0ce3ca264 Author: Artyom Lebedev <[email protected]> Date: Sat Jun 10 22:51:36 2023 +0300 k-quants : GCC12 compilation fix (#1792) commit 303f5809f1b4ec49823dbe70cacd2124ec1d0df0 Author: Andrei <[email protected]> Date: Sat Jun 10 10:47:34 2023 -0400 metal : fix issue with ggml-metal.metal path. Closes #1769 (#1782) * Fix issue with ggml-metal.metal path * Add ggml-metal.metal as a resource for llama target * Update flake.nix metal kernel substitution commit 059e99066d95d73d1ca26c3375d47c0e35596229 Author: Aisuko <[email protected]> Date: Sun Jun 11 00:08:11 2023 +1000 doc : fix wrong address of BLIS.md (#1772) Signed-off-by: Aisuko <[email protected]> commit 17c10acfb44ecb7af25e37fb67b9501cbc0034d2 Author: Georgi Gerganov <[email protected]> Date: Sat Jun 10 12:06:45 2023 +0300 ggml : force no_alloc == false when creating opt tensors (close #1699) This is needed to make operators like ggml_view() be able to store their parameters in the ggml context's memory and not get discarded when no_alloc is true commit e9b66ee9829039d4ab54550d6222e42a0b31e52a Author: Kawrakow <[email protected]> Date: Sat Jun 10 11:28:11 2023 +0300 metal : add Q4_1 implementation (#1785) 23.3 ms / token, so just ~1% slower than q4_0. Achieves 290 GB/s memory throughput. Co-authored-by: Iwan Kawrakow <[email protected]> commit 4f0154b0bad775ac4651bf73b5c216eb43c45cdc Author: Kerfuffle <[email protected]> Date: Sat Jun 10 01:59:17 2023 -0600 llama : support requantizing models instead of only allowing quantization from 16/32bit (#1691) * Add support for quantizing already quantized models * Threaded dequantizing and f16 to f32 conversion * Clean up thread blocks with spares calculation a bit * Use std::runtime_error exceptions. commit ef3171d16241c18581d4d08374f0b9e396ade6b7 Author: Xingchen Song(宋星辰) <[email protected]> Date: Sat Jun 10 15:49:40 2023 +0800 ggml : workaround for missing _mm256_setr_m128i in GCC < 8 (#1638) commit 555275a693843273759230547001f9ae07fb537e Author: rankaiyx <[email protected]> Date: Sat Jun 10 14:41:59 2023 +0800 make : add SSSE3 compilation use case (#1659) commit 98ed16557432d7a5179c57eddcc3a08a7ae6d54d Author: Robert Sung-wook Shin <[email protected]> Date: Sat Jun 10 01:24:40 2023 +0900 OpenCL: Add release memory (#1741) * Add opencl release memory * Rename function name commit ae9663f1887513e152839e91f61c513075a19422 Author: Johannes Gäßler <[email protected]> Date: Fri Jun 9 13:58:15 2023 +0200 Windows nvcc workaround (#1753) Fix gibberish output on Windows when using CUDA commit b33dee282f5d8032b5f780152732dc45cbf2d349 Author: Georgi Gerganov <[email protected]> Date: Fri Jun 9 11:11:04 2023 +0300 metal : fix build "tanhf" -> "tanh" commit 92f44ff7f778ef1b94028b2ba6d39943b5ca0ada Author: AT <[email protected]> Date: Fri Jun 9 04:00:51 2023 -0400 metal : add GELU implementation (#1770) Co-authored-by: Adam Treat <[email protected]> commit 245fc3c37da5ac5963f9f11a9f4f2ac08d96afc6 Author: Kawrakow <[email protected]> Date: Fri Jun 9 10:39:59 2023 +0300 metal : faster q4_0 (#1775) * metal : 8% faster q4_0 Avoid copying into local uchar4 anf float4. * metal : 17% faster Q4_0 Use 64 threads in a thread group. --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 72ff5282bf0388c60821f504c4c8cc2b1f491aa6 Author: Kawrakow <[email protected]> Date: Thu Jun 8 22:28:21 2023 +0300 metal : add Q2_K implementation (#1762) * metal : add Q2_K implementation 27.1 ms / token on M2 Max 30-core GPU, so about the same speed as Q4_0. Memory throughput is ~156 GB/s. The access pattern used in the Q2_K CUDA implementation resulted in significantly lower performance (~31 ms/token). * Fixing merge conflicts --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 0bf7cf1b296fc9fca05411b37afdf08a531487d2 Author: Georgi Gerganov <[email protected]> Date: Thu Jun 8 20:48:14 2023 +0300 Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)" This reverts commit 8432d4d9f716b25133e3ed671d91e21f6f3be867. commit 8432d4d9f716b25133e3ed671d91e21f6f3be867 Author: le.chang <[email protected]> Date: Fri Jun 9 00:47:56 2023 +0800 ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738) commit 0f291e1f65c1d68201e71ce99c89562a36686b6d Author: Kawrakow <[email protected]> Date: Thu Jun 8 19:46:22 2023 +0300 metal : Q6_K implementation (#1752) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master * Metal implementation for Q6_K Similar to the CUDA implementation. No idea if this is the optimum for Metal, but the few alternative variants I tried all had a lower performance. We get 36.5 ms / token on M2 Max with 30 GPU cores. This corresponds to ~200 GB/second throughput. * clang-tidy : add config back * Much better Q6_K implementation for metal 28.3 ms / token for 7B. Subtracting ~9 ms that is spent in other compute graph operations, we are left with ~19 ms for the matrix multiplications. The model is ~5.5 GB, so we are getting 1000 / 19 * 5.5 = 290 GB/s! --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 8fc8179919a11738910db07a800f2b176f8adf09 Author: qingfengfenga <[email protected]> Date: Thu Jun 8 15:58:53 2023 +0800 Add llama.cpp docker support for non-latin languages (#1673) * Modify Dockerfile default character set to improve compatibility (#1673) commit b50b570ed9d699d3d126d72fc02de92926bcd937 Author: Steven Roussey <[email protected]> Date: Thu Jun 8 00:12:28 2023 -0700 ggml : fix fprintf warnings (#1720) commit 53aba3f393f2e02a78ddaba2e934893a8bbf3246 Author: Georgi Gerganov <[email protected]> Date: Thu Jun 8 10:09:08 2023 +0300 clang-tidy : restore dot file from accidental deletion commit 4161bdc04debb70bf5f275492b4d89fd9330087c Author: Kawrakow <[email protected]> Date: Thu Jun 8 10:08:23 2023 +0300 metal : add Q4_K implementation (#1733) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 0035858273ebe0694926bf4414d279f3e1cd109d Author: johnson442 <[email protected]> Date: Thu Jun 8 08:02:48 2023 +0100 k-quants : add missing compile definition to CMakeLists (#1748) commit 5c64a0952ee58b2d742ee84e8e3d43cce5d366db Author: Georgi Gerganov <[email protected]> Date: Wed Jun 7 10:59:52 2023 +0300 k-quants : allow to optionally disable at compile time (#1734) * k-quants : put behind optional compile flag LLAMA_K_QUANTS * build : enable k-quants by default commit 5b57a5b72676540b6a45a3f527126299969ad241 Author: jacobi petrucciani <[email protected]> Date: Wed Jun 7 00:15:31 2023 -0400 flake : update to support metal on m1/m2 (#1724) commit 4dc62c545df0af60635d579e9e4dd91bc5afff51 Author: Georgi Gerganov <[email protected]> Date: Wed Jun 7 07:15:08 2023 +0300 readme : add June roadmap commit 35a84916fb029905c44746127026079268216e7a Author: Willy Tarreau <[email protected]> Date: Wed Jun 7 04:10:17 2023 +0200 main: add the possibility to open the prompt cache read-only (#1640) The prompt cache constitutes a nice speed up when using the same prompt prefix across multiple evaluations, but when using it, it will also be updated, which is not always desirable. One use case is to have a large prompt containing some context and usage rules, and a second part containing variable data of the problem being studied. In this case it's desirable to be able to save the first part once, and to always reuse it as-is without updating it with the second part. The new argument --prompt-cache-ro enables this read-only mode on the prompt cache. The prompt's contents that match the cache are loaded from the cache but the rest is not modified. This allowed to reduce a total analysis time from 112s to 49.7s here, without having to backup and restore a copy of the prompt, which takes significant time at 500 MB. Signed-off-by: Willy Tarreau <[email protected]> commit 2d7bf110edd8c49209401a16132052cba706ffd0 Author: Georgi Gerganov <[email protected]> Date: Tue Jun 6 22:54:39 2023 +0300 llama : fix vram_scratch var commit 2a4e41a086ce80da68c402457c75c77e52dcc698 Author: Georgi Gerganov <[email protected]> Date: Tue Jun 6 22:41:53 2023 +0300 llama : fix compile warnings commit 17366df842e358768c0df7024484fffecfc7865b Author: Johannes Gäßler <[email protected]> Date: Tue Jun 6 21:33:23 2023 +0200 Multi GPU support, CUDA refactor, CUDA scratch buffer (#1703) * CUDA multi GPU + scratch ggml_cuda_compute_forward Tensor parallelism ggml_cuda_add ggml_cuda_rms_norm ggml_cuda_silu CUDA scratch buffer --main-gpu CLI option commit 44f906e8537fcec965e312d621c80556d6aa9bec Author: Georgi Gerganov <[email protected]> Date: Tue Jun 6 20:16:57 2023 +0300 metal : add f16 support commit d5b111f53d14972669eb52055f9df2567663ad8b Author: LostRuins <[email protected]> Date: Wed Jun 7 01:00:01 2023 +0800 Clblast fixes + enhancements to save VRAM and offload more layers (#1675) * Use events instead of clFinish, where possible * OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel * Reduce queueing overhead for contiguous tensors by using single mul kernel call * Adapt to #1612 cl_mem malloc changes * Reduce code duplication between cuda and opencl branches * Improve implementation * Clblast fixes + enhancements to save VRAM: 1. Change all Clblast buffers to CL_MEM_READ_WRITE, as the pool malloc currently doesn't properly handle them. 2. When recycling buffers in pool malloc, always assign the SMALLEST available buffer that fits, instead of the FIRST available buffer 3. When failing to recycle a buffer in pool malloc (all too small), instead recycle the largest available free buffer by resizing it. * change max value size_t to use limits * removed flags from the CL pool malloc, apply code tidying suggestions. commit 2d43387dafe9c60f15f57aa23ee0b37864b98b32 Author: Georgi Gerganov <ggerga…
checkout latest 5c64a09@master, compilation report:
but everything is OK on x86_64, maybe arm64 does not support this intrinsic?
Using
vld4q_s8
instead ofvld1q_u8_x4
seems working, both on x86_64 and arm64.However, testing did not pass all due to issue #1736 .