Extending context size via RoPE scaling #1965

ggerganov · 2023-06-22T09:43:29Z

ggerganov
Jun 22, 2023
Maintainer

Intro

This is a discussion about a recently proposed strategy of extending the context size of LLaMA models.

The original idea is proposed here: https://kaiokendev.github.io/til#extending-context-to-8k
An ongoing Reddit discussion here: https://www.reddit.com/r/LocalLLaMA/comments/14fgjqj/a_simple_way_to_extending_context_to_8k/

Make sure to first get familiar with the info in the links above as there has already been ongoing discussions and results.

So far the discussion seems to focus on the coherency of the generated texts when using large context. I think what we can do here in llama.cpp in order to support these investigations is to provide a more objective way of evaluating the proposed method by computing the perplexity at different context sizes with and without fine-tuning. Very initial results already suggest that this idea might be viable, but we should carefully check that we are doing the computations correctly

Preliminary tests with LLaMA 7B

Applied the following simple patch as proposed by Reddit user pseudonerv in this comment:

diff --git a/examples/main/main.cpp b/examples/main/main.cpp
index 941312f..7fa3ae2 100644
--- a/examples/main/main.cpp
+++ b/examples/main/main.cpp
@@ -84,8 +84,8 @@ int main(int argc, char ** argv) {
         return 0;
     }
 
-    if (params.n_ctx > 2048) {
-        fprintf(stderr, "%s: warning: model does not support context sizes greater than 2048 tokens (%d specified);"
+    if (params.n_ctx > 8192) {
+        fprintf(stderr, "%s: warning: model does not support context sizes greater than 8192 tokens (%d specified);"
                 "expect poor results\n", __func__, params.n_ctx);
     } else if (params.n_ctx < 8) {
         fprintf(stderr, "%s: warning: minimum context size is 8, using minimum size.\n", __func__);
diff --git a/ggml.c b/ggml.c
index 4319683..0aa4bd1 100644
--- a/ggml.c
+++ b/ggml.c
@@ -12172,7 +12172,7 @@ static void ggml_compute_forward_rope_f32(
                 if (ir++ < ir0) continue;
                 if (ir   > ir1) break;
 
-                float theta = (float)p;
+                float theta = (float)p*0.5;
 
                 if (!is_neox) {
                     for (int64_t i0 = 0; i0 < ne0; i0 += 2) {
@@ -12285,7 +12285,7 @@ static void ggml_compute_forward_rope_f16(
                 if (ir++ < ir0) continue;
                 if (ir   > ir1) break;
 
-                float theta = (float)p;
+                float theta = (float)p*0.5;
 
                 if (!is_neox) {
                     for (int64_t i0 = 0; i0 < ne0; i0 += 2) {
@@ -12423,7 +12423,7 @@ static void ggml_compute_forward_rope_back_f32(
                 if (ir++ < ir0) continue;
                 if (ir   > ir1) break;
 
-                float theta = (float)p;
+                float theta = (float)p*0.5;
 
                 if (!is_neox) {
                     for (int64_t i0 = 0; i0 < ne0; i0 += 2) {
@@ -12536,7 +12536,7 @@ static void ggml_compute_forward_rope_back_f16(
                 if (ir++ < ir0) continue;
                 if (ir   > ir1) break;
 
-                float theta = (float)p;
+                float theta = (float)p*0.5;
 
                 if (!is_neox) {
                     for (int64_t i0 = 0; i0 < ne0; i0 += 2) {

This patch "scales" the RoPE position by a factor of 0.5 which should correspond to extending the max context size from 2048 to 4096.

Running the following perplexity calculation for 7B LLaMA Q4_0 with context of 4096 yields:

$ ▶ make -j && time ./perplexity -m ./models/7B/ggml-model-q4_0.bin -f ./build/wiki.test.raw --no-mmap -t 24 -c 4096
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I LDFLAGS:  
I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS   -c ggml.c -o ggml.o
ggml.c: In function ‘ggml_compute_forward_rope_f32’:
ggml.c:12175:39: warning: implicit conversion from ‘float’ to ‘double’ to match other operand of binary expression [-Wdouble-promotion]
12175 |                 float theta = (float)p*0.5;
      |                                       ^
ggml.c: In function ‘ggml_compute_forward_rope_f16’:
ggml.c:12288:39: warning: implicit conversion from ‘float’ to ‘double’ to match other operand of binary expression [-Wdouble-promotion]
12288 |                 float theta = (float)p*0.5;
      |                                       ^
ggml.c: In function ‘ggml_compute_forward_rope_back_f32’:
ggml.c:12426:39: warning: implicit conversion from ‘float’ to ‘double’ to match other operand of binary expression [-Wdouble-promotion]
12426 |                 float theta = (float)p*0.5;
      |                                       ^
ggml.c: In function ‘ggml_compute_forward_rope_back_f16’:
ggml.c:12539:39: warning: implicit conversion from ‘float’ to ‘double’ to match other operand of binary expression [-Wdouble-promotion]
12539 |                 float theta = (float)p*0.5;
      |                                       ^
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/main/main.cpp ggml.o llama.o common.o k_quants.o -o main 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/quantize/quantize.cpp ggml.o llama.o k_quants.o -o quantize 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/quantize-stats/quantize-stats.cpp ggml.o llama.o k_quants.o -o quantize-stats 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/perplexity/perplexity.cpp ggml.o llama.o common.o k_quants.o -o perplexity 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/embedding/embedding.cpp ggml.o llama.o common.o k_quants.o -o embedding 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS pocs/vdot/vdot.cpp ggml.o k_quants.o -o vdot 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o k_quants.o -o train-text-from-scratch 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/simple/simple.cpp ggml.o llama.o common.o k_quants.o -o simple 
examples/train-text-from-scratch/train-text-from-scratch.cpp: In function ‘void write_tensor(llama_file*, ggml_tensor*)’:
examples/train-text-from-scratch/train-text-from-scratch.cpp:2371:21: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses]
 2371 |         file->seek(0-file->tell() & 31, SEEK_CUR);
      |                    ~^~~~~~~~~~~~~
examples/train-text-from-scratch/train-text-from-scratch.cpp:2386:17: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses]
 2386 |     file->seek(0-file->tell() & 31, SEEK_CUR);
      |                ~^~~~~~~~~~~~~
examples/train-text-from-scratch/train-text-from-scratch.cpp: In function ‘void read_tensor(llama_file*, ggml_tensor*)’:
examples/train-text-from-scratch/train-text-from-scratch.cpp:2407:17: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses]
 2407 |     file->seek(0-file->tell() & 31, SEEK_CUR);
      |                ~^~~~~~~~~~~~~

====  Run ./main -h for help.  ====

In file included from /usr/include/string.h:535,
                 from /usr/include/c++/11/cstring:42,
                 from examples/train-text-from-scratch/train-text-from-scratch.cpp:7:
In function ‘char* strncpy(char*, const char*, size_t)’,
    inlined from ‘void init_model(my_llama_model*)’ at examples/train-text-from-scratch/train-text-from-scratch.cpp:305:16:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 32 equals destination size [-Wstringop-truncation]
   95 |   return __builtin___strncpy_chk (__dest, __src, __len,
      |          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
   96 |                                   __glibc_objsize (__dest));
      |                                   ~~~~~~~~~~~~~~~~~~~~~~~~~
In function ‘char* strncpy(char*, const char*, size_t)’,
    inlined from ‘void init_model(my_llama_model*)’ at examples/train-text-from-scratch/train-text-from-scratch.cpp:306:16:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 32 equals destination size [-Wstringop-truncation]
   95 |   return __builtin___strncpy_chk (__dest, __src, __len,
      |          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
   96 |                                   __glibc_objsize (__dest));
      |                                   ~~~~~~~~~~~~~~~~~~~~~~~~~
In function ‘char* strncpy(char*, const char*, size_t)’,
    inlined from ‘void init_model(my_llama_model*)’ at examples/train-text-from-scratch/train-text-from-scratch.cpp:307:16:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 32 equals destination size [-Wstringop-truncation]
   95 |   return __builtin___strncpy_chk (__dest, __src, __len,
      |          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
   96 |                                   __glibc_objsize (__dest));
      |                                   ~~~~~~~~~~~~~~~~~~~~~~~~~
main: warning: model does not support context sizes greater than 2048 tokens (4096 specified);expect poor results
main: build = 721 (2322ec2)
main: seed  = 1687419189
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 3615.71 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
...................................................................................................
llama_init_from_file: kv self size  = 2048.00 MB

system_info: n_threads = 24 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 81 chunks, batch_size=512
perplexity: 138.49 seconds per pass - ETA 3 hours 6 minutes
[1]6.0187,[2]7.0714,[3]6.3656,[4]5.5239,[5]5.5817,[6]5.6883,[7]5.7353,[8]5.8797,[9]6.0222,[10]6.0995,[11]5.9346,[12]6.0113,[13]6.0741,[14]6.1430,[15]6.2255,[16]6.3302,[17]6.2853,[18]6.1894,[19]6.1435,[20]6.1202,[21]5.9613,[22]5.8335,[23]5.7270,[24]5.7918,[25]5.9147,[26]5.9784,[27]5.9983,[28]5.9945,[29]6.0096,[30]5.9743,[31]5.9087,[32]5.8371,[33]5.7800,[34]5.7782,[35]5.8298,[36]5.8891,[37]5.8342,[38]5.8005,[39]5.7749,[40]5.7405,[41]5.7550,[42]5.7732,[43]5.7759,[44]5.7750,[45]5.7827,[46]5.8045,[47]5.7804,[48]5.7699,[49]5.7503,[50]5.7619,[51]5.7707,[52]5.8372,[53]5.8738,[54]5.9270,[55]5.9470,[56]5.9599,[57]5.9449,[58]5.9547,[59]5.9601,[60]5.9670,[61]5.9655,[62]5.9458,[63]5.9253,[64]5.9198,[65]5.9334,[66]5.9447,[67]5.9319,[68]5.9465,[69]5.9305,[70]5.9212,[71]5.9203,[72]5.9134,[73]5.9029,[74]5.9025,[75]5.9006,[76]5.9015,[77]5.8926,[78]5.8630,[79]5.8878,[80]5.8870,[81]5.8945,

llama_print_timings:        load time = 16255.05 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 11425948.96 ms / 331776 tokens (   34.44 ms per token,    29.04 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 11483720.44 ms

real	191m23.868s
user	4568m7.821s
sys	1m11.117s

Final result: 5.8945:

perplexity: 138.49 seconds per pass - ETA 3 hours 6 minutes
[1]6.0187,[2]7.0714,[3]6.3656,[4]5.5239,[5]5.5817,[6]5.6883,[7]5.7353,[8]5.8797,[9]6.0222,[10]6.0995,[11]5.9346,[12]6.0113,[13]6.0741,[14]6.1430,[15]6.2255,[16]6.3302,[17]6.2853,[18]6.1894,[19]6.1435,[20]6.1202,[21]5.9613,[22]5.8335,[23]5.7270,[24]5.7918,[25]5.9147,[26]5.9784,[27]5.9983,[28]5.9945,[29]6.0096,[30]5.9743,[31]5.9087,[32]5.8371,[33]5.7800,[34]5.7782,[35]5.8298,[36]5.8891,[37]5.8342,[38]5.8005,[39]5.7749,[40]5.7405,[41]5.7550,[42]5.7732,[43]5.7759,[44]5.7750,[45]5.7827,[46]5.8045,[47]5.7804,[48]5.7699,[49]5.7503,[50]5.7619,[51]5.7707,[52]5.8372,[53]5.8738,[54]5.9270,[55]5.9470,[56]5.9599,[57]5.9449,[58]5.9547,[59]5.9601,[60]5.9670,[61]5.9655,[62]5.9458,[63]5.9253,[64]5.9198,[65]5.9334,[66]5.9447,[67]5.9319,[68]5.9465,[69]5.9305,[70]5.9212,[71]5.9203,[72]5.9134,[73]5.9029,[74]5.9025,[75]5.9006,[76]5.9015,[77]5.8926,[78]5.8630,[79]5.8878,[80]5.8870,[81]5.8945,

This is already looking very promising since without applying the "RoPE scaling" patch, the perplexity is extremely bad - starts off above 110.0, which can be expected since the vanilla computation does not support context size beyond 2048.

Additional tests with context size of 2048:

Without RoPE scaling: [163] 5.4708

$  make -j && time ./perplexity -m ./models/7B/ggml-model-q4_0.bin -f ./build/wiki.test.raw --no-mmap -c 2048 -t 24
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I LDFLAGS:  
I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -c examples/common.cpp -o common.o
cc -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS   -c -o k_quants.o k_quants.c
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/main/main.cpp ggml.o llama.o common.o k_quants.o -o main 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/quantize/quantize.cpp ggml.o llama.o k_quants.o -o quantize 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/quantize-stats/quantize-stats.cpp ggml.o llama.o k_quants.o -o quantize-stats 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/perplexity/perplexity.cpp ggml.o llama.o common.o k_quants.o -o perplexity 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/embedding/embedding.cpp ggml.o llama.o common.o k_quants.o -o embedding 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS pocs/vdot/vdot.cpp ggml.o k_quants.o -o vdot 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o k_quants.o -o train-text-from-scratch 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/simple/simple.cpp ggml.o llama.o common.o k_quants.o -o simple 
examples/train-text-from-scratch/train-text-from-scratch.cpp: In function ‘void write_tensor(llama_file*, ggml_tensor*)’:
examples/train-text-from-scratch/train-text-from-scratch.cpp:2371:21: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses]
 2371 |         file->seek(0-file->tell() & 31, SEEK_CUR);
      |                    ~^~~~~~~~~~~~~
examples/train-text-from-scratch/train-text-from-scratch.cpp:2386:17: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses]
 2386 |     file->seek(0-file->tell() & 31, SEEK_CUR);
      |                ~^~~~~~~~~~~~~
examples/train-text-from-scratch/train-text-from-scratch.cpp: In function ‘void read_tensor(llama_file*, ggml_tensor*)’:
examples/train-text-from-scratch/train-text-from-scratch.cpp:2407:17: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses]
 2407 |     file->seek(0-file->tell() & 31, SEEK_CUR);
      |                ~^~~~~~~~~~~~~

====  Run ./main -h for help.  ====

In file included from /usr/include/string.h:535,
                 from /usr/include/c++/11/cstring:42,
                 from examples/train-text-from-scratch/train-text-from-scratch.cpp:7:
In function ‘char* strncpy(char*, const char*, size_t)’,
    inlined from ‘void init_model(my_llama_model*)’ at examples/train-text-from-scratch/train-text-from-scratch.cpp:305:16:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 32 equals destination size [-Wstringop-truncation]
   95 |   return __builtin___strncpy_chk (__dest, __src, __len,
      |          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
   96 |                                   __glibc_objsize (__dest));
      |                                   ~~~~~~~~~~~~~~~~~~~~~~~~~
In function ‘char* strncpy(char*, const char*, size_t)’,
    inlined from ‘void init_model(my_llama_model*)’ at examples/train-text-from-scratch/train-text-from-scratch.cpp:306:16:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 32 equals destination size [-Wstringop-truncation]
   95 |   return __builtin___strncpy_chk (__dest, __src, __len,
      |          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
   96 |                                   __glibc_objsize (__dest));
      |                                   ~~~~~~~~~~~~~~~~~~~~~~~~~
In function ‘char* strncpy(char*, const char*, size_t)’,
    inlined from ‘void init_model(my_llama_model*)’ at examples/train-text-from-scratch/train-text-from-scratch.cpp:307:16:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 32 equals destination size [-Wstringop-truncation]
   95 |   return __builtin___strncpy_chk (__dest, __src, __len,
      |          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
   96 |                                   __glibc_objsize (__dest));
      |                                   ~~~~~~~~~~~~~~~~~~~~~~~~~
main: build = 724 (bbca06e)
main: seed  = 1687421414
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 3615.71 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
...................................................................................................
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 24 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 163 chunks, batch_size=512
perplexity: 29.91 seconds per pass - ETA 1 hours 21 minutes
[1]3.9923,[2]5.3389,[3]6.3162,[4]6.4622,[5]5.9443,[6]5.6493,[7]5.1580,[8]4.9461,[9]4.9658,[10]5.0431,[11]5.1182,[12]5.1215,[13]5.0616,[14]5.1292,[15]5.2202,[16]5.3321,[17]5.4008,[18]5.5149,[19]5.5788,[20]5.6355,[21]5.5755,[22]5.4813,[23]5.4999,[24]5.5534,[25]5.5471,[26]5.5828,[27]5.5970,[28]5.6491,[29]5.6481,[30]5.7217,[31]5.7891,[32]5.8522,[33]5.8383,[34]5.8318,[35]5.7966,[36]5.7386,[37]5.7094,[38]5.7035,[39]5.6921,[40]5.7040,[41]5.6324,[42]5.5440,[43]5.4858,[44]5.4032,[45]5.3468,[46]5.3093,[47]5.3060,[48]5.3659,[49]5.4259,[50]5.4784,[51]5.5197,[52]5.5335,[53]5.5438,[54]5.5528,[55]5.5698,[56]5.5450,[57]5.5778,[58]5.5574,[59]5.5457,[60]5.5209,[61]5.4906,[62]5.4601,[63]5.4336,[64]5.3905,[65]5.3565,[66]5.3304,[67]5.3202,[68]5.3309,[69]5.3545,[70]5.3764,[71]5.4046,[72]5.4295,[73]5.3870,[74]5.3834,[75]5.3716,[76]5.3487,[77]5.3396,[78]5.3293,[79]5.3092,[80]5.2992,[81]5.2962,[82]5.3115,[83]5.3299,[84]5.3223,[85]5.3134,[86]5.3201,[87]5.3259,[88]5.3210,[89]5.3312,[90]5.3359,[91]5.3543,[92]5.3527,[93]5.3429,[94]5.3369,[95]5.3398,[96]5.3276,[97]5.3238,[98]5.3080,[99]5.3016,[100]5.3145,[101]5.3226,[102]5.3251,[103]5.3662,[104]5.3941,[105]5.4088,[106]5.4254,[107]5.4511,[108]5.4792,[109]5.4785,[110]5.4911,[111]5.4929,[112]5.5035,[113]5.4953,[114]5.4925,[115]5.4961,[116]5.5013,[117]5.4994,[118]5.5026,[119]5.5040,[120]5.5132,[121]5.5147,[122]5.5055,[123]5.4988,[124]5.4928,[125]5.4807,[126]5.4702,[127]5.4650,[128]5.4691,[129]5.4757,[130]5.4839,[131]5.4918,[132]5.4952,[133]5.4868,[134]5.4739,[135]5.4851,[136]5.4884,[137]5.4806,[138]5.4728,[139]5.4646,[140]5.4613,[141]5.4608,[142]5.4597,[143]5.4605,[144]5.4533,[145]5.4477,[146]5.4465,[147]5.4427,[148]5.4420,[149]5.4448,[150]5.4466,[151]5.4505,[152]5.4492,[153]5.4546,[154]5.4416,[155]5.4180,[156]5.4227,[157]5.4273,[158]5.4457,[159]5.4525,[160]5.4508,[161]5.4582,[162]5.4602,[163]5.4708,

llama_print_timings:        load time =  8579.66 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 4890728.66 ms / 333824 tokens (   14.65 ms per token,    68.26 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 4940994.89 ms

real	82m21,155s
user	1955m22,704s
sys	0m27,572s

With RoPE scaling of 0.5: [163] 6.0642

$ ▶ make -j && time ./perplexity -m ./models/7B/ggml-model-q4_0.bin -f ./build/wiki.test.raw --no-mmap -t 12 -c 2048
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I LDFLAGS:  
I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -c examples/common.cpp -o common.o
cc -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS   -c -o k_quants.o k_quants.c
ggml.c: In function ‘ggml_compute_forward_rope_f32’:
ggml.c:12175:39: warning: implicit conversion from ‘float’ to ‘double’ to match other operand of binary expression [-Wdouble-promotion]
12175 |                 float theta = (float)p*0.5;
      |                                       ^
ggml.c: In function ‘ggml_compute_forward_rope_f16’:
ggml.c:12288:39: warning: implicit conversion from ‘float’ to ‘double’ to match other operand of binary expression [-Wdouble-promotion]
12288 |                 float theta = (float)p*0.5;
      |                                       ^
ggml.c: In function ‘ggml_compute_forward_rope_back_f32’:
ggml.c:12426:39: warning: implicit conversion from ‘float’ to ‘double’ to match other operand of binary expression [-Wdouble-promotion]
12426 |                 float theta = (float)p*0.5;
      |                                       ^
ggml.c: In function ‘ggml_compute_forward_rope_back_f16’:
ggml.c:12539:39: warning: implicit conversion from ‘float’ to ‘double’ to match other operand of binary expression [-Wdouble-promotion]
12539 |                 float theta = (float)p*0.5;
      |                                       ^
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/main/main.cpp ggml.o llama.o common.o k_quants.o -o main 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/quantize/quantize.cpp ggml.o llama.o k_quants.o -o quantize 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/quantize-stats/quantize-stats.cpp ggml.o llama.o k_quants.o -o quantize-stats 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/perplexity/perplexity.cpp ggml.o llama.o common.o k_quants.o -o perplexity 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/embedding/embedding.cpp ggml.o llama.o common.o k_quants.o -o embedding 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS pocs/vdot/vdot.cpp ggml.o k_quants.o -o vdot 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o k_quants.o -o train-text-from-scratch 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/simple/simple.cpp ggml.o llama.o common.o k_quants.o -o simple 
examples/train-text-from-scratch/train-text-from-scratch.cpp: In function ‘void write_tensor(llama_file*, ggml_tensor*)’:
examples/train-text-from-scratch/train-text-from-scratch.cpp:2371:21: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses]
 2371 |         file->seek(0-file->tell() & 31, SEEK_CUR);
      |                    ~^~~~~~~~~~~~~
examples/train-text-from-scratch/train-text-from-scratch.cpp:2386:17: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses]
 2386 |     file->seek(0-file->tell() & 31, SEEK_CUR);
      |                ~^~~~~~~~~~~~~
examples/train-text-from-scratch/train-text-from-scratch.cpp: In function ‘void read_tensor(llama_file*, ggml_tensor*)’:
examples/train-text-from-scratch/train-text-from-scratch.cpp:2407:17: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses]
 2407 |     file->seek(0-file->tell() & 31, SEEK_CUR);
      |                ~^~~~~~~~~~~~~

====  Run ./main -h for help.  ====

In file included from /usr/include/string.h:535,
                 from /usr/include/c++/11/cstring:42,
                 from examples/train-text-from-scratch/train-text-from-scratch.cpp:7:
In function ‘char* strncpy(char*, const char*, size_t)’,
    inlined from ‘void init_model(my_llama_model*)’ at examples/train-text-from-scratch/train-text-from-scratch.cpp:305:16:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 32 equals destination size [-Wstringop-truncation]
   95 |   return __builtin___strncpy_chk (__dest, __src, __len,
      |          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
   96 |                                   __glibc_objsize (__dest));
      |                                   ~~~~~~~~~~~~~~~~~~~~~~~~~
In function ‘char* strncpy(char*, const char*, size_t)’,
    inlined from ‘void init_model(my_llama_model*)’ at examples/train-text-from-scratch/train-text-from-scratch.cpp:306:16:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 32 equals destination size [-Wstringop-truncation]
   95 |   return __builtin___strncpy_chk (__dest, __src, __len,
      |          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
   96 |                                   __glibc_objsize (__dest));
      |                                   ~~~~~~~~~~~~~~~~~~~~~~~~~
In function ‘char* strncpy(char*, const char*, size_t)’,
    inlined from ‘void init_model(my_llama_model*)’ at examples/train-text-from-scratch/train-text-from-scratch.cpp:307:16:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 32 equals destination size [-Wstringop-truncation]
   95 |   return __builtin___strncpy_chk (__dest, __src, __len,
      |          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
   96 |                                   __glibc_objsize (__dest));
      |                                   ~~~~~~~~~~~~~~~~~~~~~~~~~
main: build = 721 (2322ec2)
main: seed  = 1687435610
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 3615.71 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
...................................................................................................
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 163 chunks, batch_size=512
perplexity: 63.83 seconds per pass - ETA 2 hours 53 minutes
[1]4.5277,[2]5.9310,[3]7.0203,[4]7.1517,[5]6.6052,[6]6.3426,[7]5.7903,[8]5.5576,[9]5.5795,[10]5.6517,[11]5.7561,[12]5.7770,[13]5.7174,[14]5.7934,[15]5.8778,[16]5.9828,[17]6.0381,[18]6.1546,[19]6.2098,[20]6.2615,[21]6.2092,[22]6.1235,[23]6.1354,[24]6.1845,[25]6.1778,[26]6.2131,[27]6.2235,[28]6.2771,[29]6.2809,[30]6.3569,[31]6.4299,[32]6.4946,[33]6.4728,[34]6.4554,[35]6.4232,[36]6.3592,[37]6.3237,[38]6.3070,[39]6.3018,[40]6.3152,[41]6.2502,[42]6.1575,[43]6.1011,[44]6.0151,[45]5.9520,[46]5.9117,[47]5.9119,[48]5.9750,[49]6.0362,[50]6.0938,[51]6.1433,[52]6.1591,[53]6.1681,[54]6.1780,[55]6.1945,[56]6.1673,[57]6.2046,[58]6.1846,[59]6.1761,[60]6.1512,[61]6.1174,[62]6.0860,[63]6.0553,[64]6.0074,[65]5.9698,[66]5.9433,[67]5.9329,[68]5.9401,[69]5.9661,[70]5.9888,[71]6.0173,[72]6.0475,[73]6.0011,[74]5.9921,[75]5.9784,[76]5.9486,[77]5.9385,[78]5.9258,[79]5.9065,[80]5.8969,[81]5.8917,[82]5.9041,[83]5.9231,[84]5.9185,[85]5.9089,[86]5.9155,[87]5.9189,[88]5.9129,[89]5.9237,[90]5.9295,[91]5.9490,[92]5.9448,[93]5.9291,[94]5.9186,[95]5.9191,[96]5.9055,[97]5.9024,[98]5.8866,[99]5.8808,[100]5.8984,[101]5.9089,[102]5.9097,[103]5.9529,[104]5.9819,[105]5.9978,[106]6.0180,[107]6.0446,[108]6.0737,[109]6.0737,[110]6.0894,[111]6.0915,[112]6.1039,[113]6.0958,[114]6.0914,[115]6.0934,[116]6.0991,[117]6.0972,[118]6.1008,[119]6.1036,[120]6.1124,[121]6.1155,[122]6.1061,[123]6.0980,[124]6.0895,[125]6.0779,[126]6.0657,[127]6.0606,[128]6.0645,[129]6.0721,[130]6.0807,[131]6.0886,[132]6.0931,[133]6.0858,[134]6.0754,[135]6.0878,[136]6.0915,[137]6.0837,[138]6.0755,[139]6.0668,[140]6.0645,[141]6.0642,[142]6.0636,[143]6.0642,[144]6.0566,[145]6.0486,[146]6.0479,[147]6.0447,[148]6.0462,[149]6.0469,[150]6.0471,[151]6.0505,[152]6.0466,[153]6.0514,[154]6.0373,[155]6.0094,[156]6.0135,[157]6.0191,[158]6.0390,[159]6.0461,[160]6.0427,[161]6.0495,[162]6.0525,[163]6.0642,

llama_print_timings:        load time = 16124.85 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 10484054.26 ms / 333824 tokens (   31.41 ms per token,    31.84 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 10535367.30 ms

real	175m35.512s
user	2096m10.839s
sys	0m42.180s

I'm currently running the computations on the CPU as I have more confidence in the changes being correct, but we should look into updating the GPU code to support the RoPE scaling and doing more calculations to determine how the perplexity behaves for different context sizes.

The author of this idea @kaiokendev suggests that this approach should work even better with fine-tuned models (https://www.reddit.com/r/LocalLLaMA/comments/14fgjqj/comment/jp2dchb/?utm_source=share&utm_medium=web2x&context=3), so we should also do some tests with those models.

Result summary (live updates)

wiki.test.raw

Model	Format	Scale	Ctx	Chunks	Perplexity
LLaMA 7B	`Q4_0`	`1.0`	2048	163	`5.4708`
LLaMA 7B	`Q4_0`	`0.5`	2048	163	`6.0642`
LLaMA 7B	`Q4_0`	`1.0`	4096	81	`inf`
LLaMA 7B	`Q4_0`	`0.5`	4096	81	`5.8945`

KerfuffleV2 · 2023-06-22T11:40:37Z

KerfuffleV2
Jun 22, 2023
Collaborator

I made #1967 if anyone wants to try playing with this.

edit: Hide evidence of my shame...

Current version of that pull now reproduces GG's perplexity results at 4,096 context with scale 0.5.

0 replies

JohannesGaessler · 2023-06-22T11:57:15Z

JohannesGaessler
Jun 22, 2023
Collaborator

I will prioritize VRAM optimizations; that's already useful on its own but especially if context can be extended the extra VRAM will be very useful. My top llama.cpp priorities will be to try and do a dequantize + matrix multiplication kernel and to look into whether the KV cache can be quantized.

I think patching the CUDA implementation of RoPE won't be too difficult. Right now the CUDA code does not support LoRAs though.

If this technique produces good results we should also think about how to specify RoPE scaling. Since finetuning with the same scaling seems to be important I think the ideal solution would be to specify the correct scaling in the model file. Still, if we want to support the user setting an arbitrary scaling at runtime we will also need a CLI argument that can override whatever the model file says.

1 reply

Green-Sky Jun 22, 2023
Collaborator

if we change the model file, pls switch n_mult with n_ff :)

SlyEcho · 2023-06-22T12:57:42Z

SlyEcho
Jun 22, 2023
Collaborator Sponsor

This is amazing!

If the context is getting so long now, I have some concerns about the KV cache size. We already know that there is no difference to perplexity when it is stored in F16 as opposed to F32, but has anyone tested quantizing it? Even just Q8_0 would cut it down a lot.

0 replies

KerfuffleV2 · 2023-06-22T13:11:28Z

KerfuffleV2
Jun 22, 2023
Collaborator

edit: I'm just going to delete this since weird stuff might have been going on with CUDA BLAS.

It doesn't seem like the cuBLAS rope operations actually check that the arguments are the right type/length like the normal CPU ops so it might have been trying to use the new args format as if it was the old one.

16 replies

SlyEcho Jun 22, 2023
Collaborator Sponsor

It's an idea I had, I don't even know if there's something to it.

It would have lower recall for the first half of the context but this would allow (or rather require) using much longer initial prompts.

kaiokendev Jun 22, 2023

@SlyEcho
Im not discrediting the idea, it might work, no one knows until you try. I have a discussion with Jianlin Su, author of RoPE, here is what he say in response to scaling:

Interesting discovery! However, some of my recent experiments have left me quite puzzled about this issue.
If you can read Chinese, please refer to https://kexue.fm/archives/9603. As I mentioned in the article, all mainstream length extrapolation schemes fail under the "Post Norm + GAU (a single-head attention)" combination, and only the HWFA proposed in the linked article is effective. I just tried your approach on it as well, and it also failed. In other words, the effectiveness of length extrapolation methods still depends on the model architecture, but obviously, mainstream research has not analyzed this point, which means our understanding of length extrapolation is far from sufficient.

In other words, what works for some architecture may not work for others, and there is still lot of experimentation to be done since this path is underexplored, so please try it out and see

SlyEcho Jun 22, 2023
Collaborator Sponsor

I guess the 2.0 there would have to be something like requested_ctx / model_max_ctl

2.0 is just how steep the curve is. 1.0 is flat. <1.0 then it is curved the other way. p ranges from 0 to n_ctx, so by dividing by n_ctx it will range from 0 to 1.0, which allows the exponent function to give results from 0 to 1.0 as well, then this is scaled to 2048, the number that is the largest the model supports.

SlyEcho Jun 22, 2023
Collaborator Sponsor

With 2.0 perplexity looks really bad, 1.15 was OK, a little worse than 1.

0.95 is actually giving me 5.8682 as the final perplexity!

SlyEcho Jun 22, 2023
Collaborator Sponsor

0.75 was worse.

0.85 gives 5.8588 as final

kaiokendev · 2023-06-22T17:14:22Z

kaiokendev
Jun 22, 2023

I have added a version of the same dataset with no scaling, so you can compare the difference in ppl between the two versions and see what is the effect when finetuned using the scaling patch. This version is also trained using 4096 cutoff, but will quickly deteriorate past ~2400, even with the scaling applied during inference

https://huggingface.co/kaiokendev/superhot-13b-8k-no-rlhf-test/tree/main/no_scaling

2 replies

jxy Jun 22, 2023

Llama.cpp is incompatible with lora_only bias adapters.

kaiokendev Jun 22, 2023

@jxy PEFT has not been exporting the bias properly. You can ignore it, and change the bias to none. I would do this but Im not at my PC atm.

JohannesGaessler · 2023-06-22T20:06:21Z

JohannesGaessler
Jun 22, 2023
Collaborator

I made a PR that fixes LoRAs and CUDA acceleration not being usable at the same time: #1970 . I then also merged in the PR by @KerfuffleV2 #1967 which lets you set the rope scaling by compile time and pushed the branch here.

0 replies

turboderp · 2023-06-22T22:15:08Z

turboderp
Jun 22, 2023

I tested this as well, using my own GPTQ/CUDA-based implementation. Here are some preliminary results:

All tests are on the same set of 40 8k-token sequences, truncated at different lengths along the x axis.

Base (red) is plain Llama-13B, 4-bit GPTQ. As is evident, average perplexity goes off the chart as soon as seq_len exceeds the base model's pretraining.

The previous experiment (yellow) is something I tried out a few months back, finetuning on 6k-token examples just to see what would happen. I was able to overcome the limit at 2048 tokens, but only barely, and perplexity still starts to climb after that point, if slower than before. It should be continuing to drop. My conclusion at the time was that (much) more tuning was needed for this approach to work.

SuperHOT (blue) is the interesting part. Positional embeddings are condensed by a factor of four, stretching the original 2048 positions across the 8192 spaces evenly. The SuperHOT LoRA is applied to give the model a chance with the new positional embeddings. The results are very encouraging, I'd say.

The model is clearly taking advantage of the longer context provided, making better and better predictions the more context it has to work with. That doesn't mean it's better enough, of course. I think more testing is going to be needed, but I'm cautiously optimistic and anxious to try the 33B version soon.

For completeness, the command line to reproduce the results with ExLlama:

python test_benchmark_inference.py -d <path_to_llama-13b> -ld <path_to_superhot_lora> -ppl -ppl_ds datasets/wikitext2.txt -l 8192 -cpe 4 -ppl_cn 40 -ppl_cs 8192 -ppl_ct <seq_len>

0 replies

nikshepsvn · 2023-06-23T17:47:27Z

nikshepsvn
Jun 23, 2023

source: https://huggingface.co/kaiokendev/superhot-13b-16k-no-rlhf-test/discussions/1#6495d7c35e85f2810600a040

I did some testing here with @kaiokendev's 16k lora, looks like PPL is lower on the 16K model even at 2K context, and it does better at higher contexts (16K) too. We might find benefit in scaling this further, but we might reach a point where it starts hurting more than helping too

1 reply

ghost Jun 23, 2023

I don't think it was trained on long enough contexts to learn, yet.

jxy · 2023-06-28T03:27:20Z

jxy
Jun 28, 2023

From Meta: https://arxiv.org/abs/2306.15595

Concurrent work. Right before our release, we are informed with a concurrent blogpost (Super- HOT kaiokendev (2023)) that also interpolates positional encoding in RoPE to extend the context window from 2K to 8K. Recently, open source community picks it up in Reddit post 1 and Github Issues 2, which shows that fine-tuning with LoRA (Hu et al., 2021) also seems to work well. Our paper shows a full fine-tuning with up to 65B model work well with Position Interpolation, and we also give theoretical explanations why interpolation achieves much more stable results than extrap- olation, by showing that the upper bound of interplated attention score is much lower than that of extrapolated ones.

3 replies

Green-Sky Jun 28, 2023
Collaborator

funny, they link to this discussion :D

Green-Sky Jun 28, 2023
Collaborator

their observations are basically in line with what was discussed here.

turboderp Jun 28, 2023

Yep, that's testing the same three scenarios as in the graph above, with the same results for baseline, extrapolation and interpolation.

huu4ontocord · 2023-06-28T08:23:39Z

huu4ontocord
Jun 28, 2023

Wihtout finetuning, I find that you can get a little bit more extension wihtout degradation by not doing a simple scaling... if you do a gradual increasing scaling https://colab.research.google.com/drive/18Ou_Isi1HiqtWqkbKfBZ46Q2hHES3Jp8?authuser=2

r = self.scale**2
for idx in range(self.max_seq_len_cached):
              t[idx] = t[idx]*(min(1.0, (1/(math.sqrt((r/(self.max_seq_len_cached))*(idx+1))))))

3 replies

turboderp Jun 28, 2023

I can't load that colab for some reason. Something about AUTH tokens, don't really have time to figure it out right now.

But generally speaking you'd want the positions to be linear to ensure that attention from position n to position n-k produces the same result regardless of n. Pretrained Llama clearly has a bit of tolerance already, so you can warp the indices slightly and still get decent output, but is there actually any benefit compared to just using a larger scaling factor?

huu4ontocord Jun 28, 2023

yes, at longer scale at least in my tests without finetuning you still have conherence - 6k-8k. that is to say it outputs real text as opposed to garbage when you do simple scaling 1/4. and at shorter scale < 1K for example you don't get a penalty for scaling.

huu4ontocord Jun 28, 2023

the shared link - just remove the https://colab.research.google.com/drive/18Ou_Isi1HiqtWqkbKfBZ46Q2hHES3Jp8?usp=sharing

huu4ontocord · 2023-06-28T08:26:38Z

huu4ontocord
Jun 28, 2023

Haven't read the ^ paper yet, but i found that this method works best with larger models and there's probably a relationship between how well this method works and who many param you have.

1 reply

turboderp Jun 28, 2023

Larger models work better in general and are better at doing long-range attention, even within the standard 2048-token window. They also have larger embeddings that are less likely to get saturated as you sum up value projections from more tokens.

KerfuffleV2 · 2023-06-28T09:12:20Z

KerfuffleV2
Jun 28, 2023
Collaborator

for idx in range(self.max_seq_len_cached):

That's a good idea. Maybe the most flexible way to handle this extended RoPE operation is to allow passing in a cached scale like that. It can just use the last scale item when the context size exceeds the scale length (probably has the best chance of being reasonable). Even for absurd context lengths, this would only use a pretty small amount of memory (256KiB for context 65,535 assuming 32bit float scale values).

This approach also would handle the default scale gracefully (1 scale item of 1.0).

1 reply

huu4ontocord Jun 28, 2023

yes - that's the idea. you don't degrade your performance at shorter scales. also, you can share the rotary among all layers to decrease memory more. see colab for that

turboderp · 2023-06-28T18:32:34Z

turboderp
Jun 28, 2023

I think people are missing the point of this a little bit. The idea is to retrain the model so that it works naturally at a different scale, not to make it tolerate multiple different scales while still preferring 1:1. Working at multiple scales is more demanding of the model than simply finetuning it on one new scaling factor.

0 replies

Igoorx · 2023-06-29T15:37:16Z

Igoorx
Jun 29, 2023

A new method of interpolation has been proposed here:
https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

From what I could see it indeed gives coherent output even without fine-tuning.
The patch to llama.cpp is very simple too, we only have to change the lines:

const float theta_scale = powf(10000.0, -2.0f/n_dims);

to

const float theta_scale = powf(10000.0 * powf(8.0, n_dims / (n_dims - 2.0)), -2.0f/n_dims);

4 replies

SlyEcho Jun 29, 2023
Collaborator Sponsor

This number 8.0 was also a parameter, IIRC?

Igoorx Jun 29, 2023

8.0 is the "factor alpha" (α), it wasn't a parameter in the original code but just a local variable.

ghost Jun 30, 2023

Has anyone else tested this? I tested it twice, and both times it failed miserably after ~2200 tokens.

actually-a-cat Jul 4, 2023

I'm using this as implemented in kobold.cpp and it's working very well, 33B LLaMA model not even finetuned for scaling goes to 4096 context without issue (α = 4.0 in this case)

jxy · 2023-06-30T04:42:06Z

jxy
Jun 30, 2023

Try #2054

I don't have beefy enough system at $home to run any numbers. But a vanilla vicuna 13b v1.3.0 q8_0 works alright with -c 16384 -rope-freq-base 80000 --rope-freq-scale 0.5

0 replies

RahulVivekNair · 2023-07-06T09:07:14Z

RahulVivekNair
Jul 6, 2023

Any new news on this front?

0 replies

vmajor · 2023-07-06T09:24:01Z

vmajor
Jul 6, 2023

Well, this just landed: https://arxiv.org/abs/2307.02486

I cannot see any source code in the unilm repository that the paper links to, so this is more like a heads up about what is hopefully coming our way soon.

0 replies

KerfuffleV2 · 2023-07-06T12:49:39Z

KerfuffleV2
Jul 6, 2023
Collaborator

Well, this just landed: https://arxiv.org/abs/2307.02486

The advantage of RoPE scaling is it's something that can work with existing models (possibly after some fine-tuning). Stuff like the link is almost certainly going to require training completely new models from scratch so even if it worked perfectly and 100% of the information was available it still wouldn't really be "soon".

2 replies

FNsi Jul 7, 2023

Do you think the equation [20] in that link could be used in part of rope scaling? Any ideas?

KerfuffleV2 Jul 7, 2023
Collaborator

Do you think the equation [20] in that link could be used in part of rope scaling? Any ideas?

I'm far from an expert on the subject, but I really doubt it. The paper doesn't even mention RoPE so it's highly unlikely you could just plug in an equation relating to a different attention mechanism and get anything meaningful.

That paper seems to propose a different approach to attention. It will probably require training new models using that type of attention. It seems like a radical enough change that you wouldn't be able to do something like fine-tune existing models.

On the bright side though, if it actually does work as advertised and essentially allow infinite context length there definitely will be models using it in the not too distant future. Being able to use unlimited context is an insanely valuable feature and will instantly make LLMs vastly more useful. (And with anything that claims to have such extraordinary effects, we should take claims with a grain of salt until they're actually verified.)

gjmulder · 2023-07-09T10:34:18Z

gjmulder
Jul 9, 2023
Collaborator

Just to confirm the current status of RoPE:

There is a large demand for larger context sizes
PRs Allow specifying p scale factor for ggml rope and rope_back ops #1967 and Modified RoPE with linear scaling #2019 have been rejected as non-optimal approaches
PR Implement customizable RoPE #2054 is the latest candidate approach

0 replies

0wwafa · 2024-06-29T05:09:36Z

0wwafa
Jun 29, 2024

Yes. Anything that can extend context size to 500K - 1M tk would be great. As gradient did with mistral.

0 replies

hgftrdw45ud67is8o89 · 2024-07-15T16:25:38Z

hgftrdw45ud67is8o89
Jul 15, 2024

quick question. I assume the when loading a model pretrained with a target context of 32k.
These pr means we doesn't need to manually enable Rope scaling to enjoy 32k as minimum? for example this model.https://huggingface.co/cognitivecomputations/dolphin-2.9.3-mistral-7B-32k

I tested with simple --ctx_size 16384 and it seems to work well. Does llama.cpp automated use any rope scaling by default?

0 replies

Extending context size via RoPE scaling #1965

ggerganov Jun 22, 2023 Maintainer

Intro

Preliminary tests with LLaMA 7B

Result summary (live updates)

Replies: 21 comments · 34 replies

KerfuffleV2 Jun 22, 2023 Collaborator

JohannesGaessler Jun 22, 2023 Collaborator

Green-Sky Jun 22, 2023 Collaborator

SlyEcho Jun 22, 2023 Collaborator Sponsor

KerfuffleV2 Jun 22, 2023 Collaborator

SlyEcho Jun 22, 2023 Collaborator Sponsor

SlyEcho Jun 22, 2023 Collaborator Sponsor

SlyEcho Jun 22, 2023 Collaborator Sponsor

SlyEcho Jun 22, 2023 Collaborator Sponsor

JohannesGaessler Jun 22, 2023 Collaborator

Green-Sky Jun 28, 2023 Collaborator

Green-Sky Jun 28, 2023 Collaborator

KerfuffleV2 Jun 28, 2023 Collaborator

SlyEcho Jun 29, 2023 Collaborator Sponsor

KerfuffleV2 Jul 6, 2023 Collaborator

KerfuffleV2 Jul 7, 2023 Collaborator

gjmulder Jul 9, 2023 Collaborator

ggerganov
Jun 22, 2023
Maintainer

Replies: 21 comments 34 replies

KerfuffleV2
Jun 22, 2023
Collaborator

JohannesGaessler
Jun 22, 2023
Collaborator

Green-Sky Jun 22, 2023
Collaborator

SlyEcho
Jun 22, 2023
Collaborator Sponsor

KerfuffleV2
Jun 22, 2023
Collaborator

SlyEcho Jun 22, 2023
Collaborator Sponsor

SlyEcho Jun 22, 2023
Collaborator Sponsor

SlyEcho Jun 22, 2023
Collaborator Sponsor

SlyEcho Jun 22, 2023
Collaborator Sponsor

JohannesGaessler
Jun 22, 2023
Collaborator

Green-Sky Jun 28, 2023
Collaborator

Green-Sky Jun 28, 2023
Collaborator

KerfuffleV2
Jun 28, 2023
Collaborator

SlyEcho Jun 29, 2023
Collaborator Sponsor

KerfuffleV2
Jul 6, 2023
Collaborator

KerfuffleV2 Jul 7, 2023
Collaborator

gjmulder
Jul 9, 2023
Collaborator